An interesting dual-site ScaleIO Configuration (probably unsupported)

ScaleIO is a member of the new class of scale-out storage systems that permits you to scale-out your storage by adding additional nodes either in a hyperconverged configuration with VMs installed in your hypervisors or as a bare-metal storage cluster.

I have been a fan of this type of architecture since it gets rid of many of the limitations of the traditional scale-up SANs and offers (potentially) a new degree of portability and finally the end of the fork-lift upgrade cycle.

However, with the latest version of ScaleIO there are some odd design choices that can be problematic in the smaller and mid-sized environments. Specifically, it is now enforcing the minimum of three fault sets (should you decide to use them). The concept of the fault set is a group of nodes that are more likely to fail as a group due to some common dependency, generally power to a rack. For data protection reasons, whenever a block is written, a second copy block is written to another node in the cluster. Adding fault sets to the mix forces this second block to go to a node outside the fault set where the original block was written to ensure availability.

The problem with ScaleIO’s new enforcement of the three fault set model is that this means you can no longer easily build out a dual room configuration for availability which is pretty much the design of most highly available configuration in small and medium sized configurations (and even in quite a number of large ones). With this limitation in mind and knowing a bit about the way the data paths and metadata are placed in ScaleIO I decided to see if this really was a hard limitation or if there was a way to work around it to build a more traditional dual-site configuration with the 2.0 release.

Cluster configuration

In order to ensure a minimum level of viability when one site is offline, I set up a test bed with a cluster of two fault sets of three nodes each. The nodes used here all have three 100 Gb disks (yes, these are virtual machines). There is also a third fault set configured with a single node with the minimum of 100 Gb of storage assigned to it.

There is a shared L2 network across the entire cluster for storage services so this would be similar to having a stretched VLAN across two rooms.

On the MDM side of things, I used the 5 node cluster configuration with the primary MDM in one fault set and the standby in the second fault set.

These are attached to a three node vSphere cluster to general load and test connectivity with a half-dozen Linux VMs.

Operations

Once all of the ScaleIO nodes are online, I can use the CLI or the vSphere plugin to create and map volumes from the cluster to the SDCs on the ESXi hosts. Here there is no problem. There is an alert in the ScaleIO reporting that the fault sets are not balanced, but this simply has the result that the data distribution is not equal by volume across the fault sets, but simply by percentage used. Otherwise, the cluster is fully operational. At this stage I have all of the VMs running nicely and am running bonnie++ to generate a read and write load across the cluster.

At this point I take the single node of the third fault set offline politely using the delete_service.sh command in /opt/emc/scaleio/sds/bin.

This has the expected result of activating a rebuild operation to properly protect the blocks that were stored on the 100 Gb of the third fault set. Since there is a relatively small amount of data involved, this goes fairly quickly.

At this point, the storage is still available and operational to the SDCs and everything is running. However there is one limitation at this point: I cannot modify the structure of the cluster without the third fault set online. That’s to say I can’t create or delete volumes to present to the SDCs. In a steady state operation this is not a big deal since I don’t modify the volumes on a daily basis.

Once the rebalance has finished, I have my desired state: a dual-site setup with data being written across the two fault sets that are online. Now for the “disaster” test. Here I brutally poweroff all three of the nodes in one of the remaining fault sets and observe the results. At this stage, the result is that the storage is still available to the SDCs and the VMs are still running and generating read/write traffic. So we have a reasonable DR test for a single site failure.

Now for the fail-back: I bring the nodes in the failed fault set back online and the expected rebuild operation kicks off, reestablishing the two fault-set cluster with blocks distributed across the two fault sets.

Resume

ScaleIO is an impressively robust and resilient system that allows for things that the designers probably didn’t have in mind. That said, a simple dual-room setup based on two fault sets with a minimum number of nodes per fault set should be part of the standard configuration options given the ubiquity of this type of configuration and to put them on level competitive ground with all of the dual-site HA offerings available from HP, Huawei, Datacore, etc.

And to finish, I would also recommend separating the MDM roles from the SDS on completely different systems, perhaps in VMs pinned to local storage on site for a clear separation of responsibility. For those getting started with ScaleIO the fact that the two roles can cohabit the same servers can lead some confusion when you’re just getting started and not clear on the dependencies.