DELL: It's not a bug, it's a feature...

Sigh. This kind of stuff is really annoying. I’m in the process of building up a storage system using some of the latest kit from DELL and just ran into some very interesting and annoying problems.

The setup is two DELL R900s coupled to MD1000s with the latest PERC6/E SAS controllers. Our initial benchmarks on the system are really quite impressive. Now I’ve moved onto the acceptance tests to validate the way the system reacts to various types of failure and how it recovers.

I’m using SANMelody on the systems to form a high availability SAN and I have a set of standard failure tests that I’ve run on various similar setups using HP and IBM equipment, mostly MSA30 and MSA50 disk bays and various other JBODs. One test is a brutal crash of the bay that’s acting as the primary storage. The reaction is just as expected, all of my servers fail over gracefully to the second server, even under extremely high IO load from multiple ESX Servers with VMs running IOMeter. The machine comes back up and the two servers agree that they can’t trust the data on the crashed system so the mirrors are cleaned and automatically resynchronised. It takes a while but even with the IOMeter load hammering the backup server it puts everything back and goes merrily along its way.

All of my other standard tests of gracefully stopping the SANMelody service, cutting the replication link, etc., all work as expected. Very smooth and everyone is happy.

Then I get to the power failure of a disk bay and my day goes to hell. Cut the power to the bay, let it sit for a while and watch SANMelody reroute IO to the other server, still no interruption of service. So far so good. I power the MD1000 back up and watch the SANMelody console waiting for the volumes to come online. Wait. Wait. Wait. That’s not good. Every other disk system I’ve tested this on brings the disks back online automatically.

Rescan the disks - still nothing from the disk bay. Open up the Dell Open Manage web console and see that it has identified the disks as being a foreign configuration. That’s bad. I reimport the disk configuration from the disks using the Open Manage console and the volumes start coming back online. A couple of things that are really not good here. I shouldn’t need a manual intervention to bring my disks back. On top of that Open Manage has created a phantom hot spare in slot 22 of a 15 disk bay. I have no idea what happens to my bay if it tries to rebuild a RAID from this imaginary disk and I don’t think that it will be pretty.

Going back to the classic trouble shooting techniques, it’s time to reboot and see if things get better. They don’t. First off, the controller really isn’t convinced that it can use the disk configuration since it hasn’t rescanned all of the RAID volumes so the boot sequence stops waiting on a confirmation to import the foreign configuration again and the system is now busy doing a background initialisation. Well, not really, but the UI really needs to clarify the difference between a background initialisation and a background validation. And my phantom disk is still visible.

Hello? Dell support? (after waiting 25 minutes to get to talk to someone). Explain situation. Response: that’s the way it’s supposed to work. If the controller sees the disks go offline, it refuses to bring them back online without a manual intervention to import the “foreign” configuration. I still have the ticket open on the phantom disk.

Now perhaps I’m being stupid here, but if it thinks that it’s a foreign configuration, wouldn’t that mean that it doesn’t match with what’s in the PERC? Since nothing has changed on the configuration, how could it be different? At the very least it should be able to read the disks configuration, compare it to the last known configuration of the controller and decide that it can remount the volumes? Older controllers from Dell used to let me set a switch to tell the system how to react so I could specify to always use the disk configuration, the card’s configuration or wait for user input. I’d really really like that option back.

Now I wouldn’t be that upset overall since I can manually import the foreign configuration without restarting the server (so even if it’s 3AM, I can VPN in and access the console), but it then requires a background initialisation before it changes the state of the disks from foreign to online. And that hammers the disks and degrades my IO on the bay. On an MD1000 with 15 750Gb SATA drives, I’m good for a few days of validation.

I’m beginning to think that there are some serious problems with the current generation of Dell SAS controllers since I have another client that’s getting some grief with random loss of the RAID configurations on their ESX boot volumes. The standard RAID 1 internal SAS setup (PERC6i) that you see everywhere and for no apparent reason some of the machines will lose the configuration and stop accessing the drives while the server is running. This plays royal hell with everything since it’s a malfunction that does not trigger the ESX HA function as the OS is still alive (albeit on life support) but you can’t ask the server to do anything.

Anyone else seeing odd behaviour from DELL SAS controllers?

Note: Yes, everything is using redundant power supplies, connected to separate electric feeds in an battery/generator backed data center, but sh*t happens so you have to be prepared and know how things are going to react.