This is a subject that I’ve been spending a lot of time with lately. I’ve been reaching out through the intertubes a lot trying to get a decent feel of where things are going and what will be the viable solutions going forward.
Fundamentally, I am being confronted with a couple of issues mostly with regards to coping with issues surrounding consolidation via virtualisation and the conflicting limitations of disk capacity vs IO latency.
In the beginning
For a quick tour of the issues regarding virtualisation, I’ll start with the general structure of storage consolidation. Originally, we had SANs in order to consolidate storage. Basically you put all the disks into a big basket (I’m generalising and oversimplifying), and then put them together in various little groups that were then assigned to a server that needed them. Originally, given the cost and size limitations of the disks, this often meant that you put the disks together in a RAID group and then presented the resulting surface to a server. You started with a one to one relationship and everything was tuned to the IO needs of the server. So a standard database server would often be assigned one group of disks for the database itself, best practices usually being a RAID 5 so that you maximised your surface available based on the investment and the IO profile of the database was reasonably well suited to this type of storage. Your logs would get a separate RAID group, hopefully RAID 10 in order to maximize it’s ability to handling incoming sequential writes.
Virtualisation happened
In the meantime, a few things happened. Virtualisation and VMware’s VMFS filesystem happened for one. It’s a very nice arrangement, optimized for holding very large files that represent disks for virtual machines, clusterable between many servers, with file level locking and such. However, this changed the way that a lot of people looked at storage. VMFS gave us the ability to put the disk workload of many machines on the same RAID or LUN. In practice it works very well most of the time, since the reality of most virtual machines is that they’re not really doing a whole lot most of the time and sharing out the available IOs available on the SAN is OK. But when you run into machines that are more demanding or you have too many on the same volume, then you start running into performance issues that are very difficult to identify.
By putting multiple machines on the same LUN, you have basically ensured that your IO pattern has become random since each machine has it’s own vmdk file and it’s own IO pattern that it completely independant of the other machines on the same LUN. The bottom line is that you are very likely in the worst case scenario for determining maximum IO load. The generally accepted wisdom (from what I’ve been able to glean) is that you can expect average IOps of 180 for a 15K disk, 120 for a 10K and 80-100 for a 7200 RPM disk when dealing with a random IO workload. Now all of these disks are obviously capable of a whole lot more when dealing with sequential workloads, heck even my MacBook’s 2.5" 5400 RPM drive has been noted with peaks of over 400 IOps for some kinds of operations.
The old solution
The first way to ease this issue is to go back to the old fashioned LUN management and put a virtual machine’s data on a single LUN, and you often see this coupled with people using Raw Device Mapping, which is basically a pass through to the virtual machine. This will give you good performance, and often leads people to believe that the VMFS overhead is the issue, when in fact it’s simply dedicated disk allocation that’s making the difference. This is not an ideal long term solution for more than a few reasons.
First off, you are adding additional management overhead to the process of your storage management. If every time someone needs a virtual machine they need to request new LUNs, your storage administrator is going to be doing an awful lot of work, especially since the demand for virtual machine usually outstrips the number of physical deployments.
Big disks
Secondly, we start butting up against the issue of disk capacity growth which has outstripped IOps growth. Inside the SAN, I’m seeing more and more installations with larger and larger disks, often in such a manner that the RAID groups physical capacity exceed the needs of a single machine. The result? You create a large RAID and then designate many LUNs from this space, thus bringing us back to the same problem we saw with the VMFS. You have basically imposed a random IO workload on your RAID, since it’s supporting the IO traffic from multiple sources, each with their own workload.
This gets even worse when you start dealing with very large capacity SATA drives. A RAID5 set of 4+1 1Tb drives gives you 4Tb of useful surface and the majority of the OSs currently deployed can’t deal with physical volumes larger than 2Tb you have to cut it up into two volumes in order to use the space. Going back to a 2+1 RAID is not really very interesting since you start losing a lot more disks to parity data vs useful surface, plus you have all of the inconveniences of a RAID5 in terms of write performance without any of the potential performance gain of striping the IO across multiple physical spindles.
This is often why when you deal with enterprise iron folks, they do all of the sizing calculations based on the random IO profile since they’re fairly sure that you don’t have resources to hand optimize every single individual workload.
Up next - a few modern solutions