Managing Thin Provisioning

This question has come to me via a number of different channels over the last few days. Thin provisioning is a really nice feature to give yourself some additional flexibility in managing storage usage. But things have gotten more than a little confusing lately since you can implement it on different levels with different issues.

The biggest fear is what I refer to as a margin call. You have declared more storage to your servers that you really have, and at some point, you exceed your physical capacity and everything grinds to a halt. We’ve already see similar issues with using Hypervisor snapshots where the delta file grows unattended and the LUN fills up and can no longer accept any new writes.

In practical terms, I have a couple of different approaches, mostly depending on the environment you’re working in.

Production

You don’t want to take any chances in production. But this still doesn’t mean that thin provisioning is a bad idea. I strongly recommend that VMware’s thin provisioning not be used in production since there is a noticeable performance impact. However there are still good reasons to use it on the storage system:

  • Modern storage systems tend to use some kind of pooling technology to stripe IO across many disks. If you use fixed provisioning we have a higher probability of running into hot and cold spots and you might be limiting your performance.

  • Unexpected demands can arrive and if you have fixed provisioning your time to reaction may involve waiting on new hardware purchases.

So my general policy on production systems is to use thin provisioning, but never to overprovision. If I have an unexpected project that arrives and needs space, I can allocate it quickly, and start the purchasing process to ensure that my overprovisioned state is temporary. The key is to ensure that the demand is dependent on getting that purchase order approved, so the risk exposure is minimized.

Test, Dev, Integration, Qualification, …

In these environments the lifecycle is very very different from production. Much of the choices here depend on how you use these types of environments.

Much of the time, the work can be exploratory with unexpected demands for additional machines and storage as problems are identified and new test cases appear. In these environments, I tend more towards fixed allocation for a given project, but let the developers and testers the autonomy of deploying into these environments. Thus, the logical choice is to lean more towards thin provisioning at the VM level.

However to maintain maximum flexibility it can be useful to continue to use thin provisioning on the storage system. But in this case, we have a different issue: how to reclaim disk efficiently in an environment where machines can be created and deleted frequently? The problem is that a deletion only writes to the allocation table, but the actual blocks that represent the deleted VM have been written to and thus are still allocated on storage.

Reclaiming thin provisioned storage today remains a PITA. Basically, we need to send some kind of command to clear the contents of unallocated blocks (zeroing out) and then instruct the storage system to reclaim these blocks, which generally involves a pretty brute force approach of reading everything to see what can be reclaimed.

To get around this issue I have adopted a rolling approach where long lived test and development environments are renewed quarterly (or monthly depending on the volatility). This involves scripting the following actions :

  • Create a new LUN
  • Format the LUN as a datastore
  • svMotion the VMs on the source datastore to the new datastore
  • unmap the old LUN
  • delete the old LUN
  • possibly rename the new LUN (or use a minimal date stamp in the name)

This results in a freshly thin provisioned datastore with only the current VMs storage allocated. Any thin provisioned blocks on the original source LUN have been freed up with the deletion of the LUN.

Of course, you could always just use NFS backed by ZFS and let the system do its thing.

Other issue can come into play depending on your internal operating procedures, such as do you do internal billing for allocated storage? In these cases, the question of how to bill for thin provisioned storage is an ongoing debate.