Back to backups (yet again)

In the world of information technology, nothing is static and lasts forever, especially best practices. I’ve been pointing out to clients for a while now that backups need to be rethought in terms of the “jobs to be done” philosophy and no longer thought of as “the thing that happens overnight when files are copied to tapes”.

Historically, backups served two purposes :

Being able to go back in time and retrieve data that is no longer available
Serve as the basis for a disaster recovery

Fundamentally, backups should really only serve the first point. We have better tools and mechanisms for handling disaster recovery and business continuity. Which brings me to snapshots. I have always told people that snapshots are not backups even though they respond to the criteria of being able to go back in time.

The hiccup is that snapshots that are dependent on your primary storage system should be considered fragile, in the sense that if your primary storage goes away (disaster), you longer have access to the data or the snapshots. However, just about every storage system worth its salt today includes the ability to replicate data to another system based on or including the snapshots themselves. This is a core feature of ZFS and one I rely on regularly. Many of the modern scale out systems also include this type of functionality, some even more advanced than ZFS like the SimpliVity implementation.

When are snapshots backups?

They become backups once you have replicated them to another independent storage system. This responds to the two basic criteria of being able go back in time and be on a separate physical system so the loss of the primary does not preclude access to the data. They become part of your disaster recovery plan when the second system is physically distant from the primary.

Disk to Disk to Tape

We’ve already seen the traditional backup tools adopt this model to respond to the performance issues around coping with the every growing volume of file data so that data can trickle over to a centralized disk store which is directly connected to tape drives where they can be fed at full speed. Exploiting snapshot based replication permits the same structure, but assigns the responsibility of the disk to disk portion to the storage system rather than the backup software.

The question I ask in most cases here is whether the volume of data involved justifies the inclusion of tape as a backup medium. According to the LTO consortium, LTO6 storage is as low as 1.3 cents per Gb, but this only takes into account the media cost. The most bare bones of LTO drives runs around $2,200, which bumps up the overall cost per Gb rather dramatically.

Assuming a configuration where we store 72Gb of data on tape (12 tapes), at the $80 cost per tape cited by the LTO Consortium plus the cost of the drive, this works out to about 4.3 cents/Gb. At current street prices, the 6Tb WD Red drives run about $270 which converts to 4.5 cents per Gb, not taking into account the additional flexibility of disks that permit compression and deduplication. Note that the 6Tb cited for the LTO numbers already includes compression where the 6Tb disk is raw before compression and deduplication.

Tape does have some inherent advantages in certain use cases, particularly long term offline storage, and does cost less to operate on a $/watt basis, but for many small to medium sized environments, the constraints for using it as a primary backup medium (especially when it is also the primary restore medium) are far outweighed by the flexibility, performance and convenience of a disk based system for daily operations.

Operational convenience of disk over tape.

Tape is a great medium for dumping a full copy of a dataset, but when compared with the flexibility of a modern disk based system it falls far behind. A good example that I use is the ability to prune snapshots from a data set to reorganize the space utilisation. In many systems, I use hourly snapshots in order to give users the a decent amount of granularity to handle errors and issues during the day. This also means that the unit of replication on a given filesystem is relatively small, permitting me to recover from intersite communications failures and not have to resend huge data sets that might have been interrupted. Then on the primary system I prune out the hourly snapshots after a week to leave one daily instance to be retained for 2 weeks. A similar process is applied to weekly and monthly snapshots. Where this gets interesting is that I do not have to apply the same policy on the primary and backup storage systems. My backup storage system is designed for capacity and will retain a month of daily snapshots, 8 weekly snapshots and 12 monthly snapshots. The possibility of pruning data from a set is something that is impossible to do effectively using tape technology, so tape is used for an archival copy that needs to be retained beyond the yearly cycle.

Files vs virtual machines

The above-noted approach works equally well for file servers and storage systems hosting virtual machines, especially if we are using a file based protocol for hosting the VMs rather than a pure block protocol like FC or iSCSI. In the world of virtual machines backup tools are considerably more intelligent about the initial analysis of the data to be backed up. Traditional file server backup is based on a two phase process of scanning the contents of the source, matching this against an index of data known to be backed up and then copying the missing bits. This presents a number of practical issues :

the time to scan continues to grow with the number of files
copying many individual files is a slower process with more overhead that block based differentials

By applying the snapshot and replication technique, we can drastically reduce the backup window, since only the blocks modified between two moments in time need to be copied. In fact there is no longer a backup window since these operations are continuous in the background of the file server.

Virtual machines in the VMware world maintain tracking journals of modified blocks (CBT) which enables the backup software to ignore the filesystem representation of the data and just ask for the modified blocks to copy since the last backup transaction. But again, if we are transmitting snapshots from the underlying storage system, don’t even need to do this. It is, however useful to issue VSS snapshots inside of Windows virtual machines to ensure that any inflight data in caches is flushed to disk before creating the storage layer snapshot.

The biggest issue with backing up virtual machines is the granularity of the restore operation. With only a simple replication, the result is a virtual machine with no visibility into the contents of its internal file systems. This is where the backup tools show their value in being able to backup a virtual machine at the block level, and yet still permit file level restores by peeking inside the envelope to look at the contents of the file systems therein.

The last mile

There are still issues with certain types of restore operations that require a high level of integration with the applications. If you want to restore a single email out of a backed up Exchange or Notes datastore, you need a more sophisticated level of integration than simply having a copy of the virtual machine.

But for the majority of general purposes systems, and particularly file services, the simple replicated snapshot approach is simpler and more effective, both from a cost and operational perspective.