Data Protection in 2020

Zerto presented at Tech Field Day 21, and helped bring into focus a number of thoughts I’ve been having regarding some of the different approaches to data protection and what they actually entail in our current context of computing and how Zerto’s approach fits well in this changing world.

Disaster Recovery vs Backup vs Archive

Each of these are activities that fall under the umbrella term of Data Protection, but they have very different functions, requirements and even terminology.

The first one that I want to address is the notion of Archiving. This was the process of taking older (presumably rarely solicited) paper records and moving them to another physical location in order to free up space in the office filing cabinets which was by definition limited. In some cases where the physical volume was significant like newspapers, we would apply some kind of technical process to reduce the physical space required, like photographing the documents and converting them to microfiche, basically an analog version of compression and destroying the originals once converted.

This practice continued in the digital era where we were once again confronted with limited amounts of primary storage that we needed to free up for current data while cold data would be removed from the environment and pushed out to other media with a lower cost/Gb, traditionally some kind of tape which could be stored for long periods of time, now moving to big disks and object storage. But the key facet of archiving in both cases is that we removed data from the primary systems and put them in a (relatively) inaccessible island that required significant effort to retrieve.

This approach is becoming less and less applicable for a number of reasons:

  • Primary storage is no longer limited in the same ways it used to be, witness solutions like Vast Data

  • With the advent of machine learning systems historical data is just as valuable as current data and it needs to be accessible at reasonable speeds

  • Data is often dependent on applications to render it usable and archiving tools can interact natively with all of your applications are even harder to find and fiddly

Turning to Disaster Recovery and Backup systems, the key difference here is that we’re talking about making copies of data while archiving is all about moving a unique instance from one environment to another.

On the one end of the spectrum, we have the Disaster Recovery tools that are optimised for minimising RTO and RPO, but don’t require significant historical data since the objective is to get the current or very recent state back into operation as quickly as possible. Unlike archiving, the objective is to ensure that we have an autonomous copy with zero dependencies on the original system. This means that the destination for this data has to be storage that can be considered primary storage from a performance perpective or at least close.

Zerto is one of the leaders in the space of Disaster Recovery solutions with its continuous data protection model that is based on journaling all write activity in close to real time without imposing the constraints of synchronous array replication and the additional latencies required. Moving the data movement from the storage array level up to the hypervisor gives significantly more insight into the context of the IO as well as more flexible options in redirecting the output that doesn’t require homogenous storage systems. Some specifics that are very useful here is that I now have per VM journals, whereas a storage array can only see undifferentiated flows to a volume in most cases. Once we know what output belongs to what VM, we can start studying and understanding the contents and do interesting things like identify what a database commit I/O operation looks like or a filesystem snapshot in order to annotate the journal with useful pointers to when the VM’s application and filesystem is in a coherent state. So we no longer have to have all of the complex interactions of taking VM level snapshots and coordinating this with the OS and applications inside the VMs.

Despite having tools to handle disaster recovery that are actively making copies of your data, there is still a requirement for having a backup solution at the same time since its role is slightly different. Backups are there to permit you to go back in time beyond the immediate past and there as the last ditch stop-gap for disaster recovery. Since the requirement is to start keeping older data that won’t be actively solicited in the way that data copied for disaster recovery purposes is, we now have a market segment of storage for “secondary data”. This kind of storage has a performance profile that is generally less demanding than primary data and is usually optimised at the cost/Tb since you’re going to have a lot of it.

Back to the archive vs backup question. Products that actively archive tend to be quite complicated, so people are moving more and more towards simply using their backups with very long retention periods to fill the archive niche without the complexity of actually removing anything from primary storage outside of regular clean-up and data purging. The advent of virtualisation has simplified a lot of this since a virtual machine is a self-contained entity that includes the OS and the applications required to actually interpret the contents of the system in a useful way.

The drawback to having both a backup system and a disaster recovery system is they will both be hitting your primary storage at some point in order to make the necessary copies. No to mention dual management, operations and licence costs.

In recent versions of Zerto they’ve expanded the scope and methods of their data retention system to enable one product to fill these different niches without any additional overhead on the primary storage and eliminating much of the complexity along the way. Since they already have a copy of all of the current data on the DR site/system, instead of throwing data away once it ages out of the DR retention window, it is now copied off of the DR primary storage to secondary storage and is pruned as time goes on to match the data retention policy. This process scales to long term retention so you get DR, Backup and Archive copies of your data all from the original write operations that are siphoned off of production.