Vast Data came out of stealth mode at Storage Field Day 18 with a number of surprises including an innovative architecture that takes advantage of all of the newest hardware advances. My first thought was to see if I could write up an explanation of the architecture, but based on the same public information I have access to, Glenn Lockwood wrote this excellent description which covers everything I was going to (and then some).
If you’ve spent any time in and around the storage industry you probably have at least some passing familiarity with the taxonomy defined by Chad Sakac during his time at EMC. He posits (and up until now, I agreed completely) that all storage systems basically fall into 4 architecture designs:
-
Type 1: Clustered scale-up
-
Type 2: Tightly coupled scale-out
-
Type 3: Loosely coupled scale-out
-
Type 4: Distributed share nothing
Side note: If you’ve heard of this taxonomy but haven’t read the article, I highly recommend spending some time on it. It’s dense but highly informative.
Vast has leveraged a number of technological advances in high speed persistent storage (3D Xpoint) and high speed networking (RoCE, NVMeoF) to define a new architecture that can be summarized as “Scale-out shared everything“ or Disaggregated Shared Everything (DASE) in the Vast Data naming which appears to be a melding of the best features of Type 2 and Type 4 (Type 5 maybe?). Underlying the idea of a Type 2 design is the idea that you need a custom internal fabric in order to ensure a fixed network structure to be able to guarantee access to shared memory between all of the controllers at a fixed, predictable latency. The downside of this design is that it generally sets some upper limits on the number of interconnects available and thus the number of controllers you can add to the fabric.
That design is predicated on the historical requirement of needing cache memory on all of the controllers in order to provide acceptable performance from the persistent storage that is significantly slower than memory used for caching. Plus the general limitations on the available network technologies. Noting here that while disk is the obvious culprit, even SAS and SATA connected flash storage have protocol level bottlenecks and CPU overhead that limit the performance of the underlying flash media, mitigated by the use of DRAM caching.
Vast has gone to the logical next step in eliminating the requirement for shared memory and the associated cache state management by ensuring that state is respected across controllers by getting rid of this layer and simply replacing it with direct access to a very fast persistent storage layer. As a result every Vast IO Server (think NFS or S3 gateway) “sees” every persistent device directly so the data is always the actual data. Since the interface protocol here is NVMe over RoCE or Infiniband, the performance impact is practically invisible. After a few decades of always thinking in terms of traditional architectures, this pretty much broke my brain for a while.
The component that permits this is the 3D Xpoint or Optane where writes are accumulated and analysed in order to do a few things:
-
Build up a wide stripe of highly similar data that can be pushed down to cheap flash
-
Optimize the data structures written to get the best lifespan out of the cheap flash
-
Eliminate the risk of holding data on a volatile support permitting the construction of very wide stripes
Remembering that the actual storage devices are not controllers in the traditional sense and that all of this data management is the (CPU bound) role of the Vast IO Servers. So when more IO is required, it’s just a matter of spinning up additional IO Servers or allocating more CPU to the servers. Vast Data has jumped on the container bandwagon here which makes this a trivial task.
There’s still the issue of write amplification that comes from a clustered design (I need to write to multiple failure domains to properly protect the data), but between the raw speed of the components and the parallelism offered by NVMe, this seems to be mostly a non-issue.
But back to the shared everything bit of the puzzle - this allows them to do some interesting math (and I’m hoping to see a technical deep dive on this at a future Storage Field Day) permitting some flexible deduplication, compression and data protection that brings the useful cost/Gb back into the cheap slow hard disk range. We didn’t go too deep into the details here, but this is the stuff that will really make or break the value proposition as Renan was saying during the presentation that they expect a minimum of 1 Petabyte in their standard 2U JBOF (Just a Bunch Of Flash) enclosure, but in many cases they should be able to get considerably better data reduction, pushing the usable volume to up to “considerably more”…
No more disks
One of the messages put forth by Vast was the goal of eliminating the need for spinning rust completely. Now on a pure cost/Gb, cheap 3.5” SATA drivers will always win. There’s always the argument that all of the fancy compression technologies that are used by modern flash systems could be applied to data stored on cheap disks, but that introduces the latency and performance variability back in the system. As I’ve noticed on my internal ZFS systems, compression can be a big win for squeezing more performance out of spinning disks since it can reduce the number of physical IOs required, but it’s still a problem for primary tier-0 and tier-1 workloads that demand fixed or close to fixed latency. Clearly Vast is targeting these primary workloads that are constantly growing in capacity requirements and can’t accept the variability of having disks in the IO path.
That said, I see a second product opportunity here by leveraging the overall architecture, but replacing the flash media with the latest in SMR drives on the back-end, but limiting the client interfaces to the object storage where the IO performance requirements can be a little more forgiving. But with a new generation of QLC Flash starting to hit the market, I think that there will be lots of discussions and arguments over the value proposition of disks vs QLC given things like the density question since it’s possible to put a lot more flash storage in the same number of cubic centimetres as a hard drive, witness the 100 Tb Nimbus drive. SMR permits higher density than PMR, but we’re still a long ways away from the 100 Tb hard disk. I suspect that we’ll be seeing efficiency calculations that will be accounting for Watt/Tb/cm3 in the future as we aim for higher and higher density and performance/Watt.
Back to the future
There were some interesting digressions into ideas that this architecture enables moving forward and one of them was going away from the JBOF model to SSDs directly connected to Ethernet, similar to the Coraid Etherdrive concept, but this time around using NVMeOF instead of legacy protocols.
I’m really hoping that there’s some movement in this area since this is the logical method that will allow them to scale down effectively. Right now, this is a tool for the big players, remembering that the base JBOF is expected to provide at least a Petabyte of useful storage and scales up from there. But there’s a whole other market underneath that Petabyte floor that would love to ditch their disks on what appears to be a future-proofed architecture…
Holy grail?
So has Vast really accomplished that holy grail of having an all flash, all workloads solution at disk economics? It’s looking promising and I’m really curious to see how this plays out in the market.