Minio at SFD23

Minio had a very interesting presentation at Storage Field Day 23 (SFD23) and outside of the actual product it sparked some interesting discussion around what do words mean anyway?

The message that Minio is putting forward is that in a cloud world, object storage is going to be the place where most data lives. On top of that, now that there are great object storage solutions (like theirs) that can be deployed on-premises this applies to the datacenter as well. The phrase that was used frequently was “primary storage”. In the storage industry these two words together have a very specific definition used to describe market segmentation, promulgated by industry analysts like Gartner who use the following definition:

Then we come to the question of what is object storage? The specificity of object storage is the protocol. It can be publicly-accessible one like S3 from Amazon or B2 from Backblaze (Note that Backblaze also offers S3 compatible storage as well). But there are also private implementations of object storage like that used in VMware vSAN. In the file space we have pretty much normalized around the SMB and NFS protocols which are presentation or access protocols, in much the same way as the defacto standard of S3 is for object storage. But in all cases object storage is backed by some kind of block storage at the end of the day. As with most software defined solutions, this usually means that since the software is providing the access protocol and managing the resilience you use the simplest and dumbest storage available, generally SSDs or HDDs in JBODs. Which brings up another quibble, in that my object storage might be for user-facing applications or AI/ML workloads that are highly demanding, while others may be for colder storage like backups and archives with very different classes of performance. Same protocol, different performance profile.

In the enterprise storage community, “Primary Storage” is not a definition of volume of usage, but rather a definition of relative performance. Whether the protocol is via a file sharing protocol or a block protocol, the differentiation is all about performance.

So Minio’s use of “object storage as primary storage” provoked some confusion as Minio running on the right hardware can certainly be as performant as many other high-end storage systems, but the point they were trying to make was that object storage is “the primary storage” as in “where is most of the stuff stored?”

So in this case, I think that it would be more appropriate to say that object storage is/will be the dominant protocol for storing, accessing and managing data. (There’s a whole other vaguely related debate around structured or unstructured data that I’m not going to get into today). In the same way that NVMe will probably be the dominant protocol for SSD block storage in the future.

But don’t forget that under the hood, object storage is sitting on some block storage with a filesystem in between.

The product

Back to the product for a little bit. Minio is a flexible software implementation of the S3 protocol that can be deployed in a variety of ways depending on the use case. At its most simple you can put it in front of a file system or a network share and map objects to files allowing applications that consume S3 objects to do their work on data that is coming from other processes that don’t “speak” S3 but deposits files onto a file system. So we may have things like video cameras or other data collection systems that deposit files onto an NFS share, but then are consumed by an ML workload via S3 to interpret the contents of these files/objects.

For scaling up and adding features like immutability, governance policies etc., Minio can also be deployed as a cluster over multiple disks on a single server or multiple serveurs, using erasure coding for resilience. As it is a software solution deployed on Linux, these servers can be physical or virtual as required. This makes it viable for on-premises deployments on generic server hardware. In these cases, there are a few things to note:

clusters are made of pools
clusters are scaled by adding pools of (preferably) identical capacity & configuration

So you need to be aware of the step function involved. If you start with a 10 server pool, your growth will be done by adding another pool of 10 servers. From a design perspective in this case, you might want to initially deploy two pools of 5 servers so that the growth can be handled in chunks of 5 rather than 10. For those used to working with ZFS you can think of a pool as a vdev from an architecture design perspective. The scope of erasure coding resilience is by pool.

Then we have the Minio Operator which is available directly on all of the major cloud platforms or can be deployed on internal cloud platforms using Kubernetes. It’s a front end web & API interface that enables the deployment and management of clusters based on the available resources of the cloud where it’s running. So you choose from various server configurations with the appropriate CPU and storage options and it deploys all of the necessary pieces to get Minio clusters up and running. This would be the point of entry for most people using Minio in the cloud, using the operator to deploy and manage your clusters.

It also goes the extra mile with the automatic deployment and integration with external tools like Prometheus for visualizing use, performance etc.

Why Minio instead of AWS S3?

The question from many people is why should I use Minio instead of the native S3 options available? Since Minio is an independent software layer, you can deploy it anywhere and leverage the built in replication features to get additional resilience with the ability to replicate from your Minio instance hosted on Amazon over to another instance hosted in the Google cloud or even on-premises.

From that point, DR can be handled by pointing your applications at the currently available S3 endpoint regardless of where it’s hosted. If you have cloud native applications that deploy using Kubernetes, it’s just a matter of changing the path to the application’s bucket, some DNS entries for the application and you’re back in business.

As always, remember that egress fees are a thing, so depending on your use case, you might prefer to host core applications on-prem with replication to the cloud so that the bulk of the data flow is towards the cloud rather than from the cloud.

My experience with Minio

I’ve deployed Minio in a number of different environments and configurations and have found that it does exactly what is says it does. A performant (you choose the hardware) S3 service with all of the hot button features like immutability and object locking for backup repositories, encryption, versioning, retention policies, access auditing, …

One issue for many new to Minio is that development is proceeding at a break-neck pace and sometimes it’s hard to track whether the documentation or tutorial you’re looking at is talking about a current version or not. But they are busy updating the official documentation so start there first if you plan to do any bare metal installs yourself.