MMS • Jerome Petazzoni
- Running stateful workloads on Kubernetes used to be challenging, but the technology has matured. Today, up to 90% of companies believe that Kubernetes is ready for production data workloads
- OpenEBS provides storage for stateful applications running on Kubernetes; including dynamic local persistent volumes or replicated volumes using various “data engines”
- Local PV data engines provide excellent performance but at the risk of data loss due to node failure
- For replicated engines, there are three options available: Jiva, cStor, and Mayastor. Each engine supports different use cases and needs
- OpenEBS can address a wide range of applications, from casual testing and experimentation to high-performance production workloads
When I deliver a Kubernetes training, one chapter invariably comes at the end of the training and never earlier. It’s the chapter about stateful sets and persistent storage – in other words, running stateful workloads in Kubernetes. While running stateful workloads on Kubernetes used to be a no-go, up to 90% of companies now believe that K8s is ready for data. The last lab of that chapter involves running a PostgreSQL benchmark (that keeps writing to disk) in a pod, and then breaking the node that runs that pod and showing various mechanisms involved in failover (such as taint-based evictions). Historically, I’ve been using Portworx for that demo. Recently, I decided to give OpenEBS a shot.
In this post, I’ll give you my first impressions of OpenEBS: how it works, how to get started with it, and what I like about it.
OpenEBS provides storage for stateful applications running on Kubernetes; including dynamic local persistent volumes (like the Rancher local path provisioner) or replicated volumes using various “data engines”. Similarly to Prometheus, which can be deployed on a Raspberry Pi to monitor the temperature of your beer or sourdough cultures in your basement, but also scaled up to monitor hundreds of thousands of servers, OpenEBS can be used for simple projects, quick demos, but also large clusters with sophisticated storage needs.
OpenEBS supports many different “data engines”, and that can be a bit overwhelming at first. But these data engines are precisely what makes OpenEBS so versatile. There are “local PV” engines that typically require little or no configuration, offer good performance, but exist on a single node and become unavailable if that node goes down. And there are replicated engines that offer resilience against node failures. Some of these replicated engines are super easy to set up, but the ones offering the best performance and features will take a bit more work.
Let’s start with a quick review of all these data engines. The following is not a replacement for the excellent OpenEBS documentation; but instead is my way of explaining these concepts.
Local PV data engines
Persistent Volumes using one of the “local PV” engines are not replicated across multiple nodes. OpenEBS will use the node’s local storage. Multiple variants of local PV engines are available. It can use local directories (used as HostPath volumes), existing block devices (disks, partitions, or otherwise), raw files (ZFS file systems enabling advanced features like snapshots and clones), or Linux LVM volumes (in which case OpenEBS works similarly to TopoLVM).
The obvious downside of the local PV data engines is that a node failure will cause the volumes on that node to be unavailable; and if the node is lost, so is the data that was on that node. However, these engines feature excellent performance: since there is no overhead on the data path, read/write throughput will be the same as if we were using the storage directly, without containers. Another advantage is that the host path local PV works out of the box – without requiring any extra configuration – when installing OpenEBS, similarly to the Rancher local path provisioner. Extremely convenient when I need a storage class “right now” for a quick test!
OpenEBS also offers multiple replicated engines: Jiva, cStor, and Mayastor. I’ll be honest, I was quite confused by this at first: why do we need not one, not two, but three replicated engines? Let’s find out!
The Jiva engine is the simplest one. Its main advantage is that it doesn’t require any extra configuration. Like the host path local PV engine, the Jiva engine works out of the box when installing OpenEBS. It provides strong data replication. With the default settings, each time we provision a Jiva volume, three storage pods will be created, using a scheduling placement constraint to ensure that they get placed on different nodes. That way, a single node outage won’t take out more than one volume replica at a time. The Jiva engine is simple to operate, but it lacks the advanced features of other engines (such as snapshots, clones, or adding capacity on the fly) and OpenEBS docs mention that Jiva is suitable when “capacity requirements are small” (such as below 50 GB). In other words, that’s fantastic for testing, labs, or demos, but maybe not for that giant production database.
Next on the list is the cStor engine. That one brings us the extra features mentioned earlier (snapshots, clones, and adding capacity on the fly) but it requires a bit more work to get it going. Namely, you need to involve the NDM, the Node Disk Manager component of OpenEBS, and you need to tell it which available block devices you want to use. This means that you should have some free partitions (or even entire disks) to allocate to cStor.
If you don’t have any extra disk or partition available, you may be able to use loop devices instead. However, since loop devices carry a significant performance overhead, you might as well use the Jiva provisioner instead in that case, because it will achieve similar results but will be much easier to set up.
Finally, there is the Mayastor engine. It is designed to work tightly with NVMe (Non-Volatile Memory express) disks and protocols (it can still use non-NVMe disks, though). I was wondering why that was a big deal, so I did a little bit of digging.
In old storage systems, you could only send one command at a time: read this block, or write that block. Then you had to wait until the command was completed before you could submit another one. Later, it became possible to submit multiple commands, and let the disk reorder them to execute them faster; for instance, to reduce the number of head seeks using an elevator algorithm. In the late 90s, the ATA-4 standard introduced TCQ (Tagged Command Queuing) to the ATA spec. This was considerably improved, later, by NCQ (Native Command Queuing) with SATA disks. SCSI disks had command queuing for a longer time, which is also why they were more expensive and more likely to be found in high-end servers and storage systems.
Over time, the queuing systems evolved a lot. The first standards allowed queuing a few dozens of commands in a single queue; now we’re talking about thousands of commands in thousands of queues. This makes multicore systems more efficient, since queues can be bound to specific cores and reduce contention. We can now have priorities between queues as well, which can ensure fair access to the disk between queues. This is great for virtualized workloads, to make sure that one VM doesn’t starve the others. And importantly, NVMe also optimizes CPU usage related to disk access, because it’s designed to require less back-and-forths between the OS and the disk controller. While there are certainly many other features in NVMe, this queuing business alone makes a great deal of a difference; and I understand why Mayastor would be relevant to folks who want to design storage systems with the highest performance.
If you want help to figure out which engine is best for your needs, you’re not alone; and the OpenEBS documentation has an excellent page about this.
Container attached storage
Another interesting thing in OpenEBS is the concept of CAS, or Container Attached Storage. The wording made me raise an eyebrow at first. Is that some marketing gimmick? Not quite.
When using the Jiva replicated engine, I noticed that for each Jiva volume, I would get 4 pods and a service:
- a “controller” pod (with “-ctrl-” in its name)
- three “data replica” pods (with “-rep-” in its name)
- a service exposing (over different ports): an iSCSI target, a Prometheus metrics endpoint, and an API server
This is interesting because it mimics what you get when you deploy a SAN: multiple disks (the data replica pods) and a controller (to interface between a storage protocol like iSCSI and the disks themselves). These components are materialized by containers and pods, and the storage is actually in the containers, so the term “container attached storage” makes a lot of sense (note that the storage doesn’t necessarily use copy-on-write container storage; in my setup, by default, it’s using a hostPath volume; however this is configurable).
I mentioned iSCSI above. I found it reassuring that OpenEBS was using iSCSI with cStor, because it’s a solid, tested protocol widely used in the storage industry. This means that OpenEBS doesn’t require a custom kernel module or anything like that. I believe that it does, however, require some userland tools to be installed on the nodes. I say “I believe” because on my Ubuntu test nodes with a very barebones cloud image, I didn’t need to install or configure anything extra anyway.
After this quick tour of OpenEBS, the most important question is: does it fit my needs? I found that its wide range of options meant that it could handle pretty much anything I would throw at it. For training, development environments, and even modest staging platforms, when I need a turnkey dynamic persistent volume provisioner, the local PV engines work great. If I want to withstand node failures, I can leverage the Jiva engine. And finally, if I want both high availability and performance, all I have to do is invest a minimal amount of time and effort to set up the cStor engine (or Mayastor if I have some fancy NVMe devices and want to stretch their performance to the max). Being both a teacher and a consultant, I appreciate that I can use the same toolbox in the classroom and for my clients’ production workloads.