Presentation: Kubernetes as a Foundation for Infrastructure Control Planes

MMS Founder
MMS Daniel Mangum

Article originally posted on InfoQ. Visit InfoQ

Transcript

Mangum: My name is Dan Mangum. I’m a Crossplane maintainer. I’m a staff software engineer at Upbound, which is the company behind the initial launch of the Crossplane project, which is now a CNCF incubating project. I’m going to be talking to you about Kubernetes as a foundation for infrastructure control planes. I’m going to cheat a little bit, I did put an asterisk on infrastructure, because we’re going to be talking about control planes in general. It’s going to expand a little bit beyond infrastructure specifically. That is going to serve as a great use case for control planes here.

Outline

When I give talks, I like to be really specific about what we’re going to cover, so you can know if there’s going to be value for you in this talk. I usually go through three stages, motivation. That’s going to be, what is a control plane, and why do I need one? Then explanation. Why Kubernetes as a foundation. That’s the title of our talk here. Why does Kubernetes serve as a good foundation for building control planes on top of? Then finally, inspiration. Where do I start? Now that I’ve heard about what control planes are, why Kubernetes is useful for them, how do I actually go about implementing one within my organization with some tangible steps?

What Is a Control Plane?

What is a control plane? In the Crossplane community, we like to describe a control plane as follows, a declarative orchestration API for really anything. We’re going to be specifically talking about two use cases for control planes. The first one is infrastructure. Infrastructure is likely the entry point that many folks have had to a declarative orchestration API. What does that look like? The most common use case is with the large cloud providers, so AWS, Azure, and GCP. There’s countless others. Essentially, what you’re doing is you’re telling them, I would like this state to be a reality. It’s a database to exist or a VM to exist, or something like that. They are reconciling that and making sure that that is a state that’s actually reflected, and then giving you access to that. You don’t actually worry about the implementation details or the imperative steps to make that state a reality. There’s also a bunch of other platforms that offer maybe specialized offerings, or something that has a niche take on what a product offering should look like. Here we have examples of bare metal, we have specific databases, we have a message bus solution, and also CDN. There’s a lot of smaller cloud providers and platforms that you may also use within your organization.

The other type of control plane that you may be familiar with, and is used as a declarative orchestration API is for applications. What are the platforms that offer declarative orchestration APIs for applications? Once again, the large cloud providers have higher level primitives that give you the ability to declare that you want an application to run. Whether that’s Lambda on AWS, or Cloud Run on GCP. These are higher level abstractions that they’ve chosen and defined for you, that you can interact with to be able to run an application and perhaps consume some of that infrastructure that you provisioned. There’s also layer-2 versions of this for applications as well. Examples here include static site generators, runtimes like Deno or something, and a quite interesting platform like Fly.io that puts your app close to users. These are basically opinionated offerings for application control planes, where you’re telling them you want an application to run, and they’re taking care of making that happen.

Why Do I Need a Control Plane?

Why do I need a control plane? We’ve already made the case that you have a control plane. These various platforms, I would hazard a guess that most people have had some exposure or are currently using at least one of these product offerings. Let’s take it for granted that you do have a control plane. The other part we need to focus on is I, or maybe we, what are the attributes about us that make us good candidates for a control plane? QCon likes to call its attendees and speakers, practitioners. I like to use a term similar to that, builders. In my mind, within an organization, there’s a spectrum between two types of builders. One is platform, and the other is product. On the platform side, you have builders who are building the thing that builds the thing. They’re giving you a foundation that the product folks build on top of to actually create the business value that the organization is interested in. When we talk about these two types of personas, we frequently view it as a split down the middle. There’s a single interface where platform teams give a set of APIs or a set of access to product folks. In some organizations that may look like a ticketing system. In a more mature DevOps organization, there may be some self-service workflow for product engineers.

In reality, the spectrum looks a lot more like this. There’s folks that exist at various parts of this. You may have someone in the marketing team, for instance, who wants to create a static site, and they have no interest in the underlying infrastructure. They’re very far to the product side. You may have someone who is a developer who knows a lot about databases, and wants to tune all the parameters. Furthermore, you may have an organization that grows and evolves over time. This is something I’m personally experiencing, working at a startup that’s growing rather quickly. We’ve gone from having a small number of engineers to many different teams of engineers, and our spectrum and where developers on the product and platform side sit on that, has changed quite a bit over time. Those abstractions that you may be getting from those cloud providers or platforms that you’re using may not fit your organization in perpetuity.

Control Plane Ownership

We said that you already have a control plane. What’s the issue? You just don’t own it yet. All of these control planes that you’re using are owned by someone else. Why do they have really great businesses around them? Why would you want to actually have some ownership over your control plane? Before we get into the benefits of ownership, I want to make it very clear that I’m not saying that we shouldn’t use these platforms. Folks at AWS, and GCP, and Azure, and all of those other layer-2 offerings, they’ve done a lot of work to give you very compelling offerings, and remove a lot of the operational burden that you would otherwise have to have yourself. We absolutely want to take advantage of all that work and the great products that they offer, but we want to bring that ownership within your organization.

Benefits to Control Plane Ownership

Let’s get to the benefits. The first one is inspection, or owning the pipe. I’ve heard this used when folks are referring to AWS. I like to use Cloudflare actually as a great example of this. It’s a business that’s built on having all of the traffic of the internet flow through it, and being able to build incredible products on top of that. You as an organization currently probably have someone else who has the inspection on your pipe. That may be AWS, for instance. They can give you additional product offerings. They can see how you’re using the product. They can grow and change it over time. If you move beyond a single platform, you lose the insight and you may not be the priority of the person who owns a pipe that you’re currently using. When you have that inspection, you can actually understand how folks within your organization are using infrastructure, deploying it, consuming it. Everything goes through a central place, even if you’re using multiple different platforms.

The next is access management. I know folks who are familiar with IAM on AWS or really any other platform, know how important this is and how complicated it can be. From a control plane perspective that sits in front of these platforms, you have a single entry point for all actors. You’re not saying anymore, I want to give permission for a single developer to AWS, even if you’re giving them the ability to create a database on AWS. We’ll actually define a higher level concept that even if it maps one to one as an abstraction on top of a concrete implementation, you’re going to have a single entry point where you can define all access for users, no matter what resources they’re accessing behind the scenes.

Next is the right abstractions. Unless you’re the largest customer of a cloud platform, you are not going to be the priority. There may be really great abstractions that they offer you. For instance, in my approximation, Lambda on AWS has been very successful, lots of folks are building great products on top of it. However, if your need for abstraction changes, and maybe your spectrum between platform and product folks evolves over time, you need to be able to change the different knobs, and maybe even for a specific project offer a different abstraction to developers, and not be locked into whatever another platform defines as the best for you.

The last one is evolution. Your organization is not going to be stagnant, and so the API that you present to developers may change over time. Your business needs may change over time, so the implementations behind the API of your control plane can also evolve over time without the developers actually even needing to know.

Why Kubernetes as a Foundation?

We’ve given some motivation for why you would need a control plane within an organization. We’re going to make that really concrete here by looking at an explanation of why Kubernetes serves as a strong foundation for a control plane. Really, what I’m talking about here is the Kubernetes API. Kubernetes has come about as a de facto container orchestration API. In reality, it’s really just an API for orchestrating anything. Some may even refer to it as a distributed systems framework. When I gave this presentation at QCon in London, I referred to it as the POSIX for distributed systems. While that may have some different abstractions than something like an operating system, it is critical to start to think about Kubernetes and distributed systems as something we program just like the things that an operating system offers us, the APIs it gives us. What Kubernetes allows us to do is extend APIs and offer our own. Let’s get into how that actually works.

Functional Requirements of a Control Plane

What are the functional requirements for a control plane? Then let’s try to map those back to what Kubernetes gives us. First up is reliable. I don’t mean that this is an assertion that a pod never restarts or something like that. What I’m saying is we can’t tolerate unexpected deletion of critical infrastructure. If you’re using ephemeral workloads on a container orchestration platform, this isn’t as big of a deal. You can stand for nodes to be lost, and that sort of thing. We can still stand for that to happen when dealing with infrastructure. What we can’t have happen is for the control plane to arbitrarily go and delete some of that infrastructure. We can withstand it not reconciling it for a period of time.

The next is scalable, so you have more value at higher adoption of a control plane. We talked about that inspection and the value that you get from that. You must be able to scale to meet the increased load that comes from having higher adoption within your organization, because that higher adoption is what brings the real value to your control plane. The next is extendable. I’ve already mentioned how Kubernetes does this a little bit, and we’ll go into the technical details. You must be able to evolve that API over time to match your evolving organization. It needs to be active and responsive to change. We’re not just saying that we want this to exist, and after it exists, we’re done. We’re going to have requirements that are changing over time. Our state that we desire is going to evolve over time. There’s also going to be external factors that impact that. I said earlier that we don’t want to give up all of the great work that those cloud providers and platforms have done, so we want to take advantage of those. You may have a rogue developer come in and edit a database outside of your control plane, which you don’t want to do, because that’s going outside of the pipe. Or you may just have an outage from the underlying infrastructure, and you want to know about that. You need a control plane that constantly reconciles your state, and make sure what’s represented externally to the cluster matches the state that you’ve put inside of it.

If you’ve seen this before, it’s because I took most of these directly from the Kubernetes website. These are the qualities that Kubernetes says it provides for container orchestration. It also offers them for really any other distributed system you want to build on top. There’s a number of others that we’re not going to talk specifically about, but are also useful. One is secrets management, which is useful when you need to connect to these external cloud providers, as well as potentially manage connection details to infrastructure you provision. There’s network primitives. There’s access control, which we touched on a little bit already. Last and certainly not least, is a thriving open source ecosystem. You have a foundation that if you need to actually change the foundation itself, you can go in and be a part of driving change. You can have a temporary fork where you have your own changes. You can actually be part of making it a sustainable project over time, which lots of organizations as well as individuals do today.

You Can Get a Kubernetes

Last on the list of things that is really useful for Kubernetes, is what we like to say, you can get a Kubernetes. What I mean by this is Kubernetes is available. There is a way to consume Kubernetes that will fit your organization. In fact, there’s 60-plus distributions listed on the CNCF website that you can use on your own infrastructure, you can run yourself, or pay a vendor to do it for you. There’s 50-plus managed services where you can actually just get the API without worrying about any of the underlying infrastructure. Kubernetes, despite what platforms you use, is going to be there for you, and you can always consume upstream as well.

What’s Missing?

What’s missing from this picture? If Kubernetes gives us everything that we need, why do we need to build on top of it? There’s two things that Kubernetes is missing out on, and I like to describe them as primitives and abstraction mechanisms. Kubernetes does offer quite a few primitives. You’ve probably heard of pods, deployments, ingress. There’s lots of different primitives that make it very easy to run containerized workloads. What we need to do is actually start to take external APIs and represent them in the Kubernetes cluster. For example, an AWS EC2 instance, we want to be able to create a VM on AWS alongside something like a pod or a deployment, or maybe just in a vacuum. There’s also things outside of infrastructure, though. At Upbound, we’re experimenting with onboarding employees by actually creating an employee abstraction. We’ll talk about how we do that. That is backed by things like a GitHub team that automatically make sure an employee is part of a GitHub organization. There can also be higher level abstractions, like a Vercel deployment that you may want to create from your Kubernetes cluster for something like an index.js site.

The other side is abstraction mechanisms. How do I compose those primitives in a way that makes sense for my organization? Within your organization, you may have a concept of a database, which is really just an API that you define that has the configuration power that you want to give to developers. That may map to something like an RDS instance, or a GCP Cloud SQL instance. What you offer to developers is your own abstraction, meaning you as a platform team have the power to swap out implementations behind the scenes. I already mentioned, the use for an employee, and a website maps a Vercel deployment. The point is, you can define how this maps to concrete infrastructure, and you can actually change it without changing the API that you give to developers within the organization. Lastly, is how do I manage the lifecycle of these primitives and abstraction mechanisms within the cluster? I’m going to potentially need multiple control planes. I’m going to need to be able to install them reliably. I might even want to share them with other organizations.

Crossplane – The Control Plane Tool Chain

I believe Crossplane is the way to go to add these abstractions on top of Kubernetes, and build control planes using the powerful foundation that it gives us. If you’re one of the Crossplaners, it’s essentially the control plane tool chain. It takes those primitives from Kubernetes and it gives you nice tooling to be able to build your control plane on top of it by adding those primitives and abstraction mechanisms that you need. Let’s try and take a shot at actually mapping those things that we said are missing from Kubernetes to concepts that Crossplane offers.

Primitives – Providers

First up is providers. They map to primitives that we want to add to the cluster. Provider brings external API entities into your Kubernetes cluster as new resource types. Kubernetes allows you to do this via custom resource definitions, which are actually instances of objects you create in the cluster that define a schema for a new type. The easiest way to conceptualize this is thinking of a REST API and adding a new endpoint to it. After you add a new endpoint in a REST API, you need some business logic to actually handle taking action when someone creates updates or deletes at that endpoint. A provider brings along that logic that knows how to respond to state change and drift in those types that you’ve added to the cluster. For example, if you add an RDS instance to the cluster, you need a controller that knows how to respond to the create, update, or delete of an RDS instance within your Kubernetes cluster.

The important part here is that these are packaged up as an OCI image for ease of distribution and consumption. All of these things are possible without doing this, but taking the OCI image path for being able to distribute these means that they are readily available. OCI images have become ubiquitous within organizations, and every cloud provider has their own registry solution as well. These are really easy to share and consume. You can also compose the packages themselves, which providers is one type of Crossplane package, and start to build a control plane hierarchy out of these packages, which we’ll talk a little bit in when we get to composition.

Abstraction Mechanisms – Composition

Abstraction mechanisms are this kind of second thing that we said we’re missing from Kubernetes. Crossplane offers abstraction mechanisms via composition. Composition allows you to define abstract types that can be satisfied by one or more implementations. This is done via a composite resource definition. This is different from the custom resource definition that Kubernetes offers. Crossplane brings along a composite resource definition. The composite resource definition, or as we like to call an XRD, as opposed to the Kubernetes CRD. The XRD allows you to define a schema just like a CRD, but it doesn’t have a specific controller that watches it. Instead, it has a generic controller, which we call the composition engine, which takes instances of the type defined by the XRD and maps them to composition types. A composition takes those primitives, and basically tells you how the input on the XRD maps to the primitives behind the scenes.

Let’s take an example, and we’ll keep using that same one that’s come up a number of times in this talk, a database. In your organization, you have a database abstraction. Really, the only fields a developer needs to care about, let’s say, are the size of the database, and the region that it’s deployed in. On your XRD, you define those two fields, and you might have some constraints on the values that could be provided to them. Then you author, let’s say, three compositions here. One is for AWS, one is for GCP, and one is for Azure. On the AWS one, let’s say you have some subnets, a DB subnet group and an RDS instance. On the GCP one, you have, let’s say a network and a Cloud SQL instance. On Azure, you have something like a resource group, and an Azure SQL instance. When a developer creates a database, it may be satisfied by any one of these compositions, which will result in the actual primitives getting rendered and eventually reconciled by the controllers. Then the relevant state gets propagated back to the developer in a way that is only applicable to what they care about. They care about that their database is provisioned. They don’t care about if there is a DB subnet group and a DB behind the scenes.

These are actually also packaged as OCI images, that’s the second type of Crossplane package. A really powerful ability you have with these packages is to say what you depend on. You can reference other images as dependencies, which means that if you install a configuration package with the XRD and compositions that we just talked about, you can declare dependencies on the AWS, GCP, and Azure providers, and Crossplane will make sure those are present. If not, will install them for you with appropriate versions within the constraints that you define. This allows you to start to build a DAG of packages that define your control plane and make it really easy to define a consistent API and also build them from various modular components.

Making this a little more concrete, I know this is difficult to see. If you want this example it’s from the actual Crossplane documentation. Here we’re on the left side defining an XRD that is very similar to the one that I just described. This one just takes a storage size. On the right side, we’re defining a Postgres instance implementation that is just an RDS instance that maps that storage size to the allocated storage field in the AWS API. Something that’s really important here is that we strive and always do have high fidelity representations of the external API. That means that if you use the AWS REST API, and there’s a field that’s available for you, you’re going to see it represented in the same way as a Kubernetes object in the cluster. This is a nice abstraction here, but it’s quite simplistic, you’ll see there’s something like the public accessible field is set to true, which basically means it’s public of the internet. Not a common production use case.

Without actually changing the XRD or the API that the developer is going to interact with, we can have a much more complex example. This example here is creating a VPC, three subnets, a DB subnet group, some firewall rules, and the RDS instance, which is a much more robust deployment setup here. The developer doesn’t have to take on this increased complexity. This really goes to show that you may have lots of different implementations for an abstraction, whether on the same platform or on multiple, and you can, just like you do in programming, define interfaces, and then implementations of that interface. Because what we’re doing here is programming the Kubernetes API and building a distributed system, building a control plane on top of it.

This is what the actual thing the developer created would look like. They could as it’s a Kubernetes object, include it in something like a Helm chart, or Kustomize, or whatever your tooling of choice is for creating Kubernetes objects. They can specify not only their configuration that’s exposed to them, but potentially allow them to select the actual composition that matches to it based on schemes that you’ve defined, such as a provider and VPC fields here. They can also define where the connection secret gets written to. Here, they’re writing it to a Kubernetes secret called db-conn. We also allow for writing to External Secret Stores, like Vault, or KMS, or things like that. Really, you get to choose what the developer has access to, and they get a simple API without having to worry about all the details behind the scenes.

Where to Start

Where do I start? How do we get to this inspiration? There’s two different options. One is pick a primitive. You can take a commonly used infrastructure primitive that you use in your architecture, maybe it’s a database we’ve been talking about, maybe it’s a Redis Cache, maybe it’s a network. Everyone has things that have to communicate over the network. If you identify the provider that supplies it, and go and install that provider, without actually moving to the abstraction mechanisms, you can create the instances of those types directly. You can create your RDS instance directly, or your Memcached instance directly, and include that in your Kubernetes manifest, and have that deployed and managed by Crossplane.

The next option is picking an abstraction. This is the more advanced case that we just walked through. You can choose a commonly used pattern in your architecture, you can author an XRD with the required configurable attributes. You can author compositions for the set of primitives that support it, or maybe multiple sets of primitives that support it. You can specify those dependencies. You can build and publish your configuration, and then install your configuration. Crossplane is going to make sure all those dependencies are present in the cluster. The last step then is just creating an instance, like that Postgres database we saw.

These are some of the examples of some of the abstractions that the community has created. Some of these are really great examples here. Platform-ref-multi-k8s, for example, allows you to go ahead and create a cluster that’s backed by either GKE, EKS, or AKS. There’s lots of different options already. I hope this serves as inspiration as well, because you can actually take these and publish these publicly for other folks to install, just like you would go and install your npm package. Maybe that’s not everyone’s favorite example, but install your favorite library from GitHub. You can start to actually install modules for building your control plane via Crossplane packages, and you can create high level abstractions on top of them. You can create different implementations, but you don’t have to reinvent the wheel every time you want to bring along an abstraction. A great example we’ve seen folks do with this is create things like a standard library for a cloud provider. For example, if there is some common configuration that is always used for AWS, someone can create an abstraction in front of an RDS instance that basically has same defaults. Then folks who want to consume RDS instances without worrying about setting those same defaults themselves can just have the ones exposed to them that make sense.

Questions and Answers

Reisz: This is one of those projects that got me really excited. I always think about like Joe Beda, one of the founders of Kubernetes, talking about defining like a cloud operating model, self-service, elastic, and API driven. I think Crossplane proves that definition of building a platform on top of a platform. I think it’s so impressive. When people look at this, talk about the providers that are out there and talk about the community. Obviously, people that are getting started are dependent at least initially on the community that’s out there. Hopefully always. What’s that look like? How are you seeing people adopting this?

Mangum: The Crossplane community is made up of core Crossplane, which is the engine which provides the package manager and composition, which allows you to create those abstractions. Then from a provider perspective, those are all their individual projects with their own maintainers, but they’re part of the Crossplane community. In practice, over time, there’s been a huge emphasis on the large cloud providers, as you might expect, for either AWS, GCP, Azure, Alibaba. Those are more mature providers. However, about five months ago or so, the Crossplane community came up with some new tooling called Terrajet, which essentially takes Terraform providers for which there’s already a rich ecosystem, and generates Crossplane providers from the logic that they offer. That allowed us and the community to have a large increase in the coverage of all clouds. At this point, it’s, number one, very easy to find the provider that already exists that has full coverage. Number two, if it doesn’t exist to actually just generate your own. We’ve actually seen an explosion of folks coming in generating new providers, and in some cases putting their own spin on it to make it fit their use case, and then publishing that for other folks to do some.

Reisz: You already got all these Terraform scripts that are out there. Is it a problem to be able to use those? It’s not Greenfield. We have to deal with what already exists. What’s that look like?

Mangum: Having the cloud native ecosystem in the cloud in general exist for quite a while, almost no one who comes to Crossplane is coming in with a Greenfield thing. There’s always some legacy infrastructure. There’s a couple of different migration patterns, and there’s two kind of really important concepts. One is just taking control of resources that already exist on a cloud provider. That’s fairly straightforward to do. Cloud providers always have a concept of identity of any entity in their information architecture. If you have a database, for instance, the cloud may allow you to name that deterministically, or it may be non-deterministic, where you create the thing and they give you back a unique identifier, which you use to check in on it. Crossplane uses a feature called the external-name, which is an annotation you put on all of your resources, or if you’re creating a fresh one, Crossplane will put it on there automatically. That basically says, this is the unique identifier of this resource on whatever cloud or whatever platform it actually exists on. As long as you have those unique identifiers, let’s say like a VPC ID or something like that, you could just create your resource with that external-name already there. The first thing the provider will do is go and say, does this already exist? If it does, it’ll just reconcile from that point forward. It won’t create something new. That’s how you can take control of resources on a cloud provider.

If you have Terraform scripts and that sort of thing, there are a couple of different migration patterns. A very simple one, but one we don’t recommend long term is we actually have a provider Terraform, where you can essentially put your Terraform script into a Kubernetes manifest, and it’ll run Terraform for you, so give you that active reconciliation. There’s also some tooling to translate Terraform HCL into YAML, to be more native in Crossplane’s world.

Reisz: What are you all thinking out there, putting these XRDs, putting these CRDs onto Kubernetes clusters to maintain on and off-cluster resources. I’m curious, the questions that everybody has out there in the audience. For me, like the operator pattern is so popular with Kubernetes, and applying it on and off-cluster just seems so powerful to me. This is one of the reasons why I’m so excited about the project and the work that you all are doing.

Mangum: One of the benefits of that model is just the standardization on the Kubernetes API. One thing we like to point out, I’m sure folks are familiar with policy engines like Kyverno, or OPA, or that sort of thing, as well as like GitOps tooling, like Argo, or Flux, because everything is a Kubernetes manifest, you actually get to just integrate with all of those tools more or less for free. That’s a really great benefit of that standardization as well.

Reisz: Talk about security and state, particularly off-cluster. On-cluster is one thing, you have access to the Kubernetes API. You can understand what’s happening there. What about off-cluster? Is that like a lot of polling, what does that look like? How do you think about both managing state and also security when you’re off-cluster?

Mangum: Generally, we talk about security through configuration. Setting up security in a way where you’re creating a type system that doesn’t allow for a developer to access something they’re not supposed to. That can look like never actually giving developers access to AWS credentials, for example. You actually just give them Kubernetes access, kind of that like single access control plane I was talking about, and they interact through that. There’s never any direct interaction with AWS. The platform isn’t in charge of interacting with AWS.

In terms of how we actually get state about the external resources and report on them, there is no restriction on how a provider can be implemented. In theory, let’s say that some cloud provider offered some eventing API on resources, you could write a provider that consumed that. Frequently, that’s not an option, so we do use polling. In that case, what we really try to do is offer the most granular configuration you need to poll at a rate that is within your API rate limits, and also matches the criticality of that resource. For instance, in a dev environment, you may say, I just want you to check in on it once a day. It’s not critical infrastructure, I don’t need you to be taking action that often. If it’s production, maybe you want it more frequently.

Reisz: What do you want people to leave this talk with?

Mangum: I really encourage folks to go in and try authoring their own API abstractions. Think about the interface that you would like to have. If you could design your own Heroku or your own layer-2 cloud provider, what would that look like to you? Look at the docs for inspiration, and then think about writing your own API, maybe mapping that to some managed resources from cloud providers. Then offering that to developers in your organization to consume.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.