Article originally posted on InfoQ. Visit InfoQ
Transcript
Van Lohuizen: A little over 20 years ago, fresh out of university, I moved to the Bay Area to work at a startup where I continued working on natural language processing technology, the subject of my PhD. These were turbulent times, the dot-com crash was in full swing, which meant many closed establishments, empty office buildings, and parking lots. Traffic was even easy. The startup I joined would soon be a victim of this crash as well. These were also exciting times, Mac OS X 10.0 was just released, which was exciting for me as a NeXTSTEP fan. Also, the first Apple Stores were open. What was also exciting was the technology I worked on at the startup. As I later realized, we were essentially by today’s standards, in configuration heaven. All the words, grammars, ontologies that were maintained by this company were specified declaratively in this beautifully tailor-made system. It was worked on by multiple teams spread across universities and companies. Teams consisted of engineers and non-engineers, linguists. This was all at the scale that rivals the largest configurations I’ve seen at Google, my then future and now former employer. Later, I even realized that the properties of these configurations are actually strikingly similar to today’s cloud configuration. There didn’t seem to be any problem with this approach at all. It was fast, scalable, and manageable. It was even a joy to work with, and sometimes even with bouts of ecstasy. I don’t know about you, but I don’t know many people that speak of configuration that way nowadays.
2002 – 2005 (Google)
After the startup fell prey to the dot-com crash, I started at Google. I made some resource hungry changes to the search engine for my starter project there. I needed bigger and more machines for testing. My best bet was to use a pool of machines for batch processing that were sitting idle, owned by my team. These machines were very different from the production machines, and I needed to configure everything from scratch. I even needed to adapt the code for these servers to be able to run on these machines at all. Also, I wasn’t a big fan of the existing production configuration setup, nobody was really. I thought this was a good opportunity to introduce the configuration techniques I learned from my previous company. What I ended up with was essentially the first Kubernetes-like system I’m aware of. The system became rapidly popular within Google, and even was, to my horror, used to launch some beta, but still production services. This was not production ready. As a result of all this, though, I became part of the Borg team, which later inspired Kubernetes, with a goal to build something very similar but production ready. I was enthusiastic about configuration and wanted to replicate that experience I had with my previous company. Then, also, based on the past experience with Google, I was told that the system would have to not allow complex code, and only allow at most one layer of overrides. I thought, that shouldn’t be too hard. The system I used to work with had no overrides at all and no code, so it should be easy. I ended up creating GCL, which is something like JSON. Initially, it had some of the properties, but really, very soon, it didn’t. Needless to say, we did not end up in this configuration Nirvana at all. Clearly, we took some wrong turns along the way.
Background
At the end of 2005, I moved to work for Google, Switzerland. Here I did all sorts of things, including management, a 10-year stint on the Go team, and doing research as part of the SRE group into configuration related outages. During that time, I always kept an eye on configuration. It always bugged me that I wasn’t able to replicate that great experience in my previous line of work. In my mind, I was already pretty convinced where I made mistakes. Over the years, I saw validation that my suspicions were indeed correct. At some point, I started working on CUE to incorporate my lessons learned as well as those from others in the field. Since last year, I left Google and started working on CUE full time. My name is Marcel van Lohuizen.
The Need for Validation
Overall, I think there’s a lot of improvement to be gained in what we call configuration engineering. This is what we aim for with CUE. Why is it important at all? Who cares, one might ask? You don’t like how writing configuration is today, and you don’t like it as much as you used to. Life is hard, deal with it. Research from CloudTruth, and others as well, show that for many companies, over 50% of outages are configuration related. We see that configuration at scale is often cracking in its seams. Where did we go wrong? There are actually many lessons to be learned. I want to go in-depth on a few of those. One of those is the lack of validation, or testing of configuration. Here’s a quote from my former colleague Jaana, stressing the importance as well.
Let’s dive into an example to show the need of validation. We have a service that we want to run in both staging and production. We have two self-contained YAML files that define how to run them on Kubernetes. They’re defined as a deployment, which is a Kubernetes concept of how to run on a server. In this example, we only want to vary the number of replicas between the two environments. Everything else which can be quite a bit actually is not shown here, and it’s identical between these two files. Updating common values between these two files can get old very quickly. As we know from both software engineering and database design, redundancy, aside from being tedious, is also error prone. How can we fix this? A common pattern is to have a base file which contains all common fields, and then to have the two specific environments derived from that. Also, here are the various ways to accomplish this. Here we see a bit more detailed, but still see a greatly simplified deployment. As a reminder, we only care about replicas here. One approach is to templatize the base template using variables. This can work quite well. There are some limitations in scalability, and the variable pattern may suffer from the typical issues associated with parameterization. There are some good publications on that topic, but it certainly can work for moderate setups.
For examples, though, we want to focus on another common approach using overrides. Samples of override approaches are, use by kustomize, GCL, and JSON. For this example, we’ll use kustomize. Kustomize allows you a way to customize base configurations into environment specific versions using only YAML files. There are two kinds of YAML files, one that follow the structure of the actual configuration, and the metafiles that explain how to combine these files together. This is a typical kustomize directory layout. The base directory contains all the configurations on which our environment specific kustomizations are based. The kustomization YAML file defines which files are part of that configuration. Then there is the overlays directory which contains subdirectory for each of the environment specific tailoring.
Let’s see how that looks like. At the top, you see the content files for base and production, while at the bottom you see the corresponding metafiles describing how these files are combined. Notice that the metafiles only describe how to combine files, not the specific objects. Kustomize is specifically designed for Kubernetes and uses knowledge of the meaning of fields to know how to combine objects. This is why you see that some key information about the deployment still needs to be repeated in the patch file. The result of applying the patch is the base file where replicas is modified from 1 to 6. This is really inheritance. File overlays are just another form of override inheritance. In that sense, it’s no different from the inheritance that is used in languages like JSON and GCL. A problem with inheritance is that it may be very hard to figure out where values are coming from. As long as you keep it limited to one level, things are usually ok.
So far, so good. Now let’s look at another example. Let’s assume that the SRE team introduced a requirement that all production services have Prometheus scraping enabled. This could be a requirement for health checks as well, for example, which might make more sense. This is a very simple example. For the sake of simplicity, let’s stick with this. Someone in a team enforced this requirement in the base deployment so that it will be enabled in every environment automatically. You could also imagine that there was another base layer or another layer that provides the default for all deployments within a group or setup, not just for frontends. Now later, somebody else in a team explicitly turns Prometheus scraping off in the prod file. As before, this can be done by overriding the value like this. As we said before, this was a requirement. Clearly, this should not be allowed. Under what circumstances could you expect this to happen? Somebody could have added this to debug something and forgot to take it out. We all have fiddled with configurations or code for that matter to try to get things working, so this is not too unthinkable. Another reason could be that a user was simply relying on a tooling to catch errors. As you see, there’s no formal way to distinguish between a default value or a requirement. This can easily be overlooked. Another more sneaky way how this could happen is that somebody already put this Prometheus scrape false in before it became a requirement. This could happen, for example, when somebody turned it on and then turned it off, or maybe there was a default just before without it being a requirement. Then, it was made a requirement later down the line. You can imagine how this failure will go unnoticed then. Really, what’s missing here is a formal way of enforcing the soundness of a configuration.
What Else May We Want to Enforce?
We have shown a simple example, but the types of things we’re made to enforce are really myriad. For instance, one could require that images must be from a specific internal container registry. We could set a maximum number of replicas so that resource usage will not get out of hand. Containers might be required to implement the health check, or we might want to enforce particular use of labels, or implement some API limitations, check quota limits, or authorization. One could, of course, move all these enforcements into the system that consumes the configuration, so wherever we passed the configuration, and reject it if it’s incorrect. That’s certainly more secure, and we should anyway do that. There’s no way for users to override it in that case. Even if we do that, we still want to know about the failures ahead of deployment. We still want to know about these things early. Failing fast and early is always a good idea. We’ve learned the hard way that testing or validation is as important for data and configuration as it is for code. In the majority of cases where I’ve seen rigorous testing of data and configuration introduced, a whole host of errors were uncovered. What is testing in this context? You could think of unit tests, writing tests for your configuration actual code. Another approach is assertions. GCL and JSON do this, for instance. Really any contract that checks in variants would do. Any method that accomplishes that would work. One problem with both unit tests and assertions is that we’ll end up with lots of duplication. In the example above, it would mean that repeating the scraping requirement in a separate test or assertion.
CUE Crash Course
This is where CUE comes in. Let’s take a look at how CUE would solve the problem. CUE provides a data, schema, validation, and templating language that provides an alternative to override-based approaches. In the approach that CUE was based on already 30 years ago, it was recognized that there is a great overlap between types, templating, and validation. In a moment, I will return to how to use that to solve the above issue. Before I give CUE solutions to this problem, let me do a quick five-slide crash course on CUE. The CUE language is all of a data language like JSON, a schema validation language like JSON Schema, and a templating language like HCL. Let’s see how that looks like. As a data language, CUE is strictly an extension of JSON. Think of JSON with syntactic sugar. You can drop quotes around field names, in most cases. You can drop the outer curly braces and trailing commas. There are more human readable forms of numbers. There may appear to be some similarities to YAML. A big distinction with YAML though is that CUE is not sensitive to indentation, making copy and pasting of CUE a lot easier, and especially a lot error prone. CUE also allows defining schema. Here we see a schema defined in Go, and its equivalent in CUE. In CUE, schema look very much like data where the usual strings and number literals are replaced by type names. This already hints at an important concept in CUE that types are treated just like values. CUE also allows validation expressions where validations are like types, just values. Let’s take a look at this JSON Schema. Field ID is just defined as a string. Field arch is an Enum which can only be one of the two shown values. For RAM, we define that a machine in this data center must have at least 16 gigabytes of RAM. Similarly, we also require that all disks must be at least of size 1 terabyte. You can already see one benefit of treating types validation as values, namely that the resulting notation is quite compact.
CUE can also be used for templating. Take a look at this HCL, for example. Just like HCL, CUE has references and expressions, key elements of templating. CUE doesn’t have the notion of variables per se, but any field can be made a variable by adding the tag annotation. This is really a tooling feature, not a language feature. CUE has a very rich set of tooling built around it to make use of these kinds of things. If you want to put constraints on such variables, you just use the validation contracts we saw before on the value itself. More specifically for templating, CUE also supports default. Anytime you have an Enum in CUE, you can mark a default with an asterisk. Defaults are really a kind of override mechanism. It’s the only override mechanism that CUE allows. It’s CUE’s answer to boilerplate removal. It’s a very powerful construct, actually, ensuring that both the depth of inheritance stays limited while making boilerplate removal quite granular and effective. Comments are first-class citizens in the CUE API. It is common to note special case comments if you otherwise don’t need to. This is used in API generation, one of CUE’s capabilities. In summary, CUE uses the same structure for data, schema, validation, and templating. There’s a lot of overlap between these and combining them in one framework is a very powerful notion. Another important thing to know about CUE is that it can compose arbitrary pieces of configuration about the same object in an order independent way, meaning that the result is always the same, no matter in which order you combine it. This in and of itself is a key factor on how CUE gets a grip on configuration complexity. Override approaches do not have this property, even when just using file-based patches.
Demo of Redoing the Example in CUE
Now back to our example. Here we see a possible CUE layout for the same setup as above. At the top we see a CUE mod directory. Much like the Go language, CUE can treat an entire directory structure in a holistic way with predictable and default build behavior. This makes it easy for configurations to be treated dramatically within a context. This is what a module file would look like. It’s really just to class a unique identifier for the module, not unlike Go. The CUE mod directory more or less serves the same purpose as the kustomization YAML files in kustomize, and defines how to combine files. If the module file is all it takes to specify how to combine files, how does CUE know how to combine all the objects? If it’s not based on the file name, and as CUE is not Kubernetes specific, it’s also not based on the field, so how do we do that? One part of the answer is to rely on the directory structure itself. All files in a directory and parent directories within a module that are declared within the same package are automatically merged. All CUE files in prod are automatically merged with all files in base for instance, as long as they have the same package name declared. It doesn’t quite answer everything. In the kustomize setup, different files describe different objects at the same level. It does so by matching object types and names based on Kubernetes specific fields. CUE is not Kubernetes aware, so how does it know how to combine them? We said that CUE can combine configuration aspects about the same object in any order. All we need to know, really, is which configuration aspects belong to which object? We do this by assigning a unique address or path for each distinct object within a namespace. Think of it like a RESTful path. Here we see an example of such a possible path structure and a specific instance from our frontend deployment. A big advantage of declaring all objects in a single namespace is that we can define validation as well as boilerplate removal that spans multiple objects, such as automatically deriving a Kubernetes service from a deployment, for example.
How does this all look like? Let’s start with our production tailoring of the production frontend deployment. Here, the first line represents the package name I mentioned earlier, which is used to know how to combine files. This will be included in all the files that we’ll show. The second line is the adjustment which just sets the number of replicas to 6. Let’s compare it to our original kustomization example. You can see it roughly contains the same information. One could notice, though, that many of the fields have been omitted. This is possible because this information is now included in the path, and the path uniquely describes that object. The identifying fields are therefore no longer needed as they’re already specified in the base template. Let’s now take a look at the base template. The base applies both to prod and dev, which is reflected in the path. We really mean any environment here, though, so usually we write this as this using the any symbol. All the fields are mixed in automatically with prod that we saw earlier, causing the frontend deployment to be completely defined here. There’s an interesting difference to note compared to the original kustomize template. You can see that it’s quite similar in structure. One noticeable difference, though, is that we no longer set replicas to 1. This is because CUE doesn’t allow overrides, so setting it to 1 would conflict with the value of 6 in the prod file and just cause an error. We do set it to int though, to indicate that we at least expect a value. There’s really no need to set a default value here as all concrete instances already specify a replica explicitly. Also, it’s often good not to have a default value specified to force users to think about what value is really appropriate. However, if one really wishes to set a default, we can use it using the asterisk approach and using Enums as we saw before, so you can see that here.
Setting Our Scraping Requirement
We have replicated our original kustomize setup. How do we now introduce our scraping requirement? To show the flexibility of CUE, we define this rule in a separate file on the top directory named monitoring. It follows the same approach as before, but we specify a path to which the configuration aspect belongs, along with the desired tailoring. Because true is not the default value here, it just becomes a requirement imposed in the frontend job in any environment. If a user wanted to set this to false in any of the environment, it would first have to modify this file. Note that this is not unlike how this would work if you had unit tests. Also, there, you would have to change the test first to make it work. The key difference here, though, is that this rule functions as both the templates, as well as the requirement, so you don’t need to write a unit test anymore. There’s no duplication, but all the convenience and safety are there. Now, if you wanted to be a bit more lenient, and say, only require this setting for prod but not for other environments, we could write this as shown here. Here for any environment, the scraping value is defined as either the string true, which is the default, or the string false. This has the additional benefit that this validates that the value is actually either the string true or false, and that anything else is an error. For example, Boolean true or false. One could easily imagine a user would inadvertently write this as a Boolean instead of a string. This is another good example of how validation and templating overlap. Also know that the second rule also applies to prod. That’s fine. The only value that is allowed by both is true, and the second rule simply has no effect for prod. In general, a nice property of CUE is that you can determine that Prometheus scraping must be true by just looking at the first rule. No amount of other rule can ever change this, so you don’t even have to look at the production deployment file to check for this because no amount of other rule could specify would change this. Based on experience, this is actually an immensely helpful property to make configurations more readable and reliable. We could also easily generalize this rule beyond our frontend job. All we need to do again is to replace the frontend field with the any operator we saw before.
What Is CUE?
What is CUE? CUE is not just a language, but also has a rich set of tooling and APIs to enable a configuration ecosystem. It’s really not specific to any application, but rather aims to be a configuration Swiss Army Knife, allowing conversions to and from and composition of different configuration formats, refactoring configuration, and integrating configuration into workflows, pipelines, and scripting. It’s designed really with next-gen GitOps in mind. CUE itself is generic and not specific to Kubernetes. There are projects, tools, and products in the CUE ecosystem like KubeVela, a CNCF open source project that builds on top of CUE and adds domain specific logic. CUE itself is application agnostic. That said, CUE itself has some tools to make it more aware of a certain context, like Kubernetes, in this case, for instance. Let’s take a look at how that might work. All we need to do to make it more aware of Kubernetes really is to import schemas for Kubernetes defined elsewhere, and assign it to the appropriate path. For instance, here we say that all paths that correspond to a deployment are of a specific deployment type. From that moment on, all Kubernetes will be typed as expected. Now if you type a number of replicas as a string, for example, or even a fractional number, instead of an integer, CUE will report an error. Where does the schema come from? You don’t need to write it by hand, in most cases. It may almost seem a little bit like magic, but you can get it from running the shown command. How does this work? The source of truth for Kubernetes schema is Go. CUE knows how to extract schema from Go code. That’s really all there is to it. In the example we showed, we had a single configuration that spanned an entire directory tree. Modules and packages could also be used to break the configuration up in different parts linked by imports. This gives really a lot of flexibility on how you want to organize things with CUE.
What Really Causes Outages Nowadays?
We’ve seen some of the benefits of validating configuration. It’s really older as to it to preventing outages. Really far from it. Earlier, we mentioned that research shows that for many companies, more than 50% of the outages are related to configuration. This concurs with my experience. This is really caused by this simple validation related rules of a single system that we showed before. Indeed, I’ve seen many outages actually related to such a failure. The more mature a company becomes, the less likely that will be the case. On the other hand, the more a company matures, configuration also tends to grow in complexity. As a result, this 50% figure seems to hold up over time, even as the simple cases get nearly eliminated completely. A clue of this is shown by this outage reported by Google. I would indeed classify this as a configuration related outage, just not of the simple kind that we addressed before. There are a handful of very common patterns that one can observe from configuration related outages. One of them is if an application defines a configuration that’s valid in principle, but violates some more specific rules or policy of a system to which its configuration is communicated upstream. If these specific rules are not known and tested against at the time of deployment, a launch of such a system can fail in production, and often unnecessarily so. This is a case of not failing early due to a lack of sharing. You want to fail early, as we mentioned earlier.
This is one of the things that went wrong at a correlation. Correlation configuration and validation rules or policy, and using it pre-deployment can greatly help in these cases. How does one do that? What is configuration even really? To answer this, let’s see where configuration lives in this very simple service. Here we have a single Pong Service that listens to ping requests and replies with pong if the request is authorized. It also logs requests to a Spanner database. The low-level infrastructure is set up by Terraform, in this case, and authorization requests are checked by OPA. Can you spot the configuration? Let’s see. The most obvious one, perhaps, since this track focuses on infrastructure as code, is the Terraform configuration. In this case, it’s used to deploy the VM and the Spanner database. Also, our server operates based on settings. Here we show a JSON configuration file, but really command line flags and environment variables are all configuration artifacts. Thirdly, we have a schema definition of the database. Why do we call this configuration? Really, database table definition is also configuration. You can see here that the database tables really can be a combination of schema and constraints or validation. In other words, the database schema defines a contract of how the database can be used. This is really configuration in our view. As a litmus test, you can see the translation of the schema on CUE on the right-hand side. You see that it’s mostly a schema, but has some validation rules associated with it as well.
Let’s continue with search for configuration. We already mentioned that the Pong Server needs to be configured. Also, data types within the code that are related to communication with other components can be seen as configuration. Let’s look at one of these types here, audits, for example. You can see there’s redundancy with the database schema defined earlier. It’s basically the same schema, but in Go, it drops many of the constraints that were defined in the database schema before. It only partially encodes the contract with the database. None of these constraints are really included in the Go code, so this can result in runtime errors that could have been prevented pre-deployment. This is a nice example of that. It’s like using a dynamically typed language without unit tests.
You will only discover such errors when things are running. Let’s continue our search. Our Pong Service is friendly enough to publish an OPI spec of its interface. Really, also, this is configuration. It does overlap with other parts of the system, for instance, regarding what types of requests are allowed. We’re not done yet. We haven’t touched our authorization server yet. Aside from the configuration that is needed to launch that server, also the policies that it executes and checks are configurations as well.
Let’s take a look. Here we have a very simple Rego policy that specifies only Get methods are allowed. This is not a restriction of the system per se, but rather an additional restriction enforced by this policy. As this is a static policy, there’s really no reason not to include this restriction in the OpenAPI published by the Pong Server. Indeed, it does. The problem is, though, that in the current setup, it is maintained manually. This is error prone. On the right-hand side, you see a possible equivalent of the Rego on the left-hand side in CUE. We’ve taken a bit of a different approach here with CUE. We could have used a Boolean check, but we don’t. We’re making use of the fact that CUE is a constraint-based language here. Rather than defining allow as a Boolean, we compose it with the input, where a successful composition means allow, and a failure means deny. Here, it doesn’t make much of a difference. For larger policies, specifying the policy in terms of constraint this way tends to be quite compact and readable when done in CUE. We see an important role for CUE in policy for this reason.
The CUE Project
As we have hopefully shown, configuration is everywhere. Most of you will even carry some in your pocket, like your settings on your smartphone are configuration. We can see a lot of overlap and redundancy in the configuration of the Pong Server. This is a small server, but things really don’t get any better for larger services. Can we address that with CUE? The CUE approach is to consolidate all configuration, removing all redundancy and generating everything necessary from a single source of truth. Using CUE like this ensures that all contracts are known throughout the system as much as possible. Eliminating redundancy also increases consistency. All this helps to fail early and prevent outages. This tweet from Kelsey captures nicely what is going on. We need clear contracts between components and we need visibility of contracts, configuration, and state even throughout pipelines. What we’ve also shown is that this is not exclusive to infrastructure, this goes beyond infrastructure.
Here’s another quote from the Google Cloud website. It specifically emphasizes that contracts are often lost in code, dealing with configuration calls for a declarative approach, really. This is exactly what CUE is about. We talked a bit about what CUE is, but let me share a bit where the CUE project is at. A key part of CUE is spinning down to the precise meaning of configuration. This allows it to define adapters for accurate representations of a number of formats. The ability to morph any configuration into different formats really makes CUE great for GitHub style deployment. The set of adapters is certainly not complete, as you can see here, but things are moving pretty fast. A lot can really already be done with what exists already. For example, CUE’s own CI runs on GitHub Actions. The way we do that is we define our workflows in CUE, making use of templating and other CUE features. Then we import a publicly available JSON Schema definitions for GitHub workflows to validate these workflows. We then export YAML and feed it to GitHub. We also have a lot of users already. Here’s a small and by no means complete selection of companies, projects, and products. Some of these are using CUE as a basis for systems of tools they are building, not unlike how we saw on the demo. Some of these are actually exposing CUE to their users as a frontend, or are leveraging the composable nature of CUE as well as the rich CUE toolset that is available. We’re just getting started. Aside from me, we have Paul, Roger, Daniel, and Aram who all have strong backgrounds in the Go community, working on CUE development. Carmen also ex-Google and ex-Go team oversees a redo of our documentation and learning experience, as well as user research, among other things. Dominik is responsible for project ops and also user research.
Conclusion
To reach five nines reliability, we need to get a handle on configuration. We believe this is done by taking a holistic approach to configuration. This is not an easy task, by all means, but this is the goal we’ve set ourselves out to achieve. Configuration has become the number one complexity problem to solve in infrastructure. We need a holistic approach that goes beyond just configuration languages used. We need tooling, API, and adapters that enables an ecosystem of composable and interoperable solutions. We believe that CUE will be able to support such a rich configuration ecosystem, and that will reduce outages, increase developer productivity, while making it delightful.
Questions and Answers
Andoh: Why don’t you just use a general-purpose programming language or configuration?
Van Lohuizen: The general structure of configuration, especially as it gets a bit larger, and it’s actually already quite quickly, is that you have a lot of regularity in all the variations of configuration, but there’s a lot of irregular exceptions within it. As long as you don’t have that, as long as you have a lot of regularity, then it’s quite easy to write a few for loops and generate all the variations of this configuration. If you don’t have that, then expressing that in code is very verbose and very hard to maintain. Whereas if you have a more declarative, like logic programming-based approaches actually becomes much more manageable, in that case. That’s really the main reason. You could say that up to medium scale, programming languages work fine, but it’s really for the larger configuration where it really starts to break down, generally speaking.
Andoh: How does CUE enable better testing and validation?
Van Lohuizen: Of course, CUE is a constraint-based language, so it’s fairly easy to define basically, assertions, in terms of CUE and restrictions on your data in CUE. Really, a key part of what makes it so powerful, though, is that the same mechanism you use for testing and validation, you can also use for templating and generation. For instance, suppose you have a field that says the number of replicas should always be 10. You can use it as a templating feature. If you don’t specify, the replicas is 10 is automatically inserted in your configuration. You can also see that that’s validation. If the user specifies 10, it’s fine, but if the user specifies 9, these two things clash. Validation and templating are really two sides of the same coin. This makes things a lot easier. Yext is a company that has done this, for example. Just as I mentioned, that configuration is where you need to create a lot of small variations in setups, this is often also the case with test sets. What we’ve seen people do is that they actually use CUE itself to generate test sets, that you can use and test a whole variety of cases. CUE as a language does not only make it easier to test, but it actually also makes it easier just like configuration to generate test data.
I also think using programming language for config is also a big temptation to start embedding complex logic into config files. I think that’s not a good pattern, and programming languages may give you too much rope to hang yourself.
That’s absolutely true. This was one of the design points for GCL in the early days, that basically a configuration language should really not do computation. If you really need any computation at the configuration layer, you should shove it to the server. Especially back in the days at Google, we could do that because we had control over the entire internal ecosystem. Even there in practice, that would actually fail, because even though it’s one company controlling all of that, there are still different teams. If one team wants to configure it in some different way, the other team that’s controlling the binary might not just want to add that logic into their binary. Plus, there’s different release schedules, and for all kinds of practical reasons that’s not the case. It’s inevitable that you will have some computation at the configuration layer. What you see often happening is that these DSLs then involved in ultimately, basically, general-purpose programming language, which, of course, are very hard to use, and it becomes a complete mess. The way we try to fix that in CUE is the composable features and nature of CUE also allows you to combine externally computed data into CUE. CUE has the scripting layer where you can basically alternate between CUE evaluation and shelled out computation. What that allows you to do is to basically take configuration in CUE, get some values, shell it out to some other computation, like some binary or something else. We’re working on a Wasm extension as well. Then take that data and insert it again in the declarative configuration. That does make it a little bit impure. At least what it allows you to do is to truly separate the computation that needs to be in the configuration layer, use a general-purpose language for that, unit test it. Then have anything that can be modified quickly and easily to what’s easy to read and can be expressed in data, you can keep in CUE itself. You can compare it a little bit to spreadsheets. I often say that CUE is a spreadsheet for data, so mostly you are just specifying numbers, you can specify some validation rules. If you really need to do some computation, you use these functions in Google Sheets, or Excel, or whatever that you can program in Visual Basic, or JavaScript, or what have you. Really, you keep the code separate from the configuration. I think that’s a good compromise for these cases where the computation really needs to live in the configuration layer.
Andoh: Another thing that you said when I asked about why not just a general-purpose programming language was that for after a certain scale, you want a configuration language. Does that mean that CUE is really only best for large scale configurations? What about small?
Van Lohuizen: Even though it’s designed for very large scale, and the experience is with extremely large configurations, we need to do some performance improvement also to make that work in the general case. We also recognize that, generally speaking, configurations start very small. It should always be the goal to keep configuration small. We wanted to have something very simple that really already works, from the very beginning. Think of it a little bit like in Go, for example, is the language that you can use for very simple programs. It’s for quick development, but actually, it scales fairly well for larger systems. Some of the design principles there is to really make it data driven, and make it look like JSON, so to make it look as familiar as possible. Then if you know how to write JSON, you can already start using CUE, and then you can start using syntactic sugar and grow into the language. That was a big part of it. Also, really the scale at which we are seeing these problems, even though it works at very large scale, you can often see it coming even with fairly small configurations already. If we’re talking about hundreds of lines, you can already see these problems occurring. Sure, with tens of thousands or hundreds of thousands of lines, you’re almost guaranteed to get into it. It can happen at smaller scales, too.
Andoh: In one of your slides, we saw four different tools being used to be able to do all the work of CUE, and I think I looked at the logos and they were JSON Schema, and OpenAPI, and YAML, and JSON. Then you talked about how CUE can also do schemas, validation, and things like that. Since this is a beyond infrastructure, can you talk about what CUE can do beyond infrastructure.
Van Lohuizen: We’ve had some really unexpected use cases for CUE. Also, people coming to us, first of all, like, do you know you’re solving the composable workflow problem? People ask this question. You see that many of the uses of CUE are actually going into that direction, whereas really like full CI/CD pipelines and things like that are being defined in CUE. The same thing was for artificial intelligence and ML pipelines, like, how do you compose the results? How do you set it up? Also, very similar problem if you think about it. Also, just lower-level data validation. We have companies that are managing their documentation in CUE, which if you think about it, it’s also a configuration problem. We’re seeing it branching out in all these different levels. That’s more horizontal, to some extent. Also, if you look at the layering of configuration, like not just date and type of thing, but also policy. It’s quite hard to specify policies well. I think it can only be done in logic programming, like formulas. That’s why you see the success of Rego, and all these sorts of things. They’re all based on a principle. CUE actually was designed as a reaction to Prolog and Datalog like approaches, which are not very easy to understand for a lot of people. The biggest users were not software engineers, originally writing CUE’s predecessors. We also think that this is quite a good tool. If you have to use logic programming, this is quite an approachable thing to start going into the policy realm, and all these things. That’s why we’ve seen that much demand. It’s really nice to have one tool and one way of specifying all these different things. We think that is quite useful.
Andoh: Where can you learn more and get involved in CUE and the CUE community?
Van Lohuizen: There’s a website called cuelang.org. You can find links to the community. We have a very active Slack community. There’s also GitHub discussions for more Stack Overflow like questions that will just stay there, and where people can get help. That’s a good place to start. We’re working on new documentation that might make it a little bit easier to read. Some of the documentation or most of the documentation we have was really more written for the language designers, and not yet to get people started. We’re working on getting that going. We think CUE is quite simple, actually, but if you read the documentation out, it might not look like that.
See more presentations with transcripts