Month: June 2023

MMS • RSS
Posted on nosqlgooglealerts. Visit nosqlgooglealerts

DynamoDB is a managed NoSQL database service offered by Amazon Web Services (AWS). It has gained immense popularity as a highly available, scalable, and reliable data solution among cloud computing infrastructures. DynamoDB is a fully managed service that eliminates the need for AWS customers to manage their own database servers and provides extraordinary flexibility in terms of data access, storage, and management.
What is DynamoDB?
DynamoDB is a popular NoSQL database offering from AWS that provides high availability and scalability. It eliminates the need for manual scaling, managing, and monitoring of data infrastructure. It provides an efficient and reliable data storage service in a managed environment. DynamoDB can handle big data, and it helps users to store and retrieve any amount of data with automated and optimal performance allocation and dynamic provisioning capacity.
The platform is designed to serve adaptable workloads, and its flexibility to handle varying data demands makes it ideal for real-time applications. DynamoDB supports JSON-based document storage and semi-structured data models. It enables users to access and manage their data across multiple regions globally and provides secure and fast access to data with advanced features like encryption, backup, and restore capabilities.
DynamoDB Architectural Overview
The DynamoDB architecture allows for partitioned scaling, load balancing, and multi-region support. Every item in the table is automatically distributed across multiple partitions, and the partitions are distributed uniformly to maximize storage operations and cache utilization. The platform operates on a master-slave data replication configuration, with writes directed to a master node that forwards the data to its corresponding slaves.
The write propagation is performed for both local and global scenario, where writes are attempted entirely on local nodes before being distributed globally to all affected nodes. This design helps to reduce response time latency for local and regional reads and actions. DynamoDB offers high-performance querying via its well-designed indexing system. The platform employs Local Secondary Index (LSI) and Global Secondary Indexes (GSI) to support fast querying operations.
Features of DynamoDB
Scalability and High Availability
DynamoDB is a highly available, scalable, and distributed data storage service, making it an ideal solution for enterprises with fluctuating workloads. It was designed for scale from its inception and can be scaled up or down based on business demands automatically. Scaling is performed through partitioning; a technique that divides data into smaller units and allocates storage and processing resources to each partition. The platform provides multiple built-in features, such as auto scaling, capacity on-demand, read/write provisioning, and more, all aimed at providing scale and elasticity.
Flexible Data Models
DynamoDB supports a flexible data model that allows users to store unstructured, semi-structured, and structured data in its service. Its data models can be categorized as key-value, document, and more. The platform supports data types such as numbers, strings, binary data, sets, and document data format, including list and maps. Users can choose from any of these data models and formats based on their data needs and access or modify their data in real-time to suit their use cases.
Security and Availability
DynamoDB provides an enhanced level of security and availability for users and their data. It provides automated backups, point-in-time recovery, multi-copying, and encryption of data-at-rest and in-transit. These features provide data protection and privacy, making it ideal for businesses that require compliance and regulatory compliance. AWS also provides users with tools to manage and monitor access to their data and network traffic in real-time, including encryption for access keys, data encryption, and data access control management.
Low Latency and High Performance
DynamoDB provides a low read and write latency data access process through its global and multi-region availability, partitioning, and load balancing features. It ensures that all read and write actions are performed quickly and efficiently, irrespective of the volume or changing patterns of traffic. DynamoDB also supports caching and indexing, which enables the application to easily store and retrieve frequently used data records. Its caching feature helps reduce the overall response time for frequently accessed data records, leading to optimized performance and lower costs.
Use Cases for DynamoDB
Internet of Things Sensors and Devices
DynamoDB can handle IoT data as sensor data, device telemetry data, and more. IoT devices generate massive amounts of data generated in real-time, which may require immediate processing, querying, and analysis to identify anomalies, optimize performance, and reduce downtime. It is a perfect solution for data storage with high availability, capacity, and fast access speeds to support IoT device data management and analytics.
Gaming
DynamoDB provides gamers with a scalable, efficient, and high-performance solution for managing their data, including user profiles, game data, and game metadata. The platform is designed to handle high traffic and sudden spikes in usage demands, providing low latency and high throughput read and write actions, with automatic scaling and capacity provisioning.
High-Speed and Scalable Web Applications
DynamoDB is perfect for high-traffic web applications, chat applications, and social media networks. It is designed to deliver fast response times and high throughput, providing low latency, read and write actions with high scalability. Its support for multiple data models, flexible schema, and rich querying options makes it an ideal solution for web applications with various data requirements.
Real-Time Analytics
DynamoDB is a perfect solution for real-time analytics in the cloud. It can store and process large datasets and provide developers with a flexible, cost-effective, and highly available solution for running large-scale data analytics and machine learning models. Its stateless architecture, support for various data models, and built-in indexing make it a good platform for real-time data processing and querying operations.
DynamoDB is a powerful managed NoSQL database service that is designed to handle any size workload and data model with high scalability, availability, and performance. It eliminates the need for manual database management, provisioning, scaling, and monitoring and allows the users to focus on their business logic and application development. It provides a robust solution for multiple use cases, including IoT, gaming, web applications, and analytics, and has advanced security and data protection features.

MMS • Brandon Byars
Article originally posted on InfoQ. Visit InfoQ

Transcript
Byars: I want to talk about the journey of evolving an API without versioning. I like to start this with a little bit of a story. Just for background context, when I say API, a lot of folks immediately jump to something like a REST or a GraphQL API, and those certainly fit the criteria. So do Java APIs that operate on process, or event-based APIs, they still have some agreement that needs to be made between their consumers and the producers. A lot of the talk is dedicated to the complexity of that agreement. This is based on a real story, but it has been changed, because the story itself isn’t public.
The context, we want to book a cruise, ourselves and a partner for a vacation. This sequence diagram is an overly simplified view of what happens from a systems standpoint, when you book a cruise. It might be that we have a reservation system. It’s basically a two-phase commit. You hold the room. You pre-reserve the room until payment has been collected through a payment gateway. Then the reservation itself is confirmed in the underlying reservation system. When this works well, everybody is happy. What happens in this real scenario that I’ve abstracted is there was a significant spike in load. What that spike in load did is it forced the reservation system to send back an unexpected error on that last step, at confirming the reservation. An error that the team that developed the booking service had never seen before. Normally, you run into unexpected errors in systems, you get some unpredictable behavior. In this case, the unpredictable behavior was fairly catastrophic for the organization. Because what they had done is they’d built a retry loop around unexpected errors in the booking service for the entire workflow.
Under load, at peak volume, I might try to book a service. I’d pre-reserve the room. My credit card would be charged, getting their own confirmation. Retry, pre-reserve another room, charge my credit card, get in there, preauthorize or pre-reserve another room, charge my credit card again, and so on. What happens is that that loop continued until either a customer had pre-reserved every room on the ship, or they had maxed out their credit card, and the payment gateway itself returned an error, or there was some fraud-based error alert from the payment gateway. That was obviously a big PR disaster for the organization. It caused a lot of consternation. It was very CNN visible headline. All based on the fact that the agreement between the API producer of the reservation system and the API consumer of the booking service did not completely cover the surface area of responses available.
Background
My name is Brandon Byars. I am head of technology for Thoughtworks North America. This talk is based on a significant amount of experience that I’ve had in API development throughout my career. I’ve written an open source tool that will be the baseline of this talk following the API bit over nearly a decade, called mountebank. A few articles on martinfowler.com. This one is based on one that I haven’t yet published. It’s actually long been in my queue to finish, and this talk is a little bit of a forcing factor for me to do that. This is all based on real world experience. I have led a number of platform engagements. We consider these API platforms a really good way of skilling development inside organizations. One of those articles, the enterprise integration using REST, is quite dated, maybe 10 years old at this point.
This adaptation of Jamie Zawinski’s quote here on regular expressions is something that I wrote in that article. “Some people when confronted with a problem think, I know I’ll use versioning. Now they have 2.1.0 problems.” Versioning is oftentimes seen as the standard de facto approach to evolving an API in a way that makes sure that the agreement on backwards incompatible changes, on breaking changes is made explicit, forcing the consumers to upgrade for the new functionality, but in a very managed way. That’s a very architecturally sound strategy. There’s a reason that’s used so widely. You see here on the left, an adaptation of the old famous Facebook slogan of moving fast and breaking things. As you see from the quote that I put on the right, I like to challenge the idea of versioning as the default strategy, because I think it does cause a lot of downstream implications. The fact that all consumers do have to upgrade is itself a point of inconvenience for many of those consumers. This talk is really dedicated to exploring alternative strategies that produce more or less the same results, but with different tradeoffs for the consumers and different tradeoffs for the producer as well.
Contract Specifications as Promises
When we talk about APIs, again, REST APIs, GraphQL, Java APIs, it doesn’t matter, events, we have something like a contract, the specification. Of course, in the REST world, OpenAPI tends to be the 800-pound gorilla. There are of course alternatives, but this is a pretty widely used one. It’s easy to fall into the trap as technologists to think of that specification as a guarantee. I really like the word promise. Mark Burgess came up with this promise theory. He was big on the configuration management world, CFEngine, and Puppet, and Chef, and so forth that led to infrastructure as code techniques that we use today. He has a mathematical basis for his promise theory in the infrastructure configuration management world. For more lay audiences, he wrote a book on promise theory, and this quote came out of it. “The word promise does not have the arrogance or hubris of a guarantee, and that’s a good thing.” Promises fundamentally are expressions, communication patterns that demonstrate an intent to do something, but promises can be broken. As we saw in the reservation system example, promises can sometimes be broken in unexpected ways that lead to cascading failures.
Following the Evolution of a Complex API
I’d like to explore that idea of making best effort attempts to solve customer’s needs through some storytelling. I mentioned this open source product that I’ve managed for nine years now called mountebank. It’s a service virtualization. If you are familiar with mocks and stubs that you might use to test your Java code, for example, service virtualization is a very similar construct. It just exists out-of-process instead of in-process. If your runtime service depends on another service, if you’re put in the booking service, you depend on the reservation service, and you want to have black box tests, out-of-process tests against your booking service. In a deterministic way, where you’re not relying on certain tests to be the test data to be set up in the reservation systems, you can virtualize the reservation system. Mountebank allows you to do that. It opens up new sockets that will listen, and that will listen for certain requests that match certain criteria, and respond in a way that you the test designer set up. It’s a very deterministic way of managing your test data.
There’s more to it than this picture on the bottom. In the book that I wrote, I had to draw a number of diagrams that described how mountebank worked. This one covers more or less the back part of the process, generating the response. Mountebank gets a call. It’s a virtual service that needs to respond in a specific way, returning the test data relevant to your scenario. What it does is it grabs a response. There’s multiple types of ways of generating a response, we’ll look at a couple. Then the bulk of the storytelling is going to be around this behaviors box. Behaviors are post-processing transformations on those response. We’ll look at some examples because there has been a significant evolution of the API in sometimes backwards incompatible ways in that space, all done without versioning. Then the core construct of mountebank is virtual servers. Mountebank just holds an imposter, but it’s the same idea as the virtual stub.
Evaluating Options from a Consumer’s Perspective
As we look at some of the different options, where versioning wins hands down is implementation complexity. When you version an API, you can simply delete all of the code that was there to support a previous version. You can manage the codebase in a more effective way if you are the API producer. I’m not going to look at implementation complexity as a decision criteria because I’ve already experienced that versioning wins on that front. Instead, as I look through a number of alternatives to versioning, I’m going to look at them from the consumer’s perspective. These three criteria are the ones I’m going to focus on. When I say obviousness, think the principle of least surprise. Does it do what you expect it to in a fairly predictable way? Does it match your intuitive sense of how the API should work? Elegance is another proxy for usability. When I think elegance, is it easy to understand? Does it use the terms and the filters in a consistent way? Is the language comprehensible? Does it have a relatively narrow surface area because it’s targeted to solve a cohesive set of problems? Or does it have a very broad surface area and therefore hinder the ramp-up to comprehension, because it’s trying to solve a number of different problems in an infinitely configurable way? Then stability is, how often do I as the API consumer have to change to adapt to the evolution of the API?
Evolution Patterns – Change by Addition
A few patterns all in real-world patterns that came out of my experience, maintaining mountebank. This snippet of JSON is an example of how you might configure a response from the virtual service. This is HTTP. Mountebank supports protocols outside of HTTP. This is, I think, a pretty good one. All this is doing is saying how we’re going to return 500 with that text that you see in the body. You can also set up things like headers, for example. One of the first requests for a feature extension after releasing mountebank was, somebody wanted to add latency to the response. They wanted to wait half a second or three seconds before mountebank responded. The easiest thing in the world would have just been to add that quite directly, some latency to the JSON, which is pretty close to what I did. I added this behaviors element with a little bit of a clumsy underscore, because I was trying to differentiate the behaviors from the types of responses, represents generation of a canned response, like you see here. There’s two others. There’s ways of record and replay, and it’s called proxy. There’s ways of programmatic configuration of response, it’s called inject. Since those are not underscore prefixed, I thought I would do the underscore on the behaviors.
More importantly, I thought that having a separate object, even though I only had one use case for it right now on this latency, was a point of extension. I think that’s just a foundational topic to bring you up. We talk a lot about backwards compatibility. There is a little bit of forward thinking that allows us to cover up at least some forward compatibility concerns. If we can do something as simple as ensure, for example, that our API doesn’t respond with a raw array. Because as soon as you need to add paging information, and you have to add an object wrapper, you’ve made a breaking change. Adding an object for extensibility is a pretty popular forwards compatibility pattern. This is an example, even though I wasn’t quite sure what I would use it for when I wrote this. This works pretty well. This was just your simple addition to an API. This is Postel’s Law, where you should be able to evolve an API in a way that doesn’t change or remove elements and only adds to them. When I think about how that fits against the rubric that I mentioned earlier, I think this is as good as it gets. We should always feel comfortable as API producers, adding new elements, being a little bit thoughtful about how to do that in a forwards compatibility way. This covers obviousness, elegance, and stability quite well.
Evolution Patterns – Multi-typing
That worked great. Then somebody said, I want the latency to be configurable. I mentioned that mountebank has this inject response type, which lets you programmatically configure a response. I thought maybe I would take advantage of that same functionality to let you programmatically configure the latency. What I did is I kept the wait behavior, but I just had it accept either a number or a string that represents a JavaScript function. I call that multi-typing. It worked well enough. It allowed me to fit within the same intention of adding latency, which are two different strategies of how to resolve that latency through a number of milliseconds or a JavaScript function. It’s not as obvious. It’s not as elegant. I have not done this since that initial attempt. If I were to run into the same problem today, I’d probably add a separate behavior, something like wait dynamic. I think that’s a little bit less elegant because of experience the surface area, because you have to understand the API, but it’s a bit more obvious. I think obviousness and making sure that it makes it easy, for example, to build a client SDK, that doesn’t have to have some weird translation. Because you need different maybe subclasses, or functions, or properties to describe the API in a way that gets translated to how the API works, because it’s polymorphic in sometimes unhelpful ways. It works. I wouldn’t recommend it. It certainly involves not having to release a new version to fix the API itself.
Evolution Patterns – Upcasting
This third pattern is really my favorite. It’s upcasting. It’s a pretty common pattern. You see a lot in the event driven world, for example, but it really works for a number of different kinds of APIs. A subsequent behavior that was added to the list was this one around shellTransform. The idea was, mountebank has created this response, this status code, this body, but sometimes I want to post-process that JSON, to change it to add some dynamic information. I want to be able to use a shell program because I don’t want to parse in a JavaScript function. I want maybe to use Ruby, in this example, to do something dynamic. It was relatively easy to build that. Then what people asked for was, actually, I want a pipeline of shell programs. I want to have very small targeted shell programs that did one thing and be able to compose multiple of them to generate the post processed response. What I had to do was change shellTransform, originally a string into an array. It would execute each of those shell programs in order in the array. This one, assuming that both the string and the array can be parsed, is a little bit less obvious, because it does have some components of that multi-typing that we just looked at. It’s actually managed in a much more productive way. I think this is a very elegant and very stable approach. I think this is one of the first approaches that I generally reach for when I try to evolve an API without breaking their consumers. Let me show you how it works.
First of all, just to acknowledge, this is a breaking change. We changed the API from a string to an array. The new contract, the new specification of the API lists only the array. It does not advertise that it accepts a string. I could have simply released a new version, changed the contract to the array, and asked any consumers who had the string version to update themselves. That would have been at their inconvenience. The upcasting allows me a single place in the code that all API calls go through. I have this compatibility module, and I follow the upcast function on it, parsing in the JSON that the consumer is sending in the request. You can see the implementation of that upcast function, or at least a portion of it down below. I have this upcastShellTransformToArray, and there’s a little bit of noise in there. It’s basically just looking for the right spot in the JSON and then seeing if it is a string. If it is, it’s wrapping the string with an array so it’s an array of one string. It is managing the transformation that the consumers would have had to do in the producer side. It’s adding a little bit of implementation complexity, although quite manageable, because it’s all managed in one spot in the code, at the core of the tradeoff of not having to inconvenience any consumers.
Another reason I really like the upcasting pattern is that it works a bit like Russian dolls, you can nest them inside of each other. This is another example over time, the behaviors, these post-processing transformations of the response, added a bit more functionality. You see several here, wait, we mentioned that adds 500 milliseconds. ShellTransform, now a list of shell programs that can operate on the JSON and the response. Lookup also has a list. Copy has a list. Decorate is just a string transformation that you can run. Then it has this repeat directive that allows you to return the same response to the same request multiple times in a row. Normally, it works like a circular buffer, it rotates through a series of responses, but you can ask it to hold back for three times on the same response before cycling to the next one.
I wanted to do in a much more composable way because it allows the consumer to specify the exact order of each transformation, which isn’t possible on the left. On the left, there’s an implicit order encoded inside mountebank, not published, not advertised. While some transformations operate one time, at most, like decorate or wait, some can operate multiple times, like shellTransform and lookup. Repeat, it turns out doesn’t really belong there, because it’s less of a transformation on the response and more a directive on how to return responses when there’s a list of them from mountebank standpoint. What I wanted to do was have a list where every single element in the list is a single transformation, and you can repeat the transformations as much as you want. If you want to repeat the wait transformation multiple times, that’s on you, you can do it. It’s very consistent. This actually allowed me to make the API, in my opinion, more elegant, and more obvious, because it works more like consumers would expect it to work rather than just demonstrating the accidental evolution of the API over the years. I rank this one quite high, just like testing in general, but like all non-versioning approaches, it does require a little bit of implementation complexity.
The good news is that the implementation complexity for nested upcasting is trivial. All I have to do, I have the exact same hook in the pathway of requests coming in and being interpreted by mountebank, you can call this compatibility module, and all I have to do is add another function for the additional transformations after the previous one. As long as I execute them in order, everything works exactly as it should. We did the upcastShellTransformToArray, so took the string, made an array. The next instance, all I have to do is make the other transformation. If you have a very old consumer that only has the original contract, it’ll upcast it to the next internal version of that contract. Then the upcastBehaviorsToArray, we’ll update it to the published contract as it exists today at mountebank. The implementation was pretty trivial. It was just looking for the JSON elements in the right spot and making sure that if there was an array, it would unpack each element of the array in order. If it was a string, it would keep it as is but it’ll make sure that every single element in the behaviors array had a single transformation associated to it.
Evolution Patterns – Downcasting
The next instance of a breaking change managed without a version was far more complex. This one is going to take a little bit of a leap of faith to understand. I don’t want to deep dive into how to use mountebank, or the mountebank internal mechanics too much. This one does require a little bit more context. I mentioned that mountebank allows you, as we’ve already seen, to represent a canned response that it’ll return. For HTTP, we had the 500 status code in the body text. An alternative is this way of programmatically generating a JSON response, instead of, is, you parse in inject in a JavaScript function as a string, as you see here. The function at first just had the original request that the system under test made to mountebank as a virtual service. There is a way of keeping state so that if you were to programmatically generate the response, and maybe you wanted to add how many times you’ve done that, you could keep a counter, and you could attach the result to that counter as part of the response that you generated, and a logger. That was the original definition of the JavaScript function. You could parse it. Pretty early on, people wanted to be able to generate the response in an asynchronous way. Maybe they wanted to look something up from the database or have a network hop, so I had to add this callback. Then, a little bit later after that, it turns out that the way I’d implemented state was too narrowly scoped. Somebody made a very good pull request to add a much better way of managing state. It was certainly inelegant because I had these two state variables and the JavaScript function. While I tried to do my best in the documentation to explain it, that certainly did not aid comprehension for a newcomer to the tool, that required having to follow along the accidental evolution of the tool.
Anybody who’s done a lot of refactoring in dynamic languages, languages, in general, know that one of the most effective ways to simplify that type of interface is to use this idea of a parameter object. As you have parameters start to explode, you can replace it with a single object that represents the totality of the parameters. Then, of course, that makes a very easy extension point, because if I need to add a sixth parameter down the line, it’s just a property on that config object. This is the new published interface for mountebank. Again, a breaking change, because for people who passed on that JavaScript function on the left, they now have to be transformed into that JavaScript function on the right. However, assuming mountebank can do that transformation for you, through this technique called downcasting, it’s a pretty elegant way of managing the complexity in a producer, instead of passing it on to the consumers. It’s not quite as obvious because there is a little bit of magic that happens underneath the hood. It’s not quite as elegant, because you do have this legacy of these old parameters that somehow have to be passed around. If done well, it can be very stable.
Here is what it looked like, in this instance in mountebank. What we basically did was we had the new parameter object, this config parsed in, and we continue to parse the subsequent parameters, even though we don’t advertise them, we don’t call them out explicitly on the contract. You can’t go to the mountebank documentation today, and see that these parameters are being parsed in. The only reason they are is for consumers who have never updated to the publish contract using the old contract. Those older parameters will still be parsed in. That solves everything beyond the first parameter, the parameter object. It doesn’t solve what happens with the parameter object itself, because that still needs to look like the old request that used to be parsed in. That’s why we call this downcastInjectionConfig call down here. That takes us back to the compatibility module. All of my transformations that manage breaking changes in the contract, I can centralize in this compatibility module. I can go to one place and see the history of breaking changes through the API. When I say breaking changes, they are breaking changes to the publish contract, but mountebank will manage the transformation from old to new for you. The consumer doesn’t have to.
In this case, what I had to do to make that config parameter object that had to have state, had to have the logger, had to have the done callback on there so that people using the new interface, it would work as expected. For people using the old interface, it had to look like the old request. That’s what this bolded code down below is doing. There’s a little bit of internal mechanics that I mentioned. Mountebank has multiple protocols, there’s method and data, or ways of sensing for, in this case, HTTP and TCP. Then what it would do is it would take all of the elements of the request, none of which I knew conflicted with the names of state and logger and the done callback. I had to just have that expert knowledge as the person who architected the code to know I wasn’t going to run into any naming conflicts, but it would add all of the elements like the request headers, the request body, the request query string, to the config object. While it was a parameter object that only had state and the logger and callback for most consumers, if you happened to have your code use the old function interface, it would also have all the HTTP request properties on it as well. It continued to work. That way, it was downcasting the modern code to the old version in a way that would support both old and new in a way that was guaranteed to not run into any naming conflicts.
Evolution Patterns – Hidden Interfaces
This next pattern is, I think, where things get really interesting and really explore the boundaries of what is a contract, and what is a promise that I hinted at early. Getting back to the shellTransform. I gave a little bit of a brief description of it. It allows you to build a shell program written in the language of your choice, that would receive the JSON encoded request and response. It would allow you to spit out a JSON encoded response. It allows programmatic transformation. If you were writing this in JavaScript, for example, the way it was originally published, your code would look something like this. The request and the response would be parsed as command line arguments to your shell program, voted the right way. You would have to interpret those in your code. That had all kinds of problems, especially in Windows. It has to do with the maximum length of the command line, which is actually more variable than I understood when I wrote this code between operating systems and shells. In Windows it’s quite limited. It’s maybe 1048 characters or something like that. Of course, you can have very heavyweight HTTP requests or responses. If you are inputting that JSON, and it’s a 2000-character body, you’ve already exceeded the limit on the shell. That’s the character limit itself.
There are also a number of just polling complexities to quote the JSON the right way and escape internal quotes for the different shells. I figured it out on Linux based shells. The variety of polling mechanisms on Windows-based shells, because there’s more than one, you have PowerShell, you have the cmd.exe, you have the Linux Cygwin type ports, was more complexity than I realized when I went with this approach. What I had to do was have mountebank as the parent process, put these things in environment variables that allow the child process to read the environment variables, very safe, very clean. I don’t know why I didn’t start there from the beginning, but I didn’t. That’s the reality of API developments, you make mistakes. I wanted this to be the new published interface. Of course, I still had to leave this in there. I just removed it from the documentation. That’s what I mean when I say a hidden interface. It’s still supported, it’s just no longer part of the published contract. If it worked, I think it’s a reasonably safe way of moving forward. I downgraded stability a little bit. I think the reason is into that with the description I gave you of the character limitations of the shell.
What happened was by still publishing stuff to the command line, and this code down here was more or less the code that let me do it in this quoteForShell, manage the complexity of trying to figure out if they’re on Windows and how to quote it exactly right. Unfortunately, even if you weren’t using the old interface, even if you weren’t using the command line interfaces, if your shell program was using the environment variables, it still introduced scenarios where it would break mountebank, because it would put the command line variables as part of the shell invocation. Sometimes in certain shells, in certain operating systems, that invocation would exceed the character limit supported by the shell itself. Even though you had no intention of using them, even though you didn’t know they were being parsed, mountebank would throw an error, because it exceeded the shell limitation.
For a while, what I tried to do was say, let me be clever, and if you’re on Windows, do this, if you’re on Linux, do that. It was too much complexity, I don’t know that I’m smart enough to figure out actually how to do it all. Even if I was, you would still run into edge cases. No matter how big the character limit of the shell is, there is a limit. It’s possible to exceed that limit, especially if you’re testing very large bodies for HTTP, for example. My first attempt was to just truncate it by shells, but pretty soon I realized that was a mistake, so I had to truncate it for everybody. This was a real tradeoff. I think, probably the pivotal moment in this talk, because there was no way for me to guarantee that I could do this without a version without breaking people. If I truncated it for people who were on a Linux shell that had hundreds of thousands of characters as a limit, and I truncated it for Windows, which maybe had 1000 or 2000-character limit. There may be people who used the old interface on Linux, post-truncation, that they would get an error. I was unaware of any. I had zero feedback that that was the case. It was certainly a possibility, even if it was somewhat remote. Because the way of publishing on the command line wasn’t around for very long before it switched to the environment variable approach.
Releasing a new version would have been the safest option by far to satisfy all of the constraints around stability in that scenario. However, it would have also forced consumers to upgrade. It would have been very noticeable to consumers. They would have had to read the release notes, figure out what they need to change, and do around the testing associated with that. If alternatively, I took the approach that I did, which was to just truncate in all cases, publish only the environment variable approach, and rely on the fact that it was unlikely to break anybody. If it did, the error message would exactly specify what they needed to do to fix it until you switch to the environment variables. Then I was optimizing for the masses. I was optimizing for what would support most people in a very frictionless way with a clear path of resolution for what may be zero people who are affected by the breaking change.
How To Think About API Evolution
That’s uncomfortable, because that forces us to rethink API evolution away from an architectural pattern that guarantees stability, to thinking about it as the users would think about it. I was really inspired by this thing called Hyrum’s Law. Hyrum worked at Google. With a sufficient number of users on the API, it doesn’t matter what you promised in the contract, because consumers will couple themselves to every part of the API. I remember for a while, Microsoft Windows, when they would update, they would have to add code to the updated operating system because they would test not just the operating system itself, but they would test third-party applications using the operating system. Third-party developers had done all kinds of very creative things with unpublished parts of the Windows SDK for a long time. Windows, as they changed these unpublished parts of the SDK, maybe we were doing something clever with this eighth bit that was unused in a byte, which was a real scenario that happened sometimes. They would have to detect that and write code in the new operating system that would continue to support the same behavior, even though there’s never something that guaranteed.
Hyrum’s Law
There’s a famous xkcd comic out there where users are complaining about their Emacs was taking advantage of the fact that when you held the spacebar down, it overheated the computer to create some side effect. The developer was like, no, I just fixed the overheating problem. The Emacs user was like, no, can you change it back to the old behavior. Hyrum’s Law is a really humbling law for an API producer. Especially as one who has had a public API available for most of a decade now, I really relate to how frequently I find myself surprised at how people have hacked an API to do something that I didn’t anticipate they could do in a way that I wasn’t intending to support, but now is oftentimes supported. Mountebank is primarily a RESTful API, but some people embedded it in JavaScript, and I never really meant to support it. Some people did that because it solves the startup time, it’s just part of your website instead of a separate website. Now I have this accidental complexity of supporting a JavaScript API that you could embed in an Express application as well. That’s an example of Hyrum’s Law. Mentioned in this book, “Software Engineering at Google,” which is why I put it there. I think I got a lot of value from some of the patterns of what Google’s had to do to scale to 50,000 engineers.
API Evolution Is a Product Management Concern
We talk a lot about API as a product nowadays, usability, feasibility, viability being common descriptions of the tradeoffs in product management. I think that rethinking backwards compatibility evolution or breaking change evolution, from an architecture concern to a product management concern, is a much healthier position to think about how to manage the evolution of your API. I think that the tradeoffs that are represented by product thinking are more nuanced than the tradeoffs represented by architecture thinking. I think versioning is a very solid architectural pattern that guarantees stability in the case of breaking changes. There are always needs for that pattern. Mountebank itself has enough debt underneath it. One of these days, I would like to release a subsequent version that allows me to remove a lot of the cruft, a lot of the things I really no longer want to support, but have to because of some of these backwards compatible transformations that I’m doing.
If we think about viability, we’re solving problems that our users have, an API context, I really liked the idea of cognitive load that the authors of “Team Topologies” talk about. When I think about any product, what I really want to do is simplify the underlying complexity, I really have no idea how my phone works. I have no idea how it connects to a cell tower. I don’t understand the underlying mechanics, physics, material design. I barely understand the software. It simplifies an interface for me to be able to use it. Same as driving a car. It’s a couple pedals and a steering wheel. We have mirrors in the right places. I can drive without having to understand the underlying complexity of the system that I’m driving. I want my APIs to do the same thing. Usability really has been the focus of this talk. How do I manage evolution to that system, or to that interface in a way that provides the most usable and stable experience for my users? Then, feasibility is very much an architectural concern. How do I do that in a way that is technically feasible, that protects downstream systems and it satisfies the non-functional requirements of the overall ecosystem at large? Rethinking API evolution as product management, I think, for me has been a pretty profound way of understanding the needs and empathizing with the needs of the consumers of mountebank. It’s something that I’d recommend you consider as you’re evolving your own API, versioning is always an option that you can reach for. Upcasting and some of these others, I think, would be valuable additions to your toolbox.
Questions and Answers
Betts: How do you hide the complexity and keep it from being too bloated?
Byars: A large part of that, in my context, was trying to centralize it. Almost all of the code in mountebank only knows how to respond to the newest interface that is documented and supported behind the contract. Most of the code doesn’t have this legacy behind it. For upcasting, there’s one hook in the request processing pipeline that causes compatibility module. That’s where all the transformations happen that convert from old to new. The exceptions are downcasting. A few downcast calls have to be sprinkled in certain strategic areas of your code. That is a little bit of debt that I’d love to clean up someday with the new version. For most of the transformations, it’s pretty straightforward.
Betts: There was a question about returning a string instead of other data types. That made me wonder, a lot of your patterns you talked about are how you handle changes to the request to support different inputs. How do you evolve the response that you give to the consumer?
Byars: I don’t think there is a path that I see for the producer managing backwards incompatible changes on the response without a version. In fact, this is one of the driving forces that I would love to someday create a version for mountebank on, because there are some responses that I look on now, and it’s like, “I wish I hadn’t had done that.”
Betts: Sometimes these changes happen, and you have to evolve because there are just new features you want to add. Sometimes it’s a mistake in the original design. What drove you to make the change? Does that influence your decision?
Byars: Ideally, you’re trying to be thoughtful in the API design to make plugging in new features an addition. That has been the norm. It’s not universal. Generally speaking, that’s an easier process. Sometimes covering up mistakes requires more thought on the API design change in my experience. There are simplistic ones where I really wish I hadn’t created this endpoint or accepted a PR with this endpoint that has a specific name, because it doesn’t communicate, where I’m really hoping that feature communicates to users. It actually conflicts with some future features that I want to add. That actually happened. What I did in that case was, there’s a little bit of hidden features going on, and I just changed by addition. I created the new endpoint with a name that wasn’t as elegant as what I originally wanted to. I just compromised on that because it was more stable. My criteria is less elegant, but more stable. I just accepted that. There’s a tradeoff. Sometimes you can’t get the API exactly the way you want because of the fact that you have real users using it. That’s not necessarily a bad thing, especially if you can keep those users happy. Until there’s an attempt at deprecating an old endpoint, create a new one I want to communicate, but still having a little bit of compromise in the naming of fields. Then, of course, some of these other patterns that you see here are other strategies that do require more thought, more effort than just adding a feature in most cases.
Betts: With your centralized compatibility module, how do you actually stop supporting deprecated features? With versioning you can delete the code that’s handling version, whatever of the API, as long as it’s in a separate module. Does this stuff live around forever?
Byars: Yes, I’ve never deprecated those features. As soon as I release something, and it’s hard for an open source product sometimes to know who’s using what features. I don’t have any phone home me analytics, and I don’t intend to add any. You have to assume that you’re getting some users of that feature. The good news is that with the centralized compatibility module, especially with upcasting, which is most of what I’ve done, it’s relatively easy to adjust. I’ve been able to take one of these other patterns that doesn’t require too much fuss. Downcasting is the hardest. One of these days, especially for the response question that you asked, because that’s where I have the most debt, that I haven’t been able to use these strategies to resolve, I would love to do a version. That would be the opportunity to do a sweep through the code that I no longer want to maintain.
Betts: I’m sure mountebank v2 will be really impressive.
Byars: The irony is I did release a v2, but it was a marketing stunt. I looked at the [inaudible 00:48:21] spec and they say, if it’s a significant release, you can use a major version. I felt pedantically validated with what they said. It was really just a marketing stunt, and I made sure in the release notes to say, completely backwards compatible.
Betts: There’s no breaking changes.
See more presentations with transcripts

MMS • Akshat Vig
Article originally posted on InfoQ. Visit InfoQ

Transcript
Vig: How many of you ever wanted a database that provides predictable performance, higher availability, and is fully managed? What I’m going to do is talk about evolution of a hyperscale cloud database service, which is DynamoDB. Talk through the lessons that we have learned over the years, while building this hyperscale database. I am Akshat Vig. I’m a Principal Engineer in Amazon DynamoDB team. I’ve been with DynamoDB right from its inception.
Why DynamoDB?
AWS offers 15-plus purpose-built database engines to support diverse data models, including relational, in-memory, document, graph, time-series. The idea is that you as a customer can choose the right tool for the use case that you’re trying to solve. We are zooming in into DynamoDB, which is a key-value database. The first question that comes to mind is, why DynamoDB? Let’s go back to history. During 2004-2005 timeframe, amazon.com was facing scaling challenges caused by the relational database that the website was using. At Amazon, whenever we have these service disruptions, one thing we do as a habit, as a culture, is we do COEs, which are basically correction of errors. In that COE, we ask questions, how can we make sure that the issue that happened does not happen again? The use case for which that particular outage happened was related to a shopping cart. One of the questions that we asked in the COE was, why are we using a SQL database for this specific use case? What are the SQL capabilities that are actually needed? It turns out, not many. Choosing the right database technology is the key to build a system for scale and predictable performance. At that time, when we asked this question, if not an SQL database, what exactly would we do? At that time, no other database technology existed that met the requirements that we had for the shopping cart use case.
Amazon created Dynamo. It was between 2004 to 2007, where Dynamo was created. Finally, in 2007, we published the Dynamo paper after letting it run in production, and used by not just the shopping cart use case but multiple amazon.com services. Dynamo was created in response to the need for a highly available, scalable, and durable key-value database for the shopping cart, and then more teams started using it. Dynamo was a software system that teams had to take, run the installations on resources that were owned by them, and it became really popular inside Amazon within multiple teams. Hearing this one thing we heard from all these teams is that, Dynamo is amazing, but what if you make that as a service, so that a lot of teams who are trying to become experts in running these Dynamo installations, it becomes easier? That led to the launch of DynamoDB.
Dynamo and DynamoDB, they are different. Amazon DynamoDB is a result of everything we have learned about building scalable, large-scale databases at Amazon, and it has evolved based on the experiences that we have learned while building these services. There are differences between Dynamo and DynamoDB. For example, Dynamo, it was single tenant. As a team, you would run an installation, you would own the resources that are used to run that service. DynamoDB is multi-tenant. It’s basically serverless. Dynamo provides durable consistency. DynamoDB is opinionated about it and provides strong and eventual consistency. Dynamo prefers availability over consistency, versus DynamoDB prefers consistency over availability. In Dynamo, routing and storage are coupled. We’ll see, routing and storage in DynamoDB are decoupled. Custom conflict resolution was something that was supported in Dynamo. In DynamoDB we have last writer wins. There are differences between Dynamo and DynamoDB.
Coming back to the question of, why DynamoDB? If you ask this question today, 10 years later, a customer still will say they want consistent performance. They want better performance. They want a fully managed serverless experience. They want higher availability on their service. We are seeing that, like consistent performance at scale, this is one of the key durable tenets of Dynamo. Key properties that Dynamo provides that as Dynamo is being adopted by hundreds of thousands of customers, and as the requests are increasing, even the request rates are increasing, customers who are running mission critical workloads on DynamoDB, the performance they’re seeing is consistent. They’re getting consistent performance at scale. Proof is in the pudding. One of the customers, Zoom, in early 2020 when they saw unprecedented usage that grew from 10 million to 300 million daily meeting participants, DynamoDB was able to scale with just a click of a button and still provide predictable performance to them.
DynamoDB is fully managed, what does it mean? DynamoDB was serverless even before the term serverless was coined. You pay for whatever you use in DynamoDB. You can scale down to zero essentially. If you’re not sending any requests, you don’t get charged for whatever you’re doing. It is built with separation of storage and compute. As a customer, in case you run into logical corruptions where you accidentally deleted some of the items or deleted your table, you can do a restore. Dynamo also provides global active-active replication where you have use cases where you want the data closer to the user, so you can run DynamoDB table as a global table.
On availability, Dynamo offers an SLA of four nines of availability for a single region setup. If you have a global table, then you get five nines of availability. Just talking about magnitude of scale, to understand that, like amazon.com being one of the customers of DynamoDB, 2022 Prime Day, amazon.com and all the different websites generated 105.2 million requests per second. This is just one customer. This can help you understand the magnitude at which DynamoDB runs. Throughout all this, they saw predictable single digit millisecond performance. It’s not just amazon.com, hundreds of thousands of customers have chosen DynamoDB to run their mission critical workloads.
DynamoDB, Over the Years
Introduction of DynamoDB. How is it different from Dynamo? What properties are the durable tenets of the service? Let’s look at how it has evolved over the years. DynamoDB over the years, it was launched in 2012, working backward from customer, that’s how Amazon operates. It started as a key-value store. We first launched DynamoDB in 2012, where you as the customer can do Put, Gets, and it scales. Foundationally, very strong. Then we started hearing from customers, we want more query capabilities, serverless capabilities in DynamoDB, and we added indexing. Then customers started asking about JSON documents, we added that so that they can now preserve complex and possibly nested structures inside DynamoDB items. Then, 2015, a lot of customers are asking us, can you provide us materialized views? Can you provide us backup, restore? Can you provide us global replication? We said, let’s take a step back, figure out what common building block we need to build all these different things that customers are asking. We launched DynamoDB Streams so that by the time we build all these native features inside Dynamo customers can innovate on their own, and a lot of customers actually used the basic scan operation and streams to innovate on their own. Most recently, we launched easier ingestion of data into Dynamo or easier export of data from Dynamo. Over the years, the ask from customers around features, predictable performance, availability, durability, that has been constant.
How Does DynamoDB Scale and Provide Predictable Performance?
How does DynamoDB scale and provide predictable performance? Let’s try to understand this particular aspect of Dynamo by understanding how exactly a PutItem request works. As a client, you send a request, you might be either in Amazon EC2 network or somewhere on the internet, it doesn’t matter. As soon as you make a request, you do a PutItem, it lands on the request router. Request router is the first service that you hit. As every AWS call, this call is authenticated and authorized using IAM. Once the request is authenticated and authorized, then we look at the metadata. We try to figure out, when exactly do we need to route the request? Because the address of like where exactly the data this particular item has to finally land, is stored in a metadata service, which is what the request router consults. Once it knows the answer, where to route the request, next thing it does is it basically verifies whether the table that the customer is trying to use, has enough capacity. If it has enough capacity, the request is admitted. In case the capacity is not there, request is rejected. This is basically admission control done at the request router layer. Once all that goes through, request is sent to the storage node. For every item in Dynamo, we maintain multiple copies of that data. DynamoDB storage nodes, one of the storage node is the leader storage node. The other two storage nodes are follower storage nodes. Whenever you make a write request, it goes to the leader, gets written on at least one more follower before the write is actually acknowledged back to the client.
We don’t have, not just a single request router, and not just three storage nodes, the service consists of many thousands of these components. Whenever a client makes a request, the request is routed to a specific storage node and sent to the request router, and then sent to the storage node. AWS just like a well-architected service, DynamoDB is also designed to be fault tolerant across multiple availability zones. In each region, there are basically request router and storage nodes which are in three different availability zones. We maintain three different copies of data for every item that you store in the DynamoDB table. Request router essentially does a metadata lookup to find out where exactly to route the request. It takes away the burden from the clients to do the routing. When I said storage and routing is decoupled, that’s what I meant, that the clients now don’t have to know about where to route the request. It is all abstracted away in the request router.
Wherever the request router gets a request, it finds out the storage nodes that are hosting the data, it will connect to the leader storage node. The leader storage node submits the request, acknowledges, and finally once it gets an acknowledgment from one more replica, it acknowledges it back to the client.
Data is replicated at least to two availability zones before it is acknowledged. DynamoDB uses Multi-Paxos to elect a leader, and leader continuously heartbeats with its peers. The reason it is doing it is that so that if a peer fails to hear heartbeats from a leader, a new leader can be elected so that availability is not impacted. The goal is to reduce the failure detection and elect a new leader as soon as possible in case of failures.
Tables
Now we understand the scale at which Dynamo operates, we understand how the request routing logic works. Let’s look at the logical construct, the table, and how exactly DynamoDB automatically scales as your traffic increases, as your data size increases in the DynamoDB table. As a customer, DynamoDB, you create a table, and each table, you specify a partition key. In this particular example, each customer has a unique customer identifier, and we are storing customer information in this table. Customer ID is your partition key. Then you also store other customer information like name, city, in the item as other attributes. DynamoDB scales by doing partitioning. How exactly that happens is, behind the scenes, whenever you make a call to DynamoDB with the customer ID or whatever is your partition key, Dynamo runs a one-way hash. The reason for doing that one-way hash is that it results in random distribution across the total hash page associated with that table. One-way hash, it cannot be reversed, it’s not possible to essentially determine the input from the hashed output. The hashing algorithm, it results in essentially highly randomized hash values, even for inputs that are very similar.
A table is partitioned into smaller segments based on the overall capacity that you have asked or the size of the table. Each partition, it contains a contiguous range of key-value pairs. For example, in this case, we have a green partition that has values roughly from 0 to 6. Similarly, we have the orange partition, which has values from 9 to B, and then you have the pink partition, which has values from E to F. Essentially, given a hashed value of an item partition key, a request router can determine which hash segment that particular item falls into, and from the partition metadata service, it can find out the three storage nodes, which are holding the copy of that particular item, and then send the request to that particular set of nodes.
As I explained previously, we have three copies of data in three different availability zones. If we have these three partitions, essentially, we have three green partitions in three different zones, three orange partitions, and then three pink partitions. All these partitions, the metadata about where exactly these partitions are, is stored in a metadata service. That particular metadata is called a partition map. What a partition map looks like, it essentially is the key ranges that that partition store supports, and then green1, green2, green3, these are essentially the addresses of the three storage nodes where that partition is actually hosted. Think about it, when Zoom comes and asks for 10 million read capacity unit table, we would essentially add more partitions. If suddenly they increase their throughput to 100 million, corresponding to that we would add more partitions, update the metadata, and that’s how DynamoDB scales.
Predictable Performance and Data Distribution Challenges
What challenges are there? DynamoDB is a multi-tenant system. What are the different challenges that come into picture that we have to solve? What are the lessons that we have learned to provide predictable performance? One of the common challenges in a multi-tenant system is workload isolation, because it’s not just one customer that we have, we have multiple customers. These customers, their partitions are installed on the storage nodes which are multi-tenant. If isolation is not done right, it can cause performance impact to these customers. Let’s jump into how exactly we solve that. In the original version of DynamoDB that was released in 2012, customers explicitly specified the throughput that the table required in terms of read capacity units and write capacity units. Combined, that is what is called as provisioned throughput of the table.
If a customer is essentially reading an item, which is up to 4 KB, that means it has consumed one read capacity unit. Similarly, if a customer is doing a write of a 1 KB item, that would mean the write capacity unit is consumed. Recall from the previous example for the customer’s table, we had three partitions. If a customer asks for 300 read capacity units in the original version of Dynamo, what we would do is we would assign 100 RCUs to each of the partitions. You have basically 300 RCUs in total for your table. Assuming that your workload is uniform, essentially your traffic is going to three different partitions at a uniform rate. To provide workload isolation, DynamoDB uses token bucket algorithm. Token bucket is to track the consumption of tokens of the capacity that that particular table has, and a partition has, and enforce basically a ceiling for that.
Looking at one of these partitions, we had token buckets at a partition level in the original version of Dynamo. Each second, essentially, we are refilling tokens, at the rate of the capacity assigned to the partition, which is the bucket in this particular case. When RCUs are used for read requests, we are continuously deducting them based on the consumption. If you do one request, and we basically consume one token from this bucket, the bucket is getting refilled at a constant rate. If the bucket is empty, obviously, we cannot accept the request and we ask customers to try again. Overall, let’s say that a customer is sending request, and if there are 100 tokens, the request will get accepted for the green partition. As soon as the consumed rate goes above 100 RCUs, in this particular example, as soon as it reaches 101 RCUs, your request will get rejected, because there are no tokens that are left in that token bucket. This is a high-level idea of how token buckets could work.
What we found out when we launched DynamoDB, is that uniform distribution for the workloads is very hard. Essentially getting uniform workloads across for the full duration of when your application is running, your table exists, it’s very hard for customers to achieve that. That’s because the traffic tends to come in waves or spikes. For example, let’s say you have an application which is for serving coffee. You suddenly will see that spike happening early in the morning, and then suddenly your traffic will increase. As most of the customers get the coffee they go to their office, what you would see is the traffic suddenly drops. Traffic is not uniform. It basically changes with time. Sometimes it is spiky, sometime there is not much traffic in the system. If you create a table with 100 RCUs, and you see a spike of traffic greater than 100 RCUs, then whatever is above 100 RCUs will get rejected. That’s what I mean by non-uniform traffic over time. Essentially, what’s happening is, maybe your traffic is getting distributed across all the partitions, or maybe it’s getting to a bunch of partitions, but it is not uniform across time. Which means if you have provisioned the table at 100 RCUs, any request that is being sent above the 100 RCU limit, it will all get rejected.
Another challenge that we saw was, customers solving this problem of seeing that they are getting throttled. To solve this problem of getting throttled, what they did was they started provisioning for the peak. Instead of doing 100 RCUs, they would ask for 500 RCUs, which means that it is able to handle the peak workload that they’ll see in the day, but at the same time for rest of the day, you are seeing a lot of waste in the system. This meant a lot of capacity unused and wasted, which incurred cost to the customers. Customers asked us, can you solve this problem? We said, what if we let the customers burst? What is the capacity of the bucket? To help accommodate the spike in the consumption, we launched bursting, where we allow customers to carry over their unused throughput in a rolling 5-minute window. It’s very similar to how you think about unused minutes in a cellular plan. You’re capped, but if you don’t use minutes in the last cycle, you can move them to the next one. That’s what we called as the burst bucket. Effectively, the increased capacity of the bucket was able to help customers absorb their spikes. This is the 2013 timeframe when we introduced bursting, unused provision capacity was banked to be used later. When you exercised those tokens, your tokens will be spent. Finally, that particular problem of non-uniform workload over time, we were able to solve.
We talked about the non-uniform distribution over time, let’s talk about non-uniform distribution over keys. Let’s say that you’re running a census application for Canada, and the data of the table is partitioned based on ZIP codes. You can see in this map, 50% of Canadians live below that line, and 50% of Canadians live north of that line. What you will see is that most of your data is essentially going in a bunch of partitions, which means your traffic on those partitions will be higher as compared to your traffic on some partitions. In this example, we have 250 RCUs going to the green partition, and 10 RCUs going to orange and pink partition. Overall, the partitions, they’re not seeing uniform traffic. The takeaway that we had from bursting and non-uniform distribution over space was that we had essentially tightly coupled how much capacity a partition will get to how physically we are basically landing these partitions. We had essentially coupled partition level capacity to admission control, and admission control was distributed and performed at a partition level. What that resulted in, just a pictorial picture of that, is you would see all the traffic go into a single partition. Then, since there is not enough capacity on that partition, the request will start getting rejected.
The key point to note here is that even though a customer table has enough capacity, for example, in this case, 300 RCUs, but that particular partition got only assigned 100 RCUs, so that’s why the requests are getting rejected. Customers were like, I have enough capacity on my table, why is my request getting rejected? This particular thing was called as throughput dilution. The next thing we had to do was solve throughput dilution. To solve throughput dilution, what we did was we launched global admission control. DynamoDB realized it would be going to be beneficial to remove the admission control from partition level and move it up to the request router layer and let all these partitions burst. Still have maximum capacity that a single partition can do for workload isolation, but move the token buckets from the partition to a global table level token bucket. In the new architecture, what happens is, we introduced a new service called as GAC, global admission control as a service. It’s built on the same idea of token buckets, but the GAC service centrally tracks the total consumption of table capacity, again, in terms of tokens. Each request router maintains a local token bucket to make sure the admission decisions are made independently and communicate with GAC to replenish the tokens at regular interval. GAC essentially maintains an ephemeral state computed on the fly from the client requests. Going back to the 300 RCU example, now customers could drive that much traffic to even a single partition because we moved the token bucket from the partition level to a global level, which is the table level bucket. With that, no more throughput dilution. A great win for customers.
That particular solution was amazing, so we had essentially launched bursting, and we did global admission control. It helped a lot of use cases in DynamoDB. Still, there were cases where a customer, due to maybe non-uniform distribution over time or space, could still run into scenarios where traffic to a specific partition is reaching its maximum. If a partition can do 3000 RCUs maximum, and a customer wants to do more on that partition, requests greater than 3000 RCUs would get rejected. We wanted to solve that problem as well. What we did was as the traffic increases on the partition, we actually split the partition. Instead of throttling the customer, we started doing automatic splits. The idea behind automatic splits was to identify the right midpoint, which will actually help to redistribute the traffic between two new partitions. If customers send more traffic to one of the partitions, we would again further split that into smaller partitions and route the traffic to the new partitions.
Now you have these partitions that are equally sized, or they’re balanced, essentially. You as a developer, did not have to do any single thing. AWS literally is adjusting the service to fit your custom needs on the specific usage pattern that you are generating for the service. All this magic happens to solve both the problems, even if you have non-uniform traffic over time, or non-uniform traffic over space. This is not something that we got right from day one. As more customers built on top of DynamoDB, we analyzed their traffic, understood the problems that they were facing, and solved these problems by introducing bursting, split for consumption, and global admission control as solutions for all these different problems. Going back to the picture, where if the customer is driving 3000 requests per second to the green partition, we would automatically split, identify where exactly is the right place to split that, and split it so that 1500 RCUs, 1500 RCUs, the traffic splits between the two. This was again, amazing. We essentially did a bunch of heavy lifting on behalf of the customers.
One thing we still were hearing from customers that, DynamoDB has figured out a lot of things for us, now, for us coming from the world where we always have been thinking about servers, now you’ve started asking us to think in terms of read capacity units, write capacity units. Can you further simplify that? Instead of asking customers to specify provisioning at the time of table creation, what we did was we launched on-demand, where you don’t even have to specify that. All the innovations that we did around bursting, split for consumption, global admission control, all of those enabled us to launch something which is basically on-demand mode on your tables where you just create a table and start sending requests, and you pay per request for that particular table.
Key Lesson
The key lesson here is that designing the system which adapts to the customer traffic pattern is the best experience that you can provide to the customer while using the database, and DynamoDB strives for that. We did not get this right in the first place. We launched with the assumption that traffic will be uniformly distributed but realized there are actually non-uniform traffic distribution based on time and space. Then, analyzing those problems, making educated guesses, we evolved the service and solved all these problems so that all the heavy lifting, all the essential complexity is moved away from the customer into the service, and customers, they just get a magical experience.
How Does DynamoDB Provide High Availability?
DynamoDB provides high availability. Let’s look at how DynamoDB does that. DynamoDB has evolved, and a lot of customers have moved their mission critical workloads into DynamoDB. AWS, there are service disruptions that happen. 2015, DynamoDB also had a service disruption. As I said in the beginning, whenever there is a service disruption that happens, we try to learn from them. The goal is to make sure that the impact that we saw doesn’t repeat. We want to make sure the system weaknesses are actually fixed, and we have a more highly available service. When this issue happened, one of the learnings that we had from that particular COE was that we identified a weak link in the system. That link was related to caches. These caches are essentially the metadata caches that we had in the DynamoDB system. One thing about caches is that the caches are bimodal. There are essentially two routes that a cache code can take. One is, when there is a cache hit, your requests are served from the cache. In the case of metadata, all the metadata that request routers wanted, it was being served from the cache. Then, you have a cache miss case where all the requests actually go back to the database. That’s what I meant by the bimodal nature of the caches. Bimodality in a distributed system is a volcano waiting to erupt. Why do I say that?
This is going back to our PutItem request, so whenever a customer made a request to DynamoDB to put an item or get an item, the request router is the first service where that request lands. Request router has to find out where to route the request. What are the storage nodes for that particular customer table and partition, so it will hit a metadata service? To optimize that, DynamoDB also had a partition map cache in the request routers. The idea is that since the partition metadata doesn’t change that often, it’s a highly cacheable workload. DynamoDB actually had about 99.75% cache hit ratio from these caches, which are on the request router. Whenever a request lands on a brand-new request router, it has to go and find out the metadata. Instead of just asking the metadata for a specific partition, it would ask the metadata for the full table, assuming that next time customer makes a request for a different partition, it already has that information. Maximum, 64 MB requests you can get.
We don’t have just one request router, as I said, we have multiple request routers. If a customer creates a table with millions of partitions, and start sending that many requests, they’ll probably hit not just one request router, they’ll hit multiple request routers. All those requests, then request routers would start asking about the same metadata, which means you have essentially reached a system state where you have a big fleet talking to a small fleet. The fundamental problem is that, either there is nothing in the caches, that is, the cache hit ratio is zero, you have a big fleet driving so many requests to a small fleet, and a sudden spike in traffic. In steady state it was 0.25%, if the cache hit ratio becomes 0, caches become ineffective. Traffic jumps to 100%, which means 400x increase in traffic, and that would further lead to cascading failures in the system. The thing that we wanted to solve is remove this weak link from the system so that the system can always operate in a stable manner.
How did we do that? We did two things. One is, as I said previously, in the first version of DynamoDB, request router, whenever it finds out that there is no information about the partition which it wants to talk to, it would load the full partition map for the table. First change it did was, instead of asking for the full partition map, just ask for that particular partition which you’re interested in. That was the one change that we did then, which was a simpler change. We were able to do it faster. Then we also, secondly, built an in-memory distributed datastore called MemDS. MemDS stores all the metadata in-memory. Think of it like an L2 cache. All the data is stored in a highly compressed manner. The MemDS processes on a node encapsulates essentially a Perkle data structure. It can answer questions like, for this particular key, which particular partition it lands into, so that MemDS can respond back to the request router, the information, and then request router can route the request to the corresponding storage node.
We still do not want to impact performance. We do not want to make an off-box call for every request that customer is making, so we still want to cache results. We introduced a new cache, which is called the MemDS cache on these request routers. One thing, which is different and critical, the most important thing which we did different here on these caches is that even though there is a cache hit on the MemDS cache, we would still send all the traffic to the MemDS system asking for the information, even though we have a cache hit on these MemDS nodes. What that is doing is essentially the system is generating always constant load to the MemDS system, so there is not a case where suddenly caches become ineffective, your traffic suddenly rises. It is always acting like the caches are ineffective. That’s how, essentially, we solve the weak link in the system of metadata getting impacted by requests landing from multiple request routers onto the metadata nodes. Overall, the lesson here is that designing systems for predictability over absolute efficiency improves the stability of a system. While system like caches can improve performance, but do not allow them to hide the work that would be performed in their absence, ensuring that your system is always provisioned to handle the unexpected load that can happen when the cache has become ineffective.
Conclusions and Key Takeaways
We talked about the first and second conclusion. The first one being adapting to customer traffic patterns improve their experience. We looked at it with how different problems we solved by introducing global admission control, bursting, and finally on-demand. Second, designing systems of predictability over absolute efficiency improve system stability. That’s the one we just saw with caches. Caches are bimodal. How it’s important to make sure that system is doing predictable load. The failure scenarios are tamed by making sure that your system is provisioned for the maximum load that you have to do. Third and fourth, these are two more things that we talk about in much detail in the paper. The third being DynamoDB is a distributed system. We have multiple storage nodes, multiple request routers. Performing continuous verification of data at rest is necessary. That’s the best way we have figured out to ensure we meet high durability goals. Last, maintaining high availability as system evolves, it might mean that you have to touch the most complex part of your system. In DynamoDB, one of the most complicated most complex part is the Paxos protocol, the Multi-Paxos protocol. To improve availability, we had to do some changes in that protocol layer. The reason we were able to do those changes easily was because we had formal proof of these algorithms that are being written there, that were written there right from the original days of DynamoDB. That gave us quite high confidence since we had a proof of the working system, we could tweak it and make sure that the new system still ensures correctness, all the invariants are met.
Questions and Answers
Anand: What storage system does metadata storage use?
Vig: DynamoDB uses DynamoDB for its storage needs. Just think about it, the software that is running on the storage nodes, that’s the same software which is running on the metadata nodes as well. As we have evolved, as I talked about MemDS being introduced, basically think of it like an L2 cache on top of the metadata that is being used to serve, like partition metadata for all the learnings that we had and scaling bottlenecks with the metadata.
Anand: You’re saying it’s just another Dynamo system, or it’s a storage system? Does the metadata service come with its own request router and storage tiers?
Vig: No, it’s just the storage node software. Request router, basically has a client, which talks to the storage nodes. It uses the same client to talk to the metadata nodes so that whenever the request comes, it’s exactly the same code that we run for a production customer who would be accessing DynamoDB.
Anand: Is MemDS eventually consistent? How does it change to a cached value addition of a new value of 1 MDS node get replicated to other nodes?
Vig: Think of MemDS like you have an L2 cache, and that cache is updated, not a write through cache, but a write around cache. Whenever a write happens on the metadata fleet, which is very low throughput, partitions, they don’t change often. What happens is, whenever a write happens on the partition table in the metadata store, those writes through streams, they are basically consumed by a central system, which sends all the MemDS nodes to have the updated values. This whole process happens within milliseconds again. As the change comes in, it just gets replicated on all the boxes. The guy which is in the center is responsible for making sure all the MemDS nodes are getting the latest values. Until it is acknowledged by all the nodes, that system doesn’t move forward.
Anand: Do you employ partitioning in master-slave models there, or are there agreement protocols? Does every MemDS have all of the data?
Vig: MemDS has all the metadata. It’s basically vertically scaled. Because the partition metadata, as I said, is not that big, it’s tiny. The main thing there is the throughput that we have to support. That’s what is very high. That’s why we just keep on adding more read replicas there. It’s not the leader-follower configuration. It’s like every node is eventually consistent. It’s a cache. They are scaled for specifically the reads that that metadata requests have to be served for whenever customers are sending requests, or any other system which wants to access the partition metadata.
Anand: There’s some coordinator that’s in the MemDS system, requests are send with a new write request, that coordinator is responsible for ensuring that all of the entire fleet gets that data.
Vig: Exactly.
Anand: It’s the coordinator’s job, it’s not each peer-to-peer gossip type protocol to link. That coordinator is responsible for durable state and consistent state. Each node, once they get it, they’re durable, but that’s just for consistent state transfers.
Vig: Yes. Durable in the sense it’s a cache, so when it crashes, it will just ask from the other node in the system to get up to speed and then start serving requests.
Anand: That coordinator, how do you make that reliable?
Vig: All these writes are idempotent. You could have any number of these guys running in the system at any time, and writing to the destination. Since the writes are idempotent, they always monotonically increase. It never goes back. If a partition has changed from P1 to P2, if you write again, it will say, “I have the latest information. I don’t need this anymore.” We don’t need a coordinator. We try to avoid that.
Anand: You don’t have a coordinator service that’s writing to these nodes?
Vig: We don’t have the leader and follower configuration there.
Anand: You do have a coordinator that all requests go to, and that keeps track of the monotonicity of the writes, like a write ahead log. It’s got some distributed, reliable write ahead log. It’s just somehow sending it.
When the partition grows and splits, how do you ensure the metadata of MemDS cache layers get updated consistently?
Vig: It’s eventually consistent. The workflows that are actually doing partition splits and all these things, they wait until MemDS is updated before flipping the information there. The other thing is, even if the data is not there in MemDS, storage nodes also have the protocol to respond back to the request router saying that, “I am updated. I don’t host this partition anymore, but this is the hint that I have, so maybe you can go talk to that guy.” For the edge cases where this information might get delayed, we still have mechanisms built in the protocol to update the coordinator.
Anand: Each partition also has some information of who else has this data?
Vig: MemDS node, yes.
Anand: Also, it’s a little easier because, unlike DynamoDB, this can be eventually consistent. DynamoDB, people care about consistency over availability and other things. Essentially, we’re saying that, when you split, it doesn’t have to be replicated everywhere immediately. It’s done as time.
See more presentations with transcripts

MMS • Charity Majors
Article originally posted on InfoQ. Visit InfoQ

Subscribe on:
Transcript
Introductions [00:05]
Shane Hastie: Good day, folks. This is Shane Hastie, with the InfoQ Engineering Culture Podcast. Today I have the privilege and pleasure of sitting down, across many miles with Charity Majors. Charity is the CTO and co-founder of Honeycomb.io. Listeners to the Architecture Podcast on InfoQ will be very familiar with her voice, and people who come to the QCon conferences will have met Charity many times, but this is the first time we’ve got Charity on the Culture Podcast.
So, Charity, welcome. Thanks so much for taking the time to talk to us.
Charity Majors: How is that the first time? Oh my goodness. Thank you so much for having me.
Shane Hastie: There’s a few people out there who might not have heard of you. Who’s Charity?
Charity Majors: Well, I am an infrastructure engineer, is how I think of myself. I am co-founder, as you said, of honeycomb.io. I was the CEO for the first three years, CTO now. I live in San Francisco. I took up hand lettering and art over the pandemic as my hobby, and I’m currently in London, so I just gave a talk at WTF is SRE conference. It was super fun.
Shane Hastie: This is the Culture Podcast. You made a beautiful statement when we were chatting earlier. Culture is meaningless when somebody says your culture is broken. So how do we tease out what a good “culture” looks like?
Culture is both formal and designed and informal and emergent [01:32]
Charity Majors: Yes, it bugs me whenever people are like, culture is this, culture is that, because it can mean anything and everything. If you look it up in Google it’s everything from the rules and regulations to the practices and the habits, and the informal and the formal, and it’s just like you’re saying nothing at all when you say culture. And so I think it’s really important to get more specific.
One of the ways of differentiating here is, I think, when you’re talking about companies, it’s important to differentiate between the formal culture and the informal culture. The formal culture of the company is everything that managers are responsible for. It’s everything from how we do payroll is part of your culture, your vacation policy is part of your culture, whether or not you train your managers as part of your culture. The balance of acceptable behavior that people can do and not get fired is culture.
And then there’s the informal culture, which, I think, is much more bottoms up. It’s informal, it’s not in the employee handbook. It is often playful, and fun, and anarchic, and chaotic. It’s the in-jokes and the writing your release notes in Limerick form, or it’s all the things that you bring about your character and your personality that you bring to work with you and you play off each other. When we talk about culture at work, I think, we often think about the informal culture because it’s what jumps to mind, but in fact they’re both, I think, really important.
Shane Hastie: As a leader, how do I create the frame, perhaps, for that informal culture to be generative?
Leadership actions create culture irrespective of your intent [03:06]
Charity Majors: Yes, the thing that I think that sometimes escapes people’s minds is that if you are, I don’t really like to use the term leader as a synonym for manager because they’re not, but as managers you are being paid by the org to do certain things on behalf of the organization, and whether you like it or not, whether you think about it or not, your actions are creating culture. It’s just like how the President of the United States, everything the president says is policy for the executive branch. You can’t get around it. And whether you’re intentionally creating culture or not, you are creating culture because people look to you and they see what you accept or what you don’t.
My friend Emily Nakashima, who’s our VP of Engineering, wrote this amazing blog post, called Power Bends Light. It’s about her experience going from being an engineer to a manager, and how suddenly she started noticing, she has a weird sense of humor, suddenly as a manager people started laughing at her jokes. They had never thought she was funny before and now suddenly they thought she was funny. She was like, at first this really bugged me and then I just realized it’s how we experience power. It’s subconscious, it just is. We’re monkeys, and our power dynamics play out in these really subtle ways.
So, I think that the culture that we, as managers and formal leaders of the company, are tasked with is creating a healthy organization. Pat Lencioni, who’s the author of Five Dysfunctions of a Team and a bunch of other books, I feel like his book, called The Advantage, is the single best book I’ve ever read about this. It’s all about how we, as companies, we are so obsessed with becoming smart companies, with our strategy and our tactics and all this stuff, and we’re not nearly focused enough on becoming healthy companies. He makes the point that health trumps and often begets success, because most organizations are using just a tiny fraction of the overall intelligence and wisdom of their org, but healthy ones can tap into almost all of it.
It’s like families. You think of the super dysfunctional family. It doesn’t matter how smart they are, the kids are going to be fucked up. But a healthy family, they don’t have to be the top 10% of IQs because they learn from their mistakes. They communicate in healthy ways, they have affection and compassion for each other, and so they’re just able to be much more highly functional, and they’re likelier to be successful.
And so, I feel like the job of leadership is, number one, we have fiduciary responsibilities to make the organization successful, which means that we need to make it healthy. And an organization that is unhealthy is maybe easier to spot. It’s the ones where projects are getting canceled and people are doing political stuff, or they’re unsure where they’re going, or they’re not on the same page as the teams that are next to them, or they’re spending a lot of resources trying to argue about something. These are all super obvious visible ways that the unhealthiness of our culture manifests.
I’ve been talking for a long time so I’m going to pause there.
Shane Hastie: And I’m thoroughly enjoying listening to what you say. So, creating that space for the culture to emerge.
Charity Majors: Yes.
Shane Hastie: Actually, let’s delve into one statement you made, that is, I think, quite important. Leader manager, not synonyms. How do they show up in the workplace?
Leader and manager are not synonyms [06:27]
Charity Majors: The first blog post I ever wrote that really took off was the one called The Engineer/ Manager Pendulum, and it was in 2017, I think. I wrote it for a friend, actually, who was the director of engineering at Slack at the time, and hating his job, but really reluctant to go back to being an engineer because he liked to have the impact, and he liked to have the power and the control just to shape the technology and the team, and it felt like going back to being an individual contributor would be a huge step back in his career. The piece that I wrote, the argument that I was making, is that the most powerful technologists in our industry tend to be people who have gone back and forth, not just once or twice, but multiple times, because you accrue these skill sets of both the tech itself but also the people skills, and the merging of the two, and the navigating the organization, and you just get stronger and stronger as you go back and forth.
I feel like lots of managers who go back for the first time are shocked to see that when they’re an engineer after having been a manager, they’re treated differently and they operate differently in the workplace because they have these skills. People still look to them for this people leadership, even though they’re technically responsible only for the technical outcomes. I feel so, so strongly about this topic that your leaders are not just your managers, and in fact that management is not a promotion. Management should not be a promotion, it should just be a change of career. You’re peers with the people that you’re “managing”. Even though there is a hierarchy, there is a need for decisions to be made in an organized manner. I feel like technical contributors should be responsible for technical outcomes. Managers should be responsible for organizational outcomes. Managers are responsible for making sure that a decision gets made, but they don’t sit there making all the decisions. Good ones don’t, at least, and the ones who do rapidly find themselves without anybody who wants to be on their team.
I feel like this is a concept that is starting to really catch on and even go beyond the balance of just engineering teams, because you really want people doing their best work to be in the part of the organization where they feel most engaged and challenged and excited. And honestly, people who just do the management track for too long really lose touch with a lot of that, with the hands-on brilliance, and the genius of making things and making things work, and how your system is built, and how it’s structured. And in engineering at least, I think, the best line managers I have ever known have never been more than five years away from writing code and production. You really either need to move up the ladder as a manager, or you need to go back to the well periodically to refresh your skillset.
Shane Hastie: That has to be a very deliberate and conscious choice, doesn’t it?
Shifting between engineer and manager and back needs to be a very deliberate choice [09:19]
Charity Majors: It does. I feel like we’re just starting to see the first generation of technologists who have intentionally done this, and the results are phenomenal because you want to retain the talent who feels compelled to be good at their jobs, who feels compelled to stay on the sharp edge, who feels compelled. And so you don’t want your managers to be like, “Well, I’ve been doing this for two or three years, I’m bored. I guess I have to leave in order to be challenged again, or in order to be hands-on again.” You don’t want that. And you also don’t want people to join the manager track because they feel like they have no power otherwise, they don’t get to make decisions otherwise, they don’t get responsibility or accountability otherwise.
On both sides of the coin, as an organization, I feel like we are just starting to acknowledge that we need to restructure the entire way we think about compensation, job ladders, the career advice that we’re giving to our top performers, or any performers. It goes along with this that you also, I think, want to lower the barriers to becoming a manager. I think anybody who’s interested in being a manager should get a chance to at least build those skills. There might not be a seat in the org for them to join, but management is not an if then, it’s not an either or. It’s just a bunch of skills. If you’re interested in management let’s hook you up with an intern. Maybe you can run our intern program, maybe you can lead some meetings, maybe you can take over while this manager goes out on a four-month maternity break. There is never any shortage of work for managers to do, and if we can just level the playing field and make it…
I feel like a lot of times it’s been like if you want to go into management you sit here crossing your fingers and hoping to be tapped from above, and just be like, “You have been chosen to be our manager.” That’s bullshit. That’s creating some artificial scarcity that just doesn’t need to exist. You should be asking everyone on your team if they’re interested in being a manager, and if they’re interested hook them up with some practice skills. Sometimes they’ll be like, “No. Whoa. Tried that. Definitely not for me.” Sometimes they’ll be interested. I feel like everybody wins when we demystify management, and then we suck the hierarchy out of it as much as we can.
Shane Hastie: For the person who’s at that three to five year cuss, they’ve maybe done the pendulum a couple of times and now they are thinking, do I want to actually go deep in that management and start to climb up the hierarchy? What do they need to do to position themselves, and to move into that space?
Progressing through organisational hierarchy [11:51]
Charity Majors: A lot of it comes down to the randomness of fate and opportunities. You need to be working at a place. I mean, if you just look at the number ratios. If you need one manager for every, let’s say, seven engineers, and if you need one director for every, say, two to five managers, and you need one VP for N directors. You can see that the opportunities get scarcer and scarcer as it goes up, so you might need to change companies. And this is something where I feel like there’s this taboo about expressing that you’re interested in it because it’s seen as a blessing, or something, and it’s almost uncouth to express too much excitement for going up the ladder. Which is again, I feel like we drain the hierarchy out of this stuff, we make it more acceptable to talk about this stuff openly.
If you’re interested, I think that one of the first steps is to ask people around you if they think you’d be good at it. One of the things that I feel like if you’re an engineer and you’re interested in being a manager, you should know if people feel like they’d like to report to you or not, because that’s a pretty decent signal whether you’ll be any good at it or not. And maybe people don’t think you will, in which case you’ve got some work to do right there. There’s that. Do people naturally want to follow your lead? There is the opportunity aspect. And if you’re, say, a director at a mid-level company and the company’s not growing and your boss doesn’t seem like they want to go anywhere, you might want to try joining a smaller company, or a different company that’s growing. Typically, if you go from a mid-level company to a small company or big to a mid-level, you can typically jump up one level. You can go from a director to a VP, or a manager to a director, so that’s a way to get it on your resume.
Managing managers is an entirely different role to managing engineers [13:32]
I know I’m just jumping all over the place here. But one thing that I think is a little mysterious to some people is that going from being an individual contributor, as I see, to a manager versus going from being a manager to a manager of managers, they’re almost as big of jumps as each other. Managing managers is an entirely different job than managing engineers. This is underappreciated, because, as managers, each of you who’s managing people has a way, it’s usually through intuition, of interacting with the world that makes people want to follow your lead. And usually this is very intuitive, and usually you don’t even know how it works really, but once you start managing managers, they all have their own unique way of doing things that makes people want to follow their lead. And so now that you’re trying to help them figure out how to debug problems you need to think about management and leadership in entirely new and different ways. It’s up a level of abstraction, and it means having to relearn the job from first principles.
Another thing that I would say about this, and then I think I’m done, is that almost everybody who becomes a manager, who starts on this ladder, almost everybody assumes that they want to go to the top. Everybody is like, “Yes, I want to be a CTO someday. Yes, I want to be VP someday. Of course I want the opportunity to be a director.” There’s nothing wrong with that. It’s very natural. We see a ladder, we want to climb it. We’re primates. But, in my experience, the overwhelming majority of people get to a point and they realize that they hated it, they didn’t want it, they don’t want to go any higher. And that’s fine too. My only advice would be to people to be sure and check in with yourself. It takes a year or two, maybe three, to really settle into the new gig. But really, after that, check in with yourself. What brings you joy? Does something bring you joy? Does anything bring you joy? What are the threads you want to pull on for your next steps that will help you lean harder into that joy?
Because I know way too many people, and so many of my friends are these ambitious types, I’m a dropout, so I don’t really get this, but the people who are fucken ladder climbers from birth to grave, and they’re just like climb, climb, climb, climb. They get too something resembling the top and then they realize that they’re miserable. While it is easy to go back and forth between manager and engineer, it is real hard to go back from, say, a VP to engineer if you haven’t done it in 10 years. And I know so many people who have literally done just that.
My friend, I wrote a blog post about her. My friend, Molly, came in as our VP of Customer Success or whatever, then her husband struck it rich when OCTI POd, and she suddenly had this real come to Jesus moment and realized, “I’m jealous of the software engineers. All I really want to do is sit there and write code. I hate what I do.” It took her a few years. We moved her to support. She wrote code on the side and everything. She finally managed to move back to software engineering, and she is happy as a frog in the pond. But it was rough, and it’s really hard to find those routes back. So I think it’s really important to check yourself, and remember that ladder climbing is in and of itself not actually fulfillment.
Shane Hastie: Great points there. If we can bounce topics a little bit, AI. Generative AI versus AIOps. I know you’ve got opinions, so here’s a platform.
The challenges in and around AIOPS [16:53]
Charity Majors: One or two. I’m on record as being pretty spicy about AIOps in the past, and I think a lot of people think that I’m just an AI hater. And I thought maybe I was too, but then generative AI came along and I’m like, oh no, there’s some real stuff here. This is going to change our industry in the next couple of years. It’s going to change our industry really fast. And so I’ve had to stop and think about, okay, what was it that I hated about AIOps, and what is different about this? For those who missed it, this is Friday, May 5th, and on Wednesday we shipped our own generative AI product that lets you generate and execute queries using natural language against your observability data. What’s slow about the thing that I just deployed, or where are the errors? It’s great. It’s such a democratizing leveling feature.
AIOps, okay, so I think the way that I’m thinking about it is this, the last 20, 30 years of software development have been about the difficulty of writing and building software. It’s been hard. And so it’s been hard enough that I think that it’s obfuscated or hidden from us the reality that it has actually always been harder to run, and maintain, and extend, and understand our systems, than it has been to build them. But because the upfront cost of building was so high we got to stuff that onto the ops team, or amortize those costs down the road a bit. We could stuff it under the quilt and forget about it. But now that building is getting so easy, I feel very strongly that the next five to 10 years or so are going to be all about understanding software, all about understanding what you’ve just merged, understanding what you’re writing, understanding what your tests are doing, understanding while you write, after the fact, after you write, and everything in between, understanding what your users are doing.
And so I think that my beef with AIOps has been about the fact that they so often do things or claim to be removing the need to understand. Making it so that they’re like, you don’t need to understand your software. This AI is going to understand your software, and it’s going to take action, or it’s going to do the right thing, or it’s going to alert you, or whatever. I feel like that is not only wrong, but it’s harmful. It’s harming you, not even the long term, in the midterm. It is harming you in your ability to understand your software, or explain your software, or migrate, or use, or extend your software. And I feel like there are lots of great uses for AI, but they come in form of helping us understand our software.
Liz Fong-Jones uses this example of we don’t want to build robots to do things for us. When AI goes off and does something, it is no notoriously difficult, bordering on impossible, to go back and understand what it did or why. This is not a path that’s going to lead us to great things. But there are paths where, like Liz says, we’re not building a robot, we’re building a mecca suit. Like a transformer suit that you can get into, with big ass limbs and everything, but you are still the brain of the thing. You’re still making decisions, because machines are great at crunching numbers and any machine can tell you if there’s a spike in this data or not, but only people can attach meaning to things. Only you can tell me if that was a good spike, or a bad spike, or a scary spike, or a unexpected spike, or what. And so giving people tools with generative AI to help them, like ours does, you know what you want to ask, but it was really complicated to use the query browser.
We help you ask the question so that you can understand it better. And here’s where a lot of CTOs out there, if you get them a little drunk, they will be like, “Yes, actually I am willing to buy things, or I want to buy things that tell me that my people don’t have to understand the systems, because,” and this freaked me out when I first heard it from someone, “because people come and go, but vendors are forever.” What the fuck? They’re literally saying, no, we don’t want to invest in making our people smarter and more informed and everything, because we know we can sign a multimillion dollar contact and that’s going to be more reliable to us than our actual people are. And while I get the logic, come on, this is not leading us down a path to success or happiness. We have to invest in making our people smarter, and better informed, and better able to make decisions and judgments, because at the end of the day, someone is going to have to understand your system at some point. They just are.
Shane Hastie: I would go back even further and say that there’s another thing we’ve not been good at. Fred Brooks said it a while back, “The hardest thing about building software is figuring out what to build.”
Charity Majors: Yes. Oh, my goodness. That is so true.
Shane Hastie: How does generative AI help us with that, or does it?
The limits and potential of generative AI in software engineering [21:46]
Charity Majors: That’s a great question. I don’t think we know that yet. I don’t think it does. I really don’t think it does, and I would be really dubious of any products they claimed to. Yes, I mean, these tools are powerful. They’re really freaking powerful. People who haven’t used them, they’re incredibly powerful. But at the end of the day, cleverness has never been the most important thing. In fact, it’s often overrated. At the end of the day, I believe that we need to build systems that are comprehensible. We need to understand who we’re building them for, why we’re building them, how we’re going to make money off of them, and how to fix them. And there’s a lot of really powerful use cases in there for generative AI and other super powerful tools, but I don’t want a machine telling me any of those things.
“A computer can never be held accountable, therefore a computer must never make a management decision.” [22:38]
Maybe 10 years from now, but I think that as far as we can see down the pipe, this is not a thing that we should look to them for. If for no other reason then ultimately, there was this quote that Fred Hebert dug up that came from an IBM slide deck in 1979. It was, “A computer can never be held accountable, therefore a computer must never make a management decision.” I think that applies in a lot of different ways. A computer cannot be held accountable for the rise and fall of your stock, therefore a computer can’t tell you what to build. I think that we’re a long way from holding computers accountable. So, I think, that capability and accountability should go hand in hand. And so, for the foreseeable future, I think that we need to not just make the decisions, but ensure that we have the detail that we need to make good decisions, which again, is where AI can definitely help us. But being clear on the why’s, I think, at the root of everything,
Shane Hastie: And just to delve into a topic while I’ve got you, platform engineering. You made the statement, “We need a new kind of engineer.” Please expand.
Platform engineering is the grand reunification of DevOps [23:48]
Charity Majors: Yes. I’m walking a very narrow line here because there’s a company in particular out there who’s been making very inflammatory statements about how DevOps is dead, and screw those people. That’s not true. I mean, oh my God. Ugh, wait, clickbait. But I would go so far as to say that the DevOps split never really should have happened, and I understand why it did. There was too much complexity, too much surface area, et cetera. But fundamentally, and this does go back to accountability and responsibility again, because the people who are writing the code need to also run the code. It’s that fundamental. And the more complicated things get the more we’re running back to this, because the people who don’t build the system have no hope of understanding and debugging the system, and the people who don’t run the system have no hope of building a runable system.
Along those lines, I feel like what we are seeing is a grand reunification, and I think that the platform engineers are at the tip of the spear here. Every engineer should be writing code, and every engineer should be running the code that they write. And where platform engineering is, I think, really exciting, is number one, these tend to be engineers with deep background in both operations and software engineering. Number two, I think it’s an engineering team that when done correctly, when done well, isn’t actually owning reliability. It’s like the first ops team that doesn’t actually own reliability. Instead, your customers are not the customers, your customers are your internal developers. And your job is not to keep their code running at four 9s, or whatever, your job is to see how quickly and easily they can write code and own their own code in production. And then it’s their jobs to be responsible for however many 9 in the SLOs, and the SLIs and everything. And I think this is awesome.
I also think that right now we’re at a stage where you can’t have junior platform engineers. You really have to have experienced platform engineers. But it’s also, I think, the highest leverage engineering that anyone’s ever been able to do. Because the platform engineer, you are sitting here leveraging the work of vendors who have tens, hundreds, even thousands of engineers working on this product, and you write a very few lines of code or Shell script or whatever, and make it accessible to your entire organization, which is, I jokingly call it vendor engineering, but it’s incredibly powerful. And it involves taste, I think, as much as it involves raw engineering. You need to know how to build a thin layer, or an API, or an SDK, or something that empowers everyone internally, while providing this consistent look and feel, the right conventions and everything, to leverage everything this vendor has to offer. Then I just think it’s a really exciting place to be right now.
Shane Hastie: That’s a huge cognitive load.
One of the most important jobs of platform engineering is to manage cognitive load [26:38]
Charity Majors: It is. This is why, I think, that one of the number one jobs of platform engineers is to manage that load and to manage it very consciously. I think that you have to be constantly shedding responsibilities, because otherwise you’re going to be constantly adding responsibilities. Anytime it turns into you have to build a software product it’s time to offload it to another team, because platform engineers do not have time to write and own products. They can spec them out, they can make dummies, they can prototype, but they do not have the cycles to do that.
Yes, I think it’s a very interesting, it’s a very new space, and it’s really exciting. I tweeted jokingly a couple of weeks ago, something like, “A team that you respect has just announced that they’re building a startup for platform engineering. What are they building?” And the responses were all over the map. I really think that platform engineering is probably DevOps, in that you can’t build a platform engineering product because it’s a philosophy, it’s a way of operating. It’s a social invention, not a technical invention. I think there are a lot of technical challenges in conventions being hammered out in the ground right now, as we speak.
Shane Hastie: Charity, great conversation, and I wish we had plenty more time, but I know it’s late for you, and we have limits on how long the podcast can be. If people want to continue the conversation where can they find you?
Charity Majors: Well, I am still on Twitter, @mipsytipsy. I plan on checking out Blue Sky this weekend, but we’ll see. There’s also my blog at charity.wtf, and of course there is the Honeycomb blog, honeycomb.io/blog, where I and others write quite a lot about things, not just concerning Honeycomb, but also observability, and platform engineering, and stuff all over the map.
Shane Hastie: And we’ll make sure to include all of those links in the show notes.
Charity Majors: Awesome.
Shane Hastie: Thank you so much for taking the time to talk to us today.
Charity Majors: Thank you for having me. This is really fun.
Mentioned
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

MMS • Robert Krzaczynski
Article originally posted on InfoQ. Visit InfoQ
The release of System.ServiceModel 6.0 provides client support for calling WCF/CoreWCF functions. These NuGet packages, collectively known as the WCF client, enable .NET platform applications to interact seamlessly with WCF or CoreWCF services. Although the .NET Core 3.1 platform and later versions do not include built-in WCF server support, CoreWCF, a separate community project based on ASP.NET Core, fills this gap by providing a WCF-compliant server implementation.
The following packages are included in the 6.0 release:
Source: https://devblogs.microsoft.com/dotnet/wcf-client-60-has-been-released/
The latest version introduces support for named pipes, compatible with both WCF and CoreWCF implementations. NetNamedPipeBinding
facilitates binary communication between processes on the same Windows machine. However, named pipelines are only available on Windows and are not available on Linux or other non-Windows platforms. In order to fill this gap, CoreWCF is developing Unix domain socket support to offer equivalent functionality for Linux. The WCF client will be updated with the release of CoreWCF to ensure seamless coordination.
Starting with the 6.0 release, the WCF client package no longer supports .NET Standard 2.0 and is exclusively for .NET 6.0 and later. This change enables the use of newer functionality available in .NET 6 that is not present in the .NET Framework. By focusing on the .NET 6 platform, the size and complexity of the package are reduced, simplifying the deployment process. Support for the .NET standard is maintained, and applications or libraries can continue to use the System.ServiceModel.*
packages in version 4.x, or make use of references to conditional sets for the .NET Framework using these NuGet packages for .NET 6 and later.
Furthermore, the System.ServiceModel.Duplex
and System.ServiceModel.Security
packages are no longer required because their types have been merged into the System.ServiceModel.Primitives
package. This change eliminates the need for these mentioned packages, as the type forwarders will always reference the version of the Primitives package used by the application.
This release raises different emotions in the community. Antonello Provenzano commented below the Facebook post:
WCF? It is 2023: REST, GraphQL, gRPC, AMQP… which is the legacy system using WCF that has to be updated to .NET Core?
If the goal is updating the WCF service interface, then there’s much more behind that interface that has to be updated, causing a big bang in the software of the system (presumably a monolith).
If instead, it’s a brand new implementation, why choose WCF to expose the API, when the other ones available are mature and well-supported by an entire supply chain?
Another user, Jozef Raschmann, answered:
Calm down. All our government services are based on WCF and its security extension WS-*. Still, there isn’t a standardized alternative in the REST world.

MMS • Steef-Jan Wiggers
Article originally posted on InfoQ. Visit InfoQ

AWS continues to bring support for new versions of runtimes for AWS Lambda. Recently, the company announced the support of the Ruby 3.2 runtime.
Ruby 3.2 introduces various enhancements and performance upgrades, such as advancements in anonymous argument passing, the addition of ‘endless’ methods, improved regular expressions, the introduction of a new Data class, enhanced pattern-matching capabilities in Time and MatchData, and the inclusion of ‘find pattern’ support within pattern matching.
Praveen Koorse, senior solutions architect, AWS, writes in an AWS Compute blog post that Ruby 3.2 enhances the handling of anonymous arguments, simplifying and streamlining the utilization of keyword arguments in code and before this update, passing anonymous keyword arguments to a method involved using the delegation syntax (…) or employing Module#ruby2_keywords and delegating *args, &block. However, this approach was less intuitive and lacked clarity, particularly when dealing with multiple arguments.
If a method declaration now includes anonymous positional or keyword arguments, they can be passed to the next method as arguments. The same advantages of anonymous block forwarding apply to rest and keyword rest argument forwarding.
def keywords(**) # accept keyword arguments
foo(**) # pass them to the next method
end
def positional(*) # accept positional arguments
bar(*) # pass to the next method
end
def positional_keywords(*, **) # same as ...
foobar(*, **)
end
Another example of improvement is that Ruby 3 introduced a new feature called “endless methods,” which empowers developers to define methods with just one statement using the syntax def method() = statement. With this syntax, there is no need for an explicit end keyword, allowing methods to be defined as concise one-liners. This enhancement simplifies the creation of basic utility methods and contributes to writing cleaner code, enhancing code readability, and improving maintainability.
def dbg_args(a, b=1, c:, d: 6, &block) = puts("Args passed: #{[a, b, c, d, block.call]}")
dbg_args(0, c: 5) { 7 }
# Prints: Args passed: [0, 1, 5, 6, 7]
def square(x) = x**2
square(100)
# => 10000
To utilize Ruby 3.2 for deploying Lambda functions, developers can follow these steps: upload the code via the Lambda console and choose the Ruby 3.2 runtime. Alternatively, they can leverage the AWS CLI, AWS Serverless Application Model (AWS SAM), or AWS CloudFormation to deploy and administer serverless applications coded in Ruby 3.2. In addition, developers can also use the AWS-provided Ruby 3.2 base image to build and deploy Ruby 3.2 functions using a container image.
Tung Nguyen, founder of BoltOps Cloud Infrastructure Consultancy, tweeted:
Deployed a Jets v4 demo app on AWS Lambda Ruby 3.2. It was a success! No more need to use a Custom Runtime Lambda Layer.
In addition, a respondent to a Reddit thread about Ruby 3.2 support AWS Lambda stated:
Use containers and stop worrying what AWS supports. 🙂
With another one responding:
And manage my own runtime? Eh
Competitive Function as a Service (FaaS) offerings other than AWS Lambda, like Azure Functions, do not support any Ruby runtime version. Users can leverage Ruby (older runtimes) through a workaround for Azure Functions. On the other hand, Google Cloud Functions does support Ruby runtimes, including 3.2, which the company recommends using.
Lastly, Ruby 3.2 is Ruby’s latest long-term support (LTS) release. AWS will automatically apply updates and security patches to the Ruby 3.2 managed runtime and the AWS-provided Ruby 3.2 base image as they become available.
More details of writing functions in Ruby 3.2 are available in the developer guide.

MMS • Ben Linders
Article originally posted on InfoQ. Visit InfoQ

Applying resilience throughout the incident lifecycle by taking a holistic look at the sociotechnical system can help to turn incidents into learning opportunities. Resilience can help folks get better at resolving incidents and improve collaboration. It can also give organizations time to realize their plans.
Vanessa Huerta Granda gave a talk about a culture of resilience at QCon New York 2023.
Often, organizations don’t really do much after resolving the impact of an incident, Huerta Granda argued. Some organizations will attempt to do a post-incident activity, traditionally a root cause analysis or 5 Why’s, and some teams will do a postmortem. Either way, it’s usually focused on figuring out a root cause and preventing it from ever happening again, she said.
Huerta Granda mentioned reasons why folks aren’t doing activities for deeper learning:
- The skills required to successfully apply resilience into your culture are not traditional engineering skills; it’s communication skills, analytics, presenting information and convincing people, getting folks to talk to each other.
- You need time and training to get good at this and organizations often don’t give their engineers the bandwidth for this.
- Many organizations will stop at the step of incident response without going into becoming a learning organization.
- Some organizations are stuck in the old-fashioned pattern of thinking that all outages are there because of a root cause without focusing on the sociotechnical systems.
We can apply resilience throughout the incident lifecycle by taking a holistic look at the sociotechnical system, Huerta Granda said. She mentioned that we have to understand that an incident is never “release a bug – revert the bug – everything is back to normal”. Instead, think through the conditions that led to the incident happening the way it did. What did people think was happening? What tools did they have available? How did they collaborate and communicate? This paints a fuller picture of our systems and helps us in the future, Huerta Granda said.
Resilience can help folks get better at resolving incidents, better at understanding what is happening and how to more effectively collaborate with each other, Huerta Granda mentioned.
For the organization, when folks aren’t stuck in a cycle of incidents they will have time to complete the plans the organization has in their roadmap, she said.
To foster a culture of resilience, we need to give people the time to talk to each other, to be curious to look past the technical root cause and into the contributing factors around the experience of an incident, Huerta Granda concluded.
InfoQ interviewed Vanessa Huerta Granda about learning from incidents.
InfoQ: How big can the costs of incidents be?
Vanessa Huerta Granda: It can be huge; it can erode the trust your customers have in you, depending on the industry companies can lose their licenses because of incidents. And then there’s the cost it has on your culture, when folks are constantly stuck in a cycle of incidents, they’re not going to have the bandwidth to be creative engineers.
InfoQ: What tips do you have for creating action items?
Huerta Granda: Some tips are:
- They need to be decided by the people actually doing them.
- Management needs to be ok with giving folks time to complete them.
- They should move the needle.
- Always have an owner and a due date (so we know they can be completed).
- It’s ok giving people an out.
Giving people an out means that action items should not be set in stone. If the owner of an item tries a fix and realises it doesn’t work or it will take way longer to complete, they can decide it’s not the best course of action. In that case, let them close the action items with an explanation of the work done.
InfoQ: How can we gain cross-incident insights?
Huerta Granda: You need to focus on individual incident insights first. When you have a body of work, then look at commonalities between your incidents; you may ask yourself ,”Are the incidents that take longer to resolve related to a particular technology? Are folks aware of the observability tools that are available?” Once you have found the data you want to share, make sure to always add context to the data that you are providing, this takes “data” into “insights”.

MMS • Ben Linders
Article originally posted on InfoQ. Visit InfoQ

A way to improve developer experience is by removing time-consuming tasks and bottlenecks from developers and from the platform team that supports them. How you introduce changes matters; creating an understanding of the “why” before performing a change can smoothen the rollout.
Jessica Andersson spoke about improving developer experience in a small organization at NDC Oslo 2023.
Andersson explained that developer experience includes all the things that a software developer does in their work for developing and maintaining software, such as writing the code, testing, building, deploying, monitoring, and maintaining:
I often think of this from a product perspective where a development team is responsible for the product life cycle. A good developer experience allows a developer to focus on the things that make your product stand out against the competition and deliver customer value.
Their strategy for increasing developer experience has been to remove time-consuming tasks and bottlenecks. They started out by unblocking developers. If a developer has to wait for someone outside their team in order to make progress, then they are not able to act as an autonomous team and take full ownership of their product life cycle, Andersson said.
Next, they looked at removing time-consuming tasks from the platform team. In order to be able to continue delivering a better developer experience to their developers they needed to make sure that the platform team wasn’t stuck in an endless upgrade and migration loop.
After having freed up time from the platform team, they shifted the focus to removing time-consuming tasks from the developers leading to an overall better developer experience.
Andersson mentioned that how you introduce changes matters and it’s easier to apply changes if you have created an understanding of the “why” before you do so. They introduced a quite different workflow for developers that they believed would be a great improvement, but met some resistance in adoption before the developers understood why and how it was an improvement:
In the long run, it turned into a very appreciated way of working, but the rollout could have gone smoother if we spent more effort on introducing the change before performing it.
You need to build the confidence with developers that you will deliver value to them, Andersson said. Having a good relationship with your developers is key to understanding their problems and how you can improve their daily lives, she concluded.
InfoQ interviewed Jessica Andersson about improving the developer experience.
InfoQ: What challenges did you face improving developer experience while being on a small team?
Jessica Andersson: We couldn’t do everything, and we couldn’t do it all at once. We aimed to take on one thing, streamline it and do it well, and once it was “just working” we could move on to the next thing.
We also had to be mindful of the dependencies we brought on and the tools we started using, everything needs to be kept up-to-date and there’s a real risk of ending up in a state of constant updates with no room for new improvements.
InfoQ: Can you give an example of how you improved your developer experience?
Andersson: For unblocking developers we had the context of using DNS for service discovery. DNS was handled manually and there were just two people who had access to Cloudflare, of which I was one. This meant that every time a developer wanted to deploy a new service or update or remove an existing one, they had to come to me and ask for help.
This was not ideal for how we wanted to work so we started looking into how we could handle this differently in the Kubernetes environment we were using for container runtime. We looked at the ExternalDNS project which allows for managing DNS records through Kubernetes resources.
For us it was really simple to get up and running and fairly easy to migrate the existing, manually-created DNS records to be tracked by ExternalDNS as well. Onboarding developers to the new process was quick and we saw clear benefits within weeks of switching over!
InfoQ: What benefits can a golden path or paved road provide for developers?
Andersson: It allows developers to reuse a golden path for known problems, for instance using the same monitoring solution for different applications. Another benefit is keeping the cognitive load lower; by applying the same way of doing things to different applications, it becomes easier to maintain many applications.
InfoQ: What’s your advice to small teams or organizations that want to improve developer experience?
Andersson: My strongest advice is to assess your own organization and context before deciding on what to do. Figure out where you can make an impact on developer experience, pick one thing and improve it! Avoid copying what others have done unless it also makes sense in your context.

MMS • Sergio De Simone
Article originally posted on InfoQ. Visit InfoQ

Apple has introduced a new open source package, the Swift OpenAPI Generator, aimed at generating the code required to handle client/server communication through an HTTP API based on its OpenAPI document.
Swift OpenAPI Generator generates type-safe representations of each operation’s input and output as well as the required network calls to deal with sending requests and processing responses on the client side and server-side stubs to delegate request processing to handlers.
Both the client and the server code are based on a generated APIProtocol
type that contains one method for each OpenAPI operation. For example, for a simple GreetingService
supporting HTTP GET requests at the /greet
endpoint, the APIProtocol
would contain a getGreeting
method. Along with the protocol definition, a Client
type implementing it is also generated for use on the client side. Server-side, the package generates a registerHandlers
method belonging to the APIProtocol
to register one handler for each operation in the protocol.
The generated code does not cover authentication, logging, or retrying. This kind of logic is usually too strictly associated with the business logic to allow for a general abstraction. Anyway, developers can implement those features in a middleware that conforms to the ClientMiddleware
or ServerMiddleware
protocols to be reusable in other projects based on Swift OpenAPI Generator.
The code Swift OpenAPI Generator generates is not tied to a specific HTTP framework but relies on a generic ClientTransport
or ServerTransport
type, which any compatible HTTP framework should implement to be usable with the generator. Currently, the Swift OpenAPI Generator can be used with a few existing transport frameworks, including URLSession
from iOS own Foundation framework, HTTPClient
from AsyncHTTPClient, Vapor, and Hummingbird.
All protocols and types used in the Swift OpenAPI Generator are defined in its companion project Swift OpenAPI Runtime, which is relied upon by generated client and server code.
The generator can be run in two ways: either as a Swift Package Manager plugin, integrated in the build process, or manually through a CLI. In the first case, the plugin is controlled by a YAML configuration file named openapi-generator-config.yaml
that must exist in the target source directory along with the OpenAPI document in JSON or YAML format. Using this configuration file, you can specify whether to generate only the client code, the server code, or both in the same target. The CLI supports the same kind of configurability through command line options, as shown in the example below:
swift run swift-openapi-generator
--mode types --mode client
--output-directory path/to/desired/output-dir
path/to/openapi.yaml
The Swift OpenAPI Generator is yet in its early stages and while it supports most of the commonly used features of OpenAPI, Apple says, it still lacks a number of desired features that are in the works.

MMS • Michael Redlich
Article originally posted on InfoQ. Visit InfoQ

Day One of the 9th annual QCon New York conference was held on June 13th, 2023 at the New York Marriott at the Brooklyn Bridge in Brooklyn, New York. This three-day event is organized by C4Media, a software media company focused on unbiased content and information in the enterprise development community and creators of InfoQ and QCon. It included a keynote address by Radia Perlman and presentations from these four tracks:
There was also one sponsored solutions track.
Dio Synodinos, president of C4Media, Pia von Beren, Project Manager & Diversity Lead at C4Media, and Danny Latimer, Content Product Manager at C4Media, kicked off the day one activities by welcoming the attendees and providing detailed conference information. The aforementioned track leads for Day One introduced themselves and described the presentations in their respective tracks.
Keynote Address
Radia Perlman, Pioneer of Network Design, Inventor of the Spanning Tree Protocol and Fellow at Dell Technologies, presented a keynote entitled, The Many Facets of “Identity”. Based on the history of practicing authentication methods, Perlman provided a very insightful look at how the phrase “the identity problem” may not be as well-understood. She maintained that “most people think they know the definition of ‘identity’…kind of.” Perlman went on to describe the many dimensions of “identity” including: human and DNS naming; how to prove ownership of a human or DNS name; and what a browser needs to know to properly authenticate a website. The theory of DNS is “beautiful,” as she described, but in reality, a browser search generally returns an obscure URL string. Because of this, Perlman once fell victim to a scam while trying to return her driver’s license. She then discussed how it is difficult for humans to properly follow password rules, questioned the feasibility of security questions, and recommended that people should use identity providers. Perlman characterized the Public Key Infrastructure (PKI) as “still crazy after all these years” and discussed how a certificate authority, a device that signs a message saying “This name has this public key,” should be associated with the registry from which DNS name is returned. She then described the problem with X.509 certificates such that Internet protocols use DNS names, not X.500 names. “If being able to receive at a specific IP address is secure, we don’t need any of this fancy crypto stuff,” Perlman said. She then compared the top-down and bottom-up models with DNS hierarchical namespaces in which each node in the namespace represents a certificate authority. Perlman recommended the bottom-up model, created by Charlie Kaufman circa 1988, because organizations wouldn’t have to pay for certifications. Also, there is still a monopoly at the root level and root can impersonate everyone in the top-down model. In summary, Perlman said that nothing is quite right today because names are meaningless strings and obtaining a certification certificate is messy and insecure. In conclusion, Perlman suggested to always start with the question, “What problem am I solving?” and to compare various approaches. In a humorous moment early in her presentation, she remarked, “I hate computers” when she had difficulty manipulating her presentation slides. Perlman is the author of the books, Network Security: Private Communication in a Public World and Interconnections: Bridges, Routers, Switches, and Internetworking Protocols.
Highlighted Presentations
Laying the Foundations for a Kappa Architecture – The Yellow Brick Road by Sherin Thomas, Staff Software Engineer at Chime. Thomas introduced the Kappa Architecture as an alternative to the Lambda Architecture, both deployment models for data processing that combine a traditional batch pipeline with a fast real-time stream pipeline for data access. She questioned why the Lambda Architecture is still popular based on the underlying assumption of Lambda: “that stream processors cannot provide consistency is no longer true thanks to modern stream processors like Flink.” The Kappa Architecture has its roots from this 2014 blog post by Kafka Co-Creator Jay Kreps, Co-Founder and CEO at Confluent. Thomas characterized the Kappa Architecture as a streaming first, single path solution that can handle real-time processing as well as reprocessing and backfills. She demonstrated how developers can build a multi-purpose data platform that can support a range of applications on the latency and consistency spectrum using principles from a Kappa architecture. Thomas discussed the Beam Model, how to write to both streams and data lakes and how to convert a data lake to a stream. She concluded by maintaining that the Kappa Architecture is great, but it is not a silver bullet. The same is true for the Lambda Architecture due to the dual code path making it more difficult to manage. A backward compatible, cost effective, versatile and easy to manage data platform could be a combination of the Kappa and Lambda architectures.
Sigstore: Secure and Scalable Infrastructure for Signing and Verifying Software by Billy Lynch, Staff Software Engineer at Chainguard, and Zack Newman, Research Scientist at Chainguard. To address the rise of security attacks across every stage of the development lifecycle, Lynch and Newman introduced Sigstore, an open-source project that aims to provide a transparent and secure way to sign and verify software artifacts. Software signing can minimize the compromise of account credentials and package repositories, and checks that a software package is signed by the “owner.” However, it doesn’t prevent attacks such as normal vulnerabilities and build system compromises. Challenges with traditional software signing include: key management, rotation, compromise detection, revocation and identity. Software signing is currently widely supported in open-source software, but not widely used. By default, tools don’t check signatures due to usability issues and key management. Sigstore frees developers from key management and relies on existing account security practices such as two-factor authentication. With Sigstore, users authenticate via OAuth (OIDC) and an ephemeral X.509 code signing certificate is issued to bind to the identity of the user. Lynch and Newman provided overviews and demonstrations of Sigstore to include sub-projects: Sigstore Cosign, signing for containers; Sigstore Gitsign, Git commit signing; Sigstore Fulcio, users authentication via OAuth; Sigstore Rekor, an append-only transparency log such that the certificate is valid if the signature is valid; Sigstore Policy Controller, a Kubernetes-based admission controller; and Sigstore Public Good Operations, a special interest group comprised of a group of volunteer engineers from various companies collaborating to operate and maintain the Sigstore Public Good instance. Inspired by RFC 9162, Certificate Transparency Version 2.0, the Sigstore team provides a cryptographically tamper-proof public log of everything they do. The Sigstore team concluded by stating: there is no single or one-size fits all solution; software signing is not a silver bullet, but is a useful defense; software signing is critical for any DevSecOps; and developers should start verifying signatures including your own software. When asked by InfoQ about security concerns with X.509, as discussed in Perlman’s keynote address, Newman stated that certificates are very complex and acknowledged that vulnerabilities can still make their way into certificates. However, Sigstore is satisfied with the mature libraries available to process X.509 certifications. Newman also stated that an alternative would be to scrap the current practice and start from scratch. However, that approach could introduce even more vulnerabilities.
Build Features Faster With WebAssembly Components by Bailey Hayes, Director at Cosmonic. Hayes kicked off her presentation by defining WebAssembly (Wasm) Modules as: a compilation target supported by many languages; only one .wasm
file required for an entire application; and built from one target language. She then introduced the WebAssembly System Interface (WASI), a modular system interface for WebAssembly, that Hayes claims should really be known as the WebAssembly Standard Interfaces because it’s difficult to deploy modules in POSIX. She then described how Wasm modules interact with the WASI via the WebAssembly Runtime and the many ways that a Wasm module can be executed, namely: plugin tools such as Extism and Atmo, FaaS providers, Docker and Kubernetes. This was followed by a demo of a Wasm application. Hayes then introduced the WebAssembly Component Model, a proposed extension of the WebAssembly specification that supports high-level types within Wasm such as strings, records and variants. After describing the building blocks of Wasm components with the WASI, she described the process of how to build a component followed by a live demo of an application, written in Go and Rust, that was built and converted to a component.
Virtual Threads for Lightweight Concurrency and Other JVM Enhancements by Ron Pressler, Technical Lead OpenJDK’s Project Loom at Oracle. Pressler provided a comprehensive background on the emergence of virtual threads that included many mathematical theories. A comparison of parallelism vs. concurrency defined performance measures in latency (time duration) and throughput (task/time unit), respectively. For any stable system with long-term averages, he introduced Little’s Law as L = λW, such that:
- L = average number of items in a system
- λ = average arrival rate = exit rate = throughput
- W = average wait time in a system for an item (duration inside)
A comparison of threads vs. async/await in terms of scheduling/interleaving points, implementation and recursion/virtual calls defined the languages that support these attributes, namely: JavaScript, Kotlin and C++/Rust, respectively. After introducing asynchronous programming, syntactic coroutines (async/await) and the impact of context switching with servers, Pressler tied everything together by discussing threads and virtual threads in the Java programming language. Virtual threads is a relatively new feature that was initially introduced in JDK 19 as a preview. After a second preview in JDK 20, virtual threads will be a final feature in JDK 21, scheduled to be released in September 2023. He concluded by defining the phrase “misleading familiarity” as “there is so much to learn, but there is so much to unlearn.”
Summary
In summary, day one featured a total of 28 presentations with topics such as: architectures, engineering, language platforms and software supply chains.