QCon San Francisco 2024 Day 3: Arch Evolution, Next GenUIs, Staff+, Hardware Architectures

MMS Founder
MMS Michael Redlich

Article originally posted on InfoQ. Visit InfoQ

Day Three of the 18th annual QCon San Francisco conference was held on November 20th, 2024, at the Hyatt Regency in San Francisco, California. Key takeaways included: a debate on whether prompt engineering is a programming language or a utility; how Google Lens helps the visually impaired navigate the streets in Google Maps; challenges in migrating to a cellular architecture; and how to more properly implement high-resolution platform observability to avoid unintended consequences.

What follows is a summary of the keynote address and highlighted presentations.

Keynote Address: Prompt Engineering: Is it a New Programming Language?

Hien Luu, Sr. Engineering Manager at Zoox and Author of “MLOps with Ray,” presented his keynote address entitled, Prompt Engineering: Is it a New Programming Language? Luu kicked off his presentation by stating:

The most powerful programming language isn’t a programming language at all.

He then proposed the question: “Is prompt engineering a new programming language or just word-smithing for those that write JavaScript?

This keynote was designed to be a debate on this question, with Luu serving as debate moderator and providing arguments both for and against each side. After defining the attributes of a programming language and prompt engineering, Huu conducted an initial vote on the motion from the audience, which seemed to favor prompt engineering as a programming language.

The debate was focused on three topics:

  • Syntax & Structure
  • Skills & Expertise
  • Impact & Longevity

Luu provided attributes, examples and ChatGPT demos both for and against each of these topics. For example, the statement, “I saw a man with the telescope,” could be interpreted as the observer: saw a man holding a telescope or saw the man through the telescope. Luu also maintained that we rely on natural language instead of skills and expertise.

After providing closing arguments, Luu once again polled the audience. This time, the audience seemed to favor that prompt engineering was not a programming language.

Highlighted Presentations: Accessibility with Augmented Reality | Slack Migration to a Cellular Architecture

Making Augmented Reality Accessible: A Case Study of Lens in Maps was presented by Ohan Oda, Senior Software Engineer at Google. Oda kicked off his presentation with statistics that: one out of four 20-year olds will become disabled before they retire, according to the Council for Disability Income Awareness; and that an estimated 1.3 billion world-wide suffer from a significant disability, according to the World Health Organization. His presentation focused on individuals who are blind and have low vision.

Oda introduced Google Lens, a “camera-based experience in Google Maps that helps on-the-go users understand their surroundings and make decisions confidently by showing information in first-person perspective,” available on Google Maps. He demonstrated how to use Lens with this short video. While useful when traveling, Oda stated that Lens is not used very much in everyday situations. Of course, using Lens requires the user to hold up the phone while walking, which may cause “friction” that Oda defined as surrounding pedestrians with the perception of being recorded.

In his quest to improve usage and user retention, Oda attended several internal accessibility/disability inclusion (ADI) sessions at Google and solicited feedback from some of the visually impaired employees. He also attended external conferences such as XR Access, a research consortium at Cornell Tech designed to provide virtual, augmented and mixed reality for people with disabilities.

Oda discussed the challenges in improving Google Lens including reversing the old adage, “a picture may be worth 1000 words.” However, he maintained “the user doesn’t have time to listen to 1000 words.

Oda concluded with a video from Ross Minor, a gaming, media, and technology accessibility consultant who is dedicated to providing accessibility to those who are disabled. Minor is blind due to a traumatic event at the age of eight.

Slack’s Migration to a Cellular Architecture was presented by Cooper Bethea, Former Senior Staff Engineer and Technical Lead at Slack. Bethea kicked off his presentation with a peek behind Slack’s architecture, the web servers behind the scenes and the corresponding data store.

He discussed the goals and challenges behind building a cellular design as an availability zone for “draining the traffic,” and discussed the two options of Siloing and Internal Managed Draining.

Bethea introduced Coordination Headwind, a concept in which organizations start to feel that accomplishing simple things seems to be much slower over time. He referred to this as organizations becoming slime moulds and compared the bottom-up and top-down hierarchical designs.

Bethea then introduced Project Cadence that features: writing proposals and circulate; engage deeply with high-value services; and expand to all critical services.

The current state of the cellular design includes: siloed services are drainable in approximately 60 seconds; Vitess automation can reparent at the speed of replication; remaining critical services have roadmaps; there is a “happy path” to a silo for new services; drains can happen for incident response, rollout and even drills.

Conclusion

QCon San Francisco, a five-day event, consisting of three days of presentations and two days of workshops, is organized by C4Media, a software media company focused on unbiased content and information in the enterprise development community and creators of InfoQ and QCon. For details on some of the conference tracks, check out these Software Architecture and Artificial Intelligence and Machine Learning news items.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Efficient Serverless Development: Latest Patterns and Practices on AWS

MMS Founder
MMS Yan Cui

Article originally posted on InfoQ. Visit InfoQ

Transcript

Cui: I’m here to share some of the patterns I’ve noticed, things that people struggle with when it comes to developing serverless applications, and some of the things you can do to make that development experience a lot easier. My name is Yan. I’ve been working with AWS for a very long time, since 2010. I went through the whole EC2 to containers to serverless journey over the years, and nowadays, I’m pretty much all fully focused on serverless. Nowadays, I spend maybe half of my time working with Lumigo as a developer advocate. Lumigo are, in my mind, the best observability platform for serverless, and they support containers as well. I haven’t really done much of containers work recently. I also do a lot of consulting work, as well as independent consultant, working with companies and helping them upskill on serverless, and also finding solutions for problems they have.

Testing

As a consultant, I see a lot of problems here running both for my students, but also for my clients, and I identify maybe three main areas that you need to have in order to have a smooth software development process when it comes to building serverless architectures. You probably have encountered some problems yourself as well in terms of, how do you have that fast feedback loop when testing your serverless applications. How do you efficiently and easily deploy your application? How do you manage your application environment across your AWS real estate? I think the problem by far, the most important one, and the one that most people struggle with, is, how do you get a good testing workflow?

Specifically, how do you achieve a fast feedback loop of making changes, testing them, without having to wait for deployment every single time you make a small code change. How do you make sure that everything you’re doing is well tested, so that you can find bugs before your customers do. This is a really common myth that still persists today, that there’s just no local development experience when it comes to serverless technologies. There’s quite a lot of different ways you can get a good and fast feedback loop with local development when you’re working with serverless technologies. I’m going to be telling you five of them.

Option 1: Hexagonal Architecture

Option one you have is using hexagonal architectures, which is a software pattern for creating loosely coupled components that can easily slot into different execution environments using abstraction layers called the port and adapters. It’s also why they often called ports and adapters architecture as well. It’s often drawn in a diagram like this, where you have your core business domain in the middle right there. You’re encapsulating to different modules and objects, and it’s hosted inside your application layer. When your application layer needs to accept requests from callers and from clients, and when it needs to call out to databases and other things to read data or to write data, they expose the ports.

Then, in order to connect your application layer to the outside world, say, a lambda function or a container, you create adapters so that, for example, you can allow different clients to call into the application layer by creating an adapter for one of the clients, which may be a lambda function. In which case your adapter needs to adapt lambda’s execution invocation signature of an event and a context into whatever domain object that your domain layer requires, or rather, your application layer requires, in order to invoke some business logic that’s been captured inside the core domain modules.

You got a lambda function execution environment figured out, working with this adapter, and you can also allow other clients, maybe a Fargate container, to run your application by creating another adapter that adapts whatever web framework you want to run in a container to your application layer, to convert Express.js request objects into those same domain objects that your application layer requires. This approach gives you really nice portabilities in terms of what execution environment you want to run in. It’s also very useful when it comes to situations where you’re just not sure if your workload is going to run in lambda functions or in containers, or maybe later on, you need to change your mind, because your architecture is going to have to evolve with your context.

Today, maybe you’re building something that maybe five people are going to use, but maybe in a year’s time, it got really popular, and now suddenly you have really high throughput. Your requirements are figured out so you no longer need to really rapidly add capabilities, and instead, you need to focus on efficiency and cost. You move your workload from lambda functions into containers. Similar to what the Prime Video team did at Amazon and wrote a blog post about. As to context changes, this architecture pattern allows you to have that flexibility to make changes without having to rewrite large parts of your application.

What about when it comes to your core domain that needs to read data from other databases or writing data to other places? That’s where you have also other ports for those. You create adapters that adapt those data fetch and writes to DynamoDB. To support local development, you can also have adapters that talk to mocks instead, so that you are now able to write your application code and test it locally without having to talk to any third-party services. You can do all of this. You can test your code. You can put breakpoints and steps through it. This also then allows you to then have a really nice local development experience.

If you change your mind later, that DynamoDB was fine, but our access pattern has now got more complex, and to support the same access patterns, we just have to get ourselves tangoing in the notes, trying to work out how to make it work with DynamoDB, it doesn’t make sense anymore, let’s just go to relational database. You can just change the adapter, and the application will still continue to work. It’s just that that’s the one part of your application you have to change. A hexagonal architecture is a really good way to structure and to modularize your application so that you have a lot of portability, but also you get a really good local development experience as well.

It’s something that you should consider especially when you have a more complex business domain or application you’re working with. Then the nice thing about it, besides portability and local development experience, is that it’s pretty universally applicable to any environment in terms of programming language, in terms of development framework you’re using. Doesn’t matter if you’re using CDK, serverless framework, SAM, SST, or whatever latest flavor of the day, they can always work, because ultimately, it’s about how you structure, organize, and modularize your code.

There’s a problem, because when we talk about serverless, we’re not just talking about lambda functions. Of course, lambda is what we consider serverless as a service, but serverless as a whole is much bigger than that. When I talk about serverless, I’m thinking about any technology that fits the number of criterias, for oneness, we don’t have to think about managing patching, configuring those servers that runs our application, and it also has autoscaling built in. Importantly, you need to be able to scale to zero. Ideally, you have usage-based pricing so you only pay for your application when someone does something, rather than you have to pay for them by the hour. This usage-based pricing is also going to be very important when we later on talk about, how can you use ephemeral environments, which is one of the most impactful or important practices that has evolved with serverless.

Of course, not every single service that we maybe consider serverless fits neatly and ticks every single one of these boxes, which is why many people actually consider the definition of serverless more as a spectrum, where on the one end you’ve got things like SNS, lambda, things that are definitely serverless. Then on the other end you’ve got things like EC2, which are definitely not serverless. Somewhere in the middle, you’ve got things maybe like Fargate, which allows you to run containers, but without having to think about or worry about managing the underlying servers that runs your containers. You got things like Kinesis, which has no management, but doesn’t quite scale to zero and still have uptime-based pricing as well. This will be somewhere in the middle of that spectrum.

In the early days of serverless, oftentimes you talk about how you should use lambda functions to transform data, not to transport data. That is still very much true today, that if you got business logic, you need to implement, put them in lambda functions, absolutely. If all you’re doing is shuffling data from one place to another, for example, from a database to respond to an API call, and you don’t add any value with any custom code, then all you’re doing with lambda function is to make a call to some service using the AWS SDK. Then you’re not adding any value with that lambda function there. Instead, it’s better to have the service to do the work for you instead. If you’re building a REST API and you’re running in the API gateway, you can have the API gateway talk to Cognito directly to do user authentication and authorization, rather than you having to write some code yourself and calling Cognito.

Similarly, again, if you’re just building a simple cloud endpoint, and all it’s doing is taking whatever’s in DynamoDB and returning as it is without any business logic in the middle, then you can have API gateway talk to DynamoDB directly without you putting a lambda function in there, which introduces additional moving parts, things that you need to manage, maintain over time, and also pay for it. Every component that goes onto the architecture has a cost. You should always be thinking about, does it make sense for me to add this other moving part? What value am I getting by having that thing in my architecture?

Keep going, if you’re making a change to your data in DynamoDB, and you want to capture those data change event and you want to make them available on the Event Bus so the other services can subscribe to them. Then, instead of having a lambda function just, again, shuffling that data from one DynamoDB stream to an Event Bus, you can nowadays use EventBridge Pipes. You can use lambda functions to do data transformation, but you don’t have to do that data shuffling yourself. EventBridge Pipes can also be used to call other third-party APIs like Stripe or other internal services, other APIs you have as well.

Again, there’s lots of ways to build serverless architectures without actually having to write custom code, ship them into lambda functions. That creates a problem for our hexagonal architecture, because there’s no code, nothing for us to test, nothing for us to put into port and adapters. There’s a lot of things that you want to be able to test with hexagonal architectures. Then the downside with using hexagonal architectures is that there is a bit of upfront work you have to do in terms of designing those abstraction layers, creating adapters and ports, even if you may not end up needing to use them later. There’s a bit of upfront work to do, which is going to make it easier for you to do, porting your application to other environments later. It does come with some cost.

Option 2: Local Simulation

What if we are going to use a lot of direct service integrations, how can we still get that local development experience? Another approach you can think about is using local simulation, by simulating those services locally. Probably the most used, widely adopted tool for this is LocalStack. I started using them about four or five years ago, when it was the open source before version 1, and it was really unstable at the time. I’ve pretty much stayed away from them ever since. In the last maybe 12 months or 18 months, they’ve come a really long way since becoming a proper official product, a commercial product. I had this chat with Waldemar Hummer, who’s the CTO of LocalStack, recently, about what’s happening in v3, and it’s now looking like a really good product and much more stable from what I can see.

Taking our example from earlier, we don’t have any lambda function in this architecture, but you can use LocalStack to simulate the entire application. With v3, they have now support for EventBridge Pipes as well. You’re able to take your application, deploy it against LocalStack, and it should basically run the entire application end-to-end inside LocalStack. Of course, with v3, they’re creating some additional features that add value to things that will be difficult to do otherwise, even when you run your application in the real AWS environment. When you have direct integrations like this, one thing that often trips people up, and myself included, is the permissions.

For example, if I have this thing set up and then I somehow misconfigure the permission between EventBridge Pipes and the EventBridge Bus, then those problems are really difficult to debug. One of the things that LocalStack v3 does now is you can enforce IAM permission checking so that when you make an API call against your local endpoint, that triggers events into EventBridge Pipes, and to EventBridge, and there’s a permission problem, you see that in your LocalStack logs, “There’s an IAM permission error.” With that, you can straight away identify, I’ve got a problem with my IAM setup for this part of my application. Whereas in the real thing it’s much more difficult for you to identify those problems.

Of course, if I do have a lambda function I want to test as well, LocalStack can also simulate lambda functions running locally on your machine. I’m not sure how good the ability to put breakpoints in your code is. I think they have that support. I remember seeing a demo of it. The nice thing about local simulation is that it’s much broader in terms of what you can do, and the things that you can cover. It’s not just about lambda functions. The downside is that with any simulation it’s never going to be 100% the exact simulation of the real thing. You’re always going to run the risk of running to some service that’s not properly supported, or some of the API for a service you’re using is not fully implemented, or just bugs, or worse, some subtle behavior differences that gives you false positives or false negatives when you’re running your test against the local simulator, but in the real thing, suddenly it breaks.

The problem is that, with these kinds of things, oftentimes you just need to run into one problem for a lot of your strategy to fall apart. Having said that, I think LocalStack is a really good product, you should definitely go check it out.

Option 3: Lambdalith

If you want to again, look at something else, then there’s also the option of writing your lambda functions as lambdaliths. A lambdalith is basically when you take an existing web application framework and you just run it inside a lambda function, and then have something that adapts lambda’s execution invocation signature event and context into whatever your web framework requires. There’s lots of tools that supports that out of the box nowadays as well. AWS has the AWS Lambda Web Adapter that you can bundle with your lambda function. I think it’s available as a lambda layer, which can then do the translation for you before it invokes your code. There’s also other frameworks that you can use to develop the application from the ground up using this approach, such as Serverless Express. There’s bref for PHP.

There’s, I think, Serverless Express mostly for JavaScript, bref is for PHP. Then you’ve got Zappa, which is for Python, which allows your thing to run the Flask applications inside a lambda function, and it does that converting. The nice thing about this approach is that you’re using familiar web frameworks already, so developers know what they’re doing. They know how to test it. They know how to run Express.js app on the local machine, and can test it. Also, it gives you that portability again, because it’s just an Express.js application. If tomorrow you want to move your workload from lambda to containers, you can always just do a minimum amount of work to then change how your Express.js app is called to handle those requests.

On the flip side, again, it’s just about lambda functions, just about your code. You can’t test things that you’re not writing custom code to do. Instead, you’re using AWS services to do the actual work for you. Now you can’t test those as part of the application. Also, one thing to consider when it comes to lambdaliths, is that because you’re running a full web application framework, sometimes those can be quite large, and so they can add a lot of bloat to your deployment package for your lambda function. The size of your deployment package is going to have an impact on the cold start performance, which for applications that experience a lot of cold start, this can be something that can really hurt your user experience.

Also, an old example I’ve talked about so far with a lambdalith, is just about web application, web APIs, and most of the framework tailored for that particular workload as well. This approach doesn’t work quite as out of the box with workloads that are not an API. With a lot of lambda functions that are used for data processing or doing event driven architectures, there’s no building framework so that you do lambdaliths for those. Also, from the security and operational point of view as well, that instead of having one lambda function for every single endpoint, and so every time, you can have even more fine-grained access control, when you have just one lambda function for the entire API, and you do all the routing internally inside your code, it means that you have less fine-grained access control for individual endpoints, because it’s one lambda function, one IAM role. It’s less fine-grained in those terms.

Also, in terms of monitoring and alerting as well. If I got different lambda functions for different endpoints, when there’s an error, it triggers an alert. I can straight away, from looking at the error, know which function that is and therefore which endpoint is having a problem. Rather than just having one function, one alert, one set of metrics, so that when there’s an alert in my whatever, notifications channel, I can’t tell if it’s the whole system is broken, is a big thing, or if it’s just one endpoint is failing. You have less granularity in terms of understanding the telemetries that you get from your system as well.

Option 4: Deployment Frameworks

Then there’s also deployment frameworks that you can use that’s got built-in support for some local development. AWS SAM has got that with sam local invoke, allows you to invoke your function locally. It does also sam sync, which allows you to basically sync up code changes in your lambda functions and updates your local changes to your modules, and updates the lambda function directly every time you make a change, so that it makes the test loop a bit quicker. Serverless frameworks have got the serverless invoke local as well. SST has got a sst dev command which allows you to live debug a lambda function again.

All of these tools that makes it easier for you to write some changes and then that gets updated in AWS, or you run your function locally. It’s like a lightweight simulator, just specific for your lambda functions. The nice thing about this approach is that, it’s just all done by the framework. You don’t have to write your code in a specific way. This all comes out of the box for you. The downside is that it’s also very specific to the framework that you’re using and potentially also to the language you’re using as well. SST is mostly targeting TypeScript. I think SAM, you’ve got local, only supports JavaScript, Python, and maybe something else. Same with serverless framework support as well. It’s not universal to different languages.

Also, again, it’s just looking at your lambda functions mostly, and so you can’t use it to test the things that you’re not writing custom code for. You should use lambda to transform data, not to transport data. When you’re running something like sam invoke local to test your code locally, that’s all running good when you’re developing and working on some changes, so that you can add breakpoints even to your code. It’s great for exploring your changes, what you’re doing, not so good for automated testing, so that you know you’re not introducing some regression. You’ve got lots of different functions, lots of different use cases. You don’t want to be manually running every single input every single time you make a small change. It’s just not feasible. Even if you use these tools to help you develop new things or making changes and having a feedback loop for them, you still need to have some suite of automated tests that allows you to catch any regressions and problems that you haven’t identified earlier.

Option 5: Remocal Testing

If you do have the right test anyway, my favorite approach for writing tests for lambda functions is what I call remocal testing. Basically, you combine running code, executing your code locally against remote services rather than mocks, therefore remocal. When you think about local testing, you think about executing your code locally against the mocks. Because you’re running code locally, you can put breakpoints, which allows you to step through the code to debug more complicated problems. You can’t just rely on checking your logs. You should definitely have a good logs and observability setup for metrics and everything else, and whatever it is you need to help you debug problems in production. It’s also helpful to be able to step through the code, especially if your application is fairly heavy on the lambda functions and custom code.

Importantly, you can test your changes without having to wait for a full deployment cycle every single time. That gives you a faster feedback loop. Problem with using mocks is there’s a subtle difference between testing against the real thing and asking, is my application actually working? Versus running tests against mocks, which is asking, is my application behaving the way I expect it to, given that whatever it is I’m calling or talking to, giving me this particular response. Because that given is based on your expectations and assumptions about how that other thing works. Whereas testing against the real services, you’re testing against how the real thing, it’s a reality.

The problem is that, oftentimes, we get our expectations or assumptions wrong. I can’t tell you how many times I’ve had code or looked at code that the auto test passed, calling DynamoDB, we use a mock for that. We write a test, everything passed. Then we run in the real world, and it fails straight away because we have a typo in our DynamoDB query syntax, which is a string, so our mocks doesn’t know that. We’re not really checking that. Our assumption is that our query is right, or we’re making a request based on some documentations which end up to be wrong, or something.

Those assumptions, you’re not able to test your assumptions with your mocks. Which is why using mocks and the local testing this way is prone to giving you false positives. Also, you’re not testing a lot of other things as part of your application, things like IAM permissions. Again, I can’t tell you how many times my test will run, running the real thing, realize, I’m missing IAM permissions, and my real thing fails.

Then, ultimately, your application is more than just your code. Your customer is not going to care if something breaks, and it’s because, it’s not my code, it’s some configuration of code that’s gone wrong. Customer doesn’t care. Your job is to make sure the application works, not just your code. When it comes to remote testing, you’re testing against a real thing. You’re testing your application in the cloud where they’re going to be executed. It’s a real thing. You’re using it just like your customer would. You’re going to get much realistic test results. You can derive higher confidence from the results of those tests than you can with local tests.

Because you’re executing your application as it is, you also want to cover everything that’s along the code path, in terms of IAM permissions, in terms of your calls from lambda functions to DynamoDB and whatnot. The problem is that, if we have to do a deployment to update your application every single time you make a small change, then that’s going to be really slow feedback. It’s going to be really painful for people to develop serverless applications. Now we’ve got two things that’s got different tradeoffs, and if we combine them and put them in the melting pot, and local testing and remote testing, we get the remocal testing so that you execute your code locally, but you’re talking to the real AWS services as much as possible, at least for the happy path. Because you’re testing your code locally, you can put breakpoints to debug through them, and you can change your code without having to do a full deployment every single time.

Because it’s calling the real thing, any mistakes in your request to DynamoDB, you’ll be able to figure them out pretty quickly, without having to wait for you to do the real thing in AWS. This kind of thinking is useful, but they’re still only just looking at your lambda functions and whatever services you’re integrating with. It’s not looking at anything that comes upstream of your function. The API gateway that’s calling your endpoint, that’s calling your function, or the EventBridge Bus that’s triggering your function, because all of those things, you can have mistakes in those. You can have a missed bug in your EventBridge pattern, so an event gets sent in, it doesn’t trigger your function because you’ve got a typo in your pattern there. Or maybe some permission related errors that you have in the API gateway, or some bug in your VTL code, or JavaScript, the resolver code in AppSync, and so on.

This approach is great in that it’s universally applicable to different languages and frameworks. I’ve personally used this approach for projects in JavaScript, TypeScript, Python, and Java as well. It’s, again, code level patterns that is really easy to apply to different contexts, but you still can’t test direct integrations. Even if you’re writing your remocal test for lambda functions, you still need to have probably a complete suite of the tests that also executes your application end-to-end as well. If I was to build something with API gateway, and look at all the different things API gateway can do for me, including calling lambda functions, which is just one part of it, you can also do things like validate my request or transform my response, or call other AWS services directly.

Also, you can have different authentication and authorization authorizer you can use out of the box. Anything that I’m using the AWS service is offering, I want to have a test that covers them. If I’ve got the lambda functions that’s got some domain logic, I can write unit tests to test them in isolation. That’s where I will use mocks or structure it in such a way that they only work with domain objects, so there’s no external dependencies. That’s where things like some lightweight patterns from hexagonal architectures can be really useful here. For the IAM permissions of my function, those will be covered by my end-to-end tests. If I just want to focus on testing my functions integration with other services, that’s where I will focus more on writing remocal tests.

Anything that I’m asking API gateway to do for me, so configuring serverless to do something like transforming responses or validating a request based on a schema, then I will cover those with my end-to-end tests. Again, if I’m using API gateway to authenticate the request with Cognito, then I’ll make sure those are covered in my end-to-end test as well. Interesting use case for lambda authorizer, because that is still a lambda function, so I can still test them in isolation using unit tests. If my lambda authorizer is there, so that I can integrate with Auth0 or some third-party identity provider, then I can also write my remocal test that calls Auth0 with some token to validate it, and all of that as well. Different things that I’m using, different features I’m using, can be covered by different kinds of tests.

Demo

To give you a sense of how they can all fit together, I prepared a short demo, so I can show you what that may look like in practice. I deployed this earlier to a new stage. It’s not dev stage. You can see that stage name is qcon. This is the approach, we’ve got some lambda functions. I’ve got a couple endpoints to get a list of restaurants, to create new restaurants with post, and to search for restaurants. This project is using the serverless framework. You can see, I’ve got my function definition here. This is the Add function endpoint that’s protected by AWS IAM, and this is the path and the post. Then I’ve got the direct integration to DynamoDB to do a get from a table, because I’m not really doing much with the response, I’m just taking it as it is from DynamoDB and returning it.

We’ve got some VTL response template to transform the response, including returning a 404 if the restaurant is not found. For the Add Restaurant function, I’ve got my table down here as well, my DynamoDB table and Cognito and whatnot. I also want to make sure that when someone calls my post endpoint to create a new restaurant, they are providing a schema for my function. This Add Restaurant Schema that just says that you have to provide a name, which is a string. Now we’ve got a mix of different cases here. We’ve got lambda functions that I want to test. We’ve got things that are only implemented by API gateways, it can only be tested with end-to-end test. For the schema checking, we’ve got direct integration as well. This can only be tested, at least in my setup, using end-to-end tests. I’ve prepared some test scripts. There are some test scripts which are run by integration of my remocal test as well as my end-to-end tests.

After I deploy this, I can show you my Add Restaurant test case here, which says, we generate some random restaurant name. Say, when we invoke the Add Restaurant endpoint or handler, we expect some response to come back as 200, and this new restaurant should match whatever data that we’re sending here. After the handler runs, there should be a restaurant by that ID in the database. Notice that this test is written in such a way that it is very high level. It doesn’t pertain to invoking the code locally. What I’ve actually done is I’m using a small library that allows me to tag the test as different test groups. If you look at this function here, this one step, it basically toggles between invoking your function locally, using this viaHandler helper, or calling an HTTP endpoint that’s been deployed. I can run this test case as both remocal tests, but once I deploy my endpoint, I can also run this test again as an end-to-end test, and the customer helps us to sign the HTTP request and all of that.

In this case, I can actually use the JavaScript terminal. I can put a breakpoint in my code. For the Add Restaurant function, I can test:int. This is going to run my test case. I can put a breakpoint in my function code, so I install. If there’s any problems, I can quickly just debug through it and figure out where the problem is. I think that was a timeout because it took too long to respond. The default timer on the JS is 5 seconds. If I want to say, I made a mistake, instead of returning the new restaurant data, I precipitate a change and I can run the test again. Now the test actually failed. I forgot to raise the timeout. Now it’s going to say the test is going to fail.

Actually, let’s run this here, less noise. Run this test. This test is going to fail because of that. Ok, now I figured out the problem, or I’m doing TDD, I start with a failing test, and then I implement the actual logic, which is what I should have done. I can now make the code do what it’s supposed to do, run the test, and then once the test has passed, then I’m able to then promote my changes. I’m going to deploy that to AWS, and then run the end-to-end test version of these tests instead, by just, in this case, setting an environment variable that runs my test run. This way of writing tests means that I have less tests I need to maintain over time, especially when I’m writing tests at a fairly high level.

That’s the example of what I could do with remocal testing, and also combine that with other tests that executes my application end-to-end, to test things that are outside of my code. With this line of testing, because you are having your function code talking to the real thing, it does mean that before you can do this, you do have to do one deployment, even if after you’ve deployed your DynamoDB tables, you can add new functions, talk to them, test them, before you even deploy your function to AWS, and your API endpoint into AWS, you do have to do that initial deployment.

The question becomes, what happens if you’ve got four people all working on the same system, the same API, who gets to deploy their half-baked changes to dev so that they can do their code changes? That’s where the use of temporary environments, or ephemeral environments, really comes in and be really helpful for that. I’m jumping ahead a little bit here. Before we get to that, I want to just again reiterate that the myth that there’s no local development experience with serverless is just no longer true. There are so many different ways to do that now. These are just five ideas I’ve come across oftentimes, and there’s different ways you can combine them as well. For example, you can write remocal tests with end-to-end tests against the local simulators. You can write the same end-to-end test, but instead of having to wait for deployment to AWS, you can deploy it to LocalStack and then run tests against that instead.

Deployment

Next, we want to talk about deployment. How can you make sure that your deployment is nice and simple and smooth, so that you don’t get tangled up as part of that, which I’ve seen a lot of clients and students get tripped up into making their deployment process much more complicated than it needs to be. I really stress this that keep your deployment as simple as it can be, but no more simpler. You don’t want to be oversimplified, which causes other problems elsewhere. Part of the problem here is actually that the lambda is just no longer a simple thing anymore. Back in 2014 maybe in 2016, ’17, lambda was a fairly simple thing.

Nowadays there are just so many more additional features that many people don’t need, things like, you got lambda layers for sharing code. You can package your application, your lambda functions as a container image. It’s not the same as running a container, but just using container image as a packaging format. You can ship your custom runtime. You can use the provisioned currency to reduce the likelihood you’re going to see cold starts, and use that with aliases and whatnot. All of these things are useful for different situations.

Just because they’re there, you don’t have to use them, and you absolutely don’t have to use all of them. In fact, I always go as far to say that a lot of these features are not something that you would need to use for 90% of the use cases, but for specific problems that you run into, they are more like medications. They’re medicines, they’re not supplements. They’re the things that you use when you have a problem, not something that you just take every day because it makes you feel good.

Again, for lambda layers, I’ve got a pet peeve for lambda layers that a lot of people get into problems because they’re using lambda layers to share code between different functions, instead of sharing their code through package managers like npm, or crate, or whatever. With lambda layers, you got a number of problems. Number one is that there’s no support for semantic versioning. The versions go from 1, 2, 3, 4, 5 onwards. There’s no semantic versioning to communicate if it is a breaking change, if it is a bug fix, if it’s a patch, if it is just me adding new things. Because it exists outside of your language ecosystem, security tools that scans your code and dependencies doesn’t know about them and can’t scan them properly.

You also limit it to only five lambda layers per function as well. There are so many things that are distributed as lambda layers nowadays that you can really quickly hit these five lambda layers limit per function. Even when you’re just putting things into lambda layers, they still count towards your 250 Meg size limit for your application, for a ZIP file once it’s been unzipped. It doesn’t help you work around this limit either. They make it more difficult for you to test your application, because part of your application, part of your code now exists outside of your language ecosystem, exists outside these other things. You require specific tooling support from SAM or SST or something else to then make sure that that code also exists locally when you’re running your tests locally as well. They also were invented for supporting Python.

It practically doesn’t work for compiled languages like Java or .NET. I have worked with a client that uses lambda layers to share their code before, and the actual process of updating your lambda layers and publishing them and making it available to other code you have, is actually more work than just using npm to begin with. There’s also no tree shaking, because this only applies to JavaScript. If part of your dependency only exists as a separate thing in the lambda layer as opposed to part of your language ecosystem that you’re bundling, then you also can’t tree shake them properly either. A lot of reasons not to use lambda layers. You definitely shouldn’t use lambda layers to share code, again, since too many people get tripped up by this.

If you’re building a JavaScript application, and you’ve got different functions in the same project, and you need to share some business logic between them, just put your shared code into some other folder, and let the deployment framework handle the referencing and bundling them for you. A serverless framework and SAM, by default, will just take a zip of your whole root directory for your project, so they’ll be included at runtime, so it’s accessible at runtime anyway.

Or you can use bundlers like esbuild so that all of that shared code will be bundled and tree shaked, and all of that as well. If you share code between different projects, then just do what you have been doing before and publish that shared code to npm so that your functions that need that shared code can just grab it from npm instead of going through lambda layers instead. That’s just about the lambda layers, which is something that I definitely recommend that you don’t use for sharing code. It’s useful as a way to distribute your code once it’s been packaged and all of that, but we can talk about that separately afterwards.

I mentioned earlier, you can package your function as a container image. I still prefer to use zip files because it’s easier. Also, with zip files, you’re going to be using the managed runtimes. With container image, you have to ship the runtime yourself, which, if you look at the different responsibilities you have versus AWS has, if you’ve shipped using container image, and you have to ship the runtime yourself, that means you’re on the hook for that. Lambda and other providers can also give you the container image versions of the runtimes that you can use, but it’s up to you to then update that, make sure that you get to always use the latest version and so on.

Whereas if you just use the zip file and use the managed runtimes, you can be sure that the runtime is constantly being updated behind the scenes without you having to take some action. Two things we want to take away about the deployment, which is, don’t use lambda layers to share code, and prefer to use zip files with managed runtimes. That’s not to say that never use container images for your lambda function. With container images, you can have a function that’s up to 10 gig in the deployment artifact. There are many use cases where you do need that. As your deployment package gets bigger, there’s also some code style performance improvements with container images because of some of the optimization that they’ve done for container images, specifically for the lambda platform.

Environments

Last thing I want to talk about is just about the environment, starting with your AWS account. As a minimum, you should have one account per main stage, so dev, test, staging, production should all be separate accounts, so that you have better isolation in terms of a security breach. Someone gets into your dev account, can’t access production data. Similarly, you can have a better way to monitor the cost for different environments and so on. You’ve got different accounts for different main stages, and then you use AWS organizations to help you manage the overall AWS environment. For larger organizations with lots of different teams, different workloads, it’s also better to even go a step further to say, have one account per team per stage.

Then if you’ve got the applications, or part of your applications that are more business critical than others, you may also even go further and say, put those services into their own set of accounts. Or if you’ve got services that’s got outsized throughput requirements that may impact other services, again, put those into separate accounts so that they are insulated from other things in your ecosystem. On top of that, that’s where you can use the ephemeral environment, or temporary environment that are brought up on-demand and then tear it down when you don’t need them anymore. The way it works can be very simple. Say, I walk into the office, I have my cup of tea, maybe a biscuit, and then I start to work on a feature. I get a ticket from Jira, and I create a new temporary environment using the serverless framework. That is just a case of saying serverless deploy.

Then say the stage name is going to be dev-my-feature, assuming that that’s the feature name. Once I’ve created my temporary environment for the service I’m working on, I can start to iterate on my code changes, run remocal tests against that. Then when I’m ready, I can create my commit and PR, and that’s going to run my changes in the CI/CD pipeline. When my feature is done, I can just delete that temporary environment I just created at the start, so this environment can be brought up and torn down with just one command with the serverless framework. You can do the same thing with CDK SAM as well. Again, this practice is not framework or language specific. The nice thing about doing this is that everyone is insulated from each other, so you can have multiple people working on the same API at the same time without stepping on each other’s toes. You can insulate each other.

Everyone’s got their own isolated environment to work in. Also, you avoid polluting your shared environment with lots of test data, especially when you’re working on a new thing, it’s not fully formed yet, so you’ve got lots of test data that that you don’t want to pollute your staging environments or whatever with. The nice thing about having the usage-based pricing we mentioned earlier, one of the key definitions for what is serverless is that it doesn’t matter how many of these environments you have, you only pay for usage. You can have 10 of them. You don’t do anything, you don’t pay for them, so there’s no cost overhead for having all of these environments in your account. If you do have to use the services like RDS and services that are charged by uptime, then there are some special things you have to do to handle that.

Another place we can use ephemeral environments is in the CI/CD pipeline, so you commit your code and now it runs in CI. CI can also just create a temporary environment. Run the test against that environment, and then tear it down afterwards. It’s going to make the pipeline slightly slower, but at the same time, you also avoid polluting your main stages with those test data. When I say environment, it’s not a one-to-one mapping to an AWS account. Yes, you should have multiple accounts, and so your main environment should be dev, test, staging, production, and so on. Each one of them have got their own account. Where your developers work is going to be in the dev account.

Inside that account you can run multiple temporary environments on demand. I can have one environment for myself and then someone else working on a feature may create new environments. What the environment consists of may depend on what tools you’re working with. It may be encapsulated. It may be represented as one CloudFormation stack or one CDK app with multiple stacks, or maybe a combination of a stack or CDK app with some other things that exist in the environment, like SSM parameters and references to other things, and API keys, and whatnot, secrets, and all that.

The question that people often ask me is that, if you’re running multiple environments in the same account, how can you avoid clashes of names and resources so that you get deployment errors? A couple of things we want to do. One is, don’t explicitly name resources unless you really have to. Some more services don’t force you to name the resources yourself. That way, CloudFormation is going to tag some random string at the end, so you know they’re not going to clash on names. If you do have to name the resources yourself, such as the EventBridge Bus, then make sure that you include the environment name, like Yan whatever environment, Dev Yan, as part of the resource name as well.

Again, that helps you avoid clashing on resource names. Using a temporary environment works really well when you’re working with remocal testing, because, again, you need to have some resource in AWS account for you to run your code against. This works really well when you are finishing a feature. You can just delete things, again, no polluting your environment.

Conclusion

There are three things to think about when it comes to creating a workflow that works well with a serverless development. Make sure that you’ve got a good, fast feedback loop for running tests. Make sure you’ve got a smooth deployment process. Try to manage your environment and complement that with using temporary environments. Here’s five different ways you can get good local development experience when it comes to serverless development.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Optimizing Java Applications on Kubernetes: Beyond the Basics

MMS Founder
MMS Bruno Borges

Article originally posted on InfoQ. Visit InfoQ

Transcript

Borges: Who here is a Java developer? Who’s not a Java developer, but is here because of Kubernetes? We do have a few things that are applicable to any language, any runtime. I work for Microsoft now. I worked for Oracle before. I use a Mac, and I’m a Java developer at Microsoft. I work in the Java engineering team. We have our own JDK at Microsoft. We help internal teams at Microsoft to optimize Java workloads. Anybody can give an example of where Java is used inside Microsoft? Azure has a lot of Java internally. In fact, there is a component called Azure control plane that does a lot of messaging Pub/Sub across data centers. We use Java in that infrastructure. Office? Not much Java behind Office, except maybe Elasticsearch, which is a Java technology for a few things. Anything else? PowerToys, not Java. Minecraft. Yes, people forget Minecraft is Java.

There is the non-Java version of it now, but the Java version has tens of millions of players around the world, and we have hundreds of thousands of JVM instances at Microsoft running Minecraft servers for the Java edition of Minecraft. Anything else? LinkedIn, yes. LinkedIn is the largest subsidiary at Microsoft, fully on Java, with hundreds of thousands of JVMs in production as well, even more than the Minecraft servers. Besides Minecraft, LinkedIn, Azure, we have a lot of big data. If you’re running Spark, if you’re running Hadoop, if you’re running Kafka, those things are written in Java. It’s the JVM runtime powering those things. Even if you’re not a Java developer, but you are interacting with those things, you are consuming Java technology at the end of the day. If you’re not a developer at all, but you are using, for some reason, Bing, you are using Java technologies behind the scenes, because we also have a lot of Java at Bing.

What is this talk about? We’re going to cover four things. We’re going to cover a little bit of the basics of optimizing Java workloads on Kubernetes. We’re going to look at the size of containers and the startup time. We’re going to look at JVM defaults and how ergonomics in the JVM play a significant role in your workloads. Then we’re going to talk about Kubernetes, how Kubernetes has some interesting things that may affect your Java workloads. Finally, this weird concept that I’m coming up with called A/B performance testing in production.

Size and Startup Time

Size and startup time. Who here is interested in reducing the size of your container images? What is the first reason why you want to reduce the size of container images? The download. I and other folks that I’ve seen online say that size is important, but not the most important part. Remember that you are running your infrastructure on a data center with high-speed internet between the VMs. Is the size really impacting the download of the image from your container registry to your VM where the container will run? If you can track the slowness of startup time back to download, back to the speed of that, then, yes, reduce the size. If storage is getting expensive, then reduce the size. I believe that security is even more important.

It’s about reducing the surface attack area. It’s about reducing components that are shipped in the image and may become an attacking vector. It’s about reducing what goes in so that patching and updating that image will be faster and will reduce basically all the dependencies of your system in production. For example, where do we have Log4j in production? We don’t use Log4j because we remove that dependency from everywhere, just to reduce the size. Yes, sure, but your primary goal was to reduce attacking vector. That’s an example. Easier to audit, as well. People dealing with SBOM, supply chain security, component governance, all of that good stuff in recent days, that is, in my opinion, the primary reason to reduce the size of an image.

How to reduce the image? There are three main areas. David Delabassée at Oracle presented this many times. He broke down to these three areas. There’s the base image layer. There is the application layer, the runtime layer. For base image layer, you can use slim versions of Linux distributions. You can use distroless images. Or you can just even build your own Linux base image with whatever distribution you want to base it on. Alpine is a good option as well. I’ll talk more about that in the next slide. The second layer is the Java application. You should only put the dependencies that your application really needs. It’s not just Java application, but any application, Node packages, Python packages, anything that goes in, you should really be careful about what is going into the final image. There is a trick about application layer.

You actually should break down the application layer in different layers, the layer of your container image with just the dependencies and then the layer with your application code. Why is that? Caching. You’re going to cache the dependencies. If you’re not changing the dependencies, you’re not going to build that layer again, just the application layer. If you’re a Spring developer, Spring has a plugin for that, for example. Run as a non-root user. That’s quite important. Finally, the JVM runtime, or any tech runtime. If your tech runtime has capabilities of shrinking down the runtime to something that only contains the bits required to run your application, then great. The JDK project, years ago, added that mechanism of modules. You can have a JVM runtime with only the modules of the JDK that are important to your application. If you really want to go the extra mile, GraalVM, you can build a Native Image.

Here’s a few examples of size differences of images. You have Ubuntu and Debian Full. Ubuntu doesn’t have a slim version anymore. They just have Ubuntu Full, and it’s 78 megabytes, and compares with the slim version of Debian. If you really want to cut it down, then Alpine. Alpine has some interesting things that you have to consider. It’s a musl libc library. Some libraries may not be compatible. Luckily, the JDK is compatible with musl. There are other issues in the past that got fixed, but who knows what else is there? Keep an eye on Alpine. It’s a good option, but make sure you test it. Also, it’s hard to get commercial support from cloud vendors for Alpine, so if you’re on Amazon, Google, or Azure, you may not get that support. Spring, the application layer, this is a classic Dockerfile for a Spring application. I have my JDK, interestingly enough, it’s coming from Alpine, but that’s from the Spring Boot documentation. I have my Fat JAR. I have the entrypoint, java -jar, my application. That’s great, but not the best. A better version of it would be to use the Spring user or create a custom user so you don’t run it as root.

Finally, an even better option, is to have those dependencies in different layers, so when your application changes, only that layer gets rebuilt. That optimizes your build, optimizes the download of the image as well, and so on. If you want to automate all that, you can put in your CI/CD. Hopefully you can go to the documentation on Spring, just search for Spring Boot Docker, and you’re going to get the Maven plugin to build that image for you. Last layer, the JVM or the language runtime, whatever stack you have. Here’s an example of recent modern Java days, JDK 22, 334 megabytes extracted. When you create a Java custom runtime with only the bits that are needed for your application, the JVM is only 57 megabytes. If you really want to go native, you use GraalVM Native Image, and you get less than 10 megabytes in many applications.

Let’s talk about startup time of the JVM, and potentially other languages may have capabilities like that. The JDK has a capability called class data sharing. It’s basically a binary representation of all the libraries so that it gets loaded into memory much faster. You got improvements of startup time by half, twice faster. These are the JEPs that you should look into. JEP stands for JDK Enhancement Proposal. Search for those numbers, or just class data sharing, you’re going to figure out great ideas. Some future projects in startup time that are happening in the OpenJDK world, Project Leyden, led by Oracle, and Project CRaC, led by Azul Systems. CRaC stands for checkpoint/restore. Here’s some benchmarks that Oracle ran for Leyden on Spring Boot.

You can see the blue bar is the traditional default for JDK 22, and then you have all the way down to Spring Boot AOT plus premain, which is a cached version. You do a train. You train the execution, and you get the caching of that execution, so the next time you start, it’s a lot faster. You go from 1.2 seconds down to 0.5 seconds, so significant improvement there. Then you have checkpoint/restore, which allows you to go from nearly 4 seconds for this application down to 38 milliseconds. It’s a significant improvement in startup time. Of course, this is a checkpoint/restore technology. The framework, the library, the runtime, they have to be aware of checkpoints, so you have the state of the application at a snapshotable state. Then you can take that, put in disk, and recover next time. Keep an eye on those projects. They will make significant changes in the Java ecosystem.

JVM Defaults

Let’s go into part 1, JVM defaults. The JVM has something called default ergonomics, and almost every language runtime stack has defaults. I like to say there’s always premature optimization, because the defaults tend to be a little bit conservative. They tend to be a little bit work for most applications, but that is, in essence, an optimization by itself, that it has to set how much memory we’ll use for the heap, how many threads we’ll use for a JIT compiler, and all of that based on signals from the environment. Let’s do a quick puzzle with JVM ergonomics. Here I have some mem puzzles. Let’s look at puzzle1. I’m going to run a Java application. Let’s go to processors first. I have a Java application, quite simple, public static void main, give me how many processors are there that the JVM can see. Let’s run puzzle1. Puzzle1, we run this thing. I’m on my local machine, we’ll just do docker run java and the application. This is a Mac with 10 processors. If I run this command, how many processors will the JVM see? It sees only eight. Why? Because I’m running Docker Desktop, and I configured to only allow eight processors for Docker Desktop. That was a tricky puzzle. It goes to show you how settings in the environment will affect the JVM, period.

Let’s look at puzzle2. Puzzle2, I’m setting two CPUs for the container. If I run this thing, it’s not really a puzzle, I will get two processors. Easy. What if I have a variable, not natural number, like a decimal number. Let’s look at puzzle3, 1.2 CPUs, or in Kubernetes world, 1200 millicores. How many processors does the JVM see here? Two, because the JVM will round up. If it has more than one processor, 1000 millicore, it’s 2. If it’s 2100 millicore, 3 processors, so on, so forth. Let’s go to memory. Memory is tricky. I have a program here. I’ll just show you that we’ll find out what is the garbage collector running in the JVM. There’s a lot of code here, because this code is actually compatible with lots of versions of the JVM. In recent versions, it’s a lot easier, but I wanted to have something compatible with older versions. Let’s run this program, cat puzzle1. I have one CPU with 500 megs of memory. What is the garbage collector that the JVM will select? There are five options in the JVM these days. How much memory will be set for the heap, for the memory? It’s going to be Serial, and 25% heap.

That goes to show that if you don’t tune properly the JVM, you’re going to get really bad heap configuration. Why is that? I’ll show you later. Let’s go to the next puzzle. I won’t spend too much time here with puzzles. We have two CPUs and 2 gig of memory. Let’s go to puzzle2, you got 25%. I want to do a quick change here, because I think I made a mistake on my puzzle. Let’s run this puzzle again. Let’s give 2 gig. Let’s go back to puzzle1, one CPU, 2 gig. Which garbage collector, and how much heap? You got 25% and Serial. There is a little puzzle. This one is two CPUs with 1792 megabytes, which garbage collector? You will get G1. If you reduce 1 megabyte, then you get Serial, because of 1 megabyte. This is inside the source code of the JVM, this logic. That’s the threshold. It goes to show how complicated things are on this thing. I remember where I missed the puzzle1. Puzzle1 it was wrong on one thing. Let’s do 200 meg. Two hundred meg, the heap size will not be 25%, it will be less. It will be around 50%.

This is the math of the JVM, at least, other language runtimes may have different algorithms. The default heap size for any environment with less than 256 megabytes will be 50%. That’s the default heap size. Then you have pretty much a stable line of 127 megabytes up to 512 megabytes. Then above that, the heap size is set to 25%. If you’re just running your application in the cloud and not configuring heap size, which most people actually do really well, generally, they don’t have to concern about this that much, unless you are not tuning the JVM at a minimum. This comes from a time of when the JVM was designed for environments that were shared with different processes. In the container world, the JVM is now set to take advantage of as much resources available in that environment. We have to inform the JVM manually that, JVM, you actually have access to all those resources. The defaults of the JVM have not been enhanced ever since. There are lots of projects happening now. Microsoft is involved in some of them. Google is involved, and Oracle is involved in enhancing ergonomics and defaults of the JVM for container environments. At the end of the story, don’t just -jar your application, otherwise you’re going to be wasting resources, and that’s money.

Garbage collectors. There are lots of garbage collectors in the JVM. You want to be aware of them. There is actually one extra garbage collector I did not put in the list. It’s called the Epsilon GC. The Epsilon GC is a garbage collector that does not collect anything. It’s great for benchmarking applications where you want to eliminate the GC from the equation, and you just want to benchmark your application performance without the behavior of the GC. Does it help for production, traditional business applications? Not much. For compiler engineers and JVM engineers, it’s really helpful. When you are running things in the cloud, you got to keep in mind that no matter how much memory you give to an application on a container, there are certain areas of the memory that the JVM or the workload will consume as the same amount. It’s called the metaspace, or code cache, things like that the JVM will need regardless of how many objects you have in the heap. That’s why, when you have two containers that are considered small, they are using the same amount of non-heap memory no matter what, pretty much.

Then the heap is different. You have to keep that in mind for a lot of things that we’re going to talk next. Here’s how you configure the JVM. We do provide some recommendations, set to 75% of memory limit, usually. If you want to make things a lot easier for you, you can use memory calculators, you have Buildpacks in the Packeto project. Packeto has this memory calculator for building the container image of Java workloads. They have Buildpacks for other languages as well. Those Buildpacks usually tend to come with optimizations for containers that most often the runtime doesn’t have. For example, here we have how the heap gets calculated automatically for you, so you don’t have to set -Xmx and other things. It goes on with what are other areas that are important to our application to tune eventually. Check out the Packeto Buildpacks. Here you have the Java example.

Kubernetes (YAML Land)

Part 2, Kubernetes, or I like to call YAML land. Anybody here familiarized with vertical pod autoscaler? Who is familiarized with horizontal pod autoscaler? Horizontal pod autoscaler is great. Works for almost everything. It’s the classic, throw money at a problem and you just put more computing power, and scale out your application. When I see keynotes, people say, we have 200 billion transactions in our system, ok, tell me how many CPUs you have. Nobody says that. Tell me how many cores are running behind the scenes. How many cores per transaction are you actually spending on? That is the true scaling of your system. From a business perspective, 200 billion transactions are great. I got a lot of revenue, but what’s your margin? Nobody will tell you the CPU per transaction, because that will give people an idea of margin. For us engineers, thinking about cost, about scaling, that is actually important. Horizontal pod autoscaler is great. You can scale out based on rules, but it’s not a silver bullet. I’ll give you an example.

There is a story of this company in Latin America. They were complaining about Java workloads. They were saying, this is slow, takes too much memory, too much CPU. We have to have 20 replicas of the microservice because it doesn’t scale well. I was like, if I you’re running 20 replicas of a JVM workload, there’s something fundamentally wrong in your understanding of the JVM runtime and how it behaves. If you don’t give enough resources to the JVM, what happens? The parts of the JVM runtime that the developers don’t touch, like garbage collectors, JIT compilers, will suffer. If that is suffering, yes, scaling out is a great solution, but it’s not the most effective solution in many cases. Then this company went ahead and migrated to Rust. Great alternative to Java, but it required six months of migration work. The funny thing is, the performance could have been solved in a day just by properly understanding the JVM, tuning the JVM, and redistributing the resources in their cluster. No, they chose the hard route because they actually wanted to code in Rust, because it’s fun. I respect that.

Vertical pod autoscaler is an interesting technology in Kubernetes, where especially now with version 1.7+, you have something called InPlacePodVerticalScaling. It allows the pod to increase amount of resources of a container without restarting the container. It’s important that the runtime that is running inside understands that more resources were given, and the runtime has to be able to take advantage of that, so the JVM still today doesn’t have that great capability, but that’s in the works.

The interesting thing about vertical pod autoscaler is something that Google offers in their GitHub, and I can use that on Azure as well, called kube-startup-cpu-boost. It allows a container to have access to more resources up to a certain time, up to a certain policy, up to a certain rule that you put down in your 1500 lines of YAML, magic will work, and you’re going to have a JVM that, for example, has more access to CPU and memory to start up, but then you can actually reduce CPU and stabilize. Because the JVM, yes, it will have significant work in the first few hours of the JVM, when the system is being hit and the JIT compiler is working and optimizing the code in time. Then after a while, the CPU usage doesn’t change much. It actually goes down. That’s when you figure out, this is how much time I can give to that workload to have CPU boost. Search for kube-startup-cpu-boost. You’re going to have a nice experience.

What else do we have here? Java on Kubernetes. What is the main issue that people here who are running or monitoring or deploying Java on Kubernetes have? What is your main issue? CPU throttling is a major problem. You have an application that is given not enough CPU, but there is a GC in there. There’s a JIT compiler in there. There are certain elements of your runtime stack that will require CPU time beyond what your business application is doing. I’ll give you an example. Let’s walk through. If you don’t understand CPU throttling and how that impacts the JVM and even other runtimes with garbage collectors like .NET, Node, and Go, they have garbage collectors, maybe not at the same scale, perhaps, as the JVM, but it does impact. Let’s say you set the CPU limit to 1000 millicore, and that’s in a system with a CPS period of 100 milliseconds, by default. An application has access to 1000 millicore for every 100 milliseconds. There are six HTTP requests in the load balancer, nginx, or whatever.

Those requests are sent to the CPU to be processed by the JVM when they come in, and then they are processed across the four processors in the node. Why is that? Because the 1000 millicore is about CPU time and not about how many processors I have access to. Those threads, when they performed, each request consumed 200 millicores. In total, they consumed 800 millicores. Those 800 millicores were consumed within 20 milliseconds. Now I only have 200 millicores left for the rest of the 80 milliseconds until the next period that I can use 1000 millicores again. What happens in those 20 milliseconds? The garbage collector has to work and remove objects from the heap, clean up memory. The GC work total was 200 millicores. Now I have to wait 60 milliseconds until I can process another request. That is your CPU throttling. When you understand that, you understand that you want your application and your runtime stack to be very clear on how many resources are needed to perform all the tasks on that float. Now, new requests come in, and that’s your latency going up by at least 60 milliseconds in these two requests.

We covered this thing. One millicore is one processor, 1001 to 2000 millicores, two processors, and so on. There is this little flag in the JVM that you can actually use for whatever reason, called ActiveProcessorCount. Why is it an interesting flag? Because you can say my application has 1500 millicore limit, but I’m going to give three processor count. That will tell the JVM, the JVM has access to three processors, so it will size thread pools internally according to this number. How many threads can I process at the same time? Three? Sure, I’ll size my thread pool based on that. If you have a very strong I/O bound application, this is where you may want to use it despite having smaller CPU limit. Most microservices on Kubernetes are I/O bound, because it’s just network. Receiving requests from the load balancer, sending requests to the database, calling another REST API, receiving another REST API, is just network I/O. At Microsoft, we came up with this flow of recommendations, or better starting points for Kubernetes.

Depending on how much you have in terms of CPU limits, and I found customers saying, no, our pods in Kubernetes can only have up to 2000 millicore. I find that crazy, but there are cases like that. Yes, if that is the case, here’s some guidance on where to get started. Start with this instead of the defaults in the JVM. Then from this point, start observing your workload and fine-tune the JVM to your needs. Always have a goal. Is your goal throughput? Is your goal latency? Is your goal cost, resource utilization? Always have a goal in mind when JVM tuning, because there are garbage collectors for latency. There are garbage collectors for throughput. There are garbage collectors that are good in general, for everything. There are garbage collectors that are good for memory. You always have to keep that in mind that you have a goal to address.

Resource redistribution. Remember when I said the case of the customer with 20 replicas, 30 replicas, like just growing for the same workload. I did a benchmark years ago when I started all this research and I came up with this chart where I put a Java workload on Kubernetes, and I did four scenarios. The green line and the green bar represents six replicas of the workload. You can see that the latency, my p99 is really bad. It’s not even good by p99. Then it goes even worse on the p99.9. The throughput is ok, 2.5 million requests. Then I thought, what if I reduced the replicas and gave more CPU and memory? Technically, I’m still using the same amount of memory and CPU. I’m scheduling the same amount, but I’m reducing the number of replicas so that my language runtime has more opportunity to behave properly. The throughput went up, the latency improved. There is a correlation between resource allocation.

Then I thought, what if I could save money as well? What if I could reduce resources now that I have figured out resource redistribution helps? I came up with the two replicas, the blue line, two replicas with two CPUs. I went down from six CPUs to four, from 6 gig to 4 gig total. I improved the performance. My throughput is better than originally. My latency is better than originally. It’s not the best, but it’s better. It costs less. That goes to show that you have an opportunity today if you have a case where you have systems with dozens of replicas, to do a resource redistribution on your Kubernetes cluster. This is applicable to any language. I heard the same story from .NET folks, from Go folks, and so on.

I’ll give you some examples in practice. This is looking at Azure Kubernetes Cluster. This is a cluster of six VMs, six nodes. I deploy the same workload like 18 times, and it cost me $840 a month to run this scenario. What should I do first? I’m going to merge a few pods, and that should give me better performance on those pods, on those nodes. Then I’m going to continue to apply this rollout to more. Instead of having lots of replicas of the same node, I’m going to have just one replica per node. If you are into the idea of writing Kubernetes operator that does this thing magically for you, please be my guest. I would love to have a talk with Kubernetes operator experts.

Then, finally, this is performing better than originally. Maybe I should go even further. I’m going to increase my node pool to taller VMs, increase a little bit the resource limits of those pods, and have only three now. I’m still interested in resiliency and all of that good stuff, but now I have standby CPU and memory. The cost is still the same because I’m using the same type of VM, just with more CPU and memory on that type of VM. The cost is still the same, but now I have spare resources for more workloads. From a cloud vendor perspective, it’s great. You’re still paying the same. For your perspective, you can do more. That is the beauty of this approach.

A/B Performance Testing

To finalize, we’re going to get into a land of unproven practices. I’m still hoping that somebody will help me prove this thing. I have this concept called A/B performance testing. We see A/B testing all the time, mostly for features in a system. What if we could do A/B testing for production performance as well? What if we could go and say, I’m going to have a load balancer, and I’m going to route the loads to different instances of my application, same application, but configured differently. This instance here uses the garbage collector A. This instance here runs the garbage collector B. This instance here has more resource limits. This instance here has less memory limits, and so on. Then you start also considering, I’m going to have smaller JVMs with more replicas, horizontal scale, or I’m going to have taller JVMs with lesser replicas on a taller system. You can do that easily on Kubernetes. Here’s an example of how I can do the 2by2, 2by3, 3by2, 6by1. I put that on from the nginx, and I do a round robin. I try the least connection pattern for nginx. It’s tricky. It really depends on your workload.

For the benchmark purposes, I used round robin. The other scenario, which is a lot easier actually, is to use for benchmarking, garbage collector configuration and tuning. I have a deployment of the same application, but with ergonomics, default JVM, G1 GC, and Parallel GC. That’s how I configured. This one is actually interesting to use list connection, because you want to see how the GCs work really fast, and doesn’t pause too much the application so you can process more connections. For my benchmark purposes, I still use round robin. When you combine all that, you will have something like this, at least on Azure. Here I have the Azure dashboard. You can see here the inspect, which is just a JSON path. It returns me a serialized JSON, the prime factor, which I used to do a CPU bound emulation. Then we can see the roles. Then in the roles, that’s where I can see, 2by2, 2by3, and then I can see which one is performing better.

This is in the last 24 hours. I can make it a bit bigger. Here I have the other profiles. I don’t want to spend too much time in here, because this was just to generate data to show you, if we compare these cases here, it won’t show, this is better than that. It’s clear. No, it won’t be clear, because, first of all, my use case is definitely not like yours. You have to understand your scenario. I’m showing you what you can put in practice so that you can observe in production how things can behave differently.

I’ll give you the bread of the butter, where I do a live thing, at least. Here, I have the 2by2, 2by4 running, and I’m going to trigger this kube exec to get into the pod. I have a container in a cluster, and I just access that container in this bottom right shell. I’m inside a container in this cluster, and now I’m going to run the same test as I did before, and it’s going to be prime factor. I’m going to trigger this thing against nginx. nginx will route this thing to those different deployments, different topologies. I have 20 threads and 40 connections. Let’s go to the dashboard, and let’s actually use the live metrics, which is a nice, fancy thing for doing live demos. Here I have the request coming on the right side, the aggregated request rate and everything from this application. This application is deployed with the same instrumentation key. It’s different deployments, but it’s the same container image. I’m giving, of course, different environment variable names.

When I go to role, I can see the roles of the deployments. I’m going to show you later how I define that in the code, but it’s just a parameter for Azure Monitor. Who’s performing better? Again, hypothetical. Here I have all the pods, 2by2, 2by2, 6by1, 6by1, 6by1. For this case here, it doesn’t really matter who’s performing better, because I haven’t hit the peak performance need for this workload. You can see the CPU usage is not above 50%, so it’s pretty good. It goes to show you how you can compare in production different topologies and different GCs and different JVM tuning flags, different parameters. That will give you a much better opportunity to evaluate production performance load with your application still working just fine. Sometimes I have issues with doing performance testing in the lab, just because it doesn’t mimic exactly production customer workload. This is an opportunity for the customer themselves to do that test.

What Have We Learned?

The main takeaways, reduce the size of container images, but think primarily in security, not in size itself, unless size is the problem. Track down, is downloading the image into the node a problem? We have 1 gigabit speed, and your container registry is in the same data center as your Kubernetes cluster. Does it really matter? If it does, yes. Security should be the primary focus, in my opinion. Startup time, optimize JVM for startup time. There’s lots of technologies and features in the JVM these days that can make your application fly in terms of startup time. CDS, class data sharing is a main thing for most versions. If you’re on Java 11, Java 17, Java 21, class data sharing is available in those versions. Take advantage of that. Evaluate Project CRaC and Project Leyden as you think about modernization in the near future. JVM tuning, understand your runtime.

For any language, understand your runtime defaults, understand your runtime capabilities, and take advantage of them. Observe as much as possible: observe memory, CPU, garbage collection, JIT compilation. All of those things can be measured in production. It’s fairly easy. Understand the impact of resource constraints in your runtime stack, and make sure that you are giving enough for the runtime to behave properly. Horizontal scaling is not a silver bullet. It’s just throwing money at the problem. Take advantage of that vertical scaling as well. Finally, A/B performance tuning in production. It’s going to be the next big thing after AI. Consider that as well, especially in your staging. If you have a nice staging pre-production environment, that’s a great opportunity. If you’re interested in what other stuff that Microsoft is doing for Java, go visit, developer.microsoft.com/java.

Questions & Answers

Participant 1: I’m curious if Microsoft is looking to add CRaC support in OpenJDK distribution of Microsoft.

Borges: We are researching that. We are working with internal teams. I actually just emailed a follow-up on my conversation with the teams to see the status on their research of which projects they want to test CRaC. CRaC is a nice feature that Azul is working on. There is one little thing that can complicate things, especially if you’re not using a standard framework like Spring, which is, it does require code to be aware of the checkpoint/restore flow. I’m going to checkpoint. I’m going to restore. You have to shut down a lot of things. You have to shut down thread pools, database connections, and inflow transactions, and all of that, before you can do a checkpoint snapshot. Then, when you restore, you have to start those objects again. Spring has that implemented for you. If you are using Spring, or I think Quarkus is working on that in the meantime. If the framework has a capability, great. We know how enterprise customers are creative in coming up with their own frameworks in-house. CRaC will require at least that the framework team builds that capability into the application framework. We are looking into it.

Participant 2: About the current JSRs that are in progress, which one do you think will affect most the Java performance in production?

Borges: Not JSR specifically. JSR stands for Java Specification Request. There are no JSRs specifically for enhancing the JVM for these problems. There are projects and conversations in place by Google, Microsoft, and Oracle to make the heap of the JVM be dynamic, grow and shrink as needed. That capability will allow, especially the InPlacePodVerticalScaling capability to be taken advantage of by the JVM. Oracle is working on the ZGC. That’s the other problem. Because the JVM has lots of garbage collectors, it’s up to the garbage collector, not the JVM, to define memory areas and how to manage the origin servers. Oracle is working on adaptable heap sizing on ZGC. Google has done some work on the G1 GC. We Microsoft are looking into Serial GC for that idea.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Thoughtworks Technology Radar Oct 2024 – From Coding Assistance to AI Evolution

MMS Founder
MMS Aditya Kulkarni

Article originally posted on InfoQ. Visit InfoQ

Thoughtworks recently published their Technology Radar Volume 31, providing an opinionated guide to the current technology landscape.

As per the Technology Radar, Generative AI and Large Language Models (LLMs) dominate, with a focus on their responsible use in software development. AI-powered coding tools are evolving, necessitating a balance between AI assistance and human expertise.

Rust is gaining prominence in systems programming, with many new tools being written in it. WebAssembly (WASM) 1.0’s support by major browsers is opening new possibilities for cross-platform development. The report also notes rapid growth in the ecosystem of tools supporting language models, including guardrails, evaluation frameworks, and vector databases.

In the Techniques quadrant, notable items in the Adopt ring include 1% canary releases, component testing, continuous deployment, and retrieval-augmented generation (RAG). The Radar stresses the need to balance AI innovation with proven engineering practices, maintaining crucial software development techniques like unit testing and architectural fitness functions.

For Platforms, the Radar highlights tools like Databricks Unity Catalog, FastChat, and GCP Vertex AI Agent Builder in the Trial ring. It also assesses emerging platforms such as Azure AI Search, large vision model platforms such as V7, Nvidia Deepstream SDK and Roboflow, along with SpinKube. This quadrant highlights the rapid growth in tools supporting language models, including those for guardrails, evaluations, agent building, and vector databases, indicating a significant shift towards AI-centric platform development.

The Tools section underscores the importance of having a robust toolkit that combines AI capabilities with reliable software development utilities. The Radar recommends adopting Bruno, K9s, and visual regression testing tools like BackstopJS. It suggests trialing AWS Control Tower, ClickHouse, and pgvector, among others, reflecting a focus on cloud management, data processing, and AI-related database technologies.

For Languages and Frameworks, dbt and Testcontainers are recommended for adoption. The Trial ring includes CAP, CARLA, and LlamaIndex, reflecting the growing interest in AI and machine learning frameworks.

The Technology Radar also highlighted the growing interest in small language models (SLMs) as an alternative to large language models (LLMs) for certain applications, noting their potential for better performance in specific contexts and their ability to run on edge devices. This edition drew a parallel between the current rapid growth of AI technologies and the explosive expansion of the JavaScript ecosystem around 2015.

Overall, the Technology Radar Vol 31 reflects a technology landscape heavily influenced by AI and machine learning advancements, while also emphasizing the continued importance of solid software engineering practices. Created by Thoughtworks’ Technology Advisory Board, the technology Radar provides valuable insights twice-yearly for developers, architects, and technology leaders navigating the rapidly evolving tech ecosystem, offering guidance on which technologies to adopt, trial, assess, or approach with caution.

The Thoughtworks Technology Radar is available in two formats for readers: an interactive online version accessible through the website, and a downloadable PDF document.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


AWS Amplify and Amazon S3 Integration Simplifies Static Website Hosting

MMS Founder
MMS Steef-Jan Wiggers

Article originally posted on InfoQ. Visit InfoQ

AWS recently announced a new integration between AWS Amplify Hosting and Amazon Simple Storage Service (S3), enabling users to deploy static websites from S3 quickly. This integration streamlines the hosting process, allowing developers to deploy static sites stored in S3 and deliver content over AWS’s global content delivery network (CDN) with just a few clicks according to the company.

AWS Amplify Hosting, a fully managed hosting solution for static sites, now offers users an efficient method to publish websites using S3. The integration leverages Amazon CloudFront as the underlying CDN to provide fast, reliable access to website content worldwide. Amplify Hosting handles custom domain setup, SSL configuration, URL redirects, and deployment through a globally available CDN, ensuring optimal performance and security for hosted sites.

Setting up a static website using this new integration begins with an S3 bucket. Users can configure their S3 bucket to store website content, then link it with Amplify Hosting through the S3 console. From there, a new “Create Amplify app” option in the Static Website Hosting section guides users directly to Amplify, where they can configure app details like the application name and branch name. Once saved, Amplify instantly deploys the site, making it accessible on the web in seconds. Subsequent updates to the site content in S3 can be quickly published by selecting the “Deploy updates” button in the Amplify console, keeping the process seamless and efficient.

(Source: AWS News blog post)

This integration benefits developers by simplifying deployments, enabling rapid updates, and eliminating the need for complex configuration. For developers looking for programmatic deployment, the AWS Command Line Interface (CLI) offers an alternative way to deploy updates by specifying parameters like APP_ID and BRANCH_NAME.

Alternatively, according to the respondent on a Reddit thread, users could opt for Cloudflare:

If your webpage is static, you might consider using Cloudflare – it would probably be cheaper than the AWS solution.

Or using S3 and GitLab CI, according to a tweet by DrInTech:

Hello everyone! I just completed a project to host a static portfolio website, leveraging a highly accessible and secure architecture. And the best part? It costs only about $0.014 per month!

Lastly, Amplify Hosting integration with Amazon S3 is available in AWS Regions where Amplify Hosting is available and pricing details for S3 and hosting on the respective pricing pages.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Podcast: Trends in Engineering Leadership: Observability, Agile Backlash, and Building Autonomous Teams

MMS Founder
MMS Chris Cooney

Article originally posted on InfoQ. Visit InfoQ

Transcript

Shane Hastie: Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. Today I’m sitting down across many miles with Chris Cooney. Chris, welcome. Thanks for taking the time to talk to us today.

Introductions [01:03]

Chris Cooney: Thank you very much, Shane. I’m very excited to be here, in indeed many miles. I think it’s not quite the antipodes, right, but it’s very, very close to the antipodes. Ireland off New Zealand. It’s the antipodes of the UK, but we are about as far away as it gets. The wonders of the internet, I suppose.

Shane Hastie: Pretty much so, and I think the time offset is 13 hours today. My normal starting point is who is Chris?

Chris Cooney: That’s usually the question. So hello, I’m Chris. I am the Head of Developer Relations for a company called Coralogix. Coralogix is a full stack observability platform processing data without indexing in stream. We are based in several different countries. I am based in the UK, as you can probably tell from my accent. I have spent the past 11, almost 12 years now as a software engineer. I started out as a Java engineer straight out of university, and then quickly got into front-end engineering, didn’t like that very much and moved into SRE and DevOps, and that’s really where I started to enjoy myself. And over the past several years, I’ve moved into engineering leadership and got to see organizations grow and change and how certain decisions affect people and teams.

And now more recently, as the Head of Developer Relations for Coralogix, I get to really enjoy going out to conferences, meeting people, but also I get a lot of research time to find out about what happens to companies when they employ observability. And I get to also understand the trends in the market in a way that I never would’ve been able to see before as a software engineer, because I get to go meet hundreds and hundreds of people every month, and they all give me their views and insights. And so, I get to collect all those together and that’s what makes me very excited to talk on this podcast today about the various different topics that got on in the industry.

Shane Hastie: So let’s dig into what are some of those trends? What are some of the things that you are seeing in your conversation with engineering organizations?

The backlash against “Agile” [02:49]

Chris Cooney: Yes. When I started out, admittedly 11, 12 years ago is a while, but it’s not that long ago really. I remember when I started out in the first company I worked in, we had an Agile consultant come in. And they came in and they explained to me the principles of agility and so on and so forth, so gave me the rundown of how it all works and how it should work and how it shouldn’t work and so on. We were all very skeptical, and over the years I’ve got to see agility become this massive thing. And I sat in boardrooms with very senior executives in very large companies listening to Agile Manifesto ideas and things like that. And it’s been really interesting to see that gel in. And now we’re seeing this reverse trend of people almost emotionally pushing back against not necessarily the core tenets of Agile, but just the word. We’ve heard it so many times, there’s a certain amount of fatigue around it. That’s one trend.

The value of observability [03:40]

The other trend I’m seeing technically is this move around observability. Obviously, I spend most of my time talking about observability now. It used to be this thing that you have to have so when things have gone wrong or to stop things from going wrong. And there is this big trend now of organizations moving towards less to do with what’s going wrong. It’s a broader question, like, “Where are we as a company? How many dev hours did we put into this thing? How does that factor in the mean times to recovery reduction, that kind of thing?” They’re much broader questions now, blurring in business measures, technical measures, and lots more people measures.

I’ll give you a great example. Measuring rage clicks on an interface is like a thing now and measuring the emotionality with which somebody clicks a button. It’s a fascinating, I think it’s like a nice microcosm of what’s going on in the industry. Our measurements are getting much more abstract. And what that’s doing to people, what it’s doing to engineering teams, it’s fascinating. So there’s lots and lots and lots.

And then, obviously there’s the technical trends moving around AI and ML and things like that and what’s doing to people and the uncertainty around that and also the excitement. It’s a pretty interesting time.

Shane Hastie: So let’s dig into one of those areas in terms of the people measurements. So what can we measure about people through building observability into our software?

The evolution of what can be observed [04:59]

Chris Cooney: That’s a really interesting topic. So I think it’s better to contextualize, begin with okay, we started out, it was basically CPU, memory, disk, network, the big four. And then, we started to get a bit clever and looked at things at latency and response sizes, data exchanged over a server and so forth. And then, as we built up, we started to look at things like we’ve got some marketing metrics in there, so balance rates, how long somebody stays on a page and that kind of thing.

Now we’re looking at the next sort of tier, so the next level of abstraction up, which is more like, did the user have a good experience on the website, and what does that mean? So you see web vitals are starting to break into this area, things like when was the meaningful moment that a user saw the content they wanted to see? Not first ping, not first load, load template. The user went to this page, they wanted to see a product page. How long was it, not just how long did the page take to load before they saw all the meaningful information they needed? And that’s an amalgamation of lots and lots of different signals and metrics.

I’ve been talking recently about this distinction between a signal and an insight. So my taxonomy, the way I usually slice it, is a signal is a very specific technical measurement of something: latency, page load time, bytes exchange, that kind of thing. And in insight, there’s an amalgamation of lots of different signals to produce one useful thing, and my litmus test for an insight is that you can take it to your non-technical boss and they will understand it. They will understand what you’re talking about. When I say to my non-technical boss, “My insight is this user had a really bad experience loading the product page. It took five seconds for the product to appear, and they couldn’t buy the thing. They figured they that couldn’t work out where to do it”. That would be a combination of various different measures around where they clicked on the page, how long the HTML ping took, how long the actual network speed was to the machine, and so on.

So that’s what I’m talking about with the people experience metrics. It’s fascinating in that respect, and this new level now, which is directly answering business questions. It’s almost like we’ve built scaffolding up over the years, deeply technical. When someone would say, “Did that person have a good experience?” And we’d say, “Well, the page latency was this, and the HTTP response was 200, which is good, but then the page load time was really slow”. But now we just say yes or no because of X, Y and Z. And so, that’s where we’re going to I think. And this is all about that trend of observability moving into that business space, is taking much more broad encompassing measurements and a much higher level of abstraction. And that’s what I mean when I said to more people metrics as a general term.

Shane Hastie: So what happens when an organization embraces this? When not just the technical team, but the product teams, when the whole organization is looking at this and using this to perhaps make decisions about what they should be building?

Making sense of observations [07:47]

Chris Cooney: Yes. There are two things here in my opinion. One is there’s a technical barrier, which is making the information literally available in some way. So putting a query engine, and so putting, what’s an obvious one? Putting Kibana in front of open search is the most common example. It’s a way to query your data. Making a SQL query engine in front of your database is a good example. So just doing that is the technical boom. And that is not easy, by the way. That is a certain level of scale. Technically, that is really hard to make high performance queries for hundreds, potentially thousands of users hence taken concurrently. That’s not easy.

Let’s assume that’s out of the way and the organization’s work that out. The next challenge is, “Well, how do we make it so that users can get the questions they need answered, answered quickly without specialist knowledge?” And we’re not there yet. Obviously AI makes a lot of very big promises about natural language query. It’s something that we’ve built into the platform in Coralogix ourselves. It works great. It works really, really well. And I think what we have to do now is work out how do we make it as easy as possible to get access to that information?

Let’s assume all those barriers are out of the way, and an organization has achieved that. And I saw something similar to this when I was a Principal Engineer at Sainsbury’s when we started to surface, it’s an adjacent example, but still relevant, introduction of SLOs and SLIs into the teams. So where before if I went to one team and said, “How has your operational success been this month?” They would say, “Well, we’ve had a million requests and we serviced them all in under 200 milliseconds”. Okay. I don’t know what that means. Is 200 milliseconds good? Is that terrible? What does that mean? We’d go to another team and they’d say, “Well, our error rate is down to 0.5%” Well, brilliant. But last month it was 1%. The month before that it was 0.1% or something.

When we introduced SLOs and SLIs into teams, we could see across all of them, “Hey, you breached your error budget. You have not breached your error budget”. And suddenly, there was a universal language around operational performance. And the same thing happens when you surface the data. You create a universal language around cross-cutting insights across different people.

Now, what does that do to people? Well, one, it shines spotlights in places that some people may not want them shined there, but it does do that. That’s what the universal language does. It’s not enough just to have the data. You have to have effective access to it. You have to have effective ownership of it. And doing that surfaces conversations that would be initially quite painful. There are lots of people, especially in sufficiently large organizations, that have been kind of just getting by by flying under the radar, and it does make that quite challenging.

The other thing that it does, some people, it makes them feel very vulnerable because they feel like KPIs. They’re not. We’re not measuring that performance on if they miss their error budget. When I was the business engineer, no one would get fired. We’d sit down and go, “Hey, you missed your error budget. What can we do here? What’s wrong? What are the barriers?” But it actually made some people feel very nervous and very uncomfortable with it and they didn’t like it. Other people thrived and loved. It became a target. “How much can we beat our budget by this month? How low can we get it?”

Metrics create behaviors [10:53]

So the two things I would say, the big sweeping changes in behavior, it’s that famous phrase, “Build me a metric and I’ll show you a behavior”. So if you measure somebody, human behavior is what they call a type two chaotic system.

By measuring it, you change it. And it’s crazy in the first place. So as soon as you introduce those metrics, you have to be extremely cognizant of what happens to dynamics between teams and within teams. Teams become competitive. Teams begin to look at other teams and wonder, “How the hell are they doing that? How is their error budget so low? What’s going on?” Other teams maybe in an effort to improve their metrics artificially will start to lower their deployment frequency and scrutinize every single thing. So while their operational metrics look amazing, their delivery is actually getting worse, and all these various different things that go on. So that competitiveness driven by uncertainty and vulnerability is a big thing that happens across teams.

The other thing that I found is that the really great leaders, the really brilliant leaders love it. Oh, in fact, all leadership love it. All leadership love higher visibility. The great leaders see that higher visibility and go, “Amazing. Now I can help. Now I can actually get involved in some of these conversations that would’ve been challenging before”.

The slightly more, let’s say worrying leaders will see this as a rod with which to beat the engineers. And that is something that you have to be extremely careful of. Surfacing metrics and being very forthright about the truth and being kind of righteous about it all is great and it’s probably the best way to be. But the consequence is that a lot of people can be treated not very well if you have the wrong type of leadership in place, who see these measurements as a way of forcing different behaviors.

And so, it all has to be done in good faith. It all has to be done based on the premise that everybody is doing their best. And if you don’t start from that premise, it doesn’t matter how good your measurements are, you’re going to be in trouble. Those are the learnings that I took from when I rolled it out and some of the things that I saw across an organization. It was largely very positive though. It just took a bit of growing pains to get through.

Shane Hastie: So digging into the psychological safety that we’ve heard about and known about for a couple of decades now.

Chris Cooney: Yes. Yes.

Shane Hastie: We’re not getting it right.

Enabling psychological safety is still a challenge [12:59]

Chris Cooney: No, no. And I think that my experience when I first go into reading about, it’s like Google’s Project Aristotle and things like that may be. And my first attempt at educating an organization on psychological safety was they had this extremely long, extremely detailed incident management review, where if something goes wrong, then they have, we’re talking like a 200-person, several day, sometimes several day. I think on the low end it was like five, six hours, deep review of everything. Everyone bickers and argues at each other and points fingers at each other. And there’s enormous documents produced, it’s filed away, and nobody ever looks at it ever again because who wants to read those things? It’s just a historical text about bickering between teams.

And what I started to do is I said, “Well, why don’t we trial like a more of a blameless post-mortem method? Let’s just give that a go and we’ll see what happens”. So the first time I did it, the meeting went from, they said the last meeting before them was about six hours. We did it in about 45 minutes. I started the meeting by giving a five-minute briefing of why this post-mortem has to be blameless. The aviation industry and the learnings that came from that around if you hide mistakes, they only get worse. We have to create an environment where you’re okay to surface mistakes. Just that five-minute primer and then about a 40-ish-minute conversation. And we had a document that was more thorough, more detailed, more fact-based, and more honest than any incident review that I ever read before that.

So rolling that out across organizations was really, really fun. But then, I saw it go the other way, where they’d start saying, “Well, it’s psychologically safe”. And it’s turned inside this almost hippie loving, where nobody’s done anything wrong. There is no such thing as a mistake. And no, that’s not the point. The point is that we all make mistakes, not that they don’t exist. And we don’t point blame in a malicious way, but we can attribute a mistake to somebody. You just can’t do it by… And the language in some of these post-mortem documents that I was reading was so indirect. “The system post a software change began to fail, blah, blah, blah, blah, blah”. Because they’re desperately trying not to name anybody or name any teams or say that an action occurred. It was almost like the system was just running along and then the vibrations from the universe just knocked it out of whack.

And actually, when you got into it, one of the team pushed a code change. It’s like, “No. Team A pushed a code change. Five minutes later there was a memory leak issue that caused this outage”. And that’s not blaming anybody, that’s just stating the fact in a causal way.

So the thing I learned with that is whenever you are teaching about blameless post-mortem psychological safety, it’s crucial that you don’t lose the relationship between cause and effect. You have to show cause A, effect B, cause B, effect C, and so on. Everything has to be linked in that way in my opinion. Because that forces them to say, “Well, yes. We did push this code change, and yes, it looks like it did cause this”.

That will be the thing I think where most organizations get tripped up, is they really go all in on psychological safety. “Cool, we’re going to do everything psychologically safe. Everyone’s going to love it”. And they throw the baby out with the bath water as it were. And they missed the point, which is to get to the bottom of an issue quickly, not to not hurt anybody’s feelings, which is often sometimes a mistake that people make I think, especially in large organizations.

Shane Hastie: Circling back around to one of the comments you made earlier on. The agile backlash, what’s going on there?

Exploring the agile backlash [16:25]

Chris Cooney: I often try to talk about larger trends rather than my own experience, purely because anecdotal experience is only useful as an anecdote. So this is an anecdote, but I think it’s a good indication of what’s going on more broadly. When I was starting out, I was a mid-level Java engineer, and this was when agility was really starting to get a hold in some of these larger companies and they started to understand the value of it. And what happened was we were all on the Agile principles. We were regularly reading the Agile Manifesto.

We had a coach called Simon Burchill who was and is absolutely fantastic, completely, deeply, deeply understands the methodology and the point of agility without getting lost in the miasma of various different frameworks and planning poker cards and all the rest of it. And he was wonderful at it, and I was very, very fortunate to study under him in that respect because it gave me a really good, almost pure perspective of agile before all of the other stuff started to come in.

So what happened to me was that we were delivering work, and if we went even a week over budget or a week over due, the organization would say, “Well, isn’t agile supposed to speed things up?” And it’s like, “Well, no, not really. It’s more of just that we had a working product six weeks ago, eight weeks ago, and you chose not to go live with it”. Which is fine, but that’s what you get with the agile process. You get a much earlier working software that gives you the opportunity to go live if you get creative with how you can productionize it or turn into a product.

So that was the first thing, I think. One of the seeds of the backlash is a fundamental misunderstanding about what Agile is supposed to be doing for you. It’s not to get things done faster, it’s to just incrementally deliver working software so you have a feedback loop and a conversation that’s going on constantly. And an empirical learning cycle is occurring, so you’re constantly improving the software, not build everything, test everything, deploy it, and find out it’s wrong. That’s one.

The other thing I will say is what I see on Twitter a lot now, or X they call it these days, is the Agile Industrial Complex, which is a phrase that I’ve seen batted around a lot, which is essentially organizations just selling Scrum certifications or various different things that don’t hold that much value. That’s not to say all Scrum certifications are useless. I did one, it was years and years ago, I forget the name of the chap now. It was fantastic. He gave a really, really great insight into Scrum, for example, why it’s useful, why it’s great, times when it may be painful, times when some of its practices can be dropped, the freedom you’ve got within the Scrum guide.

One of the things that he said to me that always stuck with me, this is just an example of a good insight that came from an Agile certification was he said, “It’s a Scrum guide, not the Scrum Bible. But it’s a guide. The whole point is to give you an idea. You’re on a journey, and the guide is there to help you along that journey. It is not there to be read like a holy text”. And I loved that insight. It really stuck with me and it definitely informed how I went out and applied those principles later on. So there is a bit of a backlash against those kinds of Agile certifications because as is the case with almost any service, a lot of it’s good, a lot of it’s bad. And the bad ones are pretty bad.

And then, the third thing I will say is that an enormous amount of power was given to Agile coaches early on. They were almost like the high priests and they were sort of put into very, very senior positions in an organization. And like I said, there are some great Agile coaches. I’ve had the absolute privilege of working with some, and there were some really bad ones, as there are great software engineers and bad software engineers, great leaders and poor leaders and so on.

The problem is that those coaches were advising very powerful people in organizations. And if you’re giving bad advice to very powerful people, the impact of that advice is enormous. We know how to deal with a bad software engineering team. We know how to deal with somebody that doesn’t want to write tests. As a software function, we get that. We understand how to work around that and solve that problem. Sometimes it’s interpersonal, sometimes it’s technical, whatever it is, we know how to fix it.

We have not yet figured out this sort of grand vizier problem of there is somebody there giving advice to the king who doesn’t really understand what they’re talking about, and the king’s just taking them at their word. And that’s what happened with Agile. And that I think is one of the worst things that we could have done was start to take the word of people as if they are these experts in Agile and blah, blah, blah. It’s ultimately software delivery. That’s what we’re trying to do. We’re trying to deliver working software. And if you’re going to give advice, you’d really better deeply understand delivery of working software before you go and about interpersonal things and that kind of stuff.

So those are the three things I think that have driven the backlash. And now there’s just this fatigue around the word Agile. Like I say, I had the benefit of going to conferences and I’ve seen the word Agile. When I first started talking, it was everywhere. You couldn’t miss a conference where the word Agile wasn’t there, and now it is less and less prevalent and people start talking more about things like continuous delivery, just to avoid saying the word Agile. Because the fatigue is almost about around the word than it’s around the principles.

And the last thing I’ll say is there is no backlash against the principles. The principles are here to stay. It’s just software engineering now. We just call it what would’ve been Agile 10 years ago is just how to build working software now. It’s so deeply ingrained in how we think that we think we’re backlash against Agile. We’re not. We’re backlash against a few words. The core principles are parts of software engineering now, and they’re here to stay for a very long time, I suspect.

Shane Hastie: How do we get teams aligned around a common goal and give them the autonomy that we know is necessary for motivation?

Make it easy to make good decisions [21:53]

Chris Cooney: Yes. I have just submitted a talk to Cube put on this. And I won’t say anything just at risk of jeopardizing our submission, but the broad idea is this. Let’s say I was in a position, I had like 20 something teams, and the wider organization was hundreds of teams. And we had a big problem, which was every single team had been raised on this idea of, “You pick your tools, you run with it. You want to use AWS, you want to use GCP, you want to use Azure? Whatever you want to use”.

And then after a while, obviously the bills started to roll in and we started to see that actually this is a rather expensive way of running an organization. And we started to think, “Well, can we consolidate?” So we said, “Yes, we can consolidate”. And a working group went off, picked a tool, bought it, and then went to the teams and said, “Thou shalt use this, and nobody listened”. And then, we kind of went back to the drawing board and they said, “Well, how do we do this?” And I said, “This tool was never picked by them. They don’t understand it, they don’t get it. And they’re stacking up migrating to this tool against all of the deliverables they’re responsible for”. So how do you make it so that teams have the freedom and autonomy to make effective decisions, meaningful decisions about their software, but it’s done in a way that there is a golden path in place such that they’re all roughly moving in the same direction?

When we started to build out a project within Sainsbury’s was completely re-platforming the entire organization. It’s still going on now. It’s still happening now. But hundreds and hundreds of developers have been migrating onto this platform. It was a team in which I was part of. It’s started, I was from Manchester in the UK, we originally called it the Manchester PAS, Platform As a Service. I don’t know if you know this, but the bumblebee is one of the symbols of Manchester. It had a little bumblebee in the UI. It was great. We loved it. And we built it using Kubernetes. We built it using Jenkins for CI, CD, purely because Jenkins was big in the office at the time. It isn’t anymore. Now it’s GitHub Actions.

And what we said was, “Every team in Manchester, every single resource has to be tagged so we know who owns what. Every single time there’s a deployment, we need some way of seeing what it was and what went into it”. And sometimes some periods of the year are extremely busy and extremely serious, and you have to do additional change notifications in different systems. So every single team between the Christmas period for a grocer, Sainsbury’s sells an enormous amount of trade between let’s say November and January. So during that period, they have to raise additional change requests, but they’re doing 30, 40 commits a day, so they can’t be expected to fill up those forms every single time. So I wonder if we can automate that for them.

And what I realized was, “Okay, this platform is going to make the horrible stuff easy and it’s going to make it almost invisible; not completely invisible because they still have to know what’s going on, but it has to make it almost invisible”. And by making the horrible stuff easy, we incentivize them to use the platform in the way that it’s intended. So we did that and we onboarded everybody in a couple of weeks, and it took no push whatsoever.

We had product owners coming to us and saying one team just started, they’d started the very first sprint. The goal of their first sprint was to have a working API and a working UI. The team produced, just by using our platform, because we made a lot of this stuff easy. So we had dashboard generation, we had alert generation, we had metric generation because we were using Kubernetes and we were using Istio. We got a ton of HTTP service metrics off the bat. Tracing was built in there.

So in their sprint review at the end of the two weeks, they built this feature. Cool. “Oh, by the way, we’ve done all of this”. And it was an enormous amount of dashboards and things like that. “Oh, by the way, the infrastructure is completely scalable. It’s totally, it’s multi-AZ failover. There’s no productionizing. It’s already production ready”. The plan was to go live in months. They went live in weeks after that. It changed the conversation and that was when things really started to capitalize and have ended up in the new project now, which is across the entire organization.

The reason why I told that story is because you have to have a give and take. If you try and do it like an edict, a top-down edict, your best people will leave and your worst people will try and work through it. Because the best people want to be able to make decisions and have autonomy. They want to have kind of sense of ownership of what they’re building. Skin in the game is often the phrase that it’s banded around.

And so, how do you give engineers the autonomy? You build a platform, you make it highly configurable, highly self-serviced. You automate all the painful bits of the organization, for example, compliance of change request notifications and data retention policies and all that. You automate that to the hilt so that all they have to do is declare some config and repository and it just happens for them. And then, you make it so the golden path, the right path, is the easy path. And that’s it. That’s the end of the conversation. If you can do that, if you can deliver that, you are in a great space.

If you try to do it as a top-down edict, you will feel a lot of pain and your best people will probably leave you. If you do it as a collaborative effort so that everybody’s on the same golden path, every time they make a decision, the easy decision is the right one, it’s hard work to go against the right decision. Then you’ll incentivize the right behavior. And if you make some painful parts of their life easy, you’ve got the carrot, you’ve got the stick, you’re in a good place. That’s how I like to do it. I like to incentivize the behavior and let them choose.

Shane Hastie: Thank you so much. There’s some great stuff there, a lot of really insightful ideas. If people want to continue the conversation, where do they find you?

Chris Cooney: If you open up LinkedIn and type Chris Cooney, I’ve been reliably told that I am the second person in the list. I’m working hard for number one, but we’ll get there. If you look for Chris Cooney, if I don’t come up, Chris Cooney, Coralogix, Chris Cooney Observability, anything like that, and I will come up. And I’m more than happy to answer any questions. On LinkedIn is usually where I’m most active, especially for work-related topics.

Shane Hastie: Cool. Chris, thank you so much.

Chris Cooney: My pleasure. Thank you very much for having me.

Mentioned:

About the Author

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


AWS CodeBuild Adds Support for Managed GitLab Runners

MMS Founder
MMS Aditya Kulkarni

Article originally posted on InfoQ. Visit InfoQ

Recently, AWS CodeBuild introduced support for managed GitLab self-hosted runners, towards advancement in continuous integration and continuous delivery (CI/CD) capabilities. This new feature allows customers to configure their CodeBuild projects to receive and execute GitLab CI/CD job events directly on CodeBuild’s ephemeral hosts.

The integration offers several key benefits including Native AWS Integration, Compute Flexibility and Global Availability. GitLab jobs can now seamlessly integrate with AWS services, leveraging features such as IAM, AWS Secrets Manager, AWS CloudTrail, and Amazon VPC. This integration enhances security and convenience for the users.

Furthermore, customers gain access to all compute platforms offered by CodeBuild, including Lambda functions, GPU-enhanced instances, and Arm-based instances. This flexibility allows for optimized resource allocation based on specific job requirements.The integration is available in all regions where CodeBuild is offered.

To implement this feature, users need to set up webhooks in their CodeBuild projects and update their GitLab CI YAML files to utilize self-managed runners hosted on CodeBuild machines.

The setup process involves connecting CodeBuild to GitLab using OAuth, which requires additional permissions such as create_runner and manage_runner.

It’s important to note that CodeBuild will only process GitLab CI/CD pipeline job events if a webhook has filter groups containing the WORKFLOW_JOB_QUEUED event filter. The buildspec in CodeBuild projects will be ignored unless buildspec-override:true is added as a label, as CodeBuild overrides it to set up the self-managed runner.

When a GitLab CI/CD pipeline run occurs, CodeBuild receives the job events through the webhook and starts a build to run an ephemeral GitLab runner for each job in the pipeline. Once the job is completed, the runner and associated build process are immediately terminated.

As a side, GitLab has been in the news since earlier this year as they planned to introduce CI Steps, which are reusable and composable pieces of a job that can be referenced in pipeline configurations. These steps will be integrated into the CI/CD Catalog, allowing users to publish, unpublish, search, and consume steps similarly to how they use components.

Moreover, GitLab is working on providing users with better visibility into component usage across various project pipelines. This will help users identify outdated versions and take prompt corrective actions, promoting better version control and project alignment.

AWS CodeBuild has been in the news as well, as they added support for Mac Builds. Engineers can build artifacts on managed Apple M2 instances that run on macOS 14 Sonoma. Few weeks ago, AWS CodeBuild enabled customers to configure automatic retries for their builds, reducing manual intervention upon build failures. They have also added support for building Windows docker images in reserved fleets.

Such developments demonstrate the ongoing evolution of CI/CD tools and practices, with a focus on improving integration, flexibility, and ease of use for DevOps teams.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Improving Developer Experience Using Automated Data CI/CD Pipelines

MMS Founder
MMS Noemi Vanyi Simona Pencea

Article originally posted on InfoQ. Visit InfoQ

Transcript

Pencea: I bet you’re wondering what this picture is doing on a tech conference. These are two German academics. They started to build a dictionary, but they actually became famous, because along the way, they collected a lot of folk stories. The reason they are here is partly because they were my idol when I was a child. I thought there was nothing better to do than to listen to stories and collect them. My family still makes fun of me because I ended up in tech after that. The way I see it, it’s not such a big difference. Basically, we still collect folk stories in tech, but we don’t call them folk stories, we call them best practices. Or we go to conferences to learn about them, basically to learn how other people screwed up, so that we don’t do the same. After we collect all these stories, we put them all together. We call them developer experience, and we try to improve that. This brings us to the talk that we have, improving developer experience using automated data CI/CD pipelines. My name is Simona Pencea. I am a software engineer at Xata.

Ványi: I’m Noémi Ványi. I’m also a software engineer at the backend team of Xata. Together, we will be walking through the data developer experience improvements we’ve been focused on recently.

Pencea: We have two topics on the agenda. The first one, testing with separate data branches covers the idea that when you create a PR, you maybe want to test your PR using a separate data environment that contains potentially a separate database. The second one, zero downtime migrations, covers the idea that we want to improve the developer experience when merging changes that include schema changes, without having any downtime. Basically, zero downtime migrations. For that, we developed an open-source tool called pgroll. Going through the first one, I will be covering several topics. Basically, I will start by going through the code development flow that we focused on. The testing improvements that we had in mind. How we ensured we have data available in those new testing environments. How we ensured that data is safe to use.

Code Development Workflow

This is probably very familiar to all of you. It’s basically what I do every day. When I’m developing, I’m starting with the local dev. I have my local dev data. It’s fake data. I’m trying to create a good local dev dataset when I’m testing my stuff. I’m trying to think about corner cases, and cover them all. The moment I’m happy with what I have in my local dev, I’m using the dev environment. This is an environment that is shared between us developers. It’s located in the cloud, and it also has a dataset. This is the dev dataset. This is also fake data, but it’s crowdfunded from all the developers that use this environment. There is a chance that you find something that it’s not in the local dev. Once everything is tested, my PR is approved. I’m merging it. I reach staging.

In staging, there is another dataset which is closer to the real life, basically, because it’s from beta testing or from demos and so on. It’s not the real thing. The real thing is only in prod, and this is the production data. This is basically the final test. The moment my code reaches prod, it may fail, even though I did my best to try with everything else. In my mind, I would like to get my hands on the production dataset somehow without breaking anything, if possible, to test it before I reach production, so that I minimize the chance of bugs.

Data Testing Improvements – Using Production Data

This is what led to this. Can we use production data to do testing with it? We’ve all received those emails sometimes, that say, test email, and I wasn’t a test user. Production data would bring a lot of value when used for testing. If we go through the pros, the main thing is, it’s real data. It’s what real users created. It’s basically the most valuable data we have. It’s also large. It’s probably the largest dataset you have, if we don’t count load test generated data and so on. It’s fast in the way that you don’t have to write a script or populate a database. It’s already there, you can use it. There are cons to this. There are privacy issues. It’s production data: there’s private information, private health information. I probably don’t even have permission from my users to use the data for testing. Or, am I storing it in the right storage? Is this storage with the right settings so that I’m not breaking GDPR or some other privacy laws.

Privacy issues are a big con. The second thing, as you can see, large is also a con, because a large dataset does not mean a complete dataset. Normally, all the users will use your product in the most common way, and then you’ll have some outliers which give you the weird bugs and so on. Having a large dataset while testing may prevent you from seeing those corner cases, because they are better covered. Refreshing takes time because of the size. Basically, if somebody changes the data with another PR or something, you need to refresh everything, and then it takes longer than if you have a small subset. Also, because of another PR, you can get into data incompatibility. Basically, you can get into a state where your test breaks, but it’s not because of your PR. It’s because something broke, or something changed, and now it’s not compatible anymore.

If we look at the cons, it’s basically two categories that we can take from those. The first ones are related to data privacy. Then the second ones are related to the size of the dataset. That gives us our requirements. The first one would be, we would like to use production data but in a safe way, and, if possible, fast. Since we want to do a CI/CD pipeline, let’s make it automated. I don’t want to run a script by hand or something. Let’s have the full experience. Let’s start with the automated part. It’s very hard to cover all the ways software developers work. What we did first was to target a simplification, like considering GitHub as a standard workflow, because the majority of developers will use GitHub. One of the things GitHub gives to you is a notification when a PR gets created. Our idea was, we can use that notification, we can hook up to it. Then we can create what we call a database branch, which is basically a separate database, but with the same schema as the source branch, when a GitHub PR gets created. Then after creation, you can copy the data after it. Having this in place would give you the automation part of the workflow.

Let’s see how we could use the production data. We said we want to have a fast copy and also have it complete. I’ll say what that means. Copying takes time. There is no way around it. You copy data, it takes time. You can hack it. Basically, you can have a preemptive copy. You copy the data before anyone needs it, so when they need it, it’s already there. Preemptive copying means I will just have a lot of datasets around, just in case somebody uses it, and then, I have to keep everything in sync. That didn’t really fly with us. We can do Copy on Write, which basically means you copy at the last minute before data is actually used, so before that, all the pointers point to the old data. The problem with using Copy on Write for this specific case is that it did not leave us any way into which we could potentially change the data to make it safe. If I Copy on Write, it’s basically the same data. I will just change the one that I’m editing, but the rest of it is the same one.

For instance, if I want to anonymize an email or something, I will not be able to do it with Copy on Write. Then, you have the boring option, which is, basically, you don’t copy all the data, you just copy a part of the data. This is what we went for, even though it was boring. Let’s see about the second thing. We wanted to have a complete dataset. I’ll go back a bit, and consider the case of a relational database where you have links as a data type. Having a complete dataset means all the links will be resolved inside of this dataset. If I copy all the data, that’s obviously a complete dataset, but if I copy a subset, there is no guarantee it will be complete unless I make it so. The problem with having a complete dataset by following the links is it sounds like an NP-complete problem, and that’s because it is an NP-complete problem. If I want to copy a subset of a bigger data, and to find it of a certain size, I would actually need to find all the subsets that respect that rule, and then select the best one. That would mean a lot of time. In our case, we did not want the best dataset that has exactly the size that we have in mind. We were happy with having something around that size. In that case, we can just go with the first dataset that we can construct that follows this completeness with all the links being resolved in size.

Data Copy (Deep Dive)

The problem with constructing this complete subset is, where do we start? How do we advance? How do we know we got to the end, basically? The where do we start part is solvable, if we think about the relationships between the tables as a graph, and then we apply a topological sort on it. We list the tables based on their degrees of independence. In this case, this is an example. t7 is the most independent, then we have t1, t2, t3, and we can see that if we remove these two things, the degrees of independence for t2 and t3 are immediately increased because the links are gone. We have something like that. Then we go further up. Here we have the special case of a cycle, because you can’t point back with links to the same table that pointed to you. In this case, we can break the cycle, because going back we see the only way to reach this cycle is through t5.

Basically, we need to first reach t5 and then t6. This is what I call the anatomy of the schema. We can see this is the order in which we need to go through the tables when we collect records. In order to answer the other two questions, things get a bit more complicated, because the schema is not enough. The problem with the schema not being enough for these cases is because, first of all, it will tell you what’s possible, but it doesn’t have to be mandatory, unless you have a constraint. Usually, a link can also be empty. If you reach a point where you run into a link that points to nothing, that doesn’t mean you should stop. You need to go and exhaustively find the next potential record to add to the set. Basically, if you imagine it in 3D, you need to project this static analysis that we did on individual rows. The thing that you cannot see through the static analysis from the beginning is that you can have several records from one table pointing to the same one in another table. The first one will take everything with it, and the second one will bring nothing.

Then you might be tempted to stop searching, because you think, I didn’t make any progress, so then the set is complete, which is not true. You need to exhaustively look until the end of the set. These are just a few of the things that, when building this thing, need to be kept on the back of the mind, basically. We need to always allow full cycle before determining that no progress was made. When we select the next record, we should consider the fact that it might have been already brought into the set, and we shouldn’t necessarily stop at that point.

We talked about at the beginning how we want to have this production data, but have it safe to use. This is the last step that we are still working on. It is a bit more fluffy. The problem with masking the data is that, for some fields, you know exactly what they are. It’s an email, then, sure, it’s private data. What if it’s free text, then what? If it’s free text, you don’t know what’s inside. The assumption is it could be private data. The approach here was to provide several possibilities on how to redact data and allow the user to choose, because the user has the context and they should be able to select based on the use case. The idea of having, for instance, a full reduction or a partial reduction, is that, sure you can apply that, but it will break your aggregations.

For instance, if I have an aggregation by username, like my Gmail address, and I want to know how many items I have assigned to my email address, if I redact the username and it will be, **.gmail, then I get aggregations on any Gmail address that has items in my table. The most complete would be a full transformation. The problem with full transformation is that it takes up a lot of memory, because you need to keep the map with the initial item and the changed item. Depending on the use case, you might not need this because it’s more complex to maintain. Of course, if there is a field that has sensitive data and you don’t need it for your specific test case, you can just remove it. The problem with removing a field is that that would basically mean you’re changing the schema, so you’re doing a migration, and that normally causes issues. In our case, we have a solution for the migrations, so you can feel free to use it.

Zero Downtime Migrations

Ványi: In this section of the presentation, I would like to talk about, what do we mean by zero downtime. What challenges do we face when we are migrating the data layer? I will talk about the expand-contract pattern and how we implemented it in PostgreSQL. What do I mean when I say zero downtime? It sounds so nice. Obviously, downtime cannot be zero because of physics, but the user can perceive it as zero. They can usually tolerate around 20 milliseconds of latency. Here I talk about planned maintenance, not service outages. Unfortunately, we rarely have any control over service outages, but we can always plan for our application updates.

Challenges of Data Migrations

Let’s look at the challenges we might face during these data migrations. Some migrations require locking, unfortunately. These can be table, read, write locks, meaning no one can access the table. They cannot read. They cannot write. In case of high availability applications, that is unacceptable. There are other migrations that might rely on read, write locks. Those are a bit better, and we can live with that. Also, it’s something we want to avoid. Also, when there is a data change, obviously we have to update the application as well, and the new instance, it has to start and run basically at the same time as the old application is running. This means that the database that we are using, it has to be in two states at the same time. Because there are two application versions interacting with our database, we must make sure, for example, if we introduce a new constraint, that it is enforced in both the existing records and on the new data as well.

Based on these challenges, we can come up with a list of requirements. The database must serve both the old schema and the new schema to the application, because we are running the old application and the new application at the same time. Schema changes are not allowed to block database clients, meaning we cannot allow our applications to be blocked because someone is updating the schema. The data integrity must be preserved. For example, if we introduce a new data constraint, it must be enforced on the old records as well. When we have different schema versions live at the same time, they cannot interfere with each other. For example, when the old application is interacting with the database, we cannot yet enforce the new constraints, because it would break the old application. Finally, as we are interacting with two application versions at the same time, we must make sure that the data is still consistent.

Expand-Contract Pattern

The expand-contract pattern can help us with this. It can minimize downtime during these data migrations. It consists of three phases. The first phase is expand. This is the phase when we add new changes to our schema. We expand the schema. The next step is migrate. That is when we start our new application version. Maybe test it. Maybe we feel lucky, we don’t test it at all. At this point, we can also shut down the old application version. Finally, we contract. This is the third and last phase. We remove the unused and the old parts from the schema. This comes with several benefits.

In this case, the changes do not block the client applications, because we constantly add new things to the existing schema. The database has to be forward compatible, meaning it has to support the new application version, but at the same time, it has to support the old application version, so the database is both forward and backwards compatible with the application versions. Let’s look at a very simple example, renaming a column. It means that here we have to create the new column, basically with a new name, and copy the contents of the old column. Then we migrate our application and delete our column with the old name. It’s very straightforward. We can deploy this change using, for example, the blue-green deployments. Here, the old application is still live, interacting with our table through the old view. At the same time, we can deploy our new application version which interacts through another new view with the same table. Then we realize that everything’s passing. We can shut down the old application and remove the view, and everything just works out fine.

Implementation

Let’s see how we implemented in Postgres. First, I would like to explain why we chose PostgreSQL in the first place. Postgres is well known, open source. It’s been developed for 30 years now. The DDL statements are transactional, meaning, if one of these statements fail, it can be rolled back easily. Row level locking. They mostly rely on row level locking. Unfortunately, there are a few read, write locks, but we can usually work around those. For example, if you are adding a nonvolatile default value, the table is not rewritten. Instead, the value is added to the metadata of the table. The old records are updated when the whole record is updated. It doesn’t really work all the time. Let’s look at the building blocks that Postgres provides. We are going to use three building blocks, DDL statements, obviously, to alter the schema.

Views, to expose the different schema version to the different application versions. Triggers and functions to migrate the old data, and on failure, to roll back the migrations. Let’s look at a bit more complex example. We have an existing column, and we want to add the NOT NULL constraint to it. It seems simple, but it can be tricky because Postgres does a table scan, meaning it locks the table, and no one can update or read the table, because it goes through all of the records and checks if any of the existing records violate the NOT NULL constraint. If it finds a record that violates this constraint, then the statement returns an error, unfortunately. We can work around it. If we add NOT VALID to this constraint, the table scan escaped. Here we add the new column and set NOT NULL constraint and add NOT VALID to it, so we are not blocking the database clients.

We also create triggers that move the old values from the columns. It is possible that some of the old records don’t yet have values, and in this case, we need to add some default value or any backfill value we want, and then we migrate our app. We need to complete the migration, obviously. We need to clean up the trigger, the view we added, so the applications can interact with the table and the old column. Also, we must remember to remove NOT VALID from the original constraint. We can do it because the migration migrated the old values, and we know that all of the new records, or new values are there, and every record satisfies the constraint.

It all seemed quite tedious to do this all the time, and that’s why we created pgroll. It’s an open-source command line tool, but it is written in Go, so you can also use it as a library. It is used to manage safe and reversible migrations using the expand-contract pattern. I would like to walk you through how to use it. Basically, pgroll is running a Postgres instance, so you need one running somewhere. After you installed and initialized it, you can start creating your migrations. You can define migrations using JSON files. I will show you an example. Once you have your migration, you run a start command. Then it creates a new view, and you can interact with it through your new application. You can test it. Then you can also shut down your old application. You run the complete command. pgroll removes all of these leftover views and triggers for you. This is the JSON example I was just talking about.

Let’s say that we have a user’s column that has an ID field, name field, and a description, and we want to make sure that the description is always there, so we put a NOT NULL constraint on it. In this case, you have to define a name. For the migration, it will be the name of the view, or the schema in Postgres. We define a list of operations. We are altering a column. The table is obviously named users. The description field, we no longer allow null values in the column. This is the interesting part. This is the up migration. It contains what to do when we are migrating the old values. In this case, it means that if the description is missing, we add the text description for and insert the name. Or if the data is there, we just move it to the new column. The down migration defines what to do when there is an error and we want to roll back. In this case, we keep the old value, meaning, if the value was missing, it’s a null, and if there was something, we keep it there.

Here is the start command. Let’s see in psql what just happened. We have a user’s table with these three columns, but you can see that pgroll added a new column. Remember, there is this migration ongoing right now. In the old description column, there are records that do not yet satisfy the constraint. In the new description the backfill value is already there for us to use. We can inspect what schemas are in the database. We can notice that there is this create_users_table, that’s the old schema version. The new one is the user_description_set_nullable, which is the name of the migration we just provided in our JSON. Let’s try to insert some values into this table. We are inserting two values. The first one is basically how the new application version is behaving. The description is not empty. In the second record, we are mimicking what the old application is doing. Here the description is NULL. Let’s say that we succeeded. We can try to query this table.

From the old app’s point of view, we can set the search path to the old schema version and perform the following query so we can inspect what happened after we inserted these values. This is what we get back. The description for Alice is, this is Alice, and for Bob it’s NULL because the old application doesn’t enforce the constraint. Let’s change the search path again to the new schema version and perform the same query. Here we can see that we have the description for Alice. Notice that Bob has a description. It is the default description or default migration we provided in the JSON file. Then we might complete the migration using the complete command, and we can see that the old schema is cleaned up. Also, the intermediary column is also removed, and the triggers, functions, everything is removed. Check out pgroll. It’s open source. It takes care of mostly everything. There is no need to manually create new views, functions, new columns, nothing. After you complete your migrations, it cleans up after itself. It is still under development, so there are a few missing features. For example, few missing migrations. We do not yet support adding comments, unfortunately, or batched migrations.

Takeaways

Pencea: Basically, what we presented so far were bits and pieces from this puzzle that we want to build the CI/CD data pipeline. What we imagined when we created this was, somebody creates a PR. Then, test environment with a test database with valid data that’s also safe to use, gets created for them. Then the tests are run. Everything is validated, PR is merged. Then it goes through the pipeline, and nobody has to take care or worry about migrations, because we can do the changes and everything.

Ványi: The migrations are done without downtime. If your pull request is merged, it goes through the testing pipeline, and if everything passes, that’s so nice. We can clean up after ourselves and remove the old schema. If maybe there is a test failure or something is not working correctly, we can roll back anytime, because the old schema is still kept around just in case. As we just demonstrated or told you about, there are still some work left for us, but we already have some building blocks that you can integrate into your CI/CD pipeline. You can create a test database on the fly using GitHub notifications, fill it with safe and relevant data to test. You can create schema changes and merge them back and forth without worrying about data migrations. You can deploy and roll back your application without any downtime.

Questions and Answers

Participant 1: Does pgroll take care of keeping the metadata of every migration done: is started, ongoing, finished?

Ványi: Yes, there is a migration site. Also, you can store, obviously, your migrations file in Git if you want to control it, but pgroll has its own bookkeeping for past migrations.

Participant 2: For the copying of the data from production, was that for local tests, local dev, or the dev? How did you control costs around constantly copying that data, standing up databases, and tearing them back down?

Pencea: It’s usually for something that sits in the cloud, so not for the local dev.

Participant 2: How did you control cost if you’re constantly standing up a near production size database?

Pencea: What we use internally is data branching. We don’t start a new instance every time. We have a separate schema inside a bigger database. Also, what we offer right now is copying 10k of data, it’s not much in terms of storage. We figured it should be enough for testing purposes.

Participant 3: I saw in your JSON file that you can do migrations that pgroll knows about like, is set nullable to false? Can you also do pure SQL migrations?

Ványi: Yes. We don’t yet support every migration. If there is anything missing, you can always work around it by using raw SQL migrations. In this case, you can shoot yourself in the foot, because, for example, in case of NOT NULL, we take care of the skipping of the table scan for you. When you are writing your own raw SQL migration, you have to be careful not to block your table and the database access.

Participant 4: It’s always amazed me that these databases don’t do safer actions for these very common use cases. Have you ever talked to the Postgres project on improving the actual experience of just adding a new column, or something? It should be pretty simple.

Ványi: We’ve been trying to have conversations about it, but it is a very mature project, and it is somewhat hard to change such a fundamental part of this database. Constraints are like the basic building block for Postgres, and it’s not as easy to just make it more safe. There is always some story behind it.

Pencea: I think developer experience was not necessarily something that people were concerned about, up until recently. I feel like sometimes it was actually the opposite, if it was harder, you looked cooler, or you looked like a hacker. It wasn’t exactly something that people would optimize for. I think it’s something that everybody should work towards, because now everybody has an ergonomic chair or something, and nobody questions that, but we should work towards the same thing about developer experience, because it’s ergonomics in the end.

Participant 5: In a company, assuming they are adopting pgroll, all these scripts can grow in number, so at some point you have to apply all of them, I suppose, in order. Is there any sequence number, any indication, like how to apply these. Because some of them might be serial, some of them can be parallelized. Is there any plan to give direction on the execution? I’ve seen there is a number in the script file name, are you following that as a sequence number, or when you’re then developing your batching feature, you can add a sequence number inside?

Ványi: Do we follow some sequence number when we are running migrations?
Yes and no. pgroll maintains its own table or bookkeeping, where it knows what was the last migration, what is coming next? The number in the file name is not only for pgroll, but also for us.

Participant 6: When you have very breaking migrations using pgroll, let’s say you need to rename a column, or even changing its type, which you basically replicate a new column and then copying over the data. How do you deal with very large tables, say, millions of rows? Because you could end up having even some performance issues with copying these large amounts of data.

Ványi: How do we deal with tables that are basically big? How do we make sure that it doesn’t impact the performance of the database?

For example, in case of moving the values to the new column, we are creating triggers that move the data in batches. It’s not like everything is copied in one go, and you cannot really use your Postgres database because it is busy copying the old data. We try to minimize and distribute the load on the database.

Participant 7: I know you were using the small batches to copy the records from the existing column to the new column. Once you copy all the records, only then you will remove the old column. There is a cost with that.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Using DORA for Sustainable Engineering Performance Improvement

MMS Founder
MMS Ben Linders

Article originally posted on InfoQ. Visit InfoQ

DORA can help to drive sustainable change, depending on how it is used by teams and the way it is supported in a company. According to Carlo Beschi, getting good data for the DORA keys can be challenging. Teams can use DORA reports for continuous improvement by analysing the data and taking actions.

Carlo Beschi spoke about using DORA for sustainable improvement at Agile Cambridge.

Doing DORA surveys in your company can help you reflect on how you are doing software delivery and operation as Beschi explained in Experiences from Doing DORA Surveys Internally in Software Companies. The way you design and run the surveys, and how you analyze the results, largely impact the benefits that you can get out of them.

Treatwell’s first DORA implementation in 2020 focused on getting DORA metrics from the tools. They set up a team that sits between their Platform Engineering team and their “delivery teams” (aka product teams, aka stream aligned teams), called CDA – Continuous Delivery Acceleration team. Half of their time is invested in making other developers and teams life better, and the other half is about getting DORA metrics from the tools:

We get halfway there, as we manage to get deployment frequency and lead time for changes for almost all of our services running in production, and when the team starts to dig into “change failure rate”, Covid kicks in and the company is sold.

DORA can help to drive sustainable change, but it depends on the people who lead and contribute to it, and how they approach it, as Beschi learned. DORA is just a tool, a framework, that you can use to:

  • Lightweight assess your teams and organisation
  • Play back the results, inspire reflection and action
  • Check again a few months / one year later, maybe with the same assessment, to see if / how much “the needle has moved”

Beschi mentioned that teams use the DORA reports as part of their continuous improvement. The debrief about the report is not too different from a team retrospective, one that brings in this perspective and information, and from which the team defines a set of actions, that are then listed, prioritised, and executed.

He has seen benefits from using DORA in terms of aligning people on “this is what building and running good software nowadays looks like”, and “this is the way the best in the industry work, and a standard we aim for”. Beschi suggested focusing the conversation on the capabilities, much more than on the DORA measures:

I’ve had some good conversations, in small groups and teams, starting from the DORA definition of a capability. The sense of “industry standard” helped move away from “I think this” and “you think that”.

Beschi mentioned the advice and recommendations from the DORA community on “let the teams decide, let the teams pick, let the teams define their own ambition and pace, in terms of improvement”. This helps in keeping the change sustainable, he stated.

When it comes to meeting the expectations of senior stakeholders, when your CTO is the sponsor of a DORA initiative then there might be “pushback” on teams making decisions, and expectations regarding their “return of investment” on doing the survey, aiming to have more things change, quicker, Beschi added.

A proper implementation of DORA is far from trivial, Beschi argued. The most effective ones rely on a combination of data gathered automatically from your system alongside qualitative data gathered by surveying (in a scientific way) your developers. Getting good data quickly from the systems is easier said than done.

When it comes to getting data from your systems for the four DORA keys, while there has been some good progress in the tooling available (both open and commercial) it still requires effort to integrate any of them in your own ecosystem. The quality of your data is critical.

Start ups and scale ups are not necessarily very disciplined when it comes to consistent usage of their incident management processes – and this impacts a lot the accuracy of your “failure change rate” and “response time” measures, Beschi mentioned.

Beschi mentioned several resources for companies that are interested in using DORA:

  • The DORA website, where you can self-serve all DORA key assets and find the State of DevOps reports
  • The DORA community has a mailing list and bi-weekly vídeo calls
  • The Accelerate book

In the community you will find a group of passionate and experienced practitioners, very open, sharing their stories “from the trenches” and very willing to onboard others, Beschi concluded.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


.NET Aspire 9.0 Now Generally Available: Enhanced AWS & Azure Integration and More Improvements

MMS Founder
MMS Robert Krzaczynski

Article originally posted on InfoQ. Visit InfoQ

.NET Aspire 9.0 is now generally available, following the earlier release of version 9.0 Release Candidate 1 (RC1). This release brings several features aimed at improving cloud-native application development on both AWS and Azure. It supports .NET 8 (LTS) and .NET 9 (STS).

A key update in Aspire 9.0 is the integration of AWS CDK, enabling developers to define and manage AWS resources such as DynamoDB tables, S3 buckets, and Cognito user pools directly within their Aspire projects. This integration simplifies the process of provisioning cloud resources by embedding infrastructure as code into the same environment used for developing the application itself. These resources are automatically deployed to an AWS account, and the references are included seamlessly within the application.

Azure integration has been upgraded in Aspire 9.0. It now offers preview support for Azure Functions, making it easier for developers to build serverless applications. Additionally, there are more configuration options for Azure Container Apps, giving developers better control over their cloud resources. Aspire 9.0 also introduces Microsoft Entra ID for authentication in Azure PostgreSQL and Azure Redis, boosting security and simplifying identity management.

In addition to cloud integrations, Aspire 9.0 introduces a self-contained SDK that eliminates the need for additional .NET workloads during project setup. This change addresses the issues faced by developers in previous versions, where managing different .NET versions could lead to conflicts or versioning problems. 

Aspire Dashboard also receives several improvements in this release. It is now fully mobile-responsive, allowing users to manage their resources on various devices. Features like starting, stopping, and restarting individual resources are now available, giving developers finer control over their applications without restarting the entire environment. The dashboard provides better insights into the health of resources, including improved health check functionality that helps monitor application stability.

Furthermore, telemetry and monitoring have been enhanced with expanded filtering options and multi-instance tracking, enabling better debugging in complex application environments. The new support for OpenTelemetry Protocol also allows developers to collect both client-side and server-side telemetry data for more comprehensive performance monitoring.

Lastly, resource orchestration has been improved with new commands like WaitFor and WaitForCompletion, which help manage resource dependencies by ensuring that services are fully initialized before dependent services are started. This is useful for applications with intricate dependencies, ensuring smoother deployments and more reliable application performance.

Community feedback highlights how much Aspire’s development experience has been appreciated. One Reddit user noted:

It is super convenient, and I am a big fan of Aspire and how far it has come in such a short time.

Full release details and upgrade instructions are available in the .NET Aspire documentation.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.