Presentation: Pump It Up! Actually Gaining Benefit from Cloud Native Microservices

MMS Founder
MMS Sam Newman

Article originally posted on InfoQ. Visit InfoQ

Transcript

Newman: Welcome to this virtual presentation of me talking about the pitfalls associated with getting the most out of microservices, and the cloud, or cloud native microservices, as we now call them. I’ve written books about microservices. You probably don’t care about that, because you’re all here to hear about interesting pitfalls, and also some tips about how to avoid the nasty problems associated with adopting microservices on the cloud. Also, some guidance about the good things that you can do to get the most out of what should be a fantastic combination. This combination of cloud native and microservices. Microservices which offer us so many possibilities. Many of you are making use of microservices to enable more autonomous teams able to operate without too much coordination and contention. Teams that own their own resources can make decisions about when they want to release a new version of their software. Reducing the release coordination in this way helps you go faster. It can improve your sense of autonomy, in turn has benefits in terms of motivation, and employee morale, even. This allows us to ship software, get it into the hands of the users more quickly. Get fast feedback about whether or not these things are working. These models nowadays are shifting towards a world where we’re looking for our teams to have full ownership of their IT assets: you build it, you run it, you ship it. It’s amazing how developers that have to also support their own software end up creating software that tends to page the support team much less frequently.

Why Have Cloud Native and Microservices Not Delivered?

Cloud native and microservices should have been a wonderful combination. It should have been brilliant. It should have been like Ant and Dec, or Laurel and Hardy, or Crypto and Bros. Instead, it’s ended up being much more like Boris and Brexit. Something that’s been really well marketed, but unfortunately has failed to deliver on some of its promises, has ended up costing us a lot more money than we thought, and left many of us feeling quite sick. Why has cloud native and microservices often not delivered on some of the promises that we might have for them? Many of these big enterprise initiatives you may have been involved with around these transformations have somehow failed to deliver. Part of this, I think is about confusion. Let’s look at the CNCF landscape, and it’s easy to poke a degree and follow this. This is a sign of success, really. This is all the different types of tools and product in the different sectors that the CNCF manages. This is a sign of success, but it’s a bewildering sign of success. How do you navigate this to pick out which pieces of the big CNCF toolkit you could use? It’s ok because help is at hand that is through an interactive website now where you can navigate this space and try and find the thing you want. Although whenever I read this disclaimer, I have some other view in my head. Namely, this is going to cost me a lot of money, isn’t it? In terms of money, the CNCF landscape also talks about money quite a bit. It talks about how much money has been pumped into this sector. If we just look at the funding of $26 billion, where does all that money go? Obviously, a lot of that money goes into creating excellent and awesome products. A lot of that money also goes into marketing, which leads to confusion, dilution of terms, misunderstanding of core concepts, and increased confusion.

Undifferentiated Heavy Lifting – AWS (2006)

Back in 2006, AWS launched in a way, which was interesting. At the time, we didn’t really realize what was going to come, because at this point, we were dealing with our own physical infrastructure, we were having to rack and cable things ourselves. Along came AWS and said, no, we can do that for you. You can just rent out some machines. When you actually spoke to the people behind AWS about why they were doing this and why they thought it was beneficial, this term, undifferentiated heavy lifting would come back again. This describes how Amazon thought about their world internally. We want autonomous teams to be able to focus on the delivery of their product, and we don’t want them to have to deal with busy work. It’s still heavy work, the heavy lifting, but your ability to do that work to rack up servers doesn’t differentiate what you’re doing compared to anybody else. Can we create something whereby you can offload a lot of that heavy lifting to other people that can do it better than you can? If you think about what AWS gave us, it gave us, hopefully, these capabilities, in the same way that Google Cloud does, or Azure, or a whole suite of other public cloud providers.

APIs

Really, when you distill down the important part of a lot of this, it was around the APIs they gave us. A few years ago, I went up to the north of England. It’s not necessarily quite as cold and hospitable as this, but it definitely has better beer than the south. I was chatting to some developers at a company up there. I was doing a bit of a research visit, which I do every now and then. I was talking to them about what improvements they wanted to see happening in terms of the development environment that would make them more productive. This was just, I’m interested in what developers want. Also, I was going to feed some information back to CIOs, so they could get a sense of key things they could do to improve. These developers said, what we want is we’d love access to the cloud, we’d like to get onto the cloud, so that we don’t have to keep sending tickets. I went and spoke to the CIO, and the CIO says, we already have the cloud. I went back to the developers, I said, developers, you already have the cloud. They were confused by this.

I dug a bit deeper, and it turned out that this company had embraced the cloud in a rather interesting manner. What happens is, a developer, if they wanted to access a cloud resource, or spin up a VM, what they would actually do would be to raise a ticket using Jira. That Jira ticket would be picked up by someone in a sysadmin team, who would then go and click some buttons on the AWS console, and then send you back the details of your VM. Although the sysadmin side of this organization had adopted the public cloud for some value of the word adopted, from the point of view of the developers, nothing had changed. This is an odd thing to do. Using public cloud services, which give us such great ability around self-service, it is fashion. It’s really bizarre. It’s like putting a wheel clamp on a hypercar. You could do it. Should you really do it?

Why Do Private Clouds Fail?

I’m a great fan of cherry picking statistics and surveys that confirm my own biases. I was very glad to find this survey done by Gartner at their data center conference several years ago. This is from people who went to a data center conference run by Gartner. That’s already an interesting subsection of the overall IT industry. They were saying, what is going wrong with your private clouds? Why are your private clouds installs, are they working, are they’re not working? They find out only 5% of people thought it was going quite well. Most people found significant issues with how they were implementing private cloud. Really, interestingly, the biggest focus amongst people who went to a data center conference run by Gartner was accepting that they’d been focusing on the wrong things. That by implementing a private cloud, they had been focusing on cost savings, rather than focusing on agility. Also, as part of that, not changing anything around how they operate, or how the funding models work. Thinking back to that previous example, this seemed to tie up. You adopted the public cloud, but didn’t really want to change any of the behaviors around it.

Pitfall 1: Not Enabling Self-Service

This leads to our very first pitfall around this whole story, and that’s not enabling self-service. If you want to create an organization, where teams have more autonomy, have the ability to think and execute faster, you need to give them the ability to self-service provision things like the infrastructure they’re going to run on. You need to empower teams to make the decisions and to get things done. AWS’s killer feature wasn’t per hour rented managed virtual machines, although they’re pretty good. It was this concept of self-service. A democratized access via an API. It wasn’t about rental, it was about empowerment. You only saw that benefit if you really changed your operating models.

Top Tip: Trust Your People

A lot of the reasons I think that people don’t do this and don’t allow for self-service really comes down to a simple issue, you need to trust your people. This is difficult, because for many of us, we’ve come from a more traditional IT structure, where you have very siloed organizations who have very small defined roles. You had to live in your box, the QA did the QA things, the DBA did the DBA things, the sysadmins did the sysadmin things. Then there’d be some yawning chasm of distrust over to the business who we were creating the software for. In this world of siloed roles and responsibilities, the idea of giving people more power is a bit odd. It doesn’t really fit. We’re moving away from this world. We’re breaking down these silos. We’re moving to more poly-skill teams that can have full ownership over the software we’ve delivered.

Stream Aligned Teams

I urge all of you to go read the book, “Team Topologies,” which is giving us some terms and the vocabulary around these new types of organizations in the context of IT delivery. In the “Team Topologies” book, they talk about these ideas of stream aligned teams. Instead of thinking about short-lived project teams, you create a long-lived product oriented team who owns part of your domain. They are focused on delivering a valuable stream of work. Their work cuts across all those traditional layers of IT delivery. They’re not thinking about data in isolation from functionality, or thinking about a separate UI from the backend, you’re thinking holistically about the end-to-end delivery of functionality to the users of your software. This long-lived view is really important, because you get domain expertise, in terms of what you own. This is why microservices can be so valuable, because you can say, these microservices are owned by these different streams. That strong sense of ownership is also what gives you the ability to make decisions and execute without having to constantly coordinate with loads of other teams.

The issue, obviously, with this model is that there is lots of other things that need to happen. There’s other work that needs to be done to help these teams do their jobs. Traditionally, those kinds of roles would be done by separate siloed parts of the organization, there were often a functional handoff. I had a need for a security review. I need a separate Ops team to provision my test environment or deploy into production. That work still needs to be done, and we can’t expect these teams to do all that work as well. Where does all that extra work go?

Enabling Teams

This is where the “Team Topologies” authors introduced the idea of enabling teams. You’ve got your stream aligned teams, and they’re your main path to focusing on delivery of functionality. What we need to do is to create teams that are going to support these stream aligned teams in doing their jobs, so we might have a cross-cutting architecture function, maybe frontend design, maybe security, or maybe the platform. These enabling teams exist to help the stream aligned teams do their job. We want to do whatever we can to remove impediments to help them get things done. This isn’t about creating barriers or silos. It’s about enablement. At this point, many of you who have got microservices are sitting there face nothing going, but that’s ok because I’ve got a platform.

Amazon Web Services (2009)

Let’s go back in time a bit further. Back in 2009, I found myself accidentally involved in helping create the first ever training courses for AWS. At this point, AWS itself were just saying, here’s our stuff, use it. They weren’t putting any effort into helping you use these tools well. I remember myself and Nick Hines, a colleague at the time, we’re having a chat with them. The AWS view was, “We’re like a utility. We just sell electricity.” Nick turned to them and said, “Yes, that’s all well and good, but your customers are constantly getting electrocuted.” Without good guidance about how to use these products well, you can end up making mistakes, and you may then end up blaming the tool itself. It’s amazing that still in 2022, I meet people that run entirely out of one availability zone, for example. To be fair to AWS, they spotted this gap and have plugged it very well. If you go along to the training certification section that they run, and this is the same story for GCP, and Azure as well, you’ll see a massive ecosystem of people able to give you training on how to use these tools well. Because all of these vendors recognize that without that training and guidance, you won’t get the best out of them.

Pitfall 2: Not Helping People Use the Tools Well

This is a lesson that we need to learn for our own tools that we provide to our developers. Having cool tools is not enough. You’ve got to help people use them. This is another common pitfall I see, people bring in all these tools and technology, here is some Kubernetes, and here is some Prometheus. Unless you’re doing some work to help people get the most out of those tools, you’re not really enabling them. If you’re somebody working on the platform team, your job isn’t just to build the platform. It’s actually to create something that enables other people to do their job. Are you spending time working with the teams using your platform? Are you listening to their needs? Are you giving them training when they need it? Are you actually spending time embedded with them on day-to-day delivery to understand where the friction points are? Because if you’re not doing these things, the platform can end up just becoming another silo.

Top Tip: Treat Your Microservices Platform like a Product

This leads me to my next tip, you should treat your platform like a product. Any good product that you create is going to involve outreach, is going to involve chatting to your customers, understanding what they need and what they want. This is the same thing with a platform. Inside an organization, talk to your developers. Talk to your testers. Talk to your security people. What is it that you need to do in whatever platform you deliver to help them do their job? This is all about creating a good developer experience. Although maybe the term developer experience should be delivery experience, because obviously, there are many more stakeholders than just the developers. Yes, that’s right, developers, there are people other than you out there. Think about that delivery experience if you’re the person helping drive the development of the platform team. I actually think this is a great place to have a proper full time product owner. Have a person who has got product management experience or product owner experience. Have them head that team up, and drive your roadmap based on the needs of your customers.

If you are the person providing that platform into your organization, it is your job to navigate this mess. Again, I don’t mean to beat up the CNCF at all. Absolutely not. This is a sign of success, but it is bewildering, trying to navigate this world. If we go and look at the public cloud vendors, we also see a huge array of new products and services being released all the time. AWS is in many ways the worst culprit. Again, a sign of success. It’s interesting that the easiest way to keep up with all the different things that AWS are launching is actually go to a third party site. This particular site tracks the various different service offerings that are out there. I screencaped this probably about three months ago. The number of products offered by AWS is constantly growing. As of 2021, they had as many as 285 individual services. That’s 285 individual services, many of which overlap with each other. Each of those services can have a vast array of different features. How are you supposed to navigate this landscape?

Top Tip: It’s Ok To Provide a Curated Experience

There is this idea that if I’m going to allow you to use a public cloud provider, like Azure, GCP, or AWS, and I want to let you do that in a self-service fashion, that I should just throw you in at the deep end and say, just go for it. No, you should train your support people in using those tools well. It is also true that it’s ok to provide a curated experience. If you’re somebody working as the experts in that platform team who’s providing these services to the people building your microservices, you are the person who should be responsible for helping them navigate this landscape and curating the right platform for you. If you’re going to build your own private cloud, it’s something I don’t tend to advise you to do, you might start with Kubernetes as your core. Then you’re going to have to pick the right pieces to create the right platform for your team, for your organization.

Governance

Another potential issue around the platform, though, comes in the shape of governance. Governance often gets a bad name. Partly I feel not because of what governance is, but because of how governance is often implemented. We see governance as a barrier. On the face of it, governance is an entirely acceptable and sensible thing that we should be doing. Governance is, simply put, deciding on how things should be done, and making sure that they are done. I think this is totally fine. In many small teams, you’re doing governance without ever realizing it. In a large scale organization, you have different governance that needs to take place. This is all governance is. How should things be done, and making sure that they are done. The problem is that when people start seeing, especially people who are more familiar to more centralized command and control models, who are being dragged kicking and screaming into the world of independent autonomous teams, and they’re like, how do I use my skills to operate in this environment? Then they see the platform, they go, I can use the platform to decide what people can do.

Pitfall 3: Trying To Implement Governance through Tooling

This leads to another pitfall. Because if you say, what we’re going to do, is we’re going to basically use the platform as a way of enforcing what people do. We’re going to limit what you do. We’re going to stop you doing things. We’re going to do that through the platform, through the tools that we give you. The issue with that mindset is the moment you say that, the next thing follows, which is you say, because if you use the platform, we know you’re doing the right thing, therefore, you must use the platform. Because if we enforce that you use the platform, we know you’ll be doing the right thing, and my job is done. This is another real problem with people who are adopting platforms. Enforcement of tools and its function really undermines the whole mindset around this. If you force people to use a platform, it’s not about enablement, it’s about control. Here’s the reality. If you make it hard for people to do their jobs, people are either going to bypass your controls into the world of shadow IT, or they’re going to leave. There are also the other people that will just put up with it. Often, those people have already been beaten down by other problems and issues in the organization.

I remember Werner Vogels telling a story many years ago. He was going into these fortune 500 companies, early days of AWS, and they were trying to encourage these big U.S. companies to take AWS seriously. The CIOs would say things like, we’re never going to use a bookseller to run our compute. He said, “You already are.” He put out a list of all the people that worked at that company that were already billing AWS, paying for things on their corporate credit cards. People used AWS as a way of getting their job done without having to go through the traditional corporate controls. A lot of people are horrified by this idea. This is what we now call shadow IT. IT that isn’t centrally provisioned and isn’t centrally managed. Shadow IT is massive now and is growing. People want to get their jobs done, they find a way to get their job done. It’s almost like a logical manifestation of the world of SaaS. SaaS has made it so easy to provision software services that people are cutting out the middlemen.

In an environment like this, you can try and use a platform, so you’ve got to use a platform, and this is a way of stopping you doing stuff. Is it really going to stop people anyway? Because here’s the thing, those people that actually bypass your controls to get the job done, they’re motivated. They’ve gone out of their way to find a better solution to solve the problems that they’ve got, because they want to do the job. They’re motivated. These are the people you probably want to promote, or at the very least listen to and help. You don’t want to sideline them or make it so difficult for them to do their job and become so demotivated, they just go somewhere else. As part of governance, it’s about talking about what should be done and making sure it is done. Part of that is being clear and communicating as to why you’re doing things in a certain way. If you explain the why you want things done in a certain way, it’s going to be much easier for people to make the right decisions. It’s also completely appropriate for you to make it as easy as possible to do the right thing.

Top Tip: Provide a Paved Road

I love this metaphor of the paved road. Creating an experience for your delivery teams that makes doing the right thing as easy as possible. We’ve laid a path out in front of you, if you just do these things, everything’s going to be absolutely fine and rosy. If you realize that the path isn’t quite right for you, you can head off into the woods yourself, but you’re going to have to do a bunch of work yourself. You’re still obligated to follow what should be done. You’re still accountable for the work you do, but you’re going to be a bit more on your own. On the other hand, the paved road is all going to be gravy, because in many situations, you can justify going into the woods in niche situations, and that’s ok.

Provide that paved road. Create an experience that makes it easy for people to do the right thing. If you make it easy for people to do the right thing and explain why those things are done in a certain way, you’ll also find that when people do need to go off that beaten path, they can be much more aware of what they’re doing and how that fits in. When you identify that people aren’t using your paved road, that becomes feedback back into your platform. What was it about what we gave them that didn’t help them do their job? Why did they go to this third party service? Is that actually something we’re not bothered about, and actually, their situation is niche enough that we don’t have to worry? Or does that speak to a gap in what we’re offering our customers? Our customers in this case being our fellow developers, and testers, and sysadmins, and everything else.

There are some great examples of companies doing things like this, creating not only a paved road, but also something that doubles as an educational tool. Sarah Wells has spoken before about the use of BizOps inside the “Financial Times,” which on the face of it looks almost like a service registry, but it goes further than that. It talks about certain levels of criticality, and what microservices and other types of software products need to deliver to reach those levels of criticality. If you want to be a platinum service you have to do this and this. A lot of those checks are automated. There’s links you can go to, to find out how to solve those problems. At a glance, you can see what you should be doing, what you are doing, and get information about how to address those discrepancies. This isn’t the big stick. This is the paved road with guidance, with a map.

Top Tip: Make the Platform Optional

This leads us on to maybe one of the more controversial tips that I’m going to give you. This is actually a piece of advice that comes straight from the “Team Topologies” book. Many of you who run a platform might be quite worried by this idea, but it’s this, you should make the platform you give to your microservice teams, optional. This is scary. Why would I make it optional? Partly, because of reasons I’ve already talked about. We don’t want to just put arbitrary barriers in front of people. If we really want to create independent, autonomous teams that are focused on delivering their functionality, they might have real needs that aren’t met by the platform. If we force them to use the platform, and all aspects of the platform, we’re actually effectively undermining their ability to make decisions that are best for them. That’s part of it. Absolutely.

There’s another thing which is a little bit more insidious. If you make the platform optional, it means that the owners of the platform are going to be focused on ease of use. If everyone has to use a platform, and that’s mandated, it’s very easy for the platform team to stop caring about it, in terms of what it’s like to use about that delivery experience. Whereas if the platform is optional, then one of the key things that’s going to be driving how successful your platform is viewed is how many people are using it. By making it optional, you will go out of your way to make sure that it’s easy to use and easy to adopt. It triggers you into doing that outreach, aside from also, of course, enabling this self-service where it’s warranted.

Summary (Pitfalls)

The big pitfall we started off with was giving people these awesome tools, these awesome platforms without enabling self-service. Once you’ve given people these tools that maybe do allow for self-service, it’s not helping people use those tools well. Then, finally, we talked about the challenges of trying to implement governance or enforcement around governance through the tooling and all the pitfalls associated with that.

Summary (Top Tips)

Firstly, and really importantly, you’ve got to understand to trust your people. This might be difficult, but fundamentally, this is where a lot of the journey starts from, trust your people. I didn’t say verify. Verification is also useful. You should start with trusting. You need to treat your platform like a product. You need to treat the people that use that product like users. Understand what they want, do the outreach. As part of that platform, it is ok to deliver a curated experience. Make it easy for people to navigate their world. This lines up really nicely with this idea of the paved road. The paved road helps deliver the things that people need most of the time. Finally, but maybe controversially, make the platform optional. Making the platform optional signals that it is ok to use alternative products where warranted. Also, it makes sure the team that builds a platform is going to be focused on making that platform as easy to use as possible.

If we need to distill all of it down, when I’ve looked back at these different companies I’ve worked at and people I’ve chatted to, it still feels that so many people are using the cloud without really using it. Many of us have bought the hypercar and stuck the wheel clamp on it. Taking that wheel clamp off all starts with trusting your people.

Resources

There’s more information about what I do over at my website, https://samnewman.io, including information about my latest book, the second edition of, “Building Microservices.”

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Article: The Most Common Developer Challenges That Prevent a Change Mindset—and How to Tackle Them

MMS Founder
MMS Shree Mijarguttu

Article originally posted on InfoQ. Visit InfoQ

Key Takeaways

  • Encourage a culture of being hungry to learn: When guiding an engineering team, you’ve got to encourage constant growth and learning, like exploring different tech stacks.
  • Larger companies and corporations may not expect developers to master several tech stacks—but startups do. What can we learn from those environments?
  • Give yourself and your team a window of time to work out certain problems, then move on. Developers work for long periods of time on difficult tasks with slow progress.
  • Believe in an approach where everyone in the organization is responsible for developing and maintaining a culture of opportunity.
  • Build a business-impact-first mindset, which means treating technology as an enabler, not a driving force.

Some 83% of developers stated they are feeling burnout, and that’s because many aspects of their jobs are preventing a flexible mindset.

The truth is, being a software developer is synonymous with the continuous cycle of tight deadlines. Programming is a creative profession, yet much of our time is spent working on difficult tasks that slow down productivity: It takes away the joy from day-to-day projects, resulting in frustration.

There’s often pressure to solve last-minute defects and stick to set timeframes, but another stress is the ever-growing digital world.

New technologies entering the space are becoming essential for software developers overnight. So, to succeed in this fast-paced industry, they better stay on top of the latest tools.

This is where chief technology officers (CTOs) or senior leaders overseeing developer teams can help shift old patterns, ignite a growth mentality, and teach developers to be flexible and agile. In other words, a change mindset.

This buzzword is used across many industries and spoken about in publications like HBR and Forbes, among others.

The idea is that we have to stop waiting to be forced into change, and instead be ahead of the game, when we live in a rapidly evolving world. In his book Think Again, Adam Grant writes:

“… We need to spend as much time rethinking as we do thinking.”

So, here’s what obstacles software engineering teams face and what to do to cultivate a change mentality.

The common pitfalls

A phrase that particularly rings true with software engineering when talking about mindset is:

“Your team is only as strong as its weakest member.”

One negative team member slacking can throw a spanner in the works, demoralizing and demotivating the rest of the team. Watch out for these team members who make excuses for not getting work done, don’t accept their inability, and make others doubt their own work – it could quickly lower productivity.

It’s also common knowledge that the more automation a company has, the more likely it will free up developers’ time to focus on digital innovation that benefits customers. This is one of the ideas behind BOS Framework, a product that I am proud to be one of the engineers of.

Still, today, developers spend forever doing manual tasks, especially when troubleshooting. There are multiple tools out there that enterprises are slow to embrace, like code editors, defect trackers, and even Kubernetes, an open-source system helping automate applications’ deployment and operations.

And another factor is unrealistic deadlines, which impact software engineers’ work-life balance, and their motivation can plummet as a result. The best investment for developers is to invest in their own development, especially in this ever-evolving technical landscape. But when working overtime and bogged down with manual tasks, it’s no wonder developers lack the motivation to take up new skills.

All these aspects impact dev teams’ change mindset. So what can team leads do about it?

Encourage a culture of being hungry to learn

Satya Nadella, CEO at Microsoft, initiated the “tech giant’s cultural refresh with a new emphasis on continuous learning” to change employees’ behavioral habits. He called it moving from being a group of “know-it-alls” to “learn-it-alls.”

When fresh talent appears every second, those “oldies but goodies” in the industry with 10+ years of developer experience, may get left behind if they don’t up their game too.

Therefore, when guiding an engineering team, you’ve got to encourage constant growth and build learning into the daily routine. Start with quizzes, guides, and games and end with offering developers opportunities to work on projects with different tech stacks – a forcing mechanism to expand their knowledge.

For example, if an employee says they have become an AWS-certified architect, that’s wonderful. At our company, we suggest they also look at other cloud platforms to be more grounded in their architectural ability. This, of course, needs to be a mutual agreement between the two parties and not forced on the employee.

Ultimately, if your team focuses on certain tools too much without looking at other options, you could quickly become outdated and stop your company from innovating. Companies that build cultures of hunger for learning mean that the developers are always ahead of the curve, finding the best solutions for users and stakeholders.

Learn from startup environments

Larger companies and corporations may not expect developers to master several tech stacks simultaneously – but startups do. So what can we learn from those environments?

Steve Jobs was already ahead of the game by organizing Apple like a startup for improved collaboration and teamwork. He referred to Apple as the “biggest startup on the planet,” where you could trust colleagues to come through on their side of the bargain without watching them all the time or having corporate committees.

The biggest advantage of working in a startup or smaller company is that you can wear many hats and learn about every aspect of the company. I joined my current job right out of college as a Software Engineering Intern and have been with them since. Over the past 10 years, I worked with many technologies, from web and mobile development to databases, IoT applications, and data science projects. From this experience, I quickly built a base of knowledge for various technology stacks and noticed the nuances of each one.

Developers often get attached to specific tools and want to apply them to everything. But if you just have a hammer, you treat everything like a nail. Working at a startup is an exciting opportunity for developers to build from scratch, pitch crazy ideas, and be part of a team attempting to solve complex problems.

It’s not like being a regular employee; you are part of something bigger and crafting a company’s future. Developers in these environments can often stop being lone rangers and learn to communicate effectively across various departments.

Therefore, when building a change mindset in a dev team, look to startups for inspiration and view startup experience on a CV as a plus. And if you already are a startup, learning is limitless – so look to competitors.  

Work on problems for a certain time

Give yourself and your team a window of time to work out certain problems, then move on to something new. Developers’ change mindset is often inhibited as they work for hours and hours on difficult tasks or get stuck on bugs with slow progress.

“It’s part of the job,” they’d say. However, I have a clear rule: If your team spends more than four hours on a problem, get them to take a walk. Then, they can come back to it with a mindset of change. If they spend another four hours on something with little traction, it’s time for them to reach out to their peers; somebody is bound to have an answer. If developers learn to open themselves up to other team members’ advice, it will stimulate – yet again – a change in mindset.

This is also why developers and engineers should never fall in love with only one programming language or technology; they’ve got to keep moving. The sky is the limit if they always investigate alternative solutions – that’s the basic expectation of engineers.

Believe in an approach where everyone is responsible

Everyone in a dev team should believe in the mission and vision of the organization where they work. But they should also be equally concerned about building and maintaining a culture of opportunity and transformation. This is a shared responsibility that relies on every single developer’s contribution.

The unofficial agreement is that companies are responsible for giving employees opportunities on a plate. I’ll always be thankful to my seniors for pushing me to test myself and try new things, like training airmen in software development and data science at the Department of Defense.
There are always appropriate moments to force incredible opportunities onto team members.

However, a top-down approach to building culture doesn’t always work: I believe in creating an organizational culture where everyone shares accountability. Culture should be adaptable, not just established by leaders. And employees must also show they are willing to explore and be adventurous; this is the difference between a good employee and an outstanding one.

Before going remote, every Friday at the workplace, we would pick a team member to choose a topic – anything in the field of software development, technology, cloud, and CI/ CD – and educate the entire team. It put developers out of their comfort zone and helped other team members learn something new every week.

Tech team leaders shouldn’t just expect developers to follow a culture set in stone. Instead, they should allocate resources to ensure employees understand it, vet it, uphold its principles, and add to it. That way, developers are encouraged to think for themselves and criticize, boosting that much-needed change mindset.

Build a business-impact-first mindset

Change in an engineering team also comes from cultivating a business-impact-first frame of mind – one of the main pillars of success at BOS Framework and the philosophy on which I was trained. In other words, it means an ingrained culture where developers speak both business and engineering languages, viewing technology as a tool to achieve business outcomes.

This is because engineers with an entrepreneurial mindset will love to get it right while getting it done. Obviously, a stark over-exaggeration, but engineers are mostly perfectionists, while entrepreneurs don’t have time to overthink, are better at delegating, and learn things just in time. Engineering teams must not forget that their projects should have a commercial end.

Engineering leads must build a business case for every project that allows non-technical and business stakeholders to weigh in and inform decisions. Meanwhile, developers must start communicating with business stakeholders, peers, and other departments in corporate jargon to deliver tech-enabled business impact. Bridging the gap between product stakeholders and development teams helped me progress in my career and adapt to new roles more easily.

Developers can get stuck in an endless loop where they can’t remember the last time they learned new things at their job or impacted company culture. That’s why building a change-ready mindset in your company is essential to help break career plateaus: It puts team members at the core of processes, says no to staying with one tech stack for too long, and encourages constant progression.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Triggermesh Introduces an Open-Source AWS Eventbridge Alternative with Project Shaker

MMS Founder
MMS Steef-Jan Wiggers

Article originally posted on InfoQ. Visit InfoQ

Recently TriggerMesh, a cloud-native integration platform provider, announced Shaker, a new open-source AWS EventBridge alternative project that captures, transforms, and delivers events from many out-of-the-box and custom event sources in a unified manner.

The Shaker project provides a unified way to work with events using the CloudEvents specification. It can be used with event sources and targets in AWS, Azure and GCP, Kafka, or HTTP webhooks. In addition, it includes a transformation engine based on a simple DSL, which can be controlled via code-based transformations if necessary.

With TriggerMesh’s command line interface, tmctl, developers can create, configure and run TriggerMesh on a machine with Docker. Next, they optionally leverage TriggerMesh’s Bumblebee transformation component and route to an external target.

 
Source: https://docs.triggermesh.io/get-started/quickstart/

Jonathan Michaux, a Product Manager at TriggerMesh, explains in a blog post:

TriggerMesh is designed to be cloud-agnostic, in fact, it works brilliantly to connect different clouds as well as on-premises together. It can run anywhere because all the functionality is provided as containers that can be declaratively configured. tmctl makes it easy to run these containers on Docker, and the TriggerMesh CRDs and controllers will run them natively on any Kubernetes distribution. This means Shaker is easy to embed into existing projects, such as internal developer platforms or commercial SaaS software, and can be operated in the same way as any other containerized workloads.

With Shaker, the company is aiming for DevOps, SREs, and platform engineers looking for a one-stop shop to produce and consume events to build real-time applications.  It is similar to AWS EventBridge capabilities; however, it is open-source and can run anywhere that supports Docker or Kubernetes. In addition, it is designed to capture events from all cloud providers and SaaS or custom applications.

TriggerMesh co-founder and CEO Sebastien Goasguen told InfoQ:

As AWS announced AWS EventBridge Pipes for point-to-point integration, TriggerMesh Shaker provides an open-source alternative that can run on GCP, Azure, or Digital Ocean with a set of event producers and consumers from each of those major Cloud providers.

In addition, Kate Holterhoff, an analyst at RedMonk, said in a press release:

The paradigm of event-driven architecture has become increasingly important to the process of application development and enterprise modernization. The Shaker project from TriggerMesh is an open-source solution for developers and platform teams to unify events across disparate sources and connect their own platforms to new event sources.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Grafana Labs Announces Trace Query Language TraceQL

MMS Founder
MMS Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

Part of the upcoming Grafana Tempo 2.0, TraceQL is a query language aiming to make it simple to interactively search and extract traces. This will speed up the process of diagnosing and responding to root causes, says Grafana.

Distributed traces contain a wealth of information that can help you track down bugs, identify root cause, analyze performance, and more. And while tools like auto-instrumentation make it easy to start capturing data, it can be much harder to extract value from that data.

According to Grafana, existing tracing solutions are not flexible enough when it comes to search traces if you do not know exactly which traces you need or if you want to reconstruct the context of a chain of events. That is the reason why TraceQL has been designed from the ground up to work with traces. The following example shows how you can find traces corresponding to database insert operations that took longer than one second to complete:

{ .db.statement =~ "INSERT.*"} | avg(duration) > 1s

TraceQL can select traces using spans, timing, and durations; aggregate data from the spans in a trace; and use structural relationships between spans. A query is built as a set of chained expressions that select or discard spansets, e.g.:

{ .http.status = 200 } | by(.namespace) | count() > 3

It supports attribute fields, expressions including fields, combining spansets and aggregating them, grouping, pipelining, and more. The next example shows how you can filter all traces that crossed two regions in a specific order:

{ .region = "eu-west-0" } >> { .region = "eu-west-1" }

TraceQL is data-type aware, meaning you can express queries in terms of text, integers, and other data types. Additionally, TraceQL supports the new Apache Parquet-compatible storage format in Tempo 2.0. Parquet is a columnar data file format that is supported by a number of databases and analytics tools.

As mentioned, TraceQL will be part of Tempo 2.0, which will be released in the coming weeks, but it can also be previewed in Grafana 9.3.1.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: The State of APIs in the Container Ecosystem

MMS Founder
MMS Phil Estes

Article originally posted on InfoQ. Visit InfoQ

Introduction

Estes: This is Phil Estes. I’m a Principal Engineer at AWS. My job is to help you understand and demystify the state of APIs in the container ecosystem. This is maybe a little more difficult task than usual because there’s not one clear, overarching component when we talk about containers. There’s runtimes. There’s Kubernetes. There’s OCI and runC. Hopefully, we’ll look at these different layers and make it practical as well that you can see where APIs exist, how vendors and integrators are plugging into various aspects of how containers work in runtimes, and Kubernetes.

Developers, Developers, Developers

It’s pretty much impossible to have a modern discussion about containers without talking about Docker. Docker came on the scene, 2013, definitely huge increased interest in use in 2014 mainly around developers. Developers love the concept, and the abstraction that Docker had put around these various set of Linux kernel capabilities, and fell in love really with this command line, simplicity of docker build, docker push, docker run. Of course, again, command lines can be scripted and automated, but it’s important to know that this command line has always been a lightweight client.

The Docker engine itself is listening on a socket, and clearly defines an HTTP based REST API. Every command that you run in the Docker client is calling one or more of these REST APIs to actually do the work to start your container to pull or push an image to a registry. Usually, this is local. Again, to many early users of Docker, you just assumed that your docker run was instantly creating a process on your Linux machine or cloud instance, but it was really calling over this remote API. Again, on a Linux system would be local, but could be remote over TCP, or a much better way was added more recently to tunnel that over SSH if you really need to be remote from the Docker engine. The important fact here is that Docker has always been built around an API. That API has matured over the years.

APIs are where we enable integration and automation. It’s great to have a command line, developers love it. As you mature your tooling and your security stack and your monitoring, the API has been the place where there have been other language clients created, Python API for Docker containers, and so on. Really, much of the enablement around vendor technology and runtime security tools, all these things have been enabled by that initial API that Docker created for the Docker engine.

What’s Behind the Docker API?

It will be good for us to understand the key concepts that were behind that API. There are three really key concepts that I want us to start to understand, and we’ll see how they affect even higher layer uses via other abstractions like Kubernetes today. The first one is what I’m going to call the heart of a container, and that’s the JSON representation of its configuration. If you ever use the Docker inspect command, you’ve seen Docker’s view of that. Effectively, you have things like the command to run, maybe some cgroups resource limits or settings, various things about the isolation level. Do you want its own PID namespace? Do you want the PID namespace of the host? Are you going to attach volumes, environment variables? All this is wrapped up in this configuration object. Around that is an image bundle. This has image metadata, the layers, the actual file system.

Many of you know that if you use a build tool or use something like docker build, it assembles layers of content that are usually used with a copy-on-write file system at runtime to assemble these layers into what you think of as the root file system of your image. This is what’s built and pushed and pulled from registries. This image bundle has references to this configuration object and all the layers and possibly some labels or annotations. The third concept is not so much an object or another representation, but the actual registry protocol itself. This is again separate from the Docker API. There’s an HTTP based API to talk to an image registry to query or inspect or push content to a remote endpoint. For many, in the early days, this equated to Docker Hub. There are many implementations of the distribution protocol today, and many hosted registries by effectively every cloud provider out there.

The Open Container Initiative (OCI)

The Open Container Initiative was created in 2015 to make sure that this whole space of containers and runtimes and registries didn’t fragment into a bunch of different ideas about what these things meant, and to standardize effectively around these concepts we just discussed that record to the Docker API and the Docker implementation. That configuration we talked about became the runtime spec in the OCI. The image bundle became the core of what is now the image spec. That registry API, again, more recently, wasn’t part of the initial chart of the OCI, has now been formalized into the distribution spec. You’ll see that even though there are many other runtimes than Docker today, almost all of them are conformant to these three OCI specifications.

There’s ways to check that and validate that. The OCI community continues to innovate and develop around these specifications. In addition, the OCI has a runtime implementation that can parse and understand that runtime spec and turn it into an isolated process on Linux. That implementation, many of you would know as runC. runC was created out of some of the core underlying operating system interfaces that were in the Docker engine. They were brought out of the engine, contributed to the OCI, and became runC today. Many of you might recognize the term libcontainer, most of that libcontainer code base is what became runC.

What about an API for Containers?

At this point, you might say, I understand about the OCI specs and the standardization that’s happened, but I still don’t see a common API for containers. You’d be correct. The OCI did not create a standardized API for container lifecycle. The runC command line may be a de facto standard. There have been other implementations of the runC command line, therefore allowing someone to replace runC at the bottom of a container stack and have other capabilities. That’s not really a clearly defined API for containers. So far, all we’ve seen is that Docker has an API, and we now have some standards around those core concepts and principles that allow there to be commonality and interoperability among various runtimes. Before we try and answer this question, we need to go a bit further in our journey and talk a little bit more than just about container runtimes.

We can see that Docker provided a solid answer for handling the container lifecycle on a single node. Almost as soon as Docker became popular, the use of containers in production showed that really at scale, users needed ways to orchestrate containers. Just as fast as Docker had become popular, now there are a bunch of popular orchestration ideas, everything from Nomad, to Mesos, to Kubernetes, and Docker even creating Docker Swarm to offer their own ideas about orchestration. Really, at this point, we have to dive into what it means to orchestrate containers, and not just talk about running containers on a single node.

Kubernetes

While it might be fun to dive in and try and talk about the pros and cons of various ideas that were hashed around during the “orchestration wars,” effectively, we only have time to discuss Kubernetes, the heavyweight in the room. The Cloud Native Computing Foundation was formed around Kubernetes as its first capstone project. We know the use of Kubernetes is extremely broad in our industry. It continues to gain significant amounts of investment from cloud providers, from integrations of vendors of all kinds. The CNCF landscape continues to grow dramatically, year-over-year. Our focus is going to be on Kubernetes just given that and the fact that we’re continuing to dive into what are the common APIs and API use around containers.

When we talk about orchestration, it really makes sense to talk about Kubernetes. There’s two key aspects since we’re talking about APIs that I’d like for us to understand. One coming from the client side is the Kubernetes API. We’re showing one piece of the broader Kubernetes control plane known as the API server. That API server has an endpoint that listens for the Kubernetes API, again, a REST API over HTTP. Many of you, if you’re a Kubernetes user would use it via the kubectl tool. You could also curl that endpoint or use other tools, which have been written to talk to the Kubernetes API server.

At the other end of the spectrum, I want to talk a little bit more about how the kubelet, this node specific daemon that’s listening to the API server for the placement of actual containers and pods. We’re going to talk about how the kubelet talks to an actual container runtime, and that happens over gRPC. Any container runtime that wants to be plugged into Kubernetes implements something known as the container runtime interface.

Kubernetes API

First, let’s talk a little bit more about the Kubernetes API. This API server is really a key component of the control plane, and how clients and tools interact with the Kubernetes objects. We’ve already mentioned, it’s a REST API over HTTP. You probably recognize if you’ve been around Kubernetes, or even gone to a 101 Kubernetes talk or workshop, there are a set of common objects, things like pods, and services, and daemon sets, and many others, these are all represented in a distributed database. The API is how you handle operations, create an update, and delete. The rest of the Kubernetes ecosystem is really using various watchers and reconcilers to handle the operational flow for how these deployments or pods actually end up on a node. The power of Kubernetes is really the extensibility of this declarative state system. If you’re not happy with the abstractions given to you, some of these common objects I just talked about, you can create your own custom resource objects, they’re going to lay into that same distributed database. You can create custom controllers to handle operations on those.

Kubernetes: The Container Runtime Interface (CRI)

As we saw in the initial diagram, the Kubernetes cluster is made up of multiple nodes, and on each node is a piece of software called the kubelet. The kubelet, again, is listening for state changes in the distributed database, and is looking to place pods and deployments on to the local node when instructed to do so by the orchestration layer. The kubelet doesn’t run containers itself, it needs a container runtime. Initially, when Kubernetes was created, it used Docker as the runtime. There was a piece of software called the dockershim part of the kubelet that implemented this interface between the kubelet and Docker. That implementation has been deprecated and will be removed in the upcoming release of Kubernetes later this month. What you have left is the container runtime interface created several years ago as a common interface so that any compliant container runtime could serve as the kubelet.

If you think about it, the CRI is really the only common API for runtimes we have today. We talked about this earlier that Docker had an API. Containerd, the project I’m a maintainer of, we have a Go API as well as the gRPC API to our services. CRI-O, Podman, Singularity, there are many other runtimes out there across the ecosystem. CRI is really providing a common API, although truly, the CRI is not really used outside of the Kubernetes ecosystem today. Instead of being a common API endpoint that you could use anywhere in the container universe, CRI really tends to only be used in the Kubernetes ecosystem and pairs with other interfaces like CNI for networking and CSI for storage. If you do implement the CRI, say you’re going to create a container runtime and you want to plug into Kubernetes. It’s not just enough to represent containers, there’s the idea of a pod and a pod sandbox. These are represented in the definition of the CRI gRPC interfaces. You can look those up on GitHub, and see exactly what interfaces you have to implement to be a CRI compliant runtime.

Kubernetes API Summary

Let’s briefly summarize what we’ve seen as we’ve looked at Kubernetes from an API perspective. Kubernetes has a client API that reflects this Kubernetes object model. It’s a well-defined API that’s versioned. It uses REST over HTTP. Tools like kubectl use that API. When we talk about how container runtimes are driven from the kubelet, this uses gRPC defined interfaces known as the container runtime interface. Hearkening back to almost the beginning of our talk, when we actually talk about containers and images that are used by these runtimes, these are OCI compliant. That’s important because fitting into the broader container ecosystem, there’s interoperability between these runtimes because of the OCI specs. If you look at the pod specification in Kubernetes, some of those flags and features that you would pass to a container represent settings in the OCI runtime spec, for example. When you define an image reference, how that’s pulled from a registry uses the OCI distribution API. That summarizes briefly both ends of the spectrum of the Kubernetes API that we’ve looked at.

Common API for Containers?

Coming back to our initial question, have we found that common API for containers? Maybe in some ways, if we’re talking in the context of Kubernetes, the CRI is that well defined common API that abstracts away container runtime differences. It’s not used outside of Kubernetes, and so therefore, we still have other APIs and other models of interacting with container lifecycles when we’re not in the Kubernetes ecosystem. However, the CRI API is providing a valuable entry point for integrations and automation in the Kubernetes context. For example, tools maybe from Sysdig, or Datadog, or Aqua Security or others can use that CRI endpoint. Similar to how in the pre-Kubernetes world they might have used the Docker Engine API endpoint, to gather information about what containers are running or provide other telemetry and security information, coalesce maybe with eBPF tools or other things that those agents are running on your behalf. Again, maybe we’re going to have to back away from the hope that we would find a common API that covers the whole spectrum of the container universe, and go back to a moniker that Docker used at the very dawn of the container era.

Build, Ship, Run (From an API Perspective)

As you well know, no talk on containers is complete without the picture of a container ship somewhere. That shipping metaphor has been used to good effect by Docker throughout the last several years. One of those monikers that they’ve used throughout that era has been build, ship, and run. It’s a good representation of the phases of development in which containers are used. Maybe instead of trying to find that one overarching API, we should think about for each of these steps in the lifecycle of moving containers from development to production, where do APIs exist? How would you use them? Given your role, where does it make sense? We’re going to take that aspect of APIs from here on out, and hopefully make it practical to understand where you should be using what APIs from the container ecosystem.

Do APIs Exist for Build, Ship, and Run?

Let’s dive in and look briefly at build, ship, and run as they relate to APIs or standardization that may be available in each of those categories. First, let’s look at build. Dockerfile itself, the syntax of how Dockerfiles are put together, has never been standardized in a formal way, but effectively has become a de facto standard. Dockerfile is not the only way to produce a container image. It might be the most traditional and straightforward manner, but there’s a lot of tooling out there assembling container images without using Dockerfiles. Of course, the lack of a formal API for build is not necessarily a strong requirement in this space, because teams tend to adopt tools that match the requirements for that organization.

Maybe there’s already a traditional Jenkins cluster, maybe they have adopted GitLab, or are using GitHub Actions, or other hosted providers, or even vendor tools like Codefresh. What really matters is that the output of these tools is a standard format. We’ve already talked about OCI and the image format and the registry API, which we’ll talk about under ship. It really doesn’t matter what the inputs are, what those build tools are, the fact that all these tools are producing OCI compliant images that can be shipped to OCI compliant registries is the standardization that has become valuable for the container ecosystem.

Of course, build ties very closely to ship, because as soon as I assemble an image, I want to put it in a registry. Here, we have the most straightforward answer. Yes, the registry and distribution protocol is an OCI standard today. We talked about that, and how it came to be coming out of the original Docker registry protocol. Pushing and pulling images and related artifacts is standardized, and the API is stable and well understood. There’s still some unique aspects to this around authentication that is not part of the standard. At least the core functionality of pushing an image reference and all its component parts to a registry is part of that standard.

When we talk about run, we’re going to have to really talk in two different aspects. When we talk about Kubernetes, the Kubernetes API is clearly defined and well adopted by many tools and organizations. When we step down to that runtime layer, as we’ve noted, only the formats are standardized there, so the OCI runtime spec and image spec. We’ve already noted the CRI is the common factor among major runtimes built around those underlying OCI standard types. That does give us commonality in the Kubernetes space, but not necessarily at the runtime layer itself.

Build

Even though I just said that using a traditional Dockerfile is not the only way to generate a container image, this use of base images and Dockerfiles, and the workflow around that remains a significant part of how people build images today. This is encoded into tools like Docker build, BuildKit, which is effectively replacing Docker build with its own implementation, but also used by many other tools. Buildah from Red Hat and many others, continue to provide and enhance this workflow of Dockerfile base images, adding content. The API in this model is really that Dockerfile syntax. BuildKit has actually been providing revisions of the Dockerfile, in effect its own standard and adding new features. There are interesting new innovations that have been announced even in the past few weeks.

If you’re looking for tools that combine these build workflows with Kubernetes deployments and development models, they’re definitely more than the few ones in the list. You can look at Skaffold, or Tekton, or Kaniko. Again, many other vendor tools that integrate ideas like GitOps and CI/CD with these traditional build operations of getting your container images assembled. There are a few interesting projects out there that may be worth looking at, Ko. If you’re writing in Go, maybe writing microservices that you just want static Go binaries on a very slim base, ko can do that for you, even build multi-arch images, and integrates push, and integrates with many other tools.

Buildpacks, which has been contributed to the CNCF, coming out of some of the original work in Cloud Foundry, brings interesting ideas about replacing those base layers without having to rebuild the whole image. BuildKit has been adding some interesting innovations. Actually, just have a recent blog post about a very similar idea using Dockerfile. Then, dagger.io, a new project from Solomon Hykes and some of his early founders from Docker, are looking at providing some new ideas around CI/CD, again, integrating with Kubernetes and other container services. Providing a pipeline for build, CI/CD, and update of images.

Ship

For ship, there’s already a common registry distribution API and a common format, the OCI image spec. Many build tools handle the ship step already by default. They can ship images to any OCI compliant registry. All the build tools we just talked about support pushing up to cloud services like ECR or GCR, an on-prem registry or self-hosted registry. The innovations here will most likely come via artifact support. One of the hottest topics in this space is image signing. You’ve probably heard of projects like cosign and sigstore, and the Notary v2 efforts.

There’s a lot of talk about secure supply chain, and so software bill of materials is another artifact type that aligns with your container image. Then there’s ideas about bundling. It’s not just by image, but Helm charts or other artifacts that might go along with my image. These topics are being collaborated on in various OCI and CNCF working groups. Therefore, hopefully, this will lead to common APIs and formats, and not a unique set of tools that will all operate slightly differently. Again, ship has maybe our clearest sense of common APIs, common formats, and it continues to do so even with some of the innovations around artifacts and signing.

Run – User/Consumer

For the run phase, we’re going to split our discussion along two axes, one as a user or a consumer, and the other as a builder or a vendor. On the user side, your main choice is going to be Kubernetes, or something else. With Kubernetes, you’ll have options for additional abstractions, or not just whether you depend on a managed service from a cloud provider or roll your own, but even higher layer abstractions around PaaSs like Knative, or OpenFaaS, or Cloud Foundry, which also is built around Kubernetes.

No matter your choice here, the APIs will be common across these tools, and there’ll be a breadth of integrations that you can pick from because of the size and scale of the CNCF and Kubernetes ecosystem. Maybe Kubernetes won’t be the choice based on your specific needs. You may choose some non-Kubernetes orchestration model, maybe one of the major cloud providers, Fargate, or Cloud Run, or maybe cycle.io, or HashiCorp’s Nomad. Again, ideas that are built around Kubernetes, but provide some of those same capabilities. In these cases, obviously, you’ll be adopting the API and the tools and the structure of that particular orchestration platform.

Run – Builder/Vendor

As a builder or vendor, again, maybe you’ll have the option to stay within the Kubernetes or CNCF ecosystem. You’ll be building or extending or integrating with the Kubernetes API and its control plane, again, giving you a common API entry point. The broad adoption means you’ll have lots of building blocks and other integrations to work with. If you need to integrate with container runtimes, we’ve already talked about the easy path within the Kubernetes context of just using the CRI.

The CRI has already abstracted you away from having to know details about the particular runtime providing the CRI. If you need to integrate at a lower point, for more than one runtime, we’ve already talked about there not being any clean option for that. Maybe there’s a potential for you to integrate at the lowest layer of the stack, runC or using OCI hooks. There are drawbacks there as well, because maybe there’ll be integration with microVMs like Kata Containers or Firecracker, which may prevent you from having the integration you need at that layer.

Decision Points

Hopefully, you’ve seen some of the tradeoffs and pros and cons of decisions you’ll need to make either as someone building tools for the space or needing to adopt a platform, or trying to understand how to navigate the container space.

Here’s a summary of a few decision points. First of all, the Docker engine and its API are still a valid single node solution for developers. There’s plenty of tools and integrations. It’s been around for quite a while. We haven’t even talked about Docker Compose, which is still very popular, and has plenty of tools built around it, so much so that Podman from Red Hat, has also implemented the Docker API and added compose support. Alternatively, containerd, which really was created as an engine to be embedded, without really a full client, now has a client project called nerdctl that also has been adding compose support and providing some of the similar client experiences without the full Docker engine.

Of course, we’ve already seen that Kubernetes really provides the most adopted platform in this space, both for tools and having a common API. This allows for broad standardization, so, tools, interoperability, used in both development and production. There’s a ton going on in this space, and I assume, and believe that will continue. It’s also worth noting that even though we’ve shown that there’s no real common API outside of the Kubernetes ecosystem for containers, most likely, as you know, you’re going to adopt other APIs adjacent even to your Kubernetes use, or container tools that you might adopt. You’re going to choose probably a cloud provider, an infrastructure platform. You’re going to use other services around storage and networking. There will always be a small handful of APIs, even if we could come into a perfect world where we defined a clear and common API for containers.

The API Future

What about the future? I think it’s pretty easy to say that significant innovation around runtimes and the APIs around them will stay in Kubernetes because of the breadth of adoption, and the commonality provided there. For example, SIG-Node, the special interest group in Kubernetes, focused on the node that includes the kubelet software and its components and the OCI communities, are really providing innovations that cross up through the stack to enhance capabilities. For example, there have been Kubernetes enhancement proposals still in flight for user namespaces, checkpoint/restore, swap support.

As these features are added, they drive this commonality up through being exposed in the CRI, and also implemented by the teams managing the runtimes themselves. You get to adopt new container capabilities all through the common CRI API and the runtimes and the OCI communities that deal with the specifications, do that work to make it possible to have a single interface to these new capabilities.

There will probably never be a clear path to commonality at the runtimes themselves. Effectively at this moment, you have two main camps. You’ve got Docker, dependent on Containerd and runC, and you have CRI-O, and Podman, and Buildah, and crun, and some other tools used in OpenShift and Red Hat customers via RHEL and other OS distros. There are different design ideologies between these two camps, and it really means it’s unlikely that there will be absolutely common API for runtimes outside of that layer above, in the container runtime interface in Kubernetes.

Q&A

Wes Reisz [Track Host]: There was nerdctl containerd approach, does it use the same build API as the Dockerfile syntax?

Estes: Yes, so very similar to how Docker has been moving to using BuildKit as the build engine when you install Docker. That’s available today using the Docker Buildx extensions. Nerdctl adopts the exact same capability, it’s using BuildKit under the covers to handle building containers, which means it definitely supports Dockerfile directly.

Reisz: You said there towards the end, no clear path for commonality at the runtime almost kind of CRI, Podman, buildah, versus Docker containerd. Where do you see that going? Do you see that always being the case? Do you think there’s going to be unification?

Estes: I think because of the abstraction where a lot of people aren’t building around the runtime directly today, if you adopt OpenShift, you’re going to use CRI-O, but was that a direct decision? No, it’s probably because you like OpenShift the platform and some of those platform capabilities. Similarly, containerd is going to be used by a lot of managed services in the cloud, already is.

Because of those layers of platform abstraction, again, personal feeling is there’s not a ton of focus on, I have to make a big choice between CRI-O or do I use Podman for my development environment, or should I try out nerdctl? Definitely in the developer tooling space, there’s still potentially some churn there. I try and stay out of the fray, but you can watch on Twitter, there’s the Podman adherents promoting Podman’s new release in RHEL. It’s not necessarily the level of container wars as when we saw Docker and Docker Swarm and Kubernetes.

I think it’s more in the sense of the same kinds of things we see in the tooling space where you’re going to make some choices, and the fact that I think people can now depend on interoperability because of OCI. There’s no critical sense in which we need to have commonality here at that base layer, because I build with BuildKit. I run on OpenShift, and it’s fine. The image works. It didn’t matter the choice of build tool I used, or my GitHub Actions spits out an OCI image and puts it in the GitHub Container Registry. I can use that with Docker on Docker Desktop. I think the OCI has calmed any nervousness about that being a problem that there’s different tools and different directions that the runtimes are going in.

Reisz: I meant to ask you about ko, because I wasn’t familiar with it. I’m familiar with Cloud Native Buildpacks and the way that works. Is ko similar just from a Go perspective? It just doesn’t require a Dockerfile, creates the OCI image from it. What does that actually look like?

Estes: The focus was really that simplification is I’m in the Go world, I don’t really want to think about base images, and whether I’m choosing Alpine or Ubuntu or Debian. I’m building Go binaries that are fully isolated. They’re going to be static, they don’t need to link to other libraries. It’s a streamline tool when you’re in that world. They’ve made some nice connection points where it’s not just building, but it’s like, I can integrate this as a nice one line ko build, and push to Docker Hub. You get this nice, clean, very simple tool if you’re in that Go microservice world. Because Go is easy to cross-compile, you can say, through an AMD64, or an Arm and a PowerPC 64 image all together in a multi-arch image named such and such. It’s really focused on that Go microservice world.

Reisz: Have you been surprised or do you have an opinion on how people are using, some might say misusing, but using OCI images to do different things in the ecosystem?

Estes: Daniel and a few co-conspirators have done hilarious things with OCI images. At KubeCon LA last fall, they wrote a chat application that was using layers of OCI images to store the chat messages. By taking something to the extreme, showing an OCI image is just a bundle of content, and I could use it for whatever I want.

I think the artifact work in OCI, and if people haven’t read about that, search on artifact working group or OCI artifacts, and you’ll find a bunch of references. The fact is that, it makes sense that there are a set of things that an image is related to. If you’re thinking object oriented, you know what this object is related to that. A signature is a component of an image or a SBOM, a software bill of materials is a component of an image. It makes sense for us to start to find ways to standardize this idea of what refers to an image.

There’s a new part of the distribution spec being worked on called the Refers API. You can ask a registry like, I’m pulling this image, what things refer to it? The registry will hand back, here’s a signature, or here’s an SBOM, or here’s how you can go find the source tarball for, if it’s open source software, and it’s under the GPL. I’m definitely on board with expanding the OCI, not the image model, but the artifact model that goes alongside images to say, yes, the registry has the capability to store other blobs of information. They make sense because they are actually related to the image itself. There’s good work going on there.

Reisz: What’s next for the OCI? You mentioned innovating up the stack. I’m curious, what’s the threads look like? What’s the conversation look like? What are you thinking about the OCI?

Estes: I think a major piece of that is the work I was just talking about. The artifact and Refers API are the next piece that we’re trying to standardize. The container runtime spec, the image spec, as you expect, like these are things that people have built whole systems on, and they’re no longer fast moving pieces. You can think of small tweaks, making sure we have in the standards all the right media types that reference new work, like encrypted layers, or new compression formats. These are things that are not like, that’s the most exciting thing ever, but they’re little incremental steps to make sure the specs stay up with where the industry is. The artifacts and Refers API are the big exciting things because they relate to hot topics like secure supply chain and image signing.

Some of the artifact work is like how, as people are going to build tools, that’s already happening. You have security vendors building tools. You have Docker released their new beta of their SBOM generator tool. The OCI piece of that will be, ok, here’s the standard way that you’re going to put an SBOM in a registry. Here’s how registries will hand that back to you when you ask for an images SBOM. The OCI’s piece will again be standardizing and making sure that whether you use tools from the handful of security vendors and tools out there that they’ll hopefully all use a standard way to associate that with an image.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Profiles, the Missing Pillar: Continuous Profiling in Practice

MMS Founder
MMS Michael Hausenblas

Article originally posted on InfoQ. Visit InfoQ

Pros and Cons of a Distributed System

Hausenblas: Welcome to Profiles, the Missing Pillar: Continuous Profiling In Practice. What are we talking about? If you have a distributed system, you might have derived it from a monolith, breaking up the monolith into a series of microservices that are communicating with each other and the external world, then you have a couple of advantages. For example, feature velocity, so you can iterate faster, because, independently, the teams can work on different microservices and are not blocked by each other. You can also use the program language or a datastore that’s best suited for your microservice, so you have a polyglot system in general. You also have partial high availability for the overall app just because a part, for example, the shopping basket microservice is not available, your customers can still browse and put things into the search.

However, there are also a number of limitations or a number of downsides to microservices, be that containerized or not. That is, you’re not dealing with a distributed system. Meaning that more often than not, you’re having network calls between the different microservices, complexity increases. Most importantly, for our conversation, observability is not an optional thing anymore. If you are not paying attention to observability, you are very likely to fly blind. We want to avoid that.

What Is Observability?

What is observability? In this context, I will define observability as the capability to continuously generate and discover actionable insights based on certain signals from the system under observation. It might be for example a Kubernetes cluster or a bunch of Lambda functions. This system under observation emits certain signals. The goal is that either a human or a piece of software consumes those signals to influence this system under observation. On the right-hand side, you can see this typical way, the telemetry part from the different sources, typically through agents. Signals land in various destinations, where they are consumed by humans and/or software.

Signal Types

Talking about different signal types, we all know about logs. Textual payload usually is consumed by humans, properly indexed. Then there are metrics which are numerical signals, and it can capture things like system health. Also, already widely used. Probably a little bit less widely used in logs and/or metrics are distributed traces that are all about propagating an execution context along a request path in a distributed system.

Profiles

That’s it? No, it turns out, there are indeed more than three signal types. Let’s get to profiles and continuous profiling. We’ll work our way up from the very basics to how certain challenges in continuous profiling is solved, to concrete systems specifically with the open source aspect of it. Profiles really represent certain aspects of the code execution. It’s really always in the context of code, you have some source code and you have some running process, and you want to map and explore these two different representations. The source code has all the context in there, what is the function call? What are the parameters? What are return parameters? The process has all the runtime aspects, so resource usage. How much does a process spend on CPU, or how much memory does it consume? What about the I/O? There are instructions on the application level, and then there are also operating system related, so-called syscalls usually. The function calls of interest, what we’re interested here are typically the frequency. How often a certain function is called, and the duration, so how much time is spent in a certain function.

Representing Profiles

That was a little bit abstract, maybe, so let’s have a look at a concrete example. I use Go here. It doesn’t have a full implementation, but you get the idea. You have the main function that has two functions called shortTask and longTask. One way to represent that in a very compact manner would be what is shown on the right-hand side, main;shortTask 10, longTask 90, meaning that 10 units are spent in shortTask, 90 units are spent in longTask. If you’ve ever heard anything about profiles, continuous profiling, you’ve probably come across the term flame graphs that’s something Brendan Gregg has coined. That’s essentially the idea to show that execution as stacked bars, where the width essentially represents something like, in this case, the time spent in a function. It gives you a very quick way to get an idea how your program is behaving, and you can also drill down. There is, in our context, another version of that called Icicle graphs. The only difference is flame graphs grow, like flames do, from the bottom up, whereas Icicle graphs grow from the top, but otherwise are interchangeable.

How to Acquire Profiles

The first question really is, how do we get to those profiles? Because remember, it’s about the running process on a certain machine. One set of approaches is around program language specific profiler. There are profilers in various languages, here are a few examples. For example, in Go, you have pprof there. That is something that people have been utilizing already nowadays, in a rather static one-off session. Using pprof to capture the profiles as a one-off. A different approach around the acquisition of profiles is based on eBPF. eBPF is effectively a Linux feature. It’s a Linux syscall that essentially implements an in-kernel virtual machine that allows you to write programs in user space, and execute them in the kernel space, extending the kernel functionality. This is something that can, for example, be used for observability reasons or observability use cases, in our case, to capture certain profiles.

Sampling Profiling

What we will focus on the rest of the presentation is really what is called sampling profiling. This is a way to get continuously profiles from a set of processes running on a machine through periodically capturing the call stack that could be tens of hundreds of times per second, with a very low overhead. You might be talking about one, two, three percentage in terms of the overall usage. Maybe some megabytes there. In general, the idea is you can do that all the time. You can do it in production. It’s not something that you can only do in development, although it has its use cases there. It’s something that is always on because it has such low overhead.

Motivation and Use Cases

You might be asking yourself, why are we doing that? What’s there that the logs and the metrics and the traces cannot already answer? It turns out that although with metrics, for example, you can very easily capture how the latency overall for a service is, and evolves over time, under load, for example. If you ask yourself, where do I need to go into the code to tweak something to make it even faster? Or there might be a memory leak, or you might have changed something in your code, and you’re asking yourself, did that change make my code faster or not? Then, ultimately, what you need is to be able to go to a specific line of code. Metrics are great. They give you this global view, global in this context might be a microservice, but it doesn’t really tell you, it’s on line 23, in main.code. There are a couple of things that continuous profiling enables us to do in combination with the other signal types. That’s something I really like to highlight. It’s not about replacing all the other signal types, and replacing all the other things. It’s one useful tool in the toolbox.

Evolution of Continuous Profiling

Really, if we step back a bit, the continuous profiling field is not something that’s just been around for a couple of weeks or a month, but really has been around for more than a decade. In fact, there is a very nice seminal paper in IEEE Micro Journal, 2010, by a bunch of Googlers, with the title, “Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers.” Where they really make the case for, yes, you can in production continuously capture these profiles and make them available. As you can see, from the figure here from the paper, some of those things that symbolize a MapReduce fashion, gives it away from what period of time that is. Overall, they made that point in an academic context already in 2010. You can imagine that that has been internally presumably for some time in use already.

In the past 10 years or so, we have seen that a number of cloud providers, observability providers have these typically managed offerings. Not very surprisingly we had Google Cloud, where nowadays it’s the Google Cloud operations suite. It has the Cloud Profiler. Datadog has the Continuous Profiler. Dynatrace has the Code Profiler. Splunk has the AlwaysOn Profiler. Amazon has the CodeGuru Profiler. The few snapshots here on the right-hand side give you an idea, what you will find there over again, these Icicle graphs, or flame graphs, in some cases. That has been around for quite some time. The APM vendors, the application performance monitoring vendors have been using that, and the name gives it already away a bit, in a very specific context and very specific domain, and that is performance engineering. You want to know how your application is doing. You want to improve latency. You want to improve resource usage. Those kinds of tools would allow you to do that.

Conceptual CP Architecture

One thing has changed in the last couple of years, and that is that number of things came together in this cloud native environment. Now we have a whole new set of open source based solutions in that space. I’m going to walk you first through a conceptual, like a very abstracted architecture, and then through a couple of concrete examples. What you have there, and obviously, given that this is a conceptual architecture, it’s very broad and vague. Starting on the left-hand side, you can instrument your app directly and expose certain profiles, or you have an agent that for whatever method gets the profiles for you. Then you have some ingestion, typically takes profiles, cleans them up, normalizes them, whatever. Profiles are really at the core of the solution, but you always need the symbolization, so effectively mapping these machine addresses to human readable labels, function names. You have typically one part that’s concerned with the frontend. It allows you to visualize things in a browser. Some storage, where oftentimes the current stage you have mostly in-memory storage, or not necessarily long term, not permanent storage. The projects are working on that. Keep that in mind when we review the concrete examples.

CP Challenges

There are a couple of challenges in the context of continuous profiling. All the projects that I’m going to walk you through in a moment, go about that slightly differently. There are some overlaps. More or less, if you want to get into the business of continuous profiling project, either contributing or writing your own, you should be aware of those. One of the areas is the symbolization. You have to have a strategy in place that allows you to get those and map those symbols from your runtime environment, from your production environment, where very often you don’t have these for security and other reasons. You have this debug information in a running process. You have different environments. You might have Linux and Windows. You might have Java virtual machines. Coming up with a generic way to represent, map all these symbols is a very big task. Different projects go about it differently.

Then, because it is a continuous process, you’re capturing, and obviously, to store and index a huge volume of those profiles, which means you want to be very effective with how you store them. There are a number of ways to go about. One thing that you find very often is the XOR compression initially was published about and suggested by Facebook engineers. It’s widely used and helps for these time-series data to essentially exploit the fact that between different samples, not much is changing. The XOR essentially in there, makes it so that the footprint is really small. Column storage is another thing that you find very often in these open source solutions. Another challenge is that, once indexed, you want to offer your users, the human sitting in front of the CP tool, a very powerful way to query, to say, “I’m interested in this process, on this machine. I’m interested in CPU, give me this time range.” That means you need to support very expressive queries. Last but not least, and the first three there’s already enough evidence there in the open, there could be more done in the correlation space.

Open-Source CP Solutions – Parca

Let’s have a look at three open-source continuous profiling solutions that you can right away, go and download and install and run, try out yourself. The first one up is Parca, which is a simple but powerful CP tool. You can either use in a single node Linux environment or in a Kubernetes cluster. It’s sponsored by PolarSignals. It’s a very inclusive community. They are on Discord. They have biweekly community meetings, very transparent, and can encourage you to contribute there. I consider myself a casual contributor. I really like how they go about things. [inaudible 00:18:42] really is eBPF there, so the agent uses CO-RE BTF, which means you need a certain kernel level or a certain kernel version. In general, going forward, as we all move up, this will be less of a challenge.

The overall design was really heavily inspired by Prometheus, which doesn’t really come as a big surprise, knowing where the founder is, and the main senior engineers come from Red Hat. They’ve been working in this Prometheus ecosystem for quite some time. If you look at things like service discovery, targets, labels, query language, very much inspired by Prometheus, which also means if you already know Prometheus, then you probably will have very quick on-ramp for Parca. It uses pprof, the Go language defined pprof format to ingest the profiles. It has some really nice dogfooding so it exposes all four signal types, including profiles, so it can monitor and profile Parca itself. It also has very nice security related considerations in terms of how the builds are done, so that you know exactly what’s in there, because the agents running in production need quite some capabilities to run to do their work.

On a high level, the Parca agent more or less uses eBPF to capture the profiles in cgroups and exposes them in pprof format to the Parca server. You can obviously also instrument yourself, which usually is Go, with pprof. I believe Rust supports it now as well. As you can see, the service discovery pretty much looks very familiar from Prometheus, and you have the UI and you can immediately let it run for a few moments. You can immediately have a look at that. This is just a screenshot of the demo instance. You can go to demo.parca.dev and have a look at that yourself, slice and dice and see what you can get out of that.

Pyroscope

Pyroscope is a very rich CP tool that comes with broad platform support. It’s sponsored by Pyroscope Incorporated. On the client side, it depends, in general, on program language specific profiles, but also supports eBPF. You see that the table here that summarizes these two aspects, the broad platform support and the various languages or environments quite nicely. It comes with a dedicated query language called FlameQL, which is different to what you would use in Parca, but easy enough to learn. It has a very interesting and efficient story design. Also comes with the Grafana plugin. As you can see from the overall setup, roughly comparable with what we had early on with Parca. Also, very straightforward to set up. From the UI perspective, a screenshot from their demo site, demo.pyroscope.io. Again, encourage you to go there and try it out for yourself. You see these interactions of the actual research usage over time. When you click on a certain bar there, a slice where you can actually see the rest of this profile. On the left-hand side you see a tabular version of that. Again, very easy to use, very powerful.

CNCF Pixie

Last but not least in our open-source examples, is CNCF Pixie. CNCF, the Cloud Native Computing Foundation. What happened was that New Relic, the well-known observability provider donated or contributed this project to CNCF. This is now a sandbox project. It also uses eBPF for data collection. It goes beyond profiles, it does a number of things there full body requests, resource, network metrics, profiles, even databases. On the right-hand side, you see in this column, the protocols that are supported there. You can really, beyond compute, get a nice idea on how the request flow. What is going on? Where is time spent? It’s not directly comparable to Parca and/or Pyroscope in the sense of that it positions itself really as a Kubernetes observability tool, not purely for CP, not purely for Linux, but really for Kubernetes. All the data is stored in cluster in memory, meaning that, essentially, you’re looking at the last hour or whatever, not weeks or days. It comes with PxL, a DSL that’s Pythonic. It’s both used for scripting and query language. There are a number of examples that come along with that. Pretty easy to read, less easy to write, at least for me. Comes with a little bit of complexity there. At least you should be able to very easily understand what these scripts are doing. It can also integrate with Slack and Grafana. Again, encourage you to have a closer look at that. As you can see from the architecture as well, it’s definitely wider in terms of scope compared to the previous two examples that we had a look at. Profiles in Pixie are really just one signal and one resource that it focuses on. You can also use it for that with zero effort.

The Road Ahead – eBPF as the Unified Collection Method

Let’s switch gears a little bit and look at the road ahead. What can we expect in that space? What are some areas that you might want to pay attention to? One prediction in this space is that eBPF seems to become the standard, the unified collection method for all kinds of profiles. It really has wide support. Parca, Pyroscope, and Pixie already support eBPF. More compute is eBPF enabled, so from low level, from Linux machines to Kubernetes. It essentially means that it enables zero-effort profile acquisition. You don’t need to instrument, you don’t need to do anything, you just let the agent run and use eBPF to capture all these profiles. It’s also easier every day to write eBPF programs.

Emergence of Profiling Standard

In terms of profiling, this is something where I’m less sure. Currently, how the profiles are represented on the wire in memory, varies between the different projects. I suppose that a standard in this space would be very beneficial, would enable interoperability. You could consume the profiles from different agents. Potentially, pprof, which is already supported beyond the one programming language, Go, where it came from, could be that standard, in any case. I’m definitely keeping my eye on that.

OpenTelemetry Support

Furthermore, in terms of the telemetry part, so if you look at how to get the profiles from where they are produced, where the sources are, the machines, the containers, whatever, to the servers where they are stored, and indexed, and processed. Then obviously, OpenTelemetry being this up and coming standard for all kinds of signals would need to support the profiles. Currently, OpenTelemetry only supports traces, metrics, and logs in that order. Because traces first went GA, metrics are on the way to become GA now, and logs later this year. There is an OTEP in OpenTelemetry, enhancement proposal number 139. It definitely needs more activity, more eyes on it, more suggestions. My hunch would be that once the community is done with logs, which is during this year, in 2022, that after that, also with your support focuses its attention on the profiles in this place.

Correlation

Last but not least, correlation. Profiles are not useful or don’t work in a vacuum. They work in combination with other signal types. Imagine this canonical example, you get paged. You look at a dashboard. You get an overall idea what is going on. You might use distributed tracing to figure out what of the services are impacted, or you need to have a closer look at, then you might use the same labels that you used early on to select certain services in your CP solution to have a closer look, what’s going on. Maybe identify some error in your code or some leak or whatever. Another way to go about that if we focus our attention on frontends rather than purely the labels, which are usually coming from the Prometheus ecosystem. With frontends, that’s a similar story. Think of Grafana as the unified interface, as the unified observability frontend. You will see that again, you can use these environments to very quickly and easily jump between different relevant signals. I wonder if there is maybe a desire for a specification if you think of Exemplars that allow to embed trace IDs in metrics, in the metrics format, that then can be used to correlate metrics with traces. Maybe something like that is necessary or desired in this space.

Summary

I hope I was successful in convincing you that there are more than three signal types. That profiles are these four important signal types that you should pay attention to. That continuous profiling, that CP is a useful tool in the toolbox. It’s uniquely positioned to answer certain questions around code execution that can’t really be answered with other signal types really. In the last two or three years, there is a number of open source tooling becoming available, in addition to the commercial offerings, which obviously are still here and improving. There is a wonderful website called Profilerpedia by Mark Hansen. I really encourage you to check that out. He’s keeping track of all the good tooling and formats, and what-not, what’s going on there. It’s a great site to look up things. The necessary components to build your own solution or to use them together are now in place. You have eBPF. Symbolization has become more accessible. The profiles are there, although standard would be nice. All the storage groundwork has been done. This is the time. In 2022, this is the year continuous profiling goes mainstream.

Q&A

Abby Bangser [Track Host]: You say that profiling has been around for quite a while, but yet it’s still not maybe widely adopted. There are still a lot of people who have not yet experienced bringing this into their tech stack. What would you say is a good place for people to get started?

Hausenblas: Just to make sure we are on the same page, the actual idea of using profiles for certain specific tasks, if you think about performance engineering, that practice has been around. If you follow people like Brendan Gregg, for example, you will see a huge body of work, tooling. So far, it was relatively hard for mainstream if you think about, here are the performance engineer figuring out how to use these tools, maybe not the easiest. You could have used Linux namespaces in cgroups, bundle something up yourself. Docker came along, and everyone can pipe Docker run, and Docker pull and so on. That’s the same with continuous profiling now. Setting up Parca, or Pyroscope, or Pixie, and getting that UI and immediately being able to see those profiles and drill into it, has become now so straightforward. The barrier is so low that you really should have a look at that, and then figure it out. Identify certain maybe low-hanging fruits in your environment, if you’re a developer. Have a look at your ecosystem, your programming language, if that’s supported, which is very likely the case. Give it a try. I think it’s easier to really try it out and see the benefits than any theoretical elaboration like the talk, or whatever. It’s really an experience worth giving 10 minutes, 15 minutes of your time, and you will see the benefits immediately.

Bangser: Absolutely. It’s always about being hands-on and actually getting the benefit in your own experience, in your own company and tech.You mentioned a few different open-source tooling here, but we’ve got a question that just came in around the Datadog profiler. Is that something you’ve worked with or have any experience to share?

Hausenblas: I personally have not. I’ve seen that with our customers. I know Datadog is an awesome platform, end-to-end, very deeply and tightly integrated from the agent collection side to the UI. I’m a huge fan, but I personally don’t have hands-on experience with it, no, unfortunately.

Bangser: Sometimes there’s just too many tools in the industry to experience with them all. What I found with many other telemetry tools, they’ve affected not just how we deal with production issues, but all the way down to local developers. In your experience, when teams do bring in continuous profiling, how does that impact the development lifecycle?

Hausenblas: I think, especially for developers, and if you look at the various open-source projects in terms of use cases, you will see they will lead with that. Think of, you’re developing something in Python or Go, and you have a pretty good handle on what is the memory CPU usage, so you know, roughly, what is your program doing? You create something new. You fix a bug, or add a feature. With continuous profiling, you can now see, what did these three lines of code that you added there actually have an impact in terms of, it’s now using double as much CPU or tripling the memory footprint. In a very simple manner, you can just look at the previous version, current version, you do a diff, and you see immediately, there’s new three lines of code that I inserted there. You’ll make so much difference. Definitely, in the development, it’s absolutely beneficial.

Bangser: It seems like decreasing that cost of experiment is something which allows you to make the business case for even bigger changes and all that. Can you point towards any tooling for automated alerts when the performance decreases due to massive amounts of profiling data, or that came through that profiling data? Or is that something that always needs to be manually defined out of scope?

Hausenblas: That’s an area of active development. At least the tooling that I know in the open source space is not there yet fully, like you would probably expect from, if you say, I’m going to put together a Grafana dashboard. Have an alert on that. Where you have these integrations, for example in Pyroscope, of course, you can benefit from the ecosystem. That’s definitely a huge area of future work.

Bangser: One of the things you were speaking about with the different tools on the market, is that there are these different profiles, we don’t yet have a standard for profiling across. Are there and what are the significant differences between the different profiles created when using eBPF versus the language specific tooling?

Hausenblas: My main argument would be that eBPF is agnostic. It doesn’t really care if you have something interpreted, like if you have something like Python, or Java virtual machine and some Java dialect, or the language on top of it, or something like Go, or Rust, or C. You get, in addition to the program language specific things, the operating system level things as well. You get syscalls in the kernel, you get your function calls, all in one view. That’s why I’m so bullish on eBPF, taking into account that we still have some way to go to enable eBPF and what is necessary there in terms of requirements that the compute environment needs to support to benefit from it.

Bangser: With the language specific ones, it’s just you’re not getting that connection to the kernel calls as well as your software quite at the same time.

Hausenblas: Right. The idea really here is that you want, in the same way that if you’re using a SaaS, or a cloud provider, or whatever, you want to immediately be able to say, is that problem or whatever in my part in the code that I own, or is it in the operating system, the server’s part, or whatever? If it’s not in your part, then the best thing you can do is keep an eye on what the provider is doing. If it’s in your part, you can immediately start trying to fix that.

Bangser: Honing in with those quick feedback loops. One of the things you mentioned was around as we start improving our versions that we’re on, there’s going to be less of these problems of trying to integrate this into our systems. There are still lots of very important systems on older software. Here’s an example I’m being asked about a piece of software that’s not containerized and written in Java 5, is there tools on the market that are supporting that architecture instead of the more containerized Java 8 style, or Java 8 plus?

Hausenblas: Maybe that wasn’t very clear when I mentioned it, or maybe I gave the impression that all the things that I’ve presented here only work in the context of a containerized setup, if you have Kubernetes, or something on Docker. That’s not the case. You can download, for example, the Parca server and run it as a binary directly in your environment. It’s just that in the context of containers, in the context of Kubernetes, the projects make it easy to install it and you get a lot of these integration points. You get a lot of things for free. That doesn’t mean that you can’t use it for non-containerized environments, that you can’t use it for monoliths, for example. A perfect use case, if you have something written in Java as a monolith, it’s the same idea, or the same application. It’s really just, you get in this distributed setup in the context of, for example, containerized microservices. Very often, there’s the need to correlate, the need to use other signal types. For example, you probably need something like tracing initially to figure out which of the microservice it is, and then you can use profiles to drill down into one specific one. That’s the main difference there.

Bangser: You mentioned you might use this in correlation with something like tracing. You trace to get to the right service, then you profile within that service. You were quite optimistic about the connection of continuous profiling data with other types of telemetry data. What do you think is the key missing piece at this point? There’s lots of opportunities there. One thing you feel like would really change the landscape and bring a lot more people on board and into continuous profiling.

Hausenblas: I do believe we already see the beginning there. For example, the OTEP in OpenTelemetry, once logs are done for the end of this year, beginning of next year. If you have that end-to-end, from instrumentation where you actually emit certain signal types such as profiles, if your telemetry agent, OpenTelemetry Collector, for example, supports it, if and when the backends are enabled and provide these kinds of storing and querying the profiles properly. Again, it is early days. Yes, it’s going mainstream. It is 2022, definitely a year where you can perfectly start with it. In terms of interoperability, in terms of correlation, there’s certainly, I think, also more feedback from the community, from practitioners in general are required, like, what are the important bits? What should be prioritized?

This is really more like, up to you out there. You are using it, or maybe you want to use it. As you dig into that, you probably figure out limitations. That will then inform any standardization that doesn’t have to be a formal body that could be something like the CNCF working group or whatever that looks at what exactly should we be doing there? Or it might be one of the open source projects that has an initiative around that, that says, let’s establish a standard for that, such as we have with Exemplars in the case of metrics and spans or traces.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Vulnerability Inbox Zero

MMS Founder
MMS Alex Smolen

Article originally posted on InfoQ. Visit InfoQ

Transcript

Smolen: My name is Alex Smolen. I’m the director of security for LaunchDarkly. I’m here to talk about how our security team solved the problem, and by doing so, achieved perfect mental clarity, or at least a temporary reduction in stress. Either way, we think what we learned is worth knowing.

LaunchDarkly

First, I want to talk about where I work and what I work on so that I can put our security problems into context. I work at LaunchDarkly. LaunchDarkly is a service you can use to build feature management into your software. You can deploy code and then turn it on and off with a switch. Why would you want to have these kill switches in your software? I think it’s pretty cool. As a matter of fact, I don’t call it a kill switch, I call it a chill switch. Let’s say you have an outage, or a security incident caused by a bad piece of code. Rather than scrambling to deploy a fix, you can flip a switch. When you’re triaging an incident, the realization that you can end it with a simple flip, it’s pretty powerful. LaunchDarkly’s vision is to create a world in which software releases are safe and unceremonious. That means helping software developers around the world be more chill. The LaunchDarkly security team’s vision is to help our customers’ security teams chill out. We need to solve security problems and show our work so that they can trust us and use our service knowing that our security standards are as high as theirs.

Vulnerability Scanners

Seems easy enough. We actually had a security problem that was decidedly unchill. Vulnerability scanners. You know the story. You’ve run these things before. Let me give you a list of the worst things about vulnerability scanners. The first is that you have a lot of vulnerability reports that you have to triage and deal with. Second, there’s a lot of vulnerability reports that come out of them, and it keeps on going from there. The real worst thing about vulnerability scanners is that there’s so many of them. You’ve got network scanners, OS scanners, web app scanners, DB scanners, code scanners, container scanners, cloud scanners. At LaunchDarkly, we’re a small but mighty security team. We know that if we tried to triage every result of all of these scanners, we’d be overwhelmed. We could turn them all on, but pretty soon, we’d have to tune them all out. We’re also in the process of undergoing a FedRAMP certification, which has strict vulnerability management requirements. It’s a tough standard, there’s a mere, at the moderate baseline, 325 controls that we get audited against. That includes RA-5, which describes how we must perform vulnerability scanning at each of our different layers, and then mitigate or otherwise document our acceptance of vulnerabilities within a defined set of SLOs. How do we deal with the huge volume of scanner results that we have and the required evidence documentation, while keeping our chill level to the max?

Inbox Zero

Security can be a stressful occupation. Our minds are constantly bombarded with problems and potential problems. Our team felt the pain of this chaos. That’s when we realized we weren’t alone, we could build a shelter from the storm. Imagine the clarity of knowing that all of your systems are scanned, and all of the results are accounted for. Your mind would be empty, free. We search for inspiration on how to achieve this highest of mental states, and that’s when we found Inbox Zero. Inbox Zero is not a new concept, it’s about email, that thing that we used to send back in the early millennium. Inbox Zero is the practice of processing every email you receive until your inbox is empty with zero messages. Why would you do this? Because attention and energy are finite. Spending your time responding to messages makes you reactive. When you have a zero inbox, you can actually focus on something, you know there’s nothing more urgent that’s sitting and waiting for you and your dopamine fueled need for distraction and novelty. The principles of Inbox Zero are intended to free us from the tyranny of incoming information, and let us return to our lives.

Inbox Zero as a concept was developed in 2007 by Merlin Mann in a series of blog posts in a Google Tech talk. He was overwhelmed by the torrent of email that he was receiving, so he described how to build walls around your inbox. This means processing your email and deciding what it means to you with a series of actions that he described, the first being delete, or archive. You can go back and refer to it later but get it out of your inbox if it’s not important. Second, you could delegate or send the email to somebody else. You could respond but this should be quick, less than two minutes, or else you’ll forget about the rest of your inbox. You can defer or use the snooze feature to remind you at a more opportune time. Or finally, you could do something or capture a placeholder to do something about it. These actions are in priority order, and the most important step for Inbox Zero is to delete email before it gets to your inbox. He described the importance of aggressive filters to keep the junk out and your attention free. Focus on creating filters for any noisy, frequent, and non-urgent items. The most important part of your Inbox Zero process should be automatically deciding whether a given message can be deleted, and doing so. Sometimes it can be tough to figure out whether you can delete an email or not, just remember every email you read, reread, and re-reread as it sits in that big dumb pile is actually incurring mental debt on your behalf. Delete, think in shovels not in teaspoons. Requiring an action for each email and focusing on finding the fastest and straightest path from discovery to completion is what helps us keep our inboxes clear.

What does this mean for us security professionals? Our time and attention is finite, but the demands on our time and attention are infinite. When he came up with Inbox Zero 15 years ago, Mann knew that it was about more than just email. In his blog post he named checked Bruce Schneier, who once famously said, security is a process not a product, and the same is true of the zero inbox. He also put in his blog post that like digital security and sustainable human love, smart email filtering is a process. Inbox Zero is about email, but it may even be useful for digital security. Then there’s these claims about sustainable human love, where did he get these revolutionary ideas? It turns out that the biggest secret to Inbox Zero is that it’s based heavily on David Allen’s “Getting Things Done” book. In this book, David Allen described an action based system for processing any information or material that lives in an inbox so that you can clear it out. There’s three questions he described that you needed to answer. First, what does this message mean to me, and why do I care? Second, what action does this message require of me? Finally, what is the most elegant way to close out this message and the nested action that it contains? If you’re familiar with it, though, getting things done is more than just an organizational system.

This is directly from a 2007 Wired article, about the proclaimed power of GTD. It’s called, Getting Things Done Guru David Allen and His Cult of Hyperefficiency, within his advice about how to label a file folder, or how many minutes to allot to an incoming email. There is a spiritual promise. Later, it says, there is a state of blessed calm available to those who have taken careful measure of their habits, and made all changes suggested by reason. Maybe personal productivity can get a little culty, but I’m not trying to be like that. I am relatively sure though, if you want to manage your vulnerability scans, and be as chill as our team, you need to Inbox Zero your reports. I’m here to show you how we did that.

It was about a year ago that we started working together on this problem. We knew we needed to scan all of our resources and respond to the results of these scans within a defined timeframe. We also knew that responding to every scanner result is the recipe for a bad time. There’s no central inbox, and there’s no way to determine if it’s zero. Just a bunch of Slack messages, emails, CSV files, Excel spreadsheets, blood, sweat, and tears. We needed a single source of truth that was only filled with items that merited our time and attention. This meant we needed processing so that we could get rid of the inevitable out of date code dependencies that weren’t hit, old container images that needed to be cleaned up, and out of date versions of Curl. These things are like the spam and forward emails from your in-laws messages. The smarter our processing, the happier our responders.

Processing

This processing step was crucial. We wanted to spend our time, attention, and energy with the few findings that actually mattered. Our team got together and brainstormed what this vulnerability processing pipeline should look like. First, we knew that we wanted to automatically suppress all scan results that were known noisy items, and ignore them. Next, we would check if any of the scan results were already ticketed. Maybe we already knew about it, but hadn’t fixed it yet. If so, we could ignore them. Next, we need someone to come in and triage the result. Is it a false positive? If so, don’t ignore it. Write a suppression so that the next time we see it, it’s automatically ignored. Next, you would say, ok, this isn’t a false positive, is it critical? If it’s not, file a ticket, and we’ll work on it with our regular work stream of vulnerability scan results. If it is, ring the alarm, and we’ll declare an incident.

Our Goals

That’s what we wanted to build. We had some goals when we set about to build it. First is we wanted to have it be operationally low overhead. This meant using AWS services for us. We looked at some open source solutions here, but running someone else’s code tends to be a pretty big overhead. We also knew that we wanted the filters to be code based. We thought this would be helpful for making sure that our rules could be really expressive. It would also help us track how we were suppressing various vulnerability scan results so that we could do things like git-blame and figure out why we were suppressing a certain item with context around that change. Another goal we had was to support our FedRAMP requirements around vulnerability management, so we didn’t spend a lot of time in Excel.

Scanning Architecture

We placed this core vulnerability processing pipeline inside of our broader scan, process, and respond framework. On the left is the scanning section, with all of our scanning tools raining down results to ruin our day and our moods. In the middle is the processing section where we standardize and suppress before sending to our inbox. At the center of all this is AWS Security Hub. Then the final section on the right is for responding. This is where our team members get an alert to triage and can quickly get all the information they need to make an informed decision on how to process and spend time on this vulnerability.

We chose several tools to accomplish our scanning goals, the main ones being AWS Inspector, Trivy, Tenable, and GitHub Dependabot. Inspector is an AWS service that performs scanning of our EC2 instances for CVEs and CIS benchmark checks. Trivy is an open source scanning tool that’s used for scanning some of our container images. Tenable tests our web applications, our APIs, and our database. Then we have GitHub Dependabot, which is looking at out of date dependencies in our code that could be exploitable. For these external scanners, we have Lambda code that runs forwarders. It takes the findings out of these external vulnerability scans, and imports them into AWS Security Hub. To do that, it has to convert them into Amazon Security Hub Secure Finding Format, or ASFF. It attempts to decorate them with some contextual information that might be helpful.

Let’s take a look at security hub and how we use this in our environment. AWS Security Hub is a service designed to aggregate security findings, track the state of them, and visualize them for reporting. It integrates with a bunch of AWS services like Inspector but also GuardDuty, Access Analyzer and similar. For us, the killer feature of AWS Security Hub is that it will automatically forward and centralize Amazon Inspector results from different regions and different accounts. This allowed us to just automatically deploy the AWS Inspector agents on our EC2 hosts, and let AWS handle the routing of those findings into security hub.

Processing Architecture

The second section of our pipeline for processing is the most critical, and it’s where we do the actual crunching of our findings and prepare them for our response. The most important part of this is Suppressor. Suppressor, as its name implies, takes a list of new scanning results, and suppresses the noise out of our inbox. It does this by listening to EventBridge. Whenever there’s a new finding reported to security hub, it runs and makes sure that all of the findings go through a set of rules that recategorize and suppress known false positives. When the Suppressor finishes running, it reports the results to S3, where it’s picked up by Panther for alerting. If we dig into one of these Suppressor rules, we can see that they’re written in Python. What they look like is a pretty simple, two methods. One is check, which returns a Boolean if the rule matches the finding as it comes in. Then action, which returns what we should do for that particular rule. In this rule, what we’re looking for is this particular CVE, which doesn’t have a patch available from the operating system maintainer. We may have this ticketed somewhere. We essentially don’t want to receive an alert about it every time it comes up in a new scan. The ability to write these rules in Python, and the templated logic can be really helpful. It allows us to store our entire rule pack in a GitHub repository. Sometimes this configuration as code has some drawbacks, but we have a full CI pipeline where we lint test and deploy our rules. That makes sure that any filters we add are hopefully always going to make things more accurate.

Suppressor isn’t the only piece of our processing pipeline, there are some other supporting Lambdas that sit in this section. First is requeue, which makes it so that any time we update our rule sets in GitHub, we automatically requeue all of our findings and forward them back to Suppressor for reevaluation. This makes sure that even if we update rules, what’s in security hub always matches the state of what we’d expect. We also have asset inventory, which does several things. For the purposes of this diagram, what it does is provide details about our resources so that when we forward data about vulnerabilities, we can annotate it with additional information that will be helpful for responders. Lastly, we also have something called Terminator. What Terminator does is it takes care of findings associated with resources that have been terminated. We have our SIEM, Panther, which listens to CloudTrail logs, and determines when resources are no longer available. It then notifies Terminator, which removes the findings from AWS Security Hub. This can be for EC2 instances, domain names, databases, archived repositories, and so on.

Response

The final piece of our pipeline is the reporting section. Since all findings are reported to AWS Security Hub, we can use its built-in functions for visualization. These findings are then forwarded from Suppressor to our SIEM, which handles routing the actual scan findings to the correct alert destination, and assists us with deduplication of vulnerabilities across groups of hosts or similar resources. This makes sure that whether our finding was discovered on one host or 200, we only get one alert. When everything goes right, our unique process findings are sent to Slack from our SIEM, Panther, with enough contextual information for a team member to quickly triage a new scan finding. Inbox Zero is not just about sending you messages, we need to actually process these messages as human beings. For this, we have this role that we created on our team, which is rotating, called the Security Commander, as well as a Slack triage bot.

The Security Commander is responsible for being the human Inbox Zero responder. Their job is to quickly triage but not fix any new findings that come in. That means that for the Security Commander, their process flow looks a little bit like this. First, they determine, is this alert a false positive? If it is, write a suppression, upload it and make sure that that finding is removed from security hub. If the finding is legit, then determine, is it critical? If it’s not, file a ticket. If it is, then respond and potentially create an incident around this vulnerability. Since most of the time what this means is that the Security Commander is either writing suppressions or potentially filing tickets, it isn’t a particularly high overhead role and allows them to focus on their workday, while keeping us focused on having Inbox Zero.

Our Slack triage bot scans the results as they come into Slack from Panther, and makes sure that we are being responsive to all alerts as they come to us. To assist our Security Commander, this Lambda which is shared across all of our security tooling, helps keep us honest by making sure that we respond to alerts and also preparing metrics about the kinds of alerts we’re seeing and how quickly we’re responding to them. It also provides a couple of shortcut actions for the Security Commander for doing things like creating new tickets for vulnerabilities.

Asset Inventory

Inbox Zeroing your way to vulnerabilities being completely addressed is really great. How do you know that you’re actually scanning all of your important resources? We have a couple of Lambdas that look at our infrastructure APIs and code repositories and output the source of truth inventory to S3, as well as information about the resources being scanned. Like for EC2 instances, do they have Inspector running? Or for GitHub repositories, is Dependabot enabled? We additionally have an Endpoint Checker Lambda, and we use this to make sure that all of our domains are scanned to determine whether or not they’re publicly accessible. If they’re publicly accessible, they should be included in our vulnerability scanning. We do this via just a simple port scan.

Lessons Learned

I wanted to share some lessons we learned while setting up this scanning and processing architecture. First, the ASFF format for security hub is pretty rigid, and we had to fit some of the external findings into it in a little bit of a distorted way. We found that a little bit challenging. We also found it challenging to make sure that everything had a unique finding ID, especially when resources were similar across environments or accounts. Finally, we’ve been weighing the tradeoffs between Inspector V1 and Inspector V2. Inspector V1 doesn’t work in all regions that we want it to. It doesn’t have the same integration with security hub. It also requires its own separate agent on EC2. The big tradeoff is that V2 currently doesn’t support CIS benchmarks. Another thing we learned is that our underlying operating system in EC2 Ubuntu still requires restarts even with unattended updates. What we found is tracking reverse uptime, and making sure that old hosts get rebooted pretty frequently is important. That also means making sure that all of our infrastructure supports being rebooted and restarted. Finally, we found that security hub has some relatively tight rate limits. You can see here that there’s some APIs which can rate limit you relatively quickly, and so we had to rearchitect some of our pipeline to support this.

FedRAMP POAM (Plan of Action and Milestones) Automation

I wanted to share another benefit for having a single source of truth for this vulnerability data. This one’s for the FedRAMP heads out there. We built a few Lambdas to automatically generate our monthly continuous monitoring reports. The asset inventory Lambdas go in and generate a list of cloud resources and their compliance with some of our security controls, things like disk encryption, running security agents, and so on. We then query security hub to ensure that all vulnerabilities in it map to vulnerabilities that are documented in Jira, associate it with what are known as POAMs, or Plan of Action and Milestones. We can then automatically generate the Excel spreadsheets that we need to provide to the federal government that shows that we’re ready to handle federal data.

What’s Next?

Looking forward, we’re excited to make some improvements to this pipeline. First, we want to upgrade to Inspector V2 once it supports CIS benchmarks, which we’re hoping come soon. We’re also looking to add more scanners. I think this is going to be an opportunity for us to take advantage of this filtering pipeline to really make sure that when we do add scanners, we’re getting value from them. We’re hoping to be able to expire suppressions regularly so that they need to be revisited to ensure that they’re still appropriate. We also want to be able to delegate findings to teams. On our security team, we do go in and fix vulnerabilities, but we want to be able to also have the option to send them out to teams to parallelize our effectiveness. It would also be great to have team scorecards, where we could incentivize teams to go in and update their own infrastructure that has vulnerabilities that are close to being out of SLO. Finally, we wanted to take a lot of this data and put it into Snowflake or a similar data warehouse so that we could really slice and dice it, look at it with data visualization tooling. That’s an area that I’m excited for us to work on together.

Summary

Love them or hate them, vulnerability scanners aren’t going anywhere. We recommend that you embrace the avalanche and think in shovels not in teaspoons. Quit lying to yourself about when you’ll actually clean out your inbox. I hope that you’ll all join us in the practice of Inbox Zeroing your way to vulnerability scan tranquility.

Questions and Answers

Knecht: Thinking back and retrospecting, what were the biggest challenges in creating this flow at LaunchDarkly? What were maybe roadblocks you ran into as you were rolling it out?

Smolen: I think one of the biggest challenges was getting consensus as a team about what we were trying to do. I think we all recognize that we had problems, and different people, depending on their role, had a different perception on what the problem was or how to solve it. Figuring out what we wanted the end state to look like, I think was a really big step towards arriving at the solution that we did. That’s why I think vulnerability Inbox Zero, even though it’s a stretch maybe as a metaphor, I think was really helpful for us to all get agreement on what the big picture looked like. Where we were all heading towards, the North Star. That was a challenge, though, to figure that out. I think another set of challenges was related to really getting that source of truth to be high quality. When you’re dealing with data, the challenges are often in not just getting the data into a centralized place, but really making sure that that data is of high quality. Getting all the vulnerability data into security hub, or wherever else you might want to put vulnerability data, that certainly takes some effort. The challenges were really around making sure that it was up to date, that it had all the information that we needed, things like that.

Knecht: That makes sense, especially when you might be having scanners that find the same thing with slightly different words, like deduping, all of those things, definitely. It seems like a hard problem to solve.

One of the things you talked a lot about was Suppressors and not actioning things as the action plan for those vulnerabilities. What do you think about the risk of accidentally filtering out a critical true positive vulnerability, or how did you think about that or weigh that as you were designing the system?

Smolen: I think going back to what you were just talking about with respect to having scanners that are somewhat duplicative, I do think that there’s layers of defense in a security program where you identify vulnerabilities along a timeline. If a vulnerability never gets identified, then, who cares? If a vulnerability comes and goes, and no one ever exploits it, then it’s the proverbial tree falling in the forest. You can discover a vulnerability along a timeline, hopefully not as part of a postmortem of an incident. That’s the scenario you want to avoid. How do you catch it earlier? Obviously, scanners are going to be one way to do that. There may be a world where you identify the issue through a bug bounty, or through some manual penetration testing or something along those lines. Our security program, we do have those multiple layers of defense in place.

I also think that there’s a huge amount of value from what I may call threat intelligence, which in reality can sometimes be like, knowing other people who work in security teams, and hearing like, we’re hearing rumors that people are exploiting this vulnerability. Certainly there is, I think, a huge value in the collective knowledge that we share in the security community, as to whether or not a vulnerability is something that we really need to triple check that we haven’t missed. Because the reality is that the overwhelming majority of vulnerability scanner results do not refer to things that would be exploited. We would all be in trouble if they were.

Knecht: You talked about some of the future shape of what this is going to look like at LaunchDarkly. Has some of that stuff been realized? Have you had any additional or different thoughts about how that might evolve, or does that pretty much look the same as what you mentioned there?

Smolen: I mentioned this, but I think something that is really a focus for us right now is thinking about the data with respect to this pipeline, and how it relates to our asset inventory, and recognizing that as a security team, our decision making can only be as good as our data in some cases. We are running into I think a problem that a lot of software engineering teams run into, which is that, getting good data is almost an entirely separate discipline from software engineering. You see data engineering teams that specialize in capturing data and providing the analysis and visualization tools that enables the business to make good decisions. We are needing to build a similar capability internally for our security team. We’re learning a lot of lessons from the data engineering discipline to help us think about how we run our security program.

Knecht: We have similar stuff going on at Netflix actually, thinking about, what are the practices of data engineers, and quality checks, and all these things that help us to have trustworthy data. I think that’s a hard one to solve, but also an important one.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Article: Going from Architect to Architecting: the Evolution of a Key Role

MMS Founder
MMS Leigh Griffin Chris Foley

Article originally posted on InfoQ. Visit InfoQ

Key Takeaways

  • The role of an architect has evolved from a command and control to a technical coaching and mentoring role
  • Architectural considerations are everyone’s responsibility and problem to address
  • An inversion of control has occurred for the role of the architect, and where they fit into the team is now a challenge for many
  • Tooling advances have simplified key challenges such as scaling, that architects previously added value to
  • Teams may be stuck in a transition towards making architecting a team capability due to the lower maturity of their processes

A changing world

Software could be viewed as a very young science, compared to the more traditional sciences. But even in its infancy, one of its key components, its architecture and how that is formed, has changed significantly. Gone are the architectural blueprints, the months given to producing the complete design that would solve all problems, and gone is the sole individual governing all. That paradigm shift was partly driven by the industry creating better tools, and partly by the changing behaviour of customers. Their interaction model changed from a transactional service to a consumption driven service, moving customer behavior from systems of record to systems of engagement, where customers now have more active and timely demands. Software architecture needed to evolve to meet this new demand and embrace the available tools. Today, architecture is more about decisions than structure, more about reacting to constant change than following a master plan, more about delivering frequently than one big drop. The implications of this on the role the architect plays is profound. 

In this article we will explore the cultural change of moving towards shared architecture, and the role that the architect has evolved into; from one with an air of authority and singular vision, to one in which system design issues are surfaced, which require team-wide input to resolve. This has caused an inversion of control style relationship to develop. Teams that are being coached and guided towards shared ownership might be struggling with this paradigm shift of ownership. 

We will explore our experience with this change, from our combined 25+ years’ experience of wearing multiple hats in a team, from being an engineer, and a product owner, to a team coach and manager. Each of these roles has allowed us to interface with architects and has provided us with observations of the changing landscape and the role the architect has tried- and at times failed- to evolve. We hope to offer guidance to those stuck in transition, as well as those looking to further enhance and distribute their architecting.

Change Factors at Play

A changing responsibility

Traditionally, an architect had a number of fundamental responsibilities. One of those involved the scalability profile of the application. The architect needed to consider a number of different factors to ensure that the anticipated load on the system could be handled. That informed decisions that ranged from what language would best handle this style of application? How do we want to handle I/O? To block or not? What is our database strategy? How many cores do we need? What about RAM? What about storage? The level of refinement on this gave careful consideration to the deployment strategies, the availability of specific hardware or chipsets, right down to the location the applications would be staged from. Those decisions ultimately gave a holistic overview of the lifecycle of the application, its intended usage and often informed the update cadence and strategy.

In a more modern context, the tooling available to the overall development team has reduced the previous number of considerations that an architect would need to have thought through. The capabilities of autoscaling, for example, ensure that questions that surround the computational resources needed by the app are neutralised as a topic (notwithstanding that somebody needs to pick up the bill!!). Deployment and handling bursty load is made trivial through orchestration platforms like Kubernetes that will manage additional instantiations of your application on demand and step them down as traffic decreases. Profiling tools, from static analysis of cyclomatic complexity to performance analysis metrics, right through to visualisations of your APIs capabilities, has now made a wealth of information available at a team-wide level. The toolkit described above now appears on standard job specifications, meaning the previous specialised knowledge of the architect is naturally distributed across the entire team and the knowledge generation and data insights far exceed what a singular role could ever have hoped to share in the team. This means some ownership and accountability in this space has transitioned over to the team as a whole, rather than being solely owned by one individual. Shared ownership has become a thing. Now the team often decides on the tooling, influenced by industry standards, client expectations and technology alignment within the company.

Changing consumption patterns

The rapid evolution of cloud computing (or the SaaS culture and model) has prompted a move towards flexibility in how we ship, when we ship, and what we ship. The focus now is on creating a more robust onboarding of services and support, allowing the team the capability to change focus quickly. Knowing that increased functionality comes with more user accessibility of the feature set means that learning the consumption profile becomes a key decision in how features are developed and grown. That hardening element, previously a multi-month thought exercise on stability, scale and robustness, has now given way to more willingness to experiment.

Features being released under banners of technical preview, with caveats of do-not-use this in production often willfully ignored, allow the safety net for the evolution of the app to grow in lockstep with your customers demands. This removes the vacuum effect of the customer being disconnected from the team, with that prior relationship often solely managed on behalf of the team by roles such as the business systems analyst, or more recently the product owner. Now, the team is much more customer-aware and maybe even more aware than the architect is at times. They have exposure to how the customer interfaces with the system and when combined with the insights derived from their telemetry applications, the pulse on what the customer needs, why they need it and how they need it is present. That brings a powerful multi faceted view on the evolution of the app as now, the entire team, with their diverse backgrounds, skillsets and niche expertise can contribute towards the grander vision, truly transitioning it from the role of a single individual, to one the team largely drives, but in collaboration with the customers needs.

The actual impact of this, at a code structure level, is best seen through the rise of microservices. With this change of ownership and the change in demand, the application as a whole needs to be in a position to evolve independently; to allow some services to try something different, to test out a feature, to be able to switch on and off functionality for some or all customers. This has created a virtuous cycle wherein this development approach has created a suite of supporting tools and services such as API Gateways to organise your service contracts, to messaging systems like Apache Kafka through to microservice enabling frameworks such as Spring Boot, Flask and other language-specific frameworks. The availability and maturity of this tooling has in turn made it easier for teams to self select microservices as an architectural style to adopt, further driving the investment in the tooling.

For the architect, they no longer can design the architecture with their master blueprint in mind. Modern consumption patterns have demanded more agility. The architect must constantly adapt their system design to address the fast flowing customer needs; they must facilitate the evolution of the architecture.

Mindset changes, opportunities, challenges and new skills to be mastered by today’s architect

After seeing the different control factors at play, we believe that there are things that need to change for the modern architect; challenges to address and opportunities to take advantage of, and new skills to be attained and practised.

Software architecture is constantly evolving

A fundamental principle of today’s software architecture is that it’s an evolutionary journey, with varying routes and many influences. That evolution means we change our thinking based on what we learn, but the architect has a key role in enabling that conversation to happen. Taking two quotes from a chief architect who we interacted with at Red Hat to capture some of the thoughts and concerns that face today’s architects:

A critical subject is the impact of Conway’s law: the architecture of your system reflects the structure of the organisation building it.

Another aspect of architecture conversation is surfacing issues that either nobody was aware of, or that everyone was aware of but were not willing to talk about. Putting these to the fore for discussion is essential. Not with a sole focus to solve and implement, but more so to discuss them such that everyone knows there is a path forward or that we can initiate appropriate adjustment as the design evolves. Sometimes one of the most important outcomes of these discussions is the realisation that the problem is not a problem at that point in time. The clarity that this can provide is essential for everyone. It allows people to focus on upcoming tasks without some feeling of a shadow looming over them, in other words, it removes an unspoken burden. — Emmanuel Bernard, distinguished engineer and architect, Red Hat

The presumption is that the majority would tend to agree with those thoughts, but has their architectural decision process evolved to match this thinking? Are they taking into account org structure? Are they forgoing up front design to introduce up front conversations? The first step with any change is awareness and then acceptance.

One of the primary influences on product/service changes is customer interaction and feedback, and even understanding that your customer may be internal teams as well as paying customers. The feedback loop is continuous in modern markets. and the architect must fully embrace this opportunity. In conjunction with continuous feedback, there is an expectancy of frequent, or as close to continuous as possible, delivery. This introduces challenges for the architect, their team and their organisational structure, as rarely can continuous delivery be achieved in isolation; it often requires an organisational movement to achieve it successfully.

Small iterations resulting in frequent deliveries can be ideal for functionality that fits into such a window. However, when introducing potential bigger pieces of functionality (e.g. architectural refactoring), it may not be as simple as one would expect. This challenges the architect and his team to be able to deliver component parts on a frequent basis, while still ensuring the service runs satisfactorily, ensuring SLAs and quality expectations are abided by. More importantly surfacing a resolution path early in the development, before a forced evolution occurs, brings a sense of ownership to an enforced change. 

In a lot of cases the timing of design decisions can be governed by business needs. The marrying of business needs, in a timely fashion, with system design decisions, is a real challenge that the architect and his/her team need to address. The traditional focus of creating an architecture primarily focused on the end goal, has changed to “what’s needed next to adequately address the business needs of the upcoming months”. This may lead to decisions which may be changed further down the line, but they are the correct decisions at the point in time. As we learn by building and our customers learn by interacting with our products, it provides the tight feedback loop that makes any changes informed in the here and now. It’s the natural evolution of the system architecture, and knowing that ahead of time and building in strategies to handle best case and worst case scenarios is a key skill that the team needs to develop. Architecture used to be perceived as a straight road, but it isn’t and arguably never should be again. The evolution of an architecture has many twists and turns on its way with each one bringing a learning opportunity.

That’s not to say that the architect must ignore the perceived end goal of the architecture, something that used to be their only focus. In the current climate, the product owner becomes a very important collaborator for the architect, meaning that end goal and vision become a shared experience. Debating, collaborating and agreeing with the product owner on the vision of the product/service is essential for the architect to ensure that there is a direction, even if it may not be crystal clear all the time. Conversely, that debate allows a sense of realism to the product owner’s vision, hopefully allowing a compromise that reflects what’s achievable and what a realistic timeline looks like. This vision provides guide rails and becomes another important input to the technical decision-making process.

Architecture is no longer an individual sport

The architect playing a sole role in the software development game is no longer the case. Architecting a system is now a team sport. That team is the cross functional capability that now delivers a product and is made up of anyone who adds value to the overall process of delivering software, which still includes the architect. Part of the reason behind this, as discussed earlier, is that the software development ecosystem is a polyglot of technologies, languages (not only development languages, also business and technical), experiences (development and user) and stakeholders. No one person can touch all bases. This change has surfaced the need for a mindset shift for the architect; working as part of a team has great benefits but in turn has its challenges.

To be able to lean on others and hand over responsibility is an important skill that needs to be attained. This comes down to building trust between the architect and team members. They must now share ownership of the technical direction and trust colleagues to own and drive certain aspects or components of the system. The ecosystem of how architecture evolves has changed from a single top down commander to a conglomerate of peer contributors, all with diverse perspectives.

For trust to be bi-directional, there are times when judgment needs to be held in reserve, and new ideas and opinions allowed to flourish. The architect needs to be a leading figure in establishing psychological safety for his/her team. Meritocracy of ideas must be nourished in order to allow the architecture to evolve in the most optimal way.

Failures at the team level, rather than being perceived as incompetence, must be seen as opportunities to get things right. More frequent delivery cadence aids this approach as action can be taken quickly.

The architect needs to accept that architecting has evolved from being an exclusive individualized responsibility, to that of a shared team responsibility. Through this acceptance, they can avail of the array of benefits a team environment can generate. For people who have worn the more traditional architect’s hat, the perceived lowering of themselves to be a team member can most definitely be a struggle. We have unfortunately seen this first hand, where the architect took on a more challenging tone to every idea, suggestion or improvement that the team was bringing up in their innovation phase.

This led to an impasse, as the team slowly over time became silent, knowing their suggestions were not being taken at face value and the architect, unable to reconcile their own limitations, was unable to chart a path forward. In turn, this becomes as much a battle against the title and a reluctance to relinquish control and acknowledge gaps in their own knowledge and capability. This action in our team continued  with the need to become a deliberate reviewer of all proposed changes, undermining the growth of more than capable team members and slowing down the pace of delivery. While we have seen this pattern in several companies, over several industries, the fast paced evolution of technical stacks in a cloud centric environment is making this an even bigger challenge.

This means that the architect is no longer the singular authority on a technical area, as the speed of improvements means a constant drive to improve tooling; stacks and approaches are happening within every developer, and more importantly across the industry, making it a technical arms race. This will inadvertently cause a failure to support, trust and enable that tech stack progression if the architect is unwilling to trust and empower those around them. This becomes a losing battle for a team, as the architect feels they need to catch up with their own knowledge to make a decision, however, the developers are actively coding, debugging, experimenting and learning day in day out and at a far deeper level than the architect can reach. If that knowledge gap is never bridged, it leads to attrition and a fundamental loss of psychological safety within teams. Strong leadership is required to address this and more importantly strong support, from coaching to mentoring, to help the architect overcome the fears they are experiencing. What both the architect and the overall team need to be aware of, is that what’s best for the team may not be best for the individual, but will often reap the most benefits (for the betterment of the product or service being built).

An example of this is slowing down a release cadence to ensure common understanding. In a prior company we had a subject matter expert (SME) who was empowered with the architect position while also possessing the technical ability to drive the implementation of the feature from their design almost solo. Their vision for the architecture of this particular product was to refactor the existing component oriented design pattern towards a more modular plugin based strategy. This came from their own vast experience of the architectural best practices and from deep interlocks with customers. Their vision was projected as a robust Proof of Concept, undertaken solo, a well-formed email explaining their rationale and a suggestion that the refactor could be in place before the next scheduled customer drop. While the vision, passion and capability were undeniably things we (Chris as an engineer, Leigh as an engineering manager) wanted to tap into, an air of unease existed within the team. It was an architectural shift, with new technology that the team as a whole had minimal awareness of. The SME in this case was due to lead another project roughly three months after the next customer drop and were already contractually expecting a delivery, meaning a cost of delay would be incurred.

A decision was taken to support the SMEs vision, but to engage them to elaborate more on the benefits (better interoperability, smoother customer integrations, easier debugging) and then negotiate our release commitments with both customers. This gave a safe footing to slow down the eagerness of the individual, while supporting the idea and more importantly allowing the team as a whole to get comfortable with the tech and the changes. The result was a sustainable architecture, but more importantly sustainable skills cultured within the five other developers who longer term would own the evolution of this product. This undoubtedly caused pressure on the sales team and customer side, but the payback over the coming 18 months was clear for the development team as they negotiated the rapid business requests to add functionality. This had a happy ending, but that wider buy-in required honest conversations from the engineering leadership to create the safety for both the team and products’ evolution.

Technical acumen is no longer sufficient

Technical knowledge has and will always be a prerequisite for the architect. Business acumen and market understanding has most definitely increased its importance. But the big change the architect needs is to use coaching skills across all the people involved in the software lifecycle. This may sound overly simplistic, but is so important for the architect in the ever increasing fast paced software industry. Their ability to listen and consume the business perspective, the technical needs and wants from the engineers on the ground and the management’s need to deliver fast becomes essential. The architect needs to become competent in the use of “powerful open questions” as an important mechanism to provoke deeper thinking and elicit varying perspectives.

Things like asking “Why” questions can infer judgment, e.g. why did you take that approach? Changing to “What made you decide on that approach?” will prompt the responder to explain their thinking rather than justifying their decision, as it may have been deemed incorrect if asked using “Why?” This simple change and the use of open, curious language can go a long way to creating inclusivity across the group and more importantly, creating a supportive atmosphere rather than a perceived challenging atmosphere.

The architect has become a polyglot; their traditional technical language has been accompanied by the language of business and the language necessary to extract the best views and ideas from engineering teams.

Practical Tips

Having distilled down the challenge, here are six practical tips for architects and five practical tips for teams who might be struggling with this transition.

For architects:

  1. Become a mentor to the team’s growing understanding of architecture, rather than an impediment. Share your knowledge openly and proactively. 
  2. Seek coaching to help you get beyond inner challenges that only you feel but the team may not be aware of. Don’t suffer alone, and guided support can aid your role evolution.
  3. Welcome being challenged by customers, the team and your environment. That feedback loop can be exhausting, but channeled in the right way can be hugely rewarding.
  4. Use your experience to guide conversations towards challenges that your expertise tells you will be encountered. 
  5. Gain an understanding of the dynamics of your team, their strengths and weaknesses, their knowledge of the tooling, and their reality of building an application day in day out. Help to structure your input on where it can add the most value, at the right time.
  6. Become a relationship builder. Develop your soft skills to build a network, from the sales team to the product owner, from the engineering manager to the tech SME. Nurture and foster those relationships daily.

For the team:

  1. Distill down your experience of using a tool and its advantages for non-domain experts. Bring them on a journey of understanding.
  2. Use the architects’ vast experience to gain insight on a thought, challenge or idea you might have. They are part of your team now.
  3. Present your ideas, their benefits and drawbacks transparently and simply; be prepared for open and challenging feedback. Help embody psychological safety.
  4. Leave titles and egos at the door; embrace the team environment and learn from everyone in the room. The scope for you to influence direction is very real in how software is designed today. 
  5. Grow your presentation, communication and mentoring skills and use them daily, information exchange in fast paced teams is crucial.
  6. Do everything in your power to retain your architect. That deep rooted knowledge and expertise is invaluable to grow and empower your team; don’t make them feel isolated, make them feel part of the team and part of the future solution.

Conclusion

The role of the architect has fundamentally changed for the better for the software industry and more importantly for our customers. How we engage with customers, how we build, ship, release and support our software has changed. That change has empowered the overall development team and their previously considered support roles such as quality engineering / quality assurance. Now, every person has a voice, an opinion and a valid input in how a system grows and is supported over time. That has been complemented by two independent but related changes. Firstly, the change in end user expectations, where the demand for more rapid feedback and more imperfect services now exists in order to guide what they need and when they need it.  Secondly, a suite of tooling has emerged to support and enable developers in their day-to-day work. This brings about solutions to problems that were previously a consideration for architects only, and allows for more insight on performance, scale and design to naturally percolate among the teams. The result is a change in the foundational role that the architect needs to play. Their years of experience combined with their vast knowledge of best practices now needs to be reimagined into the daily flow of the team. This is a chance to level up the entire team’s experience, creating a more diverse view of how we build our software. That change can be difficult; it requires support from management and the team. It also requires a willingness for the person in the role to want to evolve, to offer more value than ever before and to relinquish the prestige of the title for the betterment of the team, the product and the customers.

About the Authors

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Microsoft’s Distributed Application Framework Orleans Reaches Version 7

MMS Founder
MMS Edin Kapic

Article originally posted on InfoQ. Visit InfoQ

Microsoft Orleans, a .NET framework for building scalable distributed cloud applications, has been updated for .NET 7 and released as Orleans 7.0.0 on November 8th, 2022. The improvements in this release include better performance, simplified development dependencies, and a simplified identification schema for the grains, a unit of execution in Orleans.

Orleans started in 2010 as a project inside Microsoft Research around virtual actors, an abstraction over the actor model of computation. It was then used as the technology of choice for building the Microsoft Azure back-end for the popular game franchise Halo. The core technology behind Orleans was transferred to 343 Industries, a Microsoft Xbox Game Studios subsidiary, and it was made available as an open-source project on GitHub in 2015. Another actor-based programming framework, comparable to Orleans, is Akka.

In Orleans, the desired distributed functionality is modelled as a grain, an addressable unit of execution that can send and receive messages to other grains and maintain its own state if necessary. The grains are virtual actors, persisted to durable storage and activated in memory on demand, in the same sense as virtual memory is an abstraction over a computer’s physical memory.

The grains had to inherit from the Grain base class in the previous versions of Orleans. Now the grains can be POCO objects. To get access to the code previously available only inside the Grain class, they can now implement the IGrainBase interface instead.

The Orleans runtime keeps track of the activation/deactivation and finding/invoking grains as necessary. It also keeps clusters of silos, the containers for the execution of grains. The communication with the Orleans runtime is done using the client library.

The last Orleans major version before 7.0 was version 3.0, released in 2019. The planned 4.0 release was later ported to .NET 7 and renamed to 7.0 to match the broader .NET 7 ecosystem launches.

Version 7.0.0 claims significant performance benefits over version 3, between 40% and 140%, across different machine configurations and scenarios. The Orleans source code includes a benchmarking application that Microsoft used to measure those improvements.

The development experience is improved by reducing the number of NuGet packages that are needed to be referenced, leaving three major rolled-up packages: one for the client projects, one for the server projects, and one for the abstractions and SDK. These packages then reference the needed individual Orleans packages. For comparison, NuGet lists 77 Orleans packages at the moment.

Another major improvement in Orleans 7.0.0 involves a simplified identification schema. In the previous version of Orleans, the grains could have a compound grain key. For example, one grain could use long data type as a key type while another grain could use a Guid with a string. It made the code calling the grains by their identifier cumbersome. In 7.0.0, the grains have an identity in the form of type/value where both type and value are strings.

The new identification schema and the new serialisation mechanism that is more version-tolerant are the reasons why Orleans 7.0.0 hosts won’t be able to coexist with Orleans 3.X hosts in the same cluster. Microsoft claims that the changes were needed to simplify and generalise some cumbersome aspects of Orleans and that the version jump to .NET 7 was the perfect opportunity to do those tough choices. The official recommendation is to deploy a new cluster and gradually decommission the old one.

For a deeper understanding of how an Orleans application looks like, there is a sample text adventure game on GitHub that is designed for scale and showcases how to model a game in Orleans and how to connect an external client (a game client) to the Orleans cluster. While most of Orleans samples are modelled around games, Orleans can be used for various distributed computation problems, from managing an IoT network of connected water heaters to removing the database as a bottleneck in calculations.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


How Defining Agile Results and Behaviors Can Enable Behavioral Change

MMS Founder
MMS Ben Linders

Article originally posted on InfoQ. Visit InfoQ

Specifying and measuring behavior within a certain organisational context can enable and drive behavioral change. To increase the success of an agile transformation, it helps if you link the desired behaviors to the expected results. This way you set yourself up to be able to reinforce the behavior you want to see more of in order to reach your results.

Evelyn van Kelle and Chris Baron spoke about behavior change for adopting agile at Better Ways 2022.

If you want to change something, whether it’s a process or a new way of work, it usually means someone, or a group of people, need to do something different than they were used to, Baron explained:

In an agile transformation people need to act differently so new behavior (and therefore behavioral change) is mandatory.

Specifying and measuring behavior is really important, and it’s very hard to pinpoint and get right, van Kelle mentioned:

Specifying behavior seems easy but it’s actually really hard. “You could be more supportive/transparent/open” is a typical example of things you could classify as behavior, but they’re not. Can you show me what being more supportive looks like? Easily said: if something doesn’t pass this “show me test”, it’s not behavior.

Baron suggested starting with specifying your results within a certain context. After this, you can specify behaviors that will get you these results:

In an agile transformation, I often see that behavioral change, change of culture, or having a different mindset is the focus, but if you cannot link it to results, you can’t measure it. So how would you know if your agile transformation is successful? Behavioral change is much more about measuring than people might think.

It’s crucial to start with defining results, then behaviors and doing that within a specific context, van Kelle concluded.

InfoQ interviewed Evelyn van Kelle and Chris Baron about behavioral change.

InfoQ: What is it that makes human behavior important in agile transformations?

Evelyn van Kelle: Agile transformations are super complex. Expectations are high and it usually requires a change in behavior from people. A new agile mindset is needed, servant leaders and people are supposed to “live and breathe” the new (Agile) values. After a while, resistance is a recurring topic in many conversations. People are not engaged, committed or motivated enough.

What do you see people do that you classify as resistance? What behavior entails that agile mindset? We usually have a gut feeling, but specifying what behavior really looks like is really hard. If we want transformations to be successful, we will have to stop oversimplifying these topics and unravel their complexity. This starts with specifying behavior properly.

InfoQ: How would you define results and behavior, and how are they related?

Chris Baron: Identifying what context you’re in and what results you are looking for in that context is a great start. You have to set yourself up to ask: which behavior(s) will get me the results or improve my results? Having an overview of the context, the results and behaviors will provide important insights so you can decide if behavioral change is needed and if so, which behaviors need to change.

As an example, you can start identifying the organisational context of a Scrum team by asking yourself a couple of questions:

  • Is this scrum team able to deliver value to the customer or do they have dependencies with other teams?
  • Who are their stakeholders?
  • What is their way of working?

Once you cover that you can start identifying results that you want to achieve within that context; maybe you want to deliver faster to the customer. But how much faster, in what timeframe, and what is it that they deliver?

Now that you have made clear what you want to achieve, you can focus on how you want to achieve this, and what behaviors you want to deliver faster. Maybe you want people to deploy code as early as possible instead of waiting till the sprint is finished. This might mean that behavioral change is needed in order to deliver faster.

Van Kelle: We often see a lot of work has been put into new company values, from defining them, to communicating them via all-hands sessions and beautiful posters on the walls. And then nothing changes…

There are more relevant explanations here, but one of them is that these values rarely define what behavior is desired: boldness, honesty, integrity, ownership, just to name a few, all equally promising and important, but they don’t say anything about behavior, nor do they relate to a specific result. And they probably imply different behaviors in different contexts.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.