Podcast: Trends in Engineering Leadership: Observability, Agile Backlash, and Building Autonomous Teams
MMS • Chris Cooney
Article originally posted on InfoQ. Visit InfoQ
Transcript
Shane Hastie: Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. Today I’m sitting down across many miles with Chris Cooney. Chris, welcome. Thanks for taking the time to talk to us today.
Introductions [01:03]
Chris Cooney: Thank you very much, Shane. I’m very excited to be here, in indeed many miles. I think it’s not quite the antipodes, right, but it’s very, very close to the antipodes. Ireland off New Zealand. It’s the antipodes of the UK, but we are about as far away as it gets. The wonders of the internet, I suppose.
Shane Hastie: Pretty much so, and I think the time offset is 13 hours today. My normal starting point is who is Chris?
Chris Cooney: That’s usually the question. So hello, I’m Chris. I am the Head of Developer Relations for a company called Coralogix. Coralogix is a full stack observability platform processing data without indexing in stream. We are based in several different countries. I am based in the UK, as you can probably tell from my accent. I have spent the past 11, almost 12 years now as a software engineer. I started out as a Java engineer straight out of university, and then quickly got into front-end engineering, didn’t like that very much and moved into SRE and DevOps, and that’s really where I started to enjoy myself. And over the past several years, I’ve moved into engineering leadership and got to see organizations grow and change and how certain decisions affect people and teams.
And now more recently, as the Head of Developer Relations for Coralogix, I get to really enjoy going out to conferences, meeting people, but also I get a lot of research time to find out about what happens to companies when they employ observability. And I get to also understand the trends in the market in a way that I never would’ve been able to see before as a software engineer, because I get to go meet hundreds and hundreds of people every month, and they all give me their views and insights. And so, I get to collect all those together and that’s what makes me very excited to talk on this podcast today about the various different topics that got on in the industry.
Shane Hastie: So let’s dig into what are some of those trends? What are some of the things that you are seeing in your conversation with engineering organizations?
The backlash against “Agile” [02:49]
Chris Cooney: Yes. When I started out, admittedly 11, 12 years ago is a while, but it’s not that long ago really. I remember when I started out in the first company I worked in, we had an Agile consultant come in. And they came in and they explained to me the principles of agility and so on and so forth, so gave me the rundown of how it all works and how it should work and how it shouldn’t work and so on. We were all very skeptical, and over the years I’ve got to see agility become this massive thing. And I sat in boardrooms with very senior executives in very large companies listening to Agile Manifesto ideas and things like that. And it’s been really interesting to see that gel in. And now we’re seeing this reverse trend of people almost emotionally pushing back against not necessarily the core tenets of Agile, but just the word. We’ve heard it so many times, there’s a certain amount of fatigue around it. That’s one trend.
The value of observability [03:40]
The other trend I’m seeing technically is this move around observability. Obviously, I spend most of my time talking about observability now. It used to be this thing that you have to have so when things have gone wrong or to stop things from going wrong. And there is this big trend now of organizations moving towards less to do with what’s going wrong. It’s a broader question, like, “Where are we as a company? How many dev hours did we put into this thing? How does that factor in the mean times to recovery reduction, that kind of thing?” They’re much broader questions now, blurring in business measures, technical measures, and lots more people measures.
I’ll give you a great example. Measuring rage clicks on an interface is like a thing now and measuring the emotionality with which somebody clicks a button. It’s a fascinating, I think it’s like a nice microcosm of what’s going on in the industry. Our measurements are getting much more abstract. And what that’s doing to people, what it’s doing to engineering teams, it’s fascinating. So there’s lots and lots and lots.
And then, obviously there’s the technical trends moving around AI and ML and things like that and what’s doing to people and the uncertainty around that and also the excitement. It’s a pretty interesting time.
Shane Hastie: So let’s dig into one of those areas in terms of the people measurements. So what can we measure about people through building observability into our software?
The evolution of what can be observed [04:59]
Chris Cooney: That’s a really interesting topic. So I think it’s better to contextualize, begin with okay, we started out, it was basically CPU, memory, disk, network, the big four. And then, we started to get a bit clever and looked at things at latency and response sizes, data exchanged over a server and so forth. And then, as we built up, we started to look at things like we’ve got some marketing metrics in there, so balance rates, how long somebody stays on a page and that kind of thing.
Now we’re looking at the next sort of tier, so the next level of abstraction up, which is more like, did the user have a good experience on the website, and what does that mean? So you see web vitals are starting to break into this area, things like when was the meaningful moment that a user saw the content they wanted to see? Not first ping, not first load, load template. The user went to this page, they wanted to see a product page. How long was it, not just how long did the page take to load before they saw all the meaningful information they needed? And that’s an amalgamation of lots and lots of different signals and metrics.
I’ve been talking recently about this distinction between a signal and an insight. So my taxonomy, the way I usually slice it, is a signal is a very specific technical measurement of something: latency, page load time, bytes exchange, that kind of thing. And in insight, there’s an amalgamation of lots of different signals to produce one useful thing, and my litmus test for an insight is that you can take it to your non-technical boss and they will understand it. They will understand what you’re talking about. When I say to my non-technical boss, “My insight is this user had a really bad experience loading the product page. It took five seconds for the product to appear, and they couldn’t buy the thing. They figured they that couldn’t work out where to do it”. That would be a combination of various different measures around where they clicked on the page, how long the HTML ping took, how long the actual network speed was to the machine, and so on.
So that’s what I’m talking about with the people experience metrics. It’s fascinating in that respect, and this new level now, which is directly answering business questions. It’s almost like we’ve built scaffolding up over the years, deeply technical. When someone would say, “Did that person have a good experience?” And we’d say, “Well, the page latency was this, and the HTTP response was 200, which is good, but then the page load time was really slow”. But now we just say yes or no because of X, Y and Z. And so, that’s where we’re going to I think. And this is all about that trend of observability moving into that business space, is taking much more broad encompassing measurements and a much higher level of abstraction. And that’s what I mean when I said to more people metrics as a general term.
Shane Hastie: So what happens when an organization embraces this? When not just the technical team, but the product teams, when the whole organization is looking at this and using this to perhaps make decisions about what they should be building?
Making sense of observations [07:47]
Chris Cooney: Yes. There are two things here in my opinion. One is there’s a technical barrier, which is making the information literally available in some way. So putting a query engine, and so putting, what’s an obvious one? Putting Kibana in front of open search is the most common example. It’s a way to query your data. Making a SQL query engine in front of your database is a good example. So just doing that is the technical boom. And that is not easy, by the way. That is a certain level of scale. Technically, that is really hard to make high performance queries for hundreds, potentially thousands of users hence taken concurrently. That’s not easy.
Let’s assume that’s out of the way and the organization’s work that out. The next challenge is, “Well, how do we make it so that users can get the questions they need answered, answered quickly without specialist knowledge?” And we’re not there yet. Obviously AI makes a lot of very big promises about natural language query. It’s something that we’ve built into the platform in Coralogix ourselves. It works great. It works really, really well. And I think what we have to do now is work out how do we make it as easy as possible to get access to that information?
Let’s assume all those barriers are out of the way, and an organization has achieved that. And I saw something similar to this when I was a Principal Engineer at Sainsbury’s when we started to surface, it’s an adjacent example, but still relevant, introduction of SLOs and SLIs into the teams. So where before if I went to one team and said, “How has your operational success been this month?” They would say, “Well, we’ve had a million requests and we serviced them all in under 200 milliseconds”. Okay. I don’t know what that means. Is 200 milliseconds good? Is that terrible? What does that mean? We’d go to another team and they’d say, “Well, our error rate is down to 0.5%” Well, brilliant. But last month it was 1%. The month before that it was 0.1% or something.
When we introduced SLOs and SLIs into teams, we could see across all of them, “Hey, you breached your error budget. You have not breached your error budget”. And suddenly, there was a universal language around operational performance. And the same thing happens when you surface the data. You create a universal language around cross-cutting insights across different people.
Now, what does that do to people? Well, one, it shines spotlights in places that some people may not want them shined there, but it does do that. That’s what the universal language does. It’s not enough just to have the data. You have to have effective access to it. You have to have effective ownership of it. And doing that surfaces conversations that would be initially quite painful. There are lots of people, especially in sufficiently large organizations, that have been kind of just getting by by flying under the radar, and it does make that quite challenging.
The other thing that it does, some people, it makes them feel very vulnerable because they feel like KPIs. They’re not. We’re not measuring that performance on if they miss their error budget. When I was the business engineer, no one would get fired. We’d sit down and go, “Hey, you missed your error budget. What can we do here? What’s wrong? What are the barriers?” But it actually made some people feel very nervous and very uncomfortable with it and they didn’t like it. Other people thrived and loved. It became a target. “How much can we beat our budget by this month? How low can we get it?”
Metrics create behaviors [10:53]
So the two things I would say, the big sweeping changes in behavior, it’s that famous phrase, “Build me a metric and I’ll show you a behavior”. So if you measure somebody, human behavior is what they call a type two chaotic system.
By measuring it, you change it. And it’s crazy in the first place. So as soon as you introduce those metrics, you have to be extremely cognizant of what happens to dynamics between teams and within teams. Teams become competitive. Teams begin to look at other teams and wonder, “How the hell are they doing that? How is their error budget so low? What’s going on?” Other teams maybe in an effort to improve their metrics artificially will start to lower their deployment frequency and scrutinize every single thing. So while their operational metrics look amazing, their delivery is actually getting worse, and all these various different things that go on. So that competitiveness driven by uncertainty and vulnerability is a big thing that happens across teams.
The other thing that I found is that the really great leaders, the really brilliant leaders love it. Oh, in fact, all leadership love it. All leadership love higher visibility. The great leaders see that higher visibility and go, “Amazing. Now I can help. Now I can actually get involved in some of these conversations that would’ve been challenging before”.
The slightly more, let’s say worrying leaders will see this as a rod with which to beat the engineers. And that is something that you have to be extremely careful of. Surfacing metrics and being very forthright about the truth and being kind of righteous about it all is great and it’s probably the best way to be. But the consequence is that a lot of people can be treated not very well if you have the wrong type of leadership in place, who see these measurements as a way of forcing different behaviors.
And so, it all has to be done in good faith. It all has to be done based on the premise that everybody is doing their best. And if you don’t start from that premise, it doesn’t matter how good your measurements are, you’re going to be in trouble. Those are the learnings that I took from when I rolled it out and some of the things that I saw across an organization. It was largely very positive though. It just took a bit of growing pains to get through.
Shane Hastie: So digging into the psychological safety that we’ve heard about and known about for a couple of decades now.
Chris Cooney: Yes. Yes.
Shane Hastie: We’re not getting it right.
Enabling psychological safety is still a challenge [12:59]
Chris Cooney: No, no. And I think that my experience when I first go into reading about, it’s like Google’s Project Aristotle and things like that may be. And my first attempt at educating an organization on psychological safety was they had this extremely long, extremely detailed incident management review, where if something goes wrong, then they have, we’re talking like a 200-person, several day, sometimes several day. I think on the low end it was like five, six hours, deep review of everything. Everyone bickers and argues at each other and points fingers at each other. And there’s enormous documents produced, it’s filed away, and nobody ever looks at it ever again because who wants to read those things? It’s just a historical text about bickering between teams.
And what I started to do is I said, “Well, why don’t we trial like a more of a blameless post-mortem method? Let’s just give that a go and we’ll see what happens”. So the first time I did it, the meeting went from, they said the last meeting before them was about six hours. We did it in about 45 minutes. I started the meeting by giving a five-minute briefing of why this post-mortem has to be blameless. The aviation industry and the learnings that came from that around if you hide mistakes, they only get worse. We have to create an environment where you’re okay to surface mistakes. Just that five-minute primer and then about a 40-ish-minute conversation. And we had a document that was more thorough, more detailed, more fact-based, and more honest than any incident review that I ever read before that.
So rolling that out across organizations was really, really fun. But then, I saw it go the other way, where they’d start saying, “Well, it’s psychologically safe”. And it’s turned inside this almost hippie loving, where nobody’s done anything wrong. There is no such thing as a mistake. And no, that’s not the point. The point is that we all make mistakes, not that they don’t exist. And we don’t point blame in a malicious way, but we can attribute a mistake to somebody. You just can’t do it by… And the language in some of these post-mortem documents that I was reading was so indirect. “The system post a software change began to fail, blah, blah, blah, blah, blah”. Because they’re desperately trying not to name anybody or name any teams or say that an action occurred. It was almost like the system was just running along and then the vibrations from the universe just knocked it out of whack.
And actually, when you got into it, one of the team pushed a code change. It’s like, “No. Team A pushed a code change. Five minutes later there was a memory leak issue that caused this outage”. And that’s not blaming anybody, that’s just stating the fact in a causal way.
So the thing I learned with that is whenever you are teaching about blameless post-mortem psychological safety, it’s crucial that you don’t lose the relationship between cause and effect. You have to show cause A, effect B, cause B, effect C, and so on. Everything has to be linked in that way in my opinion. Because that forces them to say, “Well, yes. We did push this code change, and yes, it looks like it did cause this”.
That will be the thing I think where most organizations get tripped up, is they really go all in on psychological safety. “Cool, we’re going to do everything psychologically safe. Everyone’s going to love it”. And they throw the baby out with the bath water as it were. And they missed the point, which is to get to the bottom of an issue quickly, not to not hurt anybody’s feelings, which is often sometimes a mistake that people make I think, especially in large organizations.
Shane Hastie: Circling back around to one of the comments you made earlier on. The agile backlash, what’s going on there?
Exploring the agile backlash [16:25]
Chris Cooney: I often try to talk about larger trends rather than my own experience, purely because anecdotal experience is only useful as an anecdote. So this is an anecdote, but I think it’s a good indication of what’s going on more broadly. When I was starting out, I was a mid-level Java engineer, and this was when agility was really starting to get a hold in some of these larger companies and they started to understand the value of it. And what happened was we were all on the Agile principles. We were regularly reading the Agile Manifesto.
We had a coach called Simon Burchill who was and is absolutely fantastic, completely, deeply, deeply understands the methodology and the point of agility without getting lost in the miasma of various different frameworks and planning poker cards and all the rest of it. And he was wonderful at it, and I was very, very fortunate to study under him in that respect because it gave me a really good, almost pure perspective of agile before all of the other stuff started to come in.
So what happened to me was that we were delivering work, and if we went even a week over budget or a week over due, the organization would say, “Well, isn’t agile supposed to speed things up?” And it’s like, “Well, no, not really. It’s more of just that we had a working product six weeks ago, eight weeks ago, and you chose not to go live with it”. Which is fine, but that’s what you get with the agile process. You get a much earlier working software that gives you the opportunity to go live if you get creative with how you can productionize it or turn into a product.
So that was the first thing, I think. One of the seeds of the backlash is a fundamental misunderstanding about what Agile is supposed to be doing for you. It’s not to get things done faster, it’s to just incrementally deliver working software so you have a feedback loop and a conversation that’s going on constantly. And an empirical learning cycle is occurring, so you’re constantly improving the software, not build everything, test everything, deploy it, and find out it’s wrong. That’s one.
The other thing I will say is what I see on Twitter a lot now, or X they call it these days, is the Agile Industrial Complex, which is a phrase that I’ve seen batted around a lot, which is essentially organizations just selling Scrum certifications or various different things that don’t hold that much value. That’s not to say all Scrum certifications are useless. I did one, it was years and years ago, I forget the name of the chap now. It was fantastic. He gave a really, really great insight into Scrum, for example, why it’s useful, why it’s great, times when it may be painful, times when some of its practices can be dropped, the freedom you’ve got within the Scrum guide.
One of the things that he said to me that always stuck with me, this is just an example of a good insight that came from an Agile certification was he said, “It’s a Scrum guide, not the Scrum Bible. But it’s a guide. The whole point is to give you an idea. You’re on a journey, and the guide is there to help you along that journey. It is not there to be read like a holy text”. And I loved that insight. It really stuck with me and it definitely informed how I went out and applied those principles later on. So there is a bit of a backlash against those kinds of Agile certifications because as is the case with almost any service, a lot of it’s good, a lot of it’s bad. And the bad ones are pretty bad.
And then, the third thing I will say is that an enormous amount of power was given to Agile coaches early on. They were almost like the high priests and they were sort of put into very, very senior positions in an organization. And like I said, there are some great Agile coaches. I’ve had the absolute privilege of working with some, and there were some really bad ones, as there are great software engineers and bad software engineers, great leaders and poor leaders and so on.
The problem is that those coaches were advising very powerful people in organizations. And if you’re giving bad advice to very powerful people, the impact of that advice is enormous. We know how to deal with a bad software engineering team. We know how to deal with somebody that doesn’t want to write tests. As a software function, we get that. We understand how to work around that and solve that problem. Sometimes it’s interpersonal, sometimes it’s technical, whatever it is, we know how to fix it.
We have not yet figured out this sort of grand vizier problem of there is somebody there giving advice to the king who doesn’t really understand what they’re talking about, and the king’s just taking them at their word. And that’s what happened with Agile. And that I think is one of the worst things that we could have done was start to take the word of people as if they are these experts in Agile and blah, blah, blah. It’s ultimately software delivery. That’s what we’re trying to do. We’re trying to deliver working software. And if you’re going to give advice, you’d really better deeply understand delivery of working software before you go and about interpersonal things and that kind of stuff.
So those are the three things I think that have driven the backlash. And now there’s just this fatigue around the word Agile. Like I say, I had the benefit of going to conferences and I’ve seen the word Agile. When I first started talking, it was everywhere. You couldn’t miss a conference where the word Agile wasn’t there, and now it is less and less prevalent and people start talking more about things like continuous delivery, just to avoid saying the word Agile. Because the fatigue is almost about around the word than it’s around the principles.
And the last thing I’ll say is there is no backlash against the principles. The principles are here to stay. It’s just software engineering now. We just call it what would’ve been Agile 10 years ago is just how to build working software now. It’s so deeply ingrained in how we think that we think we’re backlash against Agile. We’re not. We’re backlash against a few words. The core principles are parts of software engineering now, and they’re here to stay for a very long time, I suspect.
Shane Hastie: How do we get teams aligned around a common goal and give them the autonomy that we know is necessary for motivation?
Make it easy to make good decisions [21:53]
Chris Cooney: Yes. I have just submitted a talk to Cube put on this. And I won’t say anything just at risk of jeopardizing our submission, but the broad idea is this. Let’s say I was in a position, I had like 20 something teams, and the wider organization was hundreds of teams. And we had a big problem, which was every single team had been raised on this idea of, “You pick your tools, you run with it. You want to use AWS, you want to use GCP, you want to use Azure? Whatever you want to use”.
And then after a while, obviously the bills started to roll in and we started to see that actually this is a rather expensive way of running an organization. And we started to think, “Well, can we consolidate?” So we said, “Yes, we can consolidate”. And a working group went off, picked a tool, bought it, and then went to the teams and said, “Thou shalt use this, and nobody listened”. And then, we kind of went back to the drawing board and they said, “Well, how do we do this?” And I said, “This tool was never picked by them. They don’t understand it, they don’t get it. And they’re stacking up migrating to this tool against all of the deliverables they’re responsible for”. So how do you make it so that teams have the freedom and autonomy to make effective decisions, meaningful decisions about their software, but it’s done in a way that there is a golden path in place such that they’re all roughly moving in the same direction?
When we started to build out a project within Sainsbury’s was completely re-platforming the entire organization. It’s still going on now. It’s still happening now. But hundreds and hundreds of developers have been migrating onto this platform. It was a team in which I was part of. It’s started, I was from Manchester in the UK, we originally called it the Manchester PAS, Platform As a Service. I don’t know if you know this, but the bumblebee is one of the symbols of Manchester. It had a little bumblebee in the UI. It was great. We loved it. And we built it using Kubernetes. We built it using Jenkins for CI, CD, purely because Jenkins was big in the office at the time. It isn’t anymore. Now it’s GitHub Actions.
And what we said was, “Every team in Manchester, every single resource has to be tagged so we know who owns what. Every single time there’s a deployment, we need some way of seeing what it was and what went into it”. And sometimes some periods of the year are extremely busy and extremely serious, and you have to do additional change notifications in different systems. So every single team between the Christmas period for a grocer, Sainsbury’s sells an enormous amount of trade between let’s say November and January. So during that period, they have to raise additional change requests, but they’re doing 30, 40 commits a day, so they can’t be expected to fill up those forms every single time. So I wonder if we can automate that for them.
And what I realized was, “Okay, this platform is going to make the horrible stuff easy and it’s going to make it almost invisible; not completely invisible because they still have to know what’s going on, but it has to make it almost invisible”. And by making the horrible stuff easy, we incentivize them to use the platform in the way that it’s intended. So we did that and we onboarded everybody in a couple of weeks, and it took no push whatsoever.
We had product owners coming to us and saying one team just started, they’d started the very first sprint. The goal of their first sprint was to have a working API and a working UI. The team produced, just by using our platform, because we made a lot of this stuff easy. So we had dashboard generation, we had alert generation, we had metric generation because we were using Kubernetes and we were using Istio. We got a ton of HTTP service metrics off the bat. Tracing was built in there.
So in their sprint review at the end of the two weeks, they built this feature. Cool. “Oh, by the way, we’ve done all of this”. And it was an enormous amount of dashboards and things like that. “Oh, by the way, the infrastructure is completely scalable. It’s totally, it’s multi-AZ failover. There’s no productionizing. It’s already production ready”. The plan was to go live in months. They went live in weeks after that. It changed the conversation and that was when things really started to capitalize and have ended up in the new project now, which is across the entire organization.
The reason why I told that story is because you have to have a give and take. If you try and do it like an edict, a top-down edict, your best people will leave and your worst people will try and work through it. Because the best people want to be able to make decisions and have autonomy. They want to have kind of sense of ownership of what they’re building. Skin in the game is often the phrase that it’s banded around.
And so, how do you give engineers the autonomy? You build a platform, you make it highly configurable, highly self-serviced. You automate all the painful bits of the organization, for example, compliance of change request notifications and data retention policies and all that. You automate that to the hilt so that all they have to do is declare some config and repository and it just happens for them. And then, you make it so the golden path, the right path, is the easy path. And that’s it. That’s the end of the conversation. If you can do that, if you can deliver that, you are in a great space.
If you try to do it as a top-down edict, you will feel a lot of pain and your best people will probably leave you. If you do it as a collaborative effort so that everybody’s on the same golden path, every time they make a decision, the easy decision is the right one, it’s hard work to go against the right decision. Then you’ll incentivize the right behavior. And if you make some painful parts of their life easy, you’ve got the carrot, you’ve got the stick, you’re in a good place. That’s how I like to do it. I like to incentivize the behavior and let them choose.
Shane Hastie: Thank you so much. There’s some great stuff there, a lot of really insightful ideas. If people want to continue the conversation, where do they find you?
Chris Cooney: If you open up LinkedIn and type Chris Cooney, I’ve been reliably told that I am the second person in the list. I’m working hard for number one, but we’ll get there. If you look for Chris Cooney, if I don’t come up, Chris Cooney, Coralogix, Chris Cooney Observability, anything like that, and I will come up. And I’m more than happy to answer any questions. On LinkedIn is usually where I’m most active, especially for work-related topics.
Shane Hastie: Cool. Chris, thank you so much.
Chris Cooney: My pleasure. Thank you very much for having me.
Mentioned:
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.