MMS • Sarah Wells Courtney Kissler Ann Lewis Nick Caldwell
Article originally posted on InfoQ. Visit InfoQ
Transcript
Wells: I’m Sarah Wells. I’m a Technical Director at the Financial Times business newspaper. I’ve been there for nearly 11 years. I lead a group called engineering enablement, which is about seven different engineering teams that are building things for other engineers within the FT. Our customers are other engineers, and we are focused on trying to make sure that they can build products as quickly and easily as possible.
Caldwell: I’m VP of Engineering at Twitter, leading the consumer organization. Consumer, that’s just a fancy way to say the Twitter app, the website, and all the products that keep all the above safe. Previously was VPE at Reddit, and Chief Product Officer at Looker.
Lewis: My name is Ann Lewis. I am the Senior Advisor for technology in the Biden-Harris administration, currently embedded in the Small Business Administration government agency. I was previously the Chief Technology Officer of MUFON.
Kissler: Courtney Kissler. I’m CTO at Zulily, an online retailer. Prior to that worked at Nike, Starbucks, and Nordstrom.
Measuring Success from Improving Flow
Shoup: The title is Engineering Leadership Lessons for Improving Flow. I thought one way to start is, how do we know when we’re successful? When we improve flow, how do we measure that we’re successful.
Wells: Watching your talk, I realized that you’re really focused on the delivery. The Accelerate metrics, the DORA metrics are really relevant there. What I only just realized is that actually, at the FT, we got to the point where we were able to make small changes quite quickly a few years ago, and particularly for my part of the organization, the bit we’re most focused on is all the other stuff. If product engineering teams are having to spend a lot of time laboriously working out how to create a Lambda and connect something to something else, or to deploy a container into a Kubernetes platform or whatever, that also slows them down and stops them from making change. I’m not entirely sure where some of my metrics would be, which is why I’m interested to hear what other people think. Because once you improve some of your metrics, it’s very hard to prove that you should continue to invest. Because the metric doesn’t change. Apart from asking developers, is your life better because of the things we’ve built? I’m not entirely sure where else to go. I think, when you start the Accelerate metrics are absolutely great. If you are not releasing code as you finish it, and it going to production in hours, you’ve got a lot of benefit to gain from that. What happens after that?
Kissler: Plus 1000 to the Accelerate metrics and the DORA metrics. The one that doesn’t get touched upon as much but I think is critical to continuously understanding the health of the entire system is employee Net Promoter Score. Understanding how your teams feel. Maybe this goes to what you were saying Sarah, like we’ve done some improvements, we’ve seen the metric get better. Do we know if it’s really making a difference to our teams? I think that can be a way to really understand and learn. Because sometimes the underlying technology is important for minimizing burden and improving those metrics. There might be other things like what I have uncovered are ways of working, or something else has changed in the system and it doesn’t always become visible through those DORA metrics unless you’re also adding in employee Net Promoter Score.
Wells: It’s really interesting, because once you improve some things, people have different expectations. I was at the FT where it would take 20 minutes to build your code, and you went to production once a month. My newer colleagues at the FT are not happy when it takes them half an hour to release something. I think it is important to keep asking.
Caldwell: If you have a big enough company, or big enough team, they’ll have different metrics and different goals. Also, if you’re a platform team versus more of an edge team, it will be different. It’s all situational. There’s stuff I think that uniformly people care about when it comes to flow. If you need to invest in this, go ask your engineer about their release times, or how long it takes them to get a code review done. There are some real low hanging fruit that are measurable and can make a big impact on people’s lives. I think where it starts to get squishy, and I don’t know if anyone’s really solved this at scale, but when you start to aggregate the work of multiple developers, and then you’re trying to make measurements on team level output, it starts to get a little bit more squishy. We try and have tried to tackle this in lots of different ways. I think ultimately, when I’m thinking about flow, I’m thinking about for major, medium big rock items, what is the wall clock time from the inception of that idea to the delivery of the idea, maybe the first experiment that was run? Can we measure that? Then at the various stages in release, I go back and optimize it.
I find that sometimes people are getting tripped up on code. More often than not, people are getting tripped up on processes or maybe vestigial bureaucratic things that were meaningful at some point in the past, but no longer apply. You’re like trying to prune your process in order to increase your velocity. We don’t do that for every single project. I don’t know anyone who really has done that well, because it turns into like you’re clocking hours. You don’t want any engineer to clock hours, but you have to have some way to understand every phase of your release cycle and measure it. Then I just tie that to a few major projects and use them as exemplars for others.
Lewis: My answer would definitely have been different a year ago. Learning more about government bureaucracies, I think, in this context, where a great deal of implementation work is vendored out, and then the in-house career staff at an agency spend a lot of time trying to manage these vendors. Typically, these are folks who are guardians of huge amounts of budget and not subject matter experts in technology. I think one metric that I’ve started tracking at a high level is, can we actually implement any late breaking policy change without program disruption? Typically, a vendor will set up some agile system and then spend inordinate amounts of time trying to teach people what the user story is. They give up after a while, then they try to proceed forward. There’s some communication feedback loops between the decision makers and the implementers, if they can establish enough shared language. There’s also a concept of flow for who, so the vendor will try and manage their own internal flow that government liaisons are constantly disrupting. If you can get into a sense of shared flow, like at all, there’s any communication loop that doesn’t feel disruptive when you try and use it, then that’s a win for giant bureaucracies.
How to Get Buy-In From Engineers That Want To Re-Architect the System
Shoup: How do you get buy-in from engineers that want to re-architect the system because they believe that will make it faster?
Kissler: This is where I think leveraging a value stream map can be really powerful. Because if you do that, and you’re able to see where all the problems are, because I’ve been in this scenario where engineers say, “Just let us engineer our way out of this. We’re not moving fast enough, I know the answer. If we re-architect or we automate all the things, magic is going to happen.” In reality, what you might learn, is there are different bottlenecks in the system. It’s not always needing to move to microservices as the answer. In the case where I’ve applied value stream mapping to what many might call legacy technology, although I say that with a little bit of sarcasm, because in some cases, things get branded as legacy and they’re really not, like a mainframe. Some businesses run on a mainframe, and they’re not legacy technology. They’re a primary technology. You can apply these techniques regardless.
What you’ll learn is sometimes that it’s process related, or there’s insufficient information to go fast. You learn things that are not always just re-architecture. I’ve had engineers think that microservices is the answer. Then they go through a value stream mapping workshop, and they go, I had no idea that the real bottleneck is we’re asking for this information from our customer, and they don’t have it. Why are we asking for it? Or there’s a delay in our testing cycle, and so we need to do something different in how we’re doing unit tests, code coverage, or something. It’s not about the architecture of the system.
Caldwell: Microservices probably is not the default answer you should go. What the spirit of it is, is getting your teams to operate independently. Microservices can be a way to enable that, but can also introduce lots of other challenges that you have to account for. Then to the broader question of getting engineers to buy in to major re-architecture. I think I had actually the opposite problem, which is, engineers are always proposing some major re-architecture or wanting to spend five quarters on tech debt, or things like that. Getting them to treat those sorts of ideas and projects with the same level of strategic deliberation that we put into a product proposal, is what I tell them to do. That usually works out really well. If you can get a team to sit down and say, it’s not just like that moving off of our old web development framework is cool. Can you map to how that will enable better customer value, increase velocity. Really think through what the strategic bet would be. Either they’ll dissuade themselves of the idea or they’ll convince everybody. I’ve seen that happen multiple times. When I was at both Microsoft and Looker, we had a complete rewrite of the frontend tech stack. Reddit, the same thing, a complete rewrite of the frontend tech stack that came in part from the Eng team just explaining the velocity benefits that we would get. Think about it strategically. Sometimes there are fundamentally new technologies available like GCP, or ML is one now. They do need to be considered as strategic bets, as opposed to addressing tech debt or re-architecture.
Lewis: It’s helpful to dig into, why does the team want to re-architect? What problem are they trying to solve? Also, help everyone understand how to share ownership of the costs associated with it and make sure the cost of doing so justifies the value of whatever problem they’re trying to solve. There are a bunch of antipatterns in there. Sometimes engineers want to re-architect a system because someone else wrote it, and they want to do it their way. That’s usually not a good investment. Sometimes they do have a maintenance or sustainability or architecture problem that is worth solving. Sometimes engineers like to do that, because it’s a way of dealing with uncertainty to try and control something they actually can control, which is how a system is architected. I think it’s only generally worth it to re-architect when you need to rebalance what I like to think of as your complexity budget. You have a budget of time in your engineering capacity, budget of money of things you can spend money on. Then also, there’s always going to be some maximum amount of complexity across code bases that teams can reliably maintain. It’s easy to just build until you can only support things like 20% as well as everyone would like to. Then it’s good to resize, and that’s often a good opportunity to clean up tech debt, try out new architectures, and also get everyone aligned on why you’re doing this.
Wells: I was an engineer. This is me saying something that’s true of me. If you’ve come across the Spotify Team Health Check, the idea is you do a traffic light on a bunch of different categories, one of the categories is the health of your code base. If I had a team that was reliably green on that, I would think we were actually in trouble, because developers are never happy with the health of their code base. They always want to improve it. We talk sometimes about making sure we’re not building something in CV++. Is it there because you want it on your resume? We’ve tried two different things for this, and they’re the opposite, actually. I like Nick’s suggestion of, can you write a proposal about why we should do this, bring it to a forum. We have a pretty lightweight tech governance group, if you can bring something there and explain it, then, yes, we can endorse it. The other thing we did was having 10% time where people could scratch whatever itch, and I was leading the team. Anytime someone said to me, we don’t like Docker, we should move to Rocket. We could say, fine, do a 10% day thing on it. You could do whatever you liked, you had to present the next day on what you found. Very often people would say, it wasn’t as exciting as I thought it was going to be.
Connecting Code Quality and Delivery Speed
Shoup: Do you have any way to connect the quality of code with the speed to deliver, or are we mostly relying on the DORA metrics for that?
Kissler: It’s absolutely connected. The thing that’s awesome about the DORA metrics is that you’re not compromising quality, if you’re looking at that balance point of view. If your lead time is not good, sometimes that’s an indicator that quality is like you’re doing rework, or you’re finding defects late. The other one is the percent change failure rate. If every time you’re deploying to production, you’re having to roll it back, or you are uncovering an issue once you go to production, that’s an indicator of quality. I believe that you can use those to understand quality in addition to flow of value, and speed.
Wells: For me, the key thing is the small changes. If you’re doing lots of changes, those small changes, you can look at what actually changed and understand the difference between here’s a commit that’s going live and I can read it, versus here’s 4 weeks’ worth of work in one big release. When it goes wrong, we’re trying to work out which of those changes went wrong. It inevitably improves quality.
Lewis: Plus-plus to DORA metrics, but also overlaid with a sense of how decoupled the code base is for frontline systems are or need to be. It’s rolled up in some of the DORA metrics. One of the biggest ways to empower your teams to move faster is to make sure that their systems are decoupled in the right ways.
How Delivery Speed Is Affected by Changes in People and Process vs. Changes in Tech
Shoup: Based on your experiences, how much of the benefit of going faster is derived from changes relating to people and process and how much is derived from changes in technology?
Caldwell: Almost all of it is getting rid of red tape processes. There have been a few instances where it was tech. I’ll give you very concrete examples that they tend to be more generational changes, like the move to GCP and big data. If you’ve never deployed Kubernetes as part of your Ops system, everywhere I’ve used that has resulted in an immediate transformative step-up. Certain modern JavaScript frontend libraries, similar sorts of things. In general, if I’m going to debug a team that seems to be going slower, getting stuck in the mud. It’s things like, we’ve got a 50-page security compliance thing and there’s no easy way to checklist through it. The person whose job it used to be to steward people through that thing has left, and now they’re stuck, or things of that nature. It’s definitely awesome to have good tools and good technology, and occasionally you do get step function improvement here. If you’re talking about, what do I spend most of my time doing? It’s not that. It’s cutting red tape and helping people find a safe path through that maze of past decisions, so that they can try out a new course and create problems for some future person.
Wells: I tend to agree, it’s about the people and process, almost always, with the exception that there are certain foundational things about the technology you use. It’s not what technology. It’s, is it a decoupled architecture where people can make changes in one part of your system without affecting the other? The reason for things like that is we couldn’t do zero downtime deployments while I first worked at the Financial Times, which means you have to do deployments at a point where there is no news happening. Luckily, we’re a business newspaper, so we can do that on a Saturday quite often. That change to having an architecture where we can deploy to multiple instances sequentially, and we’re not doing big schema changes on a relational database that could take you out for several hours, that is a technology change. It does stop you from going fast. It’s probably easier to change than all of the stuff around people and process, and the fact that people think a change advisory board is essential, someone has to sign this off. The reporting, the fact that the Accelerate book says, change advisory boards don’t mean you have less failure, it just means it takes longer to fail. That stuff is really useful.
Lewis: I agree that it’s mostly people and process. Unfortunately, even though we want it to be the beautiful, and the shining quality of the code that we all write. It’s more about people being able to work within a system with enough shared language to be able to move quickly. I think in medium-sized organizations, there’s also an aspect of measuring onboarding time to new systems and trying to drive that down. Some of that is the system and some of that is the teams. Typically, teams that have established what Google calls psychological safety are faster at that for people and process reasons. Plus one to that. Government is a great example of how red tape intentionally slows processes down. If you were to try and measure that, measure the number of people who hold jobs where their only role or output they’re tracked against is controlling the throughput of other people for a bunch of reasons, some of which are good reasons. The way that bureaucracies work, sometimes there’s some strategic advantage to try and to slow some particular team and program down, however terrible that sounds. Most of the time, it makes everything slower, and it should be avoided. Everyone’s job should be to deliver on outcomes and not control other people.
Kissler: I talk about deploy on day one, which I think is related to your speed of onboarding, which sometimes will highlight technology investment required in order to achieve that. What I’ve seen that is a trap is just focus on getting a tool, figure out how to get CI/CD working, do automation, do pipeline automation, versus looking at what is it really taking for us to deploy? Where do we have opportunities? Then that typically shows, is it really a tech problem or not? If so, then we can focus on that, but leading with the people and process part, I think gets to better outcomes for sure.
Experimentation in Value Stream Mapping to Get Org Buy-In
Shoup: If an organization is hesitant to invest a lot into value stream mapping, is there a way to get the ball rolling with a smaller effort and demonstrate the goodness?
Caldwell: I think my approach is to always allow for experimentation of changes to process, tech, culture, anything you might imagine, if your company is large enough to support it. I assume that at any given moment, my culture is only 85% right, and to close the gap on the other 15%, I’d better have either a smart acquisition strategy, or just allow people within the company to experiment with different things. If they catch fire, I’ll pick them up and try and distribute them more broadly. You never want to fall into a state of being stagnant or just assuming that you’re doing such a good job you don’t need to improve. This applies to everything: management, technology, processes. All of it, you should try and disrupt yourself. You should set up situations whereby the smartest people within your organization who tend to be on the lowest levels of the org chart are put in positions of power to allow for that disruption.
Org Structure Change to Improve People and Process
Shoup: To improve the people and process, did you have to change the organizational structure, such as merging the product department with the engineering department?
Wells: Yes, you do have to change organizational structure. It’s the bingo Conway’s Law, you ship your organizational structure. That means that you have to look at where there is something, a boundary that makes the wrong split in the organization. It’s when your culture change needs to have that. Sometimes it’s big, so we’re moved away from having separate operations and development. You’re doing it everywhere, potentially. Sometimes it’s smaller. The easiest way to change culture is to change the structure of some part of your organization to do that. Recently, at the Financial Times, we moved some teams around, so that we ended up with my group being every team that focused in on engineers. Previously, some of the teams were in a group where there was some people building stuff for FT stuff on the whole. You want that focus on a customer, it’s really helped us to know what we’re trying to do.
Lewis: Wholeheartedly agree with that, especially the Conway’s Law bit. I think when you’re changing structures, you’re trying to solve for not how to take these teams and smash them together, or pull them apart again if they’re complaining about the kinds of interactions they’re having with each other, but ready to figure out how to create the right kinds of ownership. Sometimes you need product to sit closer to tech, because tech is building and prioritizing problems that affect tech and not necessarily your user base. Sometimes it’s helpful to split them apart, so that product can run ahead on user research that will be built into the next generation of your product that your tech team is not in place to be able to deliver on yet. Thinking through, how does ownership work right now? What’s working? What’s broken? What do you want to fix? I think that’s a good first step before thinking about org chart evolution.
Caldwell: I’m going to hit the Conway’s bingo square again. Conway’s Law, it tells you right up front, you will ship your org chart. You then need to think through, then do I have the right org chart? You should be continually changing your org chart to match the strategy for whatever you’re trying to deliver to the world. Embrace Conway’s Law. You can have some agility and flexibility. If you can bake that into the culture of your team, that we’re going to be continually moving around to best shape ourselves to suit whatever our business need is or whatever strategy we’re trying to pursue. If you can get people comfortable with that, you end up with a knowledge sharing, you end up with more effective ability to deliver for your customers. Lots of good things happen if you’re able to build out that culture. I also acknowledge that it’s very hard to do that. People don’t like being moved around a lot.
Kissler: I think that one of the additional critical components of structure changes is also having a system to validate that the org structure changes are actually getting you a better outcome. Because I’ve been in scenarios where organizations just say, if we just reorganize, things will be better, and sometimes don’t take that real critical lens on what is broken. What I’ve found often is, what tends to be broken is lack of alignment on shared outcomes. Even if you change your org structure, you might still need that. Because often, even when you’re trying to create as much autonomy as possible and ownership, there’s often dependencies outside of that org structure in order to get work done. If you can get to alignment on shared outcomes, and the structure to stay aligned on those, I think then orgs should be fluid, and often need to be. In the absence of having the system that also creates the right ownership model, I think it can break down and then it just becomes whiplash.
How to Deal with Services and Repos without Ownership
Shoup: Jorge asked a question about, his company went through a bunch of reorgs in the last year. Sometimes the bottleneck is collaborating in repos and services owned by another team or without any owner, and other teams might not be able to help. Any special way we can deal with that? That’s a very real problem.
Lewis: I think it helps to make it explicit, which teams are supporting frontline products and which teams are supporting other teams? Because the team supporting other teams are the ones who tend to get overloaded most quickly, and then can’t help the n+1 request or have these orphan code bases. I’d do resource planning with them and capacity planning with them and try and establish some shared language about how do the frontline teams ask the service teams for requests, and what visibility is there into when one of those teams runs out of capacity. Because typically, when that happens it sometimes means more budget, and in government, everyone’s solution is we need more resources. Then that sometimes helps the problem. Sometimes it just reinforces the problem. Just giving everyone leverage of negotiation, and turning that into a shared problem can be helpful too. Because if 10 frontline teams are all asking the same scarce resource team for something that they can’t all get, maybe it means they need something other than the request that they’re asking for.
Wells: On the subject of ownership of stuff, I think you have to get agreement that everything should be owned, and should be owned by a team. We’ve got a central registry of all of our systems, and the system is linked to a team. It might not be that that team really knows it. If their name is against it, and they’re the ones that will get called if it breaks, it makes people think, yes, that could be a problem. It’s still difficult, but at least you have someone to talk to, and at least they have a sense of I have got these 20 systems, I ought to have some idea of how they work.
Caldwell: I don’t know how you build your backlog, or how you determine what you’re going to work on for the quarter. One thing I’ve seen work for these support teams that get a lot of dependencies is to break the dependencies into two classes, like one is things that should be converted into platform features, or reusable components. Those can take a longer time to land, because in the long run they’ll enable everyone to move quickly. Then things that are more directly like, unless we hit this dependency, we won’t be able to ship some new feature within that quarter. Then if you tease those two apart, you can set clear expectations on timelines. For a long running platform component, for example, you should never have an edge team take a dependency on that work, even if they might benefit from it in the long term, because it will slow everybody down. Knowing that it is a long running dependency, you can then say, here’s a mitigating step for the short term. Then we’ll commit to picking up the conversion to the shared component or wherever it might be in the long term. This is a long-winded way of saying that I treat teams that we know are going to get a ton of dependencies during a special time period during our planning process, because that is usually where all the hiccups in terms of our predictability show up.
What’s Coming Next?
Shoup: It feels like a lot of what we’ve heard is the same over the last few years. What’s new and exciting to you that’s coming next?
Caldwell: I have a very abstract answer to it. I think collaboration tools have radically improved during COVID. We were just talking a lot about org charts and structure. If we enter in a world where our collaboration becomes order of magnitude easier, do we have to lean as much as we did previously on an org chart and managerial hierarchy? Then in correspondence to that, someone else was talking about microservice architecture, and we spend a lot of time talking about removing dependencies. In a world where the technology itself also supports more independent teams, where do we end up in the future? Are we going to start chopping levels out of the org chart? How are we going to get teams to trust each other? I think it’s a very interesting world that we’re moving toward, because flatter, less hierarchy, with better tooling should allow us all to move with better flow more quickly. We’re figuring it out in real time, like how it’s supposed to work, which is exciting, I think for managers.
Wells: It’s really interesting, because I think that the change from when I first started as an engineer, and you had basically got monolithic applications and slow release cycles, and considerably less complexity operationally, because you deployed your app onto a Tomcat server, and it had an Apache in front of it. We’ve got empowered autonomous teams, and everything seems more complicated. People have to know a lot more, and it takes a lot longer to necessarily get things going. That might just be partly where I’m working. Things got to get simpler. I don’t think everyone should have to run something like a Kubernetes platform. For most companies, that is overkill. I would like it to be really easy to write some code and just give it to someone. You look at companies like Spotify or Monzo, where they’ve got that framework, just starting to go back to engineers just being able to concentrate on writing the code. Whereas I think in the meantime, it’s got very broad. I’m going to say this, because it’s my job, working on the stuff that provides a platform and enables other engineers to the point where they don’t even have to think about it, seems to me to be something where we could invest and get a benefit for our companies.
Lewis: I’m going to go out on a limb with maybe an antagonist answer, which is that process optimization should never be exciting, especially in government, but probably in general. I think it’s exciting about this kind of engineering work at scale is the impact. Government tech, very boring, using at least 10-year-old tools, sometimes 30-year-old tools, but the agency I’m embedded within delivered a trillion dollars of economic aid into the economy at a time when we were facing economic disasters. Who cares if no one’s ever heard about Kubernetes and will never accept cloud into their heart? The impact still matters.
Kissler: My excitement is around, I truly believe that the focus on improving flow, and even today, it’s like we’re all in engineering leadership roles. My passion is, this is not an engineering or a technology problem. This is a business/company problem. Trying to bring the broader organization along for the value of improving flow is where I tend to get excited, because I think we’re getting to the point now, and some organizations are already there, where their business partners are in. I think many organizations still think that this is a problem for technology to solve. For me, the next evolution is to bring the broader company and business partners along.
See more presentations with transcripts