Presentation: Talk Like a Suit: Making a Business Case for Engineering Work

MMS Founder
MMS David Van Couvering

Article originally posted on InfoQ. Visit InfoQ

Transcript

Van Couvering: This is David Van Couvering. I’ll be talking to you about the important skill of being able to talk like a suit. Why do we want to talk like a suit? Actually, no one wears suits anymore. It’s just a fun term to mean people who you work with, that are focused on the business goals. Often, we talk about how engineering is focusing on how to build something, whereas our business partners are focused on what to build. Often, we’re struggling with some tech debt, some issues in our code. Many of us feel that we can never spend time on that. We never seem to be able to put time into cleaning up tech debt that we always have to build more features. It can be very frustrating. This talk is to help you figure out how to talk to your business partner in a way that makes sense to them, and helps them understand the value of investing in improving and reducing tech debt.

How Engineers Often See Tech Debt and Legacy Code

Often, how engineers see tech debt and legacy code is these kinds of experiences. It’s really difficult to work with. I don’t know what’s going on here. It keeps breaking in ways I don’t expect. It really takes a long time to get anything done. It’s difficult to debug. It’s ugly, frustrating. It’s not cool or fun. When we’re talking about legacy code to our business partners, we say we really need to clean this up, for these reasons. Sadly, often what our business partners hear is, tech debt and legacy. It’s like, engineers like to complain. They always want to keep things nice and clean. Yes, it’s a problem, but I’ve also got this feature to deliver. They’re not seeing it from the lens that makes sense to them, and so they just don’t give it a priority. I’m not saying this is true for all business partners, many product owners and managers I’ve worked with really get this. I know there’s also a lot of frustration in the industry where people feel like their business partners don’t get this and feel like they’re not being heard.

It’s important to understand that our business partners are focused on the business, and they aren’t really recognized for this vague concept of reducing tech debt. Everyone knows we should reduce tech debt in a general way, but when push comes to shove for a particular sprint, often the feature is always still more important. Why is that? Because they really are more recognized for consistently shipping high quality features on time, and under budget. Delivering on the OKRs that they’ve set up for the quarter or the year. Yes, they are responsible for making sure the employees including us engineers have high engagement and satisfaction. Is that enough reason to invest an entire sprint or two in tech debt versus shipping more features? Usually not from their perspective. There’s this blah, blah, blah syndrome, where you as engineers feel like you’re not being heard, and so we as developers are frustrated and checked out. Also, there are business opportunities that are usually missed, and business risks increase as the tech debt increases. There’s real consequences to this.

Flip Side: The Risks of Engineering-Driven Work

The other thing is, I’ve also been in organizations that are very engineering driven, meaning like the engineering leads, and engineers call the shots. Then there’s a risk of that too where you end up building things that are fun to build, and interesting, or, “I know there’s an open source version of this out there, but I think we can do better.” All sorts of reasons that are maybe not the best reasons, but they’re driven by our own motivations as engineers, and we’re not really thinking enough about the business. When we don’t talk like a suit, often we build things that maybe we shouldn’t be building just because we’re wearing glasses as it were, of an engineer, when we’re thinking about what we want to put time into.

Talking Like a Suit

Let’s talk about how we can talk like a suit. There’s many ways people address this. The framework that I found has really helped came from this book, “Building a Story Brand,” by Donald Miller. It’s basically helping you construct a good story where the business partner is the hero. I just found the structure really simple and a really great framework for me to think about how to reframe what I’m trying to say, so that it really clicks with my business partner and helps them understand the value of what it is we want to do.

Structure of a Good Story

In this book, they say there’s a structure of a good story. Then this repeats itself in many stories. I’m not sure it’s always exactly like this, but it’s definitely a common structure. It’s very useful because stories are very engaging and for a certain reason. It’s because they bring us in and help us go through an experience, and we get caught up in it. Here’s a structure of a good story: character, has a problem, meets a guide, who gives them a plan, calls them to action, that helps them avoid failure, and ends in success. I’m not going to try and give you examples of actual stories, where this plays out. I will walk through each piece of this in more detail, and apply it to telling a story about tech debt that makes sense for a business partner.

Examples: Refactoring, and Complex Client/Service Communication

I want to do a couple of specific examples to help you think about this, with real actual problems. I find it really helpful. One is, we’ve got some nasty package, or module, or component that is just really difficult to work with. Another example I’ve seen fairly frequently is, we’ve built a client-service communication pattern that’s really complex. Let’s say, we have a web client, iPhone, Android client, and they all need to talk to all these backend services that we’ve built in a nice microservice way. Now all these clients are having to call these services and then orchestrate the results, merge them together. Get the results they want, and present them for the user. This, as we’ll talk into more, has a set of problems.

What Is the Problem?

How about we describe this problem. A first pass of this might be, the spaghetti package, look, every time we try to change something it breaks in unexpected ways. It takes forever to figure out what we need to change. Pull requests are huge, and it’s just hard to find reviewers. We might say for this client servicing, every time we need to build a new feature, we have to duplicate the code in each client. Service teams have to build these specialized APIs. The clients have to pull down all these results to pick what they need. It’s really frustrating. When we describe the problem in this way, we’ve fallen into the trap of us being the hero of the story. This is not about us as engineers. This is not who we’re trying to talk to. We have to be very careful not to fall into this trap, where we describe the problem from our perspective, versus from the business partner’s perspective.

Communicating the Problem

Before I go into how you might describe this for the business partner, I want to talk a little bit more about the structure of communicating the problem based on this book. It’s always good to have a good villain, and you identify one that’s relatable. That everyone, “Yes, I get it.” It’s singular. You don’t want to have five villains. That’s confusing, just have one. I’ll talk more about how you can focus on that. Make it real. Don’t make up a villain that doesn’t exist. Then there’s three levels of conflict. There’s the external problem. There’s the internal conflict for the actual hero. Then there’s this philosophical problem, like people should be nice to each other, or it shouldn’t be this hard. It’s just like a general feeling of, it shouldn’t be this way. You want to have a singular villain, singular problem. There are usually multiple problems, and you want to try to identify the one that is the biggest concern. One thing that you might do is you come up with some ideas about certain external, internal problems that might exist. Then you might talk to your business partner and see which one they really relate to the most. Or you look at the data and see which one is causing the most problems in terms of maybe productivity or value.

Spaghetti Package Problem Summary

Let’s go through an actual example of defining the problem using these pieces of a problem statement: villain, external, internal, and philosophical for the spaghetti package problem. Of course, there can be other reasons why this package is a problem, but let’s just pick one. Let’s say this is the one that’s most concerning to our business partner is productivity, like being able to deliver features quickly. The villain in this case is we’ve got this system, you can imagine some Gremlin, almost, that just seems to slow down feature delivery, and it’s very frustrating. The external problem is I want to be able to deliver features quickly. The internal problem, something that’s more about me as a business partner versus the external problem is I really want to impress my management and do a good job, and increase my chances of career success by demonstrating that we’re delivering lots of features. That we have high productivity and we have high reliability. Philosophically, really, why is it so hard? It shouldn’t be so hard to roll out new features. Why is it such a struggle? That’s, again, taking it from the perspective where the business partner is the hero, here’s how you might explain the problem.

Client/Service Complexity Problem Summary

For the client/service complexity problem, there’s all sorts of reasons why this can be a problem. There’s that performance concern, maybe. Again, it could be productivity, like it’s so hard to get a feature out the door. Let’s say the biggest frustration for our business partner is that they really want to do lots of user experience innovation, and they’ve got this villainous system that makes it really hard to do that. The external problem is I want to quickly experiment with new user experiences. Internally, maybe, it’s, I just want to feel inspired and creative and feel like we’re doing a good job here. Philosophically, it’s like, I’ve been at other companies where it’s just really hard to innovate and experiment on the UI, and here it’s just so hard, and it shouldn’t be this hard. That’s a problem statement from the perspective of the business partner in a suit.

Back it Up with Data

When you’re presenting this problem statement, you really want to see if you can actually demonstrate it with data. Because otherwise it may start feeling like you’re making something up just to get something that you want done. They may not be experiencing this problem as much, or they may not be noticing it’s a problem. If they’re not noticing it’s a problem, then you can show them how it’s a problem. Or they’re saying, “Yes, it’s really slow. I don’t know why.” You can show them one of the core reasons why things are slow. For example, maybe you can show, from start to finish, here’s how long it takes to build features. Notice, whenever the feature touches this particular package, the time it takes to build this doubles. Or maybe you can have a number of engineers share how they have to go back and forth between the frontend and the backend every time they want to change the UI.

One thing I’ve noticed that can be challenging to communicate is if everything’s fine now, but as we look forward into the future, for whatever the business is planning, there are going to be real problems. Sometimes you have to say, looking at our business goals, for example, we want to be able to go into new markets, or we want to double the size of engineers by next year, or we really need to integrate with third parties. Identify what these longer term business goals are, and then point how the situation as it is today may be fine, but very soon, it’s going to become a real blocker for the goals of the business, and could be a serious risk. Sometimes what you’re communicating as an engineer is risk coming up versus a situation that’s today. That’s an important part of our job is learning how to express those risks in business terms. If you can’t find the data, and this has definitely happened to me, you need to ask yourself, is this really a problem, or is it just me wanting to be a perfectionist engineer? If it’s not really causing a huge problem, maybe step back and ask if it’s something you want to ask to invest time in?

You Are the Guide

What is our role as engineers in this story? Not as the hero, we’re the guide. We’re the one who has empathy, and authority, and we’re trusted. We’re the one who’s going to help them solve this problem, and we’re going to give them a plan. You want to have empathy. You give the problem statement and you share how, “I really get it that this is the concern. I don’t want this to be this way too. I can really see why this is a problem.” It may just be the way you talk to your business partner, but really take on that feeling of being someone who’s there to help them. Demonstrate your competence, showing how you’ve refactored a system like this before, and you’ve seen great success. Cite examples from the past. If you actually don’t have experience in this particular problem, then you should probably find a trusted advisor, maybe a more senior engineer who’s mentoring you, who is going to help with this project, to help your business partner feel comfortable that you have what it takes to make the change necessary in a reasonable amount of time.

Give Them a Plan

Then, you give them a plan. You want to show them a path that gives them confidence and reduces their anxiety that we don’t know how to solve this problem. I’ve seen this plan in the past, a bad plan, “We need two years, give us the best engineers, and in two years, we’re going to completely change everything.” It’s just too much of a black box, too long of a time period. You’re not going to be able to maintain momentum, it will probably get defunded halfway through. You really want to break things down in demonstrable steps that deliver value in each step. This is the agile-lean way. It’s important to do it not just with features, but also with tech debt. As you do this approach, it builds trust and it builds momentum.

Example Plans

Some example plans for the spaghetti package, say, “We just want to analyze and write up all the responsibilities of the package.” Identify which responsibility gets the most activity. Maybe factor that out first. Evaluate what a difference it’s making. If it’s making a difference, let’s keep going until we’re not seeing much value for each level of effort. Or you could say we’re going to time box the effort for two or three sprints. Again, you’re focusing on let’s do one small thing, see how it goes, and keep going. For the client/service complexity, maybe we just want to say, we want to identify a couple best approaches, maybe we want to go to GraphQL, maybe we want to build our own BFF, backend for frontend. Then we’re going to pilot these approaches with one service, or maybe we’ll decide we think it’s going to be GraphQL, let’s try it out. Then evaluate again. Then, if that’s looking really good, you’re seeing good results, let’s start rolling out to the other services and keep evaluating if the time to build new features in the frontend is getting better.

Call Them to Action

Then you need to call them to action. What actually do you need from them? “We need three engineers for three months for phase one, and then we reevaluate,” that’s going to be for a big project. Or, “Just give us two people for two weeks, and we’ll do an analysis and come back with a detailed plan.” Or, “We just need to constantly carve out time each quarter for ongoing maintenance and tech debt.” This is just something we need to make sure we allocate time for.

Avoiding Failure and Reaping Success

Then the last step is giving them a vision of the improved future. How do they avoid failure? How will they reap success? Examples of avoiding failure, we’re not going to get bogged down every time we need to touch this package. We’re not going to get outpaced by our competition, because they’re innovating so much faster on their UI than we are. Visions of success. As we clean this up, we’ll be able to deliver more features faster with higher quality. Or, as we move to better client-service architecture, we’ll be able to quickly iterate our UX and build a much more compelling experience.

Have Relevant, Measurable Success Metrics

As you’re doing these changes, as much as possible, try to have relevant measurable business success metrics. Have these metrics be the outcomes of what you’re doing that are business outcomes, quality, productivity. Whatever it is you’re saying you’re going to improve, how do you measure that? How do you show that it’s getting better? If things aren’t improving, be willing to have honest discussion about this. Because part of being able to get trust and momentum is showing that you’re actually improving. Hence, we need to be able to prove that that’s happening.

Putting It All Together – Elevator Pitch

Just putting it all together, quick summary. You’d probably go into more detail when you’re presenting to your business partner, but it’s how you might say it very quickly in an elevator. “I know how important it is to you to be able to reliably and rapidly deliver results.” You’re explaining what their problem is, and saying that you get it as a guide. “I’ve noticed that every time we have to touch this one package, we get bogged down in complexity, and it takes us much longer to deliver. It shouldn’t take this long to ship seemingly simple features.” That’s the philosophical statement. You’re basically helping them see that you have this problem, you understand the problem, and you know what’s important to them.

Now you describe how you have the expertise that is needed to give them confidence that you can fix this. “In my years as an engineer, I have regularly seen a little refactoring effort pay back big results.” Now this is a call to action, “Let’s allocate two people for two to three sprints to clean up this package.” Then you talk about how this helps them avoid failure and guarantee success. “By doing this, we should stop running into so many unexpected bottlenecks and see a noticeable difference in our speed and consistency at delivering features.” You can see how that’s a much better story for a business partner than, “We engineers are really frustrated with this package. It’s so hard to work with. We don’t understand it. We can’t figure it out.” Those are real problems, but it’s not going to necessarily motivate your business partner as much as a story does, where there’s a hero.

We Should Be Thinking Like Suits Anyway

Even if we don’t need to convince our business partner, and they’re giving us a carte blanche to do whatever we want because they really trust us, we should be thinking like suits anyway. It’s really easy for us as engineers to get hypnotized by shiny new things, the latest technology that’s come out. We all do. We tend to be overly optimistic, especially when we’re excited about something about how much work is involved. We can often get a little cocky and create these really big projects that don’t deliver incremental results, a boil the ocean project. That can really have really poor results ultimately. We’re often not fully in tune with our own customers’ needs, we so much have our head in the engine of the car that we’re not thinking about what it means to drive the car. By stepping back and putting on a suit, as it were, when we’re thinking about tech debt projects, it helps us start to identify better with our customers.

Suit-Talk Is an Engineering Superpower

I really think that the suit talk is an engineering superpower as you’re growing in seniority as an engineer, and you’re trying to communicate more with business partners, and you’re trying to have more impact and influence. Being able to think and talk this way is really a powerful skill. I highly recommend practicing and trying to get better at it.

Resources

If you want to learn more, here’s a number of really great books around this. There’s, “Building a Story Brand,” of course. Randy Shoup had a great talk in QCon Plus 2020, which is on a similar theme called communicating effectively with business partners. Then there are these books that are all great books about tech debt, and measuring outcome, how to best demonstrate value, how to calculate value, all sorts of things like that. All lots of great books here for you to take a look at.

Questions and Answers

Schuster: Is a pilot, in your mind, the same as a proof of concept. It’s something that you build and you throw away, or is it something that you integrate into the actual product?

Van Couvering: Proof of concept is another great tool for being able to reduce risk and prove confidence in what you’re offering by saying, let’s just try this as a spike or a proof of concept. Usually, proof of concepts are something that you throw away, you just put it together to demonstrate that it looks like it should work. Whereas for me a pilot is something that you actually work with a particular team or actually say, let’s just do this for one example. If it’s unsuccessful, then you might roll it back. If it’s successful, it’s something I think you would keep when you do a pilot.

Schuster: Some of the steps that you mentioned, building proof of concept, for instance, takes time. How do you sell that investment in time to your fellow suits? How do you say, this is worthwhile, gathering the data is worthwhile. How do you do that?

Van Couvering: I think in order to even make your case, you might have to spend time gathering data. Yes, it’s true, you can’t spend like, a full time, sprint of your time just gathering data. I think you probably need to time box it somewhere. If it takes longer than that, then maybe work with what you’ve got. Say, “Look, I’m seeing trends here. Can you give me a little time to uncover? I think what we’re seeing is that we’re having a lot of trouble delivering features, because it looks like this particular package is a real problem. Can you give me a little time to dig into that a little bit more,” could be really effective. Again, it’s all about cost benefit, if I can give you a lot of benefit with this much cost, then usually they’re willing to do it. It is a tricky one, because it’s cart before the horse, and how do you get bootstrapped?

Schuster: If your company has a 20% program, 20% spend on your own things, would you use that for that, or should that 20% type of thing be used for something else?

Van Couvering: If you can actually get the work done with the 20%, that’s great. You have 20% of your time and you probably have way more than 20% of things you could be working on. For me, personally, I like to give myself a business justification to help me prioritize what I want to put my time into. Even if I don’t need to get external approval. Or, there’s something that’s going to require more than 20%, then you might use your 20% to say, we need to do this big initiative to move from monolith to microservices, which is a much bigger effort than 20% would cover, I’ve seen. Again, in general, I like to convince myself that what I’m doing is delivering the most value.

There’s another part of 20%, which is, just have fun. Do stuff that is just fun, because engineers like to do. Often, you discover stuff that ends up being really useful, and mostly for engineers. That’s the part I like from hack days and 20% time, is just doing whatever you want. Maybe what you want is to do something bigger, in which case, what you want to do is gather the data and make a good case. This is more for something that’s not necessarily you haven’t been given time to play. This is where you need to actually get approval for time.

Schuster: Often, stuff like optimization are particularly interesting for that, because you have just no much better data than, I made this 10 times faster. Whereas with refactoring, it might be a bit harder to sell.

Do you have any advice to progressively convince the engineering teams that building and showing data to convince the suits is something actually valuable, not a waste of time? How do you break that it’s not going to be a waste of time.

Van Couvering: I can imagine people saying, it doesn’t even matter if I make a good case, they’re still not going to let me do it. It could be you’re in an environment like that, where it’s just that kind of relationship with product where they don’t have a lot of trust or respect for engineers. They don’t trust them to know what to build. I would love to see the particular situation, because to me, often, the first thing you need to do is establish a better relationship. Maybe you can just do it with a small thing where you demonstrate, “I get you. I understand where you’re coming from. If you just give me two days, I’ve just delivered this thing that made a big difference for you.” Then this trust starts building. I can totally get it where, if over many years, you just feel like it’s been hopeless for a long time, you’ve checked out, for those engineers, it may be harder to convince them. For me, then I’d probably want to demonstrate that, I was able to do this, so maybe you want to try it too. Yes, getting that flywheel started can be hard. I totally get that.

Schuster: As a developer who once was a young developer many years ago, you have to overcome the fear of the suits in a way. Because I’m a developer, I don’t deal with suits. You yourself may have thought like that.

Van Couvering: This is about becoming a better developer, so when you’re ready to take that next step, this is a muscle that you need to start growing.

Schuster: I particularly like your focus on the data. Again, being as a younger developer once, I had different motivations for doing things. Switching to that fancy new language was the most important thing, I didn’t realize the cost. Actually, getting the data keeps me honest, basically, or keeps yourself honest.

Van Couvering: Then the thing is, now you establish trust with the business and they let you do more things.

Schuster: As an engineer, if I identify a problem, when do you think it’s a good idea to make PMs involved and how to split the workload accordingly?

Van Couvering: I’ve found it works best as a conversation. I’ve done this in backlog grooming, but it can happen elsewhere. Part of this is like you give them a plan, because it’s scary to say to the PM, “There’s this big thing we need to do, and it’s going to take a lot of time.” It’s better to say, “Here’s a plan, so that we can be sure that this is the right path. We’re going to deliver incremental value. Here’s the first few steps we’re going to take.” You don’t want to break up the entire plan, maybe the first few steps, and say, ok, and then we’ll reevaluate. Then I think that you offer that as, “Here’s how I’m thinking we break it up. What do you think?” They can give you some feedback. Often, it’s good to come with a proposed plan, but be willing to get their input on to how you might change it. Versus coming with a blank slate and saying, what’s the plan?

Schuster: Project estimates can be from small to big depending on the project. How do you separate out cost savings from refactoring from just normal project cost in a way that is believable?

Van Couvering: You’re saying that there’s this overhead of projects that have a cost, and so how can you demonstrate that the refactoring has made a difference?

Project overhead costs? You’re trying to demonstrate that we did something that added value. How do you know that, because it could just be lost in the noise of the overall project cost? That’s a tricky one. Hopefully, the refactoring has a significant impact, so you actually see the difference. Again, there’s cost versus value. I think often a refactoring, you’re identifying a particular thing. Usually the biggest problem when you’ve got a big, ugly piece of code that needs refactoring is that everything takes a long time, in that particular piece of code. If you can look at all the different tickets that went through the system, you can separate out which ones touched this bad code and which ones didn’t. You can demonstrate a difference between how long it takes. Then when you do the refactoring, you can show, now these stories that touch this code have taken this much shorter time. That’ll be the same regardless of whether there’s project overhead or not. That’s how I’m thinking of it. That’s one idea that I have. That thing of trying to isolate the one variable and show the difference with when you change that variable.

What are some techniques to create bridges between business and technology on a continuous basis to make the job of making business case easier?

I’ve found that the best way to create a bridge is this thing of being a guide, someone who understands their perspective. Not just in when you’re making proposals, but in day-to-day conversations, like when you’re talking about a feature, you can say, if we did it this way, I think it would bring this much more value to the business because X, Y, or Z. Start thinking about those metrics and measures that are important to the business, help you become someone that they trust more as someone who gets it from their perspective. Then, of course, the other thing is, I’ve noticed in general when engineering continuously helps the business by delivering things that make a big difference, whether it’s tech debt or not, then they start trusting you more, that they know that you are the other part of a guide, which is that you have capability. They can trust you to do it and get it done. Those are the two things that come to mind. To me, establishing trust is about listening, reflecting, understanding. That talk on psychological safety is very similar, mirroring, basically helping them feel like they don’t have someone they have to fight against but someone who’s their partner.

There’s also a battle here against learned helplessness where your efforts didn’t work before, and therefore, it makes less likely to try again. That’s really hard to counteract.

I think you need to try to get to small incremental progress. That’s why I talk about building momentum. When you can deliver value in small increments, then that trust starts building and you get this momentum, and they’re more willing to let you keep going. Learned helplessness is a really tricky one, because both the engineers are like, “Just tell me what to do. I give up.” Then the leader is like, “These engineers always need to be told what to do, so how can we trust them to do anything?” It takes work to get out of that cycle.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Prod Lessons – Deployment Validation and Graceful Degradation

MMS Founder
MMS Anika Mukherji

Article originally posted on InfoQ. Visit InfoQ

Transcript

Mukherji: My name is Anika Mukherji. I’m going to be presenting on lessons from production for SRE. I’m going to be talking about two particular projects or learnings from my time at Pinterest that dramatically reduced the number of incidents that we saw. The strategies and approaches that I’m going to talk about helped us prevent incidents in a very systematic way. I think other people will find it a useful way to think about solving problems.

I’ve been at Pinterest for almost three years now, and within SRE at Pinterest for almost two years. Within SRE, I am primarily focused on the Pinner experience. I’m the SRE embedded in our Core Pinner API team, our Core Pinner web team, and our traffic infrastructure team. When I think about incidents and solving problems, I’m really focused on the user experience and how the Pinners interact with our product.

Nature of Incidents

First, I want to talk a little bit about the nature of incidents in general. Why do we have incidents? The reason we have incidents is because changes are introduced into the system by people, and people make mistakes. That means that as the number of people grow, at our company, the number of human errors are going to grow as well, meaning we’re going to have more incidents. That is a natural consequence of growth, of product development, of the complexity of our systems. You’re going to see an increase in incidents as the company grows. It’s not going to be quite so linear as the dotted line here. It’s probably going to be more of a curve. It’s going to be a trend upwards. When thinking about tracking incidents, we really want to be thinking about how do we reduce the number of incidents per people or per team, or however you want to have that denominator. How do we flatten this curve?

Key Insights

The two key insights that are going to correspond directly to the two initiatives that I talk about today, are making changes safely. In other words, safe deployment practices. How do we take the changes that engineers have made locally and safely deploy them to our end user in a way that helps us catch errors proactively, and reduce toil for engineers themselves? Then the other side of the line, we have make the right thing the easy thing. What I mean by this is developers generally want to develop safely. Engineers want to do the right thing. We don’t want to develop in an unsafe or unreliable way. As SREs, one big lever we have is to try to make the default experience for developers the safe option. How do we help them do the right thing by making it the easiest thing?

Making Deployments Safe

Onto the first section here, making deployments safe. At Pinterest, we recently adopted, and have been working on something called ACA or Automated Canary Analysis. The basic concept of canary analysis is that you have a control fleet of a certain size, takes a small percentage of production traffic, and you have a control fleet that’s serving the current production build, which is a verified healthy build. These two fleets are receiving equal amount of traffic, equal distribution of traffic. They’re identical in all respects, except for the build that they’re serving. The canary is like the experimental treatment, and the control is the control. Because these two fleets have exactly the same traffic patterns, the exact same configuration, we are really able to compare metrics in an apples to apples way between these two servers. How this works is you can build a canary control comparison into your deployment practice. Before you roll something out to production, you would run canary control comparison as a required step, and it has to pass that validation before we would deploy to production. There’s quite a few blog posts from Netflix about this actually.

ACA at Pinterest

At Pinterest, we employ ACA by using Spinnaker, which is also a Netflix maintained open source tool. Spinnaker serves to orchestrate our deploys. It essentially allows us to create a deployment DAG, where we can deploy to certain stages in a certain order and run jobs or validation in between those deployments. Separately, we have an observability framework that allows anyone to set up dashboards and graphs to help them monitor the health of their system. This is pretty standard at a lot of companies. Maybe it’s Datadog, maybe it’s some other third party tool. We just have something in-house that contains all of our metrics, and allows engineers to set up graphs and set alerts on those graphs.

When we were thinking about canary analysis, we wanted to not have to build a new metrics framework. We wanted to be able to use the entire suite of metrics that we already have at our disposal. The way we went about this was setting up a configuration system where engineers could set up a config and code in YAML, that had a list of dashboards. Those dashboards have sets of graphs that allow engineers to measure different things like success rate, CPU. The YAML also has other configuration like duration, fast fail threshold, just different things about the validation that we want to run. After committing this YAML, it essentially exposes an endpoint. This endpoint allows us to query for whether any of the graphs on those dashboards that were specified in the YAML are alerting, basically have hit some critical threshold. Or in other words, were failing validation.

What we did in Spinnaker is that we have one Spinnaker node on the stack that actually queries this API and polls it for duration, the length of the duration has been specified. If at any point, this API returns that any of the graphs are in critical mode or the threshold has been hit, then we’ll fail validation. If we fail validation, we’ll basically page somebody, and then we’ll enter a manual judgment set. When we’re in manual judgment, this means that our pipeline is paused, and somebody has been paged and is actively looking into the problem. Then they can either decide that, yes, this is a true positive and this is a problem, in which case in the manual judgment stage, they will make a selection that’s either something like roll back or pause the pipeline completely, so we can look into it more, some defensive action. Or they’ll say, this is a false positive, a noisy alert, we need to fix this asynchronously. Then they’ll promote the build to production.

As a result of this, we’ve basically been able to go from a system where we’re using staging and deploy directly to production, and we have a question mark. We have no idea if that build is healthy. Then we need to roll back all of production, causing bad user experiences, to a system where staging will then deploy to canary, will run metric validation. Will take some defensive action, and go back to staging and make some fixes, before rolling out to production. That way, we have a lot more confidence that the build we’re deploying to production is actually healthy. The other thing that this helps us do is have a much quicker time to recovery, MTTR, because production is very big. For example, our Core Pinner fleet is thousands of hosts. Canary is only around 20 to 30. Rolling back canary or deploying to canary, deploying fixes to canary, is much quicker than it is to roll back all of production.

Then we risk that the fix that we’ve put in place might not even be the right fix. The other big benefit is that after we’ve detected a problem in canary and we take the defensive action and put out a fix, whether that’s a revert, a fix forward, we can use the canary metrics again to validate that the problem has been fixed as we’d expect. This not only improves user experience dramatically, but also reduces the amount of engineering time that’s spent on mitigating production incidents and rolling back big fleets, which can take quite a while. At Pinterest, this system has saved us tons of hours in incident time. It also has reduced the number of production incidents that we’ve seen by over 30, a half, at least the last time I checked. We’ve seen really big return on investment from this initiative, or investing in deployment practices.

Systemic Graceful Degradation

Next, I’m going to talk a little bit about making the safe option the easiest option. I’m going to talk about a very specific problem that I was presented when I first became an SRE for our Core Pinner API. Our Core Pinner API is a REST API. Essentially, the clients will make a request to the API and the API will return some set of models, pin models, board models, user models. By default, those models have a very basic set of fields on them, basic set of metadata on them. The clients obviously are product rich, and they sometimes need extra data from the API. Maybe something about shopping metadata, or maybe some extra links, things like that. The way we do this is the client can specify in the query string, I want this number of fields comma separated. Those fields are then hydrated through batch fetching functions that we call field dependency functions. They’re very similar to Data Loader functions, if you’re familiar with GraphQL. These Data Loader functions were essentially a huge fanout to all sorts of different services in our infrastructure, which all had varying availability guarantees, which was the crux of the issue. Because we would see a small outage or any outage on any of these upstream systems, which would then propagate this error up to the API. We had no graceful degradation there. We had no error handling. We would return errors for basically all of Pinterest. We had tons of outages because of this specific piece of infrastructure.

Over time, we saw this pattern in the incidents that we were experiencing. We sat down and we thought about, how can we think about this in a smarter way? Because not all of the data that we were returning to the client was critical, in that the client might request some shopping metadata or comment information about a pin. That’s not actually extremely critical to the Pinner experience, in that we could gracefully degrade that information if it’s not available, and still give the user a reasonable Pinterest experience.

The way we thought about this was looking at systemic graceful degradation. One of the big realizations here was that most engineers copy and paste their code. That if you have a common code structure, a lot of the times, engineers, myself included, we are going to just copy something that we know works. Change the function signature, and then change the logic inside until it does what we want it to do. This was exactly what was happening in this function, there were just tons of functions that did follow the same exact pattern, but were just all written in a very unsafe way. The approach we took was creating a standard decorator, this was written in Python, that did the error handling for us. We wrote this field dependency decorator that you can see in this screenshot that took in the data type, that was the expected return type. Within this decorator, did all sorts of error handling. If we saw any drift exception returned from this function, we would gracefully degrade it and return an empty data structure based on the provided parameter. We made it possible to opt out of this. There’s another argument that you can put in this field dependency decorator that allows you to opt out of it. The fact that the function signature is so concise, makes it really nice for people to copy and paste it.

Of course, we needed to backfill everything, so we went through all of the field dependencies that already existed, upwards of 100 of them. I worked with all the client engineers and figured out, what is critical data? What’s necessary to the Pinterest product? Then, what is auxiliary? What can we gracefully degrade? We found that most stuff we could really gracefully degrade. There were just a few things that either broke the Pinterest product in a really bad way, like images, for example, or where the client would actually crash because of missing data. Client crashes are the worst user experience that we can provide so we’re trying to avoid those at all cost.

After doing this exercise with all the client teams. We marked everything with this dependency decorator. After that, we had no problems. Basically, everybody will copy and paste something that already existed, change the arguments, and get all the error handling for free. As a bonus, we also instrumented a set of metrics within this decorator that gave us insight into things that we had no idea about before: latency, success rate, QPS, what endpoints are actually using this data. Which give us tons of useful information into how these were actually being used, and allowed us to monitor it in a much more sustainable way. We also took inspiration from this idea and added ownership to all these functions, because they all fetch some data. Then we could understand what parts of the product which are owned by what teams at Pinterest, own this data, so that we can mark it accordingly, and then tag our metrics. Then escalate to the correct team accordingly.

As a result of all this, we were able to prevent tons of incidents. We have metrics on all the errors that are gracefully degraded, and all errors that are actually raised, we have prevented hundreds of incidents, and thousands of days of user downtime as a result of this effort. Another thing I should mention that we also did here was we did some type checking as well to really enforce users to use this decorator. In that we made sure that every Data Loader function that was getting mapped to a field was wrapped in this field dependency decorator. We did that by setting an attribute on the function object itself, so that way we could do some checking, so that if a user did not wrap it in the field dependency decorator, then it would actually fail at build time. Any time that we can take a common mistake that people make, and check it, and be able to make sure that they can’t make that mistake at the time when they’re actually writing the code and submitting the PR, that’s where we’ll gain tons of leverage over preventing incidents.

Key Takeaways

Key takeaways from these two initiatives. These two projects alone have prevented hundreds of incidents a year for Pinterest. Particularly investing in safe deployment practices really pays off. It might not seem like it at first, because writing good canary analysis or writing good metric analysis is hard. Especially when it comes to low QPS metrics and things like that, it can be very difficult to write the correct alert. There will be toil, there will be iteration in terms of noisy alerts, false positives, getting the framework set up correctly. Over time, the investment will 100% pay off. The way we identified these two projects was by looking at trends in incidents. We really looked at the incidents that Pinterest was seeing and saw that, a lot of these are induced by deployments. A lot of these are related to the Data Loader functions. For SRE, a big recommendation that I have is to really take the time and dig into the incident data and categorize it by root cause, by responsible component, by team, like all of the axes, really. You want to get cuts into the data that help you identify where the biggest return on investment is. Because you may go to an incident post-mortem, and this incident maybe only happens once a year, and you have all of these remediation items as a result, and it really may not be worth the investment. If you look over time, you might be able to find repeated things that are every couple months, and that’s where you can really get your time back in terms of the work that you’re doing.

Then, lastly, I’ve said this a couple times, but making the paved path the easiest path. Whether this is code in your framework itself, and abstractions and things that the framework is providing to help people write their code, or whether it’s runbooks, whether it is scripts that helps them set up capacity, configurations. Any time you can remove manual labor, because any time things are being done manual, mistakes will be made. Any time you can script stuff, you can automate it, you can build it into the framework, and make that part of the paved obvious path, that’s where you’re going to get that return on investment for the project. Especially prevent incidents, and save engineering time. That’s where you’ll have the most success.

What Canary and Control Serve

Canary and control are serving legitimate production requests. It’s basically just a separate piece of capacity, a separate ASG. We keep it static so it doesn’t autoscale, like our production fleet does. We keep it large enough that we get a strong enough signal. It’s just part of the regular production server set, so requests are routed there with some probability, depending on how big canary and control are. It serves read and write requests. The reason that we allow it to serve read and write requests is because it’s gone already through preliminary testing through extensive unit tests, and integration tests. It’s gone through the code review process and everything, so we feel confident that the build can serve production requests at that point. The reason that we don’t do shadow traffic, which I agree would be awesome, is because we don’t have a great way of knowing what is a read request, and what is a write request, as what was alluded to in the channel. Such that if we had canary or some other pre-production fleet serving shadow traffic, we don’t exactly know what the consequences of those double writes would be, because our tech stack is very complicated. We are working on a dev-prod separation effort, such that we would be able to serve those write requests safely. That is currently in flight.

Questions and Answers

Sombra: Please elaborate on the difference between the canary and the control?

Mukherji: Canary and control are identical in all respects. They should have the same exact configuration, run in the same container. The only difference is the build that they’re serving. The canary has the new “unverified build.” It’s the canary in the coal mine. We’re testing our new build, the new code on this canary. The control build has the verified production build. Canary and control are equal sizes, so we can do a pretty good apples to apples comparison between them in terms of finding anomalies in success rate, QPS, traffic patterns, anything that we would not expect to change when new code is being deployed. That way, we can detect if things do change and take appropriate action.

Sombra: I don’t know anything about canary and control. I just like analysis and everything. It actually does signal to me that you should have a certain level of organizational scaffolding to be able to have nice things. Can you tell me a little bit more about, what is the size of the team that maintain this infrastructure? Also, what’s your advice for folks that don’t live? My suspicion is that it’s large, it takes human effort and energy in order to be able to have more guarantees and more assurances. What’s your opinion or what’s your intuition about all of that?

Mukherji: Pinterest is a large org. We have over 1000 engineers working full time, and we have a team dedicated to CD or continuous deploy. That team is just a handful of people at the moment. Each engineering team does take a non-trivial amount of responsibility in their own deployments, in that the CD team provides the scaffolding and the software itself. They’re the ones who are keeping Teletraan, which is our capacity configuration internal site. It’s actually open source. It’s on GitHub. Spinnaker as well, which is Netflix. They’re the ones who maintain those services. Then the teams are actually responsible for setting up their own configurations and using the tools, updating their metrics analysis, that kind of thing. My advice is, it doesn’t need to be so fancy. You can definitely apply similar approaches without having the full canary-control paradigm set up. You can have a canary really without a control, and look for anomalies in your canary metrics and compare it up against prod, for example, or look for changes over time. Any way you can build in pre-production testing and have a test suite that’s taking production traffic. The reason why production traffic is really important, is because synthetic traffic is never going to mirror the full [inaudible 00:26:19]. Our integration tests definitely don’t catch everything. We found that using production traffic is really the only way of verifying.

Sombra: Can we do a short recap of the base tools that you use in your deployment process. Zoom in on the ones that you would want the audience to walk away with, like you need one of this, one of this? Or, this is what we use?

Mukherji: I think you need a capacity configuration, like internal site software, some way of configuring ASGs. Depending what cloud provider you’re using, we AWS, but you could do it directly through your cloud provider’s console. You need some way of configuring ASGs of different sizes, basically. Then you need basically an orchestrator. I assume most people have some way of deploying, because otherwise, what are we doing? You need an orchestrator, like some way of orchestrating the deployments to the different ASGs. That can be something more fully fledged like Spinnaker. It could just be a simple graph execution engine. Then just some metrics analysis tools. We’ve really worked, after setting all this up to automate as much as possible, so that ACI fails automatically and we do pauses automatically, and that stuff. That doesn’t need to happen right away to gain the benefit. You can do it manually as the first iteration, and then work to improve it from there.

Sombra: The next question is detecting drift. Can you detect or is there a process to detect functions or teams that have opted out from the decorator.

Mukherji: We’ve actually set up a build at CI time such that we don’t let you land your code if you’re not using the decorator. That has probably single-handedly, made the most difference. We did an ownership effort for our API. When we were adding ownership, at some point, we wanted to make sure that all new endpoints had owners, so we wrote some script that did some parsing and basically employed that at submitting your PR. We don’t let you land your PR unless you have all the things that the framework requires.

Sombra: How do you use your approach, or how is your approach. Is it complementary, or how does it play with feature flags?

Mukherji: At Pinterest, feature flags, we call them deciders. It basically can take a value between 0 and 100. This value is given at runtime. It basically gates that percentage of people into the feature. This is used in conjunction with the decorator, in that people will have feature flags within the function that the decorator is decorating to turn on the feature. The decorator is mostly just providing safety measures, and error handling, and metrics and that stuff. It is possible that your decorator could have a feature flag built into it. We let teams decide what feature flag gating they want, because we have something called deciders. We have experiments that we run as well, which is a more fully fledged like comparison of an experiment and a control group. That’s a more statistically significant comparison of like, with feature and without feature as opposed to deciders, or just a quick way of turning something on.

Sombra: How far out do you go when it comes to the concept of a canary. It’s simple when you’re dealing with your own application, but in complex deployments, my application can bring down a different system. How far out do you go in this evaluation where it’s just like we have the coal mine as well as the bird?

Mukherji: That is really tricky. The way that we’ve gone about it is using our error budget framework. Using that in addition with a tiering system. We tier our services by essentially whether they’re allowed to take down the site. Tier-0, tier-1 services are allowed to take down the site, and tier-2 and tier-3 are not more or less. Based on that, we can intelligently, like gracefully degrade services at different levels, and build that in. If we see a bug and we know that it’s a tier-3 service, we can have a very quick remediation, where we try and accept it, and save. I care all about the Pinner experience, so any way that I can gracefully degrade the product, and save Pinners from an error page, I’m going to go that way.

Sombra: The tiers are expressed based on the customer impact, or the customer ability to notice.

Mukherji: That is one axis of what can constitute the categories of the tiers. Other things that we take in mind are like auto healing behavior, so tier-0 services are expected to be able to auto heal. What we were talking about, like the impact on the product, that really only applies for online serving, like the Pinner facing stuff. We have a whole offline data stack for which it doesn’t apply, and so they have their own set of criteria for what constitutes tier-0 through tier-3.

Sombra: What about a feature that needs to be synchronized between more than one service, can you toggle those and how?

Mukherji: Yes. This is, I think, going back to deciders and experiments. Our decider and experiment “Configuration Map,” that is actually synced to all hosts at the same time, so that all hosts across Pinterest have the same view, more or less of decider values and experiment allocations at the same time. If I made a change, and I said, I want to ramp this decider to 100%, and release my feature to all of Pinterest, and I was using that decider in multiple services, then I could do that. I could just switch it to 100 and it will get updated to 100 in all the services. There is some aspect of it can take longer for the config map to deploy to these different services, and so there can be periods of inconsistency. If your feature is really susceptible to problems with inconsistency, then I probably recommend using a different feature flag architecture, like a new parameter, some field that lets one service tell another service that the feature is on.

Sombra: Folks would like you to expand on what you mean by auto healing.

Mukherji: Auto healing can have a lot of different meanings, but one example is our GSLB and CDN. If a CDN goes down, we will automatically route traffic to our other CDNs. That’s built into the algorithms and the configurations that we’re using. That would be an example of auto healing behavior.

Sombra: It’s like the system itself continues to make progress.

Mukherji: The idea is that the system will try to fix itself or takes a remediating action without manual intervention.

Sombra: Do you have any information about how much time you spend in manual verifications versus how much time you save from incidents?

Mukherji: We actually have done a lot of work around this, because when I was working on this, I really wanted to focus on how I could show the value brought. Because a lot of the times, SREs, we bring these processes to teams, and everyone conceptually knows it’s a good thing, but don’t have good numbers on this. For a false positive ACI alert, so like a noisy alert, we’ll be able to promote the build within 5 minutes, basically. It’s fairly obvious that it’s a false positive. If it’s a true positive, it can take 45 minutes or so to mitigate, which involves figuring out which commit broke the canary and reverting that diff, or making the remediating change, and then making sure that that build gets out basically. Forty-five minutes until we can resume our pipelines.

Incidents, if that were to go to production, so if we were to have a success rate job on some endpoint, go to production, it would take at least 30 to 45 minutes for us to roll back prod itself. Our API fleet is over 4000 hosts, and so it’s extremely large, it takes a long time to roll back. It can take upwards of an hour for us to return to a healthy state. Then, at the same time we’re going through the process of finding the problem and mitigating it. We save quite a lot of time. We catch about 30 true problems in canary, a half, the last time that I did the numbers.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.