Mobile Monitoring Solutions

Close this search box.

Presentation: How to Get Tech-Debt on the Roadmap

MMS Founder
MMS Ben Hartshorne

Article originally posted on InfoQ. Visit InfoQ


Hartshorne: I’m here to talk about a time in Honeycomb’s life. As a company, we had been around for several years and the business was slowly growing. We were getting more customers. As the traffic grew, some scaling issues come up, and they are managed. This is great. We could keep growing and a different thing comes up, and it’s managed. As the pace of growth goes along, these things happen. Each time, there are side effects that push out a little bit further. What you wind up with is a system that is still there and working well, and just creaking. It starts to push its boundaries more frequently in different ways. It’s not patterns, it’s rough. We saw as a business, we wanted to grow. We’re realizing that what got us to that part in our company’s life would not get us to the next one. I would like to talk a little bit about how we managed that.


My name is Ben Hartshorne. I’m a Principal Engineer at Honeycomb. I worked on this talk with Jess Mink. They joined Honeycomb as a senior director of platform engineering when we were facing this. They started, and we said, “We would like to double the size of our business, please help.” We needed to get a lot of work that was on the periphery here, somehow scheduled in the face of growing the business and increasing the product, and everything else. The thesis is relatively straightforward. A compelling business case is what will get your work scheduled. We need to understand, what does a compelling business case look like? How do we get there? How do we ensure that we can do this for technical work? Because the company has been growing and the product organization has been growing, and there’s more sophistication around deciding what we should build. We need to match that.

Understanding Priorities

I want to start by thinking about what does it mean to understand priorities, just a little bit. It’s a truism in the startup world and probably in all of the business world, that there’s always more work than you can possibly do. There’s this herd of features out there, they are beautiful. They are product ideas. They’re things you can do, new things you can build, ways you can improve. They all must go through this gate of your engineering capacity. You can only do so much work. We’re starting from this perspective of there’s no way we can do everything. The engineer’s view on this herd is on the inside of this fence. There’s this line of beautiful work coming in at you, it is full. You have your roadmap. You have all of your work queued up. You have the new products to build, the existing ones to improve. You look at this and you think, how can I possibly do the work that I see that we need to do in order to maintain this business? When do I schedule this? Can we close the gate for a little bit so we can sweep the entrance? That’s hard. Take for a moment the other perspective, the product managers, the ones who are creating this curated list of things for you to do. This is their view. They see this enormous pile of things they would like to build, ways the business can improve, and they’re not going anywhere. Why isn’t engineering doing our work? They ask. They see this as an enormous opportunity, and need to figure out, engineering has the capacity that they do. We need to understand which sheep to offer them, which are going to make it through the gate first. This is a little bit about what they do.

Let’s imagine this sheep in the front here. This sheep in the front will reduce your hosting costs by 20%. Very good sheep. We should do this. Any project that might show up has to be considered in the context of what the business is doing, what’s currently going on everywhere. Imagine we are a primary color company who does some web search, and we have the opportunity to reduce our hosting costs by 20%. This is a career making move. This is going to happen. It doesn’t matter really. It matters a little bit how much it costs. This is a project that is absolutely going to happen. Picture this middle group here, classic startup scene, early in their business, they have the same opportunity to reduce their hosting costs by 20%. Their hosting cost is what, like $300 a month? No, you’re not going to do that. Then there’s all of the states in between where your business might be in a merger, or it might be in a reorg, or there might be something going on. You have this obvious benefit, reduce hosting costs, but it’s really not clear whether you should do it. If the reorg isn’t a complete, and the entire product that’s being hosted there is going to be disbanded, no, don’t go spend time reducing the costs to host it. The same project, the same decision, the same idea may get a very different result depending upon what’s going on in the rest of the business.

What’s the Prioritization Process?

Let’s look a little bit more at how to choose which of these things to do. It’s not always quite so simple as an enormous cost reduction. Product management has come up in this track before. There are sophisticated tools to help product managers take this enormous mountain of ideas and input and form it into the things that they’re going to present for engineering, to understand whether or not one of these things should be done. This input comes in many forms. There’s feedback from customers. There’s ideas sprouted from anywhere, really. There’s appreciation for a specific aspect of your business. All of these come together. They come in separately, but they need to be understood to be representations of a thing that the business does, a service that it provides, a problem that it’s solving. You can look at each of these different bits, whether it’s a piece of customer feedback that comes in, and try and reframe it so that it is shaped like a problem that the business is fulfilling. This process is managed for every one of those sheep we saw. What we wind up with is this enormous corpus of data about all of the different problems that we are confronting as a business, which ones affect our customers in one way or the other. Some of them are blocking new customers. Some of them are allowing the expansion of existing customers. Each one has this depth to it that has been accumulated by the product organization.

Once you get an understanding of these ideas, you can try and tease them apart into a couple of categories. There’s the obvious good, the obvious bad, the big, squishy middle. I particularly love this graph. I first came across it in a workshop by a guy named Jeff Patton. When I went to look for the image that he used, found he credited it to a graph called Constable’s truth curve, which was in fact presented here at QCon in 2013. On the left there you have the size of a bet you’re willing to make for this particular piece of work. On the top, going across is how much you expect to get from that. Let me give you an example. I have a new feature that is going to make you several millions of dollars as a business. It’s going to take quite a lot of time. It’s up there with the green dollar sign. You ask me, “So you’re proposing this thing. We recognize all development is a bet. How confident are you? You say it’s going to bring in this much money? What’s the likelihood you’re wrong?” I say, “I think it is, and if I wrong, I’ll buy you a cup of coffee.” Maybe don’t take that bet. The green is equally clear. You have some relatively easy task that has a very clear return. You’re very confident, there’s no question about whether this would benefit the business. I’m willing to give you my car if I’m wrong. Yes, we should do that. That’s in the nice green section up there. The squishy middle is like most of everything else. We need tools to help us explore these ideas and better understand where they fit. In hopes of pushing them out of that squishy green middle, into one of the other edges. There are a couple of good thought organizing frameworks out there. These is just two samples, an opportunity canvas, a learning canvas. They’re both ways of taking your understanding of the problem, of the people it will help, of the people it might not help, of the cost, structuring it in a way that facilitates some of this conversation.

Ultimately, the goal of this process is to clarify the ROI, the return on investment of any given piece of work. We need to find out who this work will influence, who it will impact, how much, and how much those users care about this particular thing. If it makes a big difference to some aspect that they never use, maybe not so important. A big part of the process of product management is understanding all of these questions. I’ve spent a while talking about this, it seems like a lot of work. We’re here to talk about tech debt. The thing is, yes, it is expensive, and software is more expensive. This has come up a couple of times in this track too. Engineers are expensive. Engineering efforts are expensive. I don’t want to just focus on us as individuals doing this work. You’ll hear somebody say, this is about one engineers’ worth of salary for a year or two for this project. If it’ll take us three months, we should do it, that sort of thing. That’s not enough. There’s a metric out there published by a number of public companies, it represents revenue per employee. We’re beyond engineering here, it’s not revenue per engineer, revenue per employee. This is what each employee of the company represents in terms of value that they’re building in order to bring the company to a good place. If we do a little quick math, engineering is usually maybe 30%, 35% of the company. Look at those numbers a little bit. We’re talking about a million dollars per engineer, in terms of the revenue they’re expected to generate through their work. For high performing companies, this is even higher, $2 million to $5 million.

Our time is expensive, it is worth a lot. We need to recognize that in order to value the work that the product organization has been putting into how they’re curating all of those sheep and to choose which ones they’re going to let through this gate. They are putting an enormous amount of respect on our teams, in order to choose the most impactful projects for us to build. When we think about tech debt, we think about a thing that we need to do that’s outside of this regular stream of work. We think that our piece of work should jump the queue. “This database upgrade is important. We have to do it.” Do we? Maybe not. What’s the business value? What’s the return? How can we measure this piece of work and this feature in the same language, the same framework as all of this effort that’s been put towards the rest of the work that is coming to us from the rest of the organization? Hang on to that question for a moment.

Recap (Understanding the Priorities)

We talked about, obviously, we can’t do all the work and the prioritization, we’re going to base strongly in a business case. Both the return we get for doing the work, also the cost of the investment we put into doing the work, and then that balance. The balance between those must exist in the context of the business and what’s important for the business at that time. The cost of the work, very good.

Selling the Impact or Your Project

Let’s talk about what we get back. What are we going to get for this project that we need to do? We’ve seen our work queues look like this. We need to upgrade the thing. We need to move the other bit. We need to refactor this. You all know what this is. This is tech debt. We can tell. We look at it, we see, yes. In order to focus it back on the business, so I want to think a little bit more about what tech debt is. It’s hard to say what tech debt is, so let’s answer the easier question instead. What do we do here? This is a wonderful graph, the VP of Engineering at Honeycomb put up in a blog post, understanding what goes into engineering work. On the left side, we have all the things that are customer facing, the features we actually build. Also, customer escalations and incident response, bug fixes. On the right side is all of the stuff that’s behind the scenes. Customers only see that part when something goes wrong: toolchains, refactors, training, testing, compliance, upgrades. Let’s split it the other way, up top. We think engineers are responsible for writing code, that’s true. The stuff up top, this is in code. This is adding new features, removing features. It’s also the upgrades. On the bottom, we have all of the attributes of running a system that are in production that are not in code. This is going to be part of your CI/CD system and your deploys, your training, your QA. Also, your incident response, the things you need to do that are part of owning your work in a modern software SaaS system. All of this is part of an engineer’s regular job.

Which one is tech debt? I think of tech debt normally in terms of refactoring of dependency upgrades, of work that you have to do in order to reform the product to fit some new business shape. Really, every category outlined in red here can be described as technical debt. I loved one of the examples from yesterday about doing a feature flag triage and leaving your feature in there just long enough so that you can reuse your feature flag when you’re putting in the next one. You delay the removing some code. It’s not just the stuff in the back. It’s not just the stuff that you think of in terms of refactoring. There’s a problem here, when we’re using a term so consistently, and we all have different definitions, it really muddies the conversation. When I’m talking to somebody, and I say, here’s this piece of tech debt, we need to spend some time on it, and they hear something that’s in a completely different portion of the circle. It does not aid the communication. Perhaps we can just stop using the term. We’re going to need to talk about work in a business context anyways. The goal of this workflow is the same as getting it on the roadmap.

Language is Important

As engineers, one of the things that we do is translate language that we see into understanding how that will affect us and our systems. There’s an observability company out there whose website has a big banner on the front, “Find your most perplexing application issues.” When I look at that, when we as engineers look at that, we’re like, how? I want to fit that into my product. I want to understand how I can rely on it, where I can rely on it. Where it will help me, where it won’t. How do you do this? What is it? We’ll make a translation in our head. Honeycomb is a tracing product. I believe it is a good example of what goes on in our heads. This translation is very useful when we’re building systems. When we’re trying to explain business value, it leads us to talk about, we need to upgrade the database. Not so helpful. We’re here to flip that around a little bit. As we are trying to understand how to prioritize and schedule the work that we see within engineering, we need to understand how to switch from talking about what it is that we need to do, upgrade the database, and start talking about why we need to do it so that we can connect it back to this business case. Sometimes that takes a couple of tries. We talked about needing to upgrade this database. Or maybe not. Do we, really? Let’s ask the team, why do we need to upgrade this database? They tell us, actually, it’s hitting its end of life. It’s an EoL database, we need to get off of it. Do we, really? It’s still serving our queries. In this mythical world, even that isn’t good enough. Let’s say that our hosting provider says, we will not run EoL software, so you have four months, and then we are shutting it off. Now we see, ok, what happens when they shut off our database? The entire product stops working. We need to upgrade that database, because we want the product to continue. If you go to your sales reps and say, would you like our product to stop working next September? They will say, no, please schedule the work you need to do. I would like you to fix that, how can I help? Who should I talk to? I will help you get this work scheduled, because it is clearly connected to ensuring that the business can continue. Hosting providers usually don’t go in there and flip off your database just because you’re running an EoL version of MySQL.

I want to tell another story. This one’s a little different, because I didn’t actually get the business case until after it had happened. We were running an old version of a LaunchDarkly library. LaunchDarkly does feature flagging. It was fine. It worked. There was one team that they had some extra time, they upgraded it, they had a thing they wanted to try. There was a shift in the model of this library from thinking about a user as the thing that manages a flag, to thinking about a context. Intentionally more broad. The context pulls in all sorts of other attributes of where this flag is being executed. This team used that to flag on their feature for a single node and get a very nice comparison graph. This was awesome. Within just a short period, many other teams were doing the same thing and getting to see, ok, I can do a percentage rollout, rollback. I can flip it on to these. We were used to only flipping it on and some of the data that is more relevant to our application, it opened up a whole new window. We saw, we can safely canary changes in a way that we couldn’t. We can do A/B testing in a way that we couldn’t. This is amazing. If we had been able to see that business case that we can make our engineering teams able to deploy their software more safely, more quickly, with more confidence around the change, yet better understanding of the impact of the change, we would have been able to prioritize that work. It was lucky that it was not so much effort and that one team had the space to make it happen. This idea of finding the business case in order to do the work, sometimes it takes some effort to understand what is going to happen with this. That can feel a little squishy.

Back It with Data

We are engineering teams here. We have a magical power. We can create data. We can find data. Our systems give us data. If they don’t have the data we need, we can make changes to those systems. We can commit code to get more data. We can use this to back up all of our arguments. When we get one of these conversations like, I think it’s going to do this, and someone else says I think it’s going to do that. We can find out. SLOs, top of the list there. They’re my favorite tool for mapping business value back to a technical metric, because in their formation, they are representing the experience of the user. They’re encoding, what does it mean for a given experience to be successful or not? How many of those experiences are we allowed to have fail? Over what period of time before we hit a threshold where our users and our customers get angry and leave?

I would love to spend the rest of the time talking about SLOs. Liz Fong-Jones did a talk, it’s published on InfoQ, on production excellence with a fantastic segment on SLOs. They’re not alone in our sources of data, we have all of our metrics showing various limits of our services. We get to do chaos experiments. When we have a service, and we’re not quite sure what that limit is, and we need to know more, we can artificially constrain that service in order to push its boundaries and find out what we would otherwise find out in an incident, six months down the road. Of course, there are those incidents. They really are two sides of the same coin. I don’t want to leave out qualitative surveys. Often a piece of technical engineering work that fits into this category of tech debt may feel like its target is not a specific amount of money for the business, but facilitating one’s experience in a certain part of the code base. There’s this idea of haunted graveyards, the section of the code where people fear to tread, because it’s dangerous, it’s complicated. Cleaning those up is valuable, to allow the business to move forward. Some qualitative surveys can really lend credence to this as you ask your engineers, how was it to build this new feature? How was your feature flag experience? With all of this data, it’s easy to fall back into those old habits. We need to upgrade the database because the CPU is too high. Who cares? You got to bring it back in business context, take that language about the CPU is too high and translate it to, what effect will that have on our ability to run our business? We’re back to our list here. We want to get this work prioritized. To do so, we understand what is important to the business here, now. We take the projects we need to do that we see are obvious and understand how to translate that into something that’s based on impact, rather than the technical change that we wish to make. We find the data we need to back it up.

I want to take you back to the Honeycomb experience of a year and a half ago, with this system that has been growing, and it’s starting to creak and fail in other interesting ways. Let’s run this playbook. Let’s see what happens. Start with understand the priorities. Honeycomb is a SaaS business. We are software as a service. ARR rules everything, this is annual recurring revenue. This is how much people are paying us, whether it’s monthly, or yearly. It includes people that decide to leave, and end their contract with Honeycomb. It is a big deal. ARR is this number that we have on our fundraising decks that we talk to the board about, that we make projections for. This is what we are organizing our business around, saying we’re going to get to this amount of revenue and we’re going to do it in this amount of time. I know there are many talks here about reducing costs and expanding revenue. For many SaaS companies earlier in their life, the balance is heavily skewed towards revenue. ARR is the thing, so let’s use it.

We get a dollar number of our ARR target, but that doesn’t translate to upgrading a database. By talking to the sales and product team, we get more detail, because this number is such a big deal, there’s a lot of research and work that goes into building it. It’s not just a number. It’s not just throwing a dart or watching a curve and saying, we’re going to make it there. No, they know we’re going to get this many customers of this type and we’re going to see these many upgrades and we’re going to see this many folks leave. All of this comes together to form that top line number. It’s wonderful that we have descriptions of who we’re going to get, still doesn’t help because I don’t know how to work with dollars but they have more answers. When I ask, ok, you’re getting 10 more enterprise customers, what does that mean? Which of our current customers can I model this after? How are they going to use our product? They came back with answers. They’re like, ok, this group of customers, they’re going to be focused more on ingestion and they’re going to use fewer of these resources. These other ones over here, they’re going to be more of a balance. Those ones there are going to be very heavy on the API. Now we’re getting onto my turf. We’re getting to numbers that I can see, the numbers that I have grasp about. I can now use this information to connect it back to the business value that’s going to come from each of those things. We need to take this data and map it to our infrastructure. Honeycomb is a macroservice architecture. We have a smallish number of relatively large services, they’re all named after dogs. We’re applying these to our infrastructure here.

How Does This Service Scale?

For each of these teams that have a dog under their care, we ask these teams, please spend some time understanding your service, writing about your service, and reflecting upon how it fits in here. Start with just an overview of the service. Then think about where the bottlenecks are, how the service scales. Then look at the sales targets that we covered that we have these numbers for, and write down which ones of those are applicable to your service. Maybe ingestion matters, maybe it doesn’t, depending on what the service consumes, who it talks to. Some are important, some aren’t. This is where we involved all of engineering. Every team did this. It didn’t take all that long. Our SRE team is already embedded in each of them, and they helped by guiding them through some of these processes to write down and give a unified format to these reports. As an example, our primary API service, this is the front door of Honeycomb’s ingestion. It accepts all of the telemetry from our customers. It does some stuff to it, authentication, authorization, validation, transformation. It sends it along to the downstream service, which is a Kafka cluster, for transmission to our storage engine. It has some dependencies. It depends upon a database for authentication tokens. It depends upon a Redis cluster for some rate limiting and things. The upstream services are our customers. The downstream services, it’s Kafka. The sales target affecting the service is clear. It’s the telemetry that our customers are sending us. When thinking about how it scales, it is a standard horizontally scaling service, stateless. These are the canonical scaling things. It can grow as wide as it wants, almost. Through some past experience, both planned and unplanned, we’ve discovered that as the number of nodes grows, the load on the database grows. When we get to about 100, the database starts to slow down, which means you need more nodes, which makes the database slow down, and it goes boom. That is clearly a scaling limit. We have it written down.

Now comes the fun question. We found the scaling limit. That’s great. We have to fix it, obviously. Maybe? Let’s take these sales targets and say, where are we going to be on this curve of our sales target? When are we going to hit this scaling limit? Is it within the next year? Do we need to fix it now? Is it a problem for future us? Let’s decide. For this one, as it turns out, we expect it to hit that volume by mid-year. The database thing was one that we had to focus on. We got this whole report, it did highlight the database load that we will hit before reaching our sales targets, also called out a couple other ones, and some that we don’t really know about. Those are a cue for running some chaos experiments, or other ways of testing our limits.

The Business Case

The business case came down to this one. It was very clear. We could go back to our sales team and say, we need to schedule some work to mitigate the load on this database as the service scales out. If we don’t, we will not hit the sales targets that you gave us. The answer is very easy. Yes, please schedule that work. How do we need to do it? We did this for all of our services and came up with a lovely Asana board showing some things to fix, some things to care about now. Some things that are probably relevant in the year. Some that aren’t clearly like yes or no, but might extend the runway, so we can keep them in our back pocket and understand that if we see something creeping up, that’s a thing we can do to make it work. There’s one tag, it’s called explody. You’ll see the label explody on some of these.

As an organization, there’s this fun side effect to this exercise that we got to talk about a couple of things that we don’t always cover in the regular day-to-day engineering. Explody is one of them. That’s a description of this API service horizontal scaling thing that I was talking about. It grows and you don’t really see much until you get to that threshold and then boom, everything comes crashing down. Those problems are scary and huge. Also, you don’t usually have a lot of time to react between once that cascade starts and when you hit the boom. It helps you understand a little bit about how to schedule that kind of work, the non-explody ones. As an example, if you have a service that runs regularly scheduled jobs, once per minute you run a job. Over time, that job takes longer, until it takes like 65 seconds. When it’s overlapping, the next one doesn’t start. That’s not explody. Some jobs won’t run, most will. It will degrade, but it will degrade slowly and it will degrade in a way that you can go back to those SLOs, and use that to help you figure out when to schedule it. Equally important problem, slightly different method of scheduling.

The last bit here, I don’t know how many of you have had a conversation with somebody when designing a system. We look at this diagram and this architecture and say, “No, that can’t scale. No, we can’t do it this way, that won’t scale.” It’s a struggle, because, yes, maybe it won’t, but do we care? Will the system work for us for now? There’s the classic adage of don’t design things the way Google does, because Google has 100 million times your traffic, and then you will never encounter those problems. If you follow that design, you’re doing yourself a disservice. These scaling problems, which ones to ignore, this mapping of sales targets to scaling limits is a very strong way to have that conversation and make it very clear that from the business perspective, we can wait for that one.

Close the Loop

We caught ourselves a strong business case, using the priorities of the business, time to fix some of these problems, and the choice to put some of the other ones off. We are now done. We have gotten this work scheduled, almost. There’s one more step here, which isn’t so much about getting the work scheduled in the first place but is about making our jobs as an engineering team, especially in a platform, a little bit easier the next time around. We need to celebrate. We need to show the success of this work that we’re doing. When we find one of these projects, and we finish it, we need to make it very clear that we put engineering effort towards a business problem in a way that has given us the capacity to grow the business in the way that we need to. Then, when we have finished it and it’s gone on, and we pass that threshold that was previously impossible, we can call it out again and say, remember this? Yes, we couldn’t have gotten here without it. This stuff is really important. It lends credence to our discussions in the future about what’s going to be happening when we have more technical work to schedule.

The last one, this was a fun experience from a near miss incident. There were some capacity issues, and we were struggling to grow the capacity of the service quickly enough. During the course of this near miss, one engineer bemoaned, it takes 3 whole minutes to spin up new capacity for this service. This engineer had been at Honeycomb for a couple of years, but that was still recent enough that they didn’t remember the time before we were in a Kubernetes cluster, when new capacity took 15 minutes to spin up. We put up a big project here. We’re going to change the way we’re doing our capacity and the way we run our services. We’re going to set a target of 10 minutes for new capacity, and new services to come up. No, let’s be aggressive, let’s set it to 5. Here we were recognizing it takes 3 whole minutes? This is great. This is amazing material because it gives us an opportunity to reflect on where we’ve come. For a lot of the platform work, for a lot of the tech debt work, that work is almost invisible. We need to find ways to bring it to light so that we can see as an organization the effect that this work is having on our longevity, on our speed, on our ability to deliver high quality software.


This is the recipe for getting technical debt work onto the roadmap. Understand what your business needs. Understand the impact of the projects you need to do to reduce that technical debt. Own the success of that business. Frame it in a way that is communicating that priority, not just the technical work. Use our power of data to back it up. Remember to ensure that you talk about the effect it had once you’re done, and that effect has been realized the business is exceeding those limits that it previously saw.

Questions and Answers

Participant 1: You didn’t bring one thing that was establishing a shared language or a shared vocabulary or framework for presenting the business benefits and things. What about scenarios where you have potential security issues? A scenario where I need to upgrade the library, because there is the potential for every bonus point on x. At which point now you’re just projecting what the potential damage should be. How do you solve that?

Hartshorne: How do we express security concerns in the language of business impact? I will admit, our security team is very good at that. I am not so good at it. There are a couple of ways. For larger businesses, there are frameworks that they’re required to operate under for security. There are agreements that one makes with both sub-processors and with customers around the response time to certain types of issues. You can use that to build an argument. If we are contracted to respond to a security incident within 30 days of being notified, we have to. You were asking about upgrading libraries because of potential flaws. That is definitely fuzzier. I was talking about how projecting projects that you’re going to make are bets. How much are you willing to bet that this is going to have that impact. Is it a cup of coffee? Is it a house? Framing security in terms of bets, it carries a little bit of a different feel to it. Yet, that is what’s going on. None of our services are 100% secure. We are doing risk analysis to understand which risks we are willing as a business to undertake and which we’re not. Do we have the recourse of the legal system if this particular risk manifests? Our business, and I think most businesses do have large documents describing their risk analysis for the current state of the world. The risks from inside, the risks from outside, the risks from third-party actors and so on. I think all of those are tools one can use to better assess the risk of a particular change. Those all come straight back again to business value.

Participant 2: I’ve often found that one of the hardest types of tech debt problems is to get to move on, to get buy-in. It’s the kind of problem where the benefit you’re going to get from solving it is freeing up your engineers to improve the velocity. You’re not solving tasks that really should be automated, but the amount of work to do and to automate it is going to take a couple weeks. We don’t do it, and in exchange for that sometimes we do it manually once every couple weeks, and it takes us few days to do it. It’s like an inducible cost from outside of the engineering organization, because you know that it’d free up a whole lot more engineering bandwidth to work on features and other things as a part of [inaudible 00:45:51]. I’m wondering if you have any thoughts on how to frame that as a business objective that would actually [inaudible 00:46:00].

Hartshorne: How do we schedule work that is toil that our engineers are doing successfully, because the job is getting done? There are a couple of ways. The first I want to highlight is the product philosophy of incremental improvements, shipping the smallest thing. There’s a wonderful diagram of how to build a car. One path starts with a set of wheels, and then a frame, and then some shell, and then the engine. At the end of it, you have a car, which, great, you can’t do anything until you get all the way to the end. The other says, maybe start with a skateboard, graduate to a bicycle, maybe a motorcycle, and finally, you’ll make your way to the car. I reference this, because a lot of those types of toil tasks can have small improvements that incrementally transform the task. As an example, you might start with a person that needs to roll the cluster by hand, and they go through and they kick each server in turn, and on they go. Simply writing that down in clear language, forming a checklist for it makes the next version faster. You can take two of those checkboxes and transform them into a script and it makes it faster. Each one of those is significantly easier to find the time to do than the entirety of the toil. That’s the first one.

The second is to bring up again, the idea of qualitative surveys. We ask our on-calls, every time they start a shift, and every time they finish a shift, a number of questions. It’s a light survey. Did you feel prepared for your shift? Do you think the level of alerting we have is too little too much, just about right? What came up during this shift? Did you feel stressed about any particular bit? What was the hardest part of this shift? Collecting those on a quarterly basis gives us some really good and unexpected insight into some of the types of toil that are coming up. Maybe it’s a flappy alert. Maybe it’s some service that needs to get kicked and need some improvements in its operational controls. Those are data. We can use those to say, our on-calls are spending 40% of their time chasing down false positives. This is wasted effort. That brings it back to expensive engineering time, and is, I think, a pretty good road into a compelling argument to spend some time improving the situation.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.