Presentation: Deconstructing an Abstraction to Reconstruct an Outage

MMS Founder
MMS Chris Sinjakli

Article originally posted on InfoQ. Visit InfoQ

Transcript

Sinjakli: You’ve spent your day writing some code. You’ve committed your work and pushed it up to GitHub. You’ve created a pull request, and you’re going to wait for someone to review it. While you wait for that review, you decide, why not refill my coffee, today is going well. Then you get a call from your least favorite person. You quickly acknowledge the page and go back to your desk. When you open the dashboard for the relevant system, you’re greeted by something like this, your lovely functioning system has fallen on the floor. In fact, you’re not serving any API traffic anymore. When you look at your error tracker, your heart sinks, none of your app servers can connect to the database because the database is down. We’re going to take an incident just like that, and put it under the microscope.

My name is Chris. During the day, I work as an infrastructure engineer, which is a mixture of software development and systems work. In particular, databases and distributed systems are the areas I find most interesting in computing. It probably doesn’t surprise you that I work at a database as a service company called PlanetScale. We build a managed MySQL database platform that scales horizontally through built-in sharding. The events that I’m going to talk about happened at my previous job, when I was working in the infrastructure team at a payments company called GoCardless. The idea of today’s talk, is that we’re going to look behind the scenes of something that we often take for granted, the database. We’ll explore the aftermath of a complex outage in a Postgres cluster, and dive through the layers of abstractions that we take for granted, all with the goal of being able to reliably reproduce the failure so we can fix it.

Cluster Setup

Before we can dive into the outage, it’s going to be useful for you to have a high-level understanding of what that cluster looked like. What we had was a fairly standard setup. Our API backends would talk to a Postgres server. The data from that Postgres server will be replicated across to two other nodes. As well as Postgres itself, we ran a piece of software called Pacemaker on those nodes. It’s an open source cluster manager, which handles promoting new primaries in the event of a failure. It also managed the placement of a virtual IP address, which clients would connect to. Whenever the primary moved, it would move that virtual IP address so that the clients would know where they should connect. Let’s say that the primary fails, Pacemaker would demote it, turn one of the replicas into a new primary, move the virtual IP address across. The application would see a short interruption while all of that happened, and then reconnect to the new node. After some time, you’d bootstrap a replacement machine, and it would replicate from the new primary. Something to note about this configuration is that we always had one synchronous replica. What that means is that before we acknowledged a write to a user, we had to make sure that it was at least on one of those two replicas as well as the primary. This is important both for data durability, and it’s going to be relevant later on in the talk.

Cluster Automation Failure

Unfortunately, as you may have guessed by the existence of this talk, on the day of the incident, none of what I just described worked. On that day, at 9 minutes past 3:00 in the afternoon, the RAID controller on the primary Postgres node failed. It reported a simultaneous loss of three disks, which is enough to break the array, and all reads and writes going to it, which meant that the primary was down. No worries. That’s where the automation should kick in, except it didn’t. We made many attempts to get it to work. We told it to rediscover the cluster state. We told it to retry any failed operations again and again. We even powered off the node with a failed RAID array, in case that was somehow confusing Pacemaker. None of it helped. All the while our API was down. After an hour, we ended up calling time on trying to fix that cluster automation and fell back on configuring a primary and replica by hand. One of the team logged into the synchronous replica, told Pacemaker to stop managing the cluster. Promoted that node to primary, and configured replication across to the remaining node by hand. They pushed the config change to the API servers so that they would connect to that newly promoted primary. Lastly, they bootstrapped another replica so that we’re back at 3:00 in case something else failed. In the end, the outage lasted two hours.

Recreating Outage Away from Prod

For now, we were safe. We were up and serving traffic to our customers again. We were only one failure away from downtime. While our cluster automation had let us down that day, there was a reason that we had it, and now we were running without it. In the event of a hardware failure, we’d need one of our engineers to repeat that manual failover process. How can we fix the issue with the clustering software and regain confidence in it? At this point, we had one single mission, recreate the outage away from our production environment, so that we could come up with an appropriate bug fix and reenable the clustering software. I’m about to introduce the main events that we were focused on in that recreation. There’s a fair bit of complexity here. Don’t worry, we’ll get into it step by step.

First off, there was the thing that kicked it all off, the RAID array losing those disks. As a result, that array couldn’t serve data: it couldn’t read, it couldn’t write. Next, we saw that the kernel marked the file system on that RAID array as read only. At this point reading anything was wishful thinking because the array was gone. We saw that Pacemaker detected the failure of the primary node. It did detect it, but for some reason it wasn’t able to promote one of the replicas. We also noticed that a subprocess of Postgres crashed on one of those replicas. This is a little thing in Postgres. Postgres is actually made up of many different subprocesses. If any one of them crashes unexpectedly, the postmaster restarts the whole thing, so that you’re back into a good known state. That happened. Postgres came back up after that crash on the synchronous replica, but for some reason, Pacemaker didn’t promote that replica. The last thing we noticed was that there was another suspicious log line on the synchronous replica. This error message really caught our eye. It’s a little bit weird. Something just seemed suspicious about this whole invalid log length, some offset, so we added it to the list of possible contributors to the outage.

Looking at these five factors, we potentially had a lot of work to do. How would we choose a starting point? We made an educated bet that we could set aside the RAID array failure itself and the corresponding action taken by the kernel. While this was the initial cause of the accident, we had a hunch that we could achieve similar breakage but through something easier to control. It’s not easy to recreate some very strange failure of a piece of hardware right. We focused on how we could cause the same end impact to the cluster. While it seemed interesting enough to come back to, we put point 5 on hold for a while. The reason for that is that points 3 and 4, they’re lower hanging fruit, they’re easier to do. We thought, we’ll come back to that if we need it. For points 3 and 4, we turn to everyone’s favorite fault injection tool. It’s something you all are probably very familiar with, probably running in the last week. It’s the Unix Kill Command. It is the simplest and most frequently used tool for injecting faults into systems. To make it easier and quicker to experiment, we ran a version of this stack in Docker on our laptops. What this let us do was very quickly set the whole thing up, play some different actions against different nodes, and see if we could break it, and then tear it down and spin it up again. This made things very quick to iterate on. To simulate that hard failure of the primary, we used SIGKILL or kill -9. Then we sent a SIGABRT to the relevant subprocess on the synchronous replica. This matches what we’d seen in production. We ran the script with just those two faults being injected, and we didn’t get our hopes up. It seemed unlikely that this was going to be enough. We were right. With just those two kill commands, the cluster didn’t break in the same way that it did in production.

What Do We Mean By Log?

We went back to our list of what we’d seen go wrong in the incident. We decided that the next thing to focus on was that suspicious log message that we’d seen on the synchronous replica. This is where we start to dive into Postgres internals. We need to understand exactly what these two log lines mean in order to recreate them. I’d like to start by talking about what we mean by log in the context of a database. What we normally mean by log is something like this. These are the kinds of logs that you might see on a web server that’s hosting a blog. Someone’s visited the blog, logged in, posted something. It’s actually a blog post about this talk, which someone then visits. When we’re talking about databases, we mean a different thing by logs, we mean binary logs. What are those? Let’s take a really simple example of inserting some users into a database. At some point, that data has to end up on persistent storage, which could be a hard disk, or, more likely, unless you’ve time traveled back to 2005, it’s a solid-state drive. The explanation I’m about to give of these binary logs is a little bit simplified, but not in a way that distorts the truth. Whenever you run a query that adds or modifies data in Postgres, like an insert or an update, the first thing that the database does is it writes a record of that change on to disk. This is before it updates anything on the actual table data, or any indexes that reference it. In a lot of databases, these are known as write ahead logs, and Postgres is one of the databases that uses that term. The reason for that term is that you write every change into them ahead of writing it to the database. Why bother doing that? Why have these write ahead logs in the first place? The reason that they exist is that databases need to be crash safe. They need to preserve data written to them even if they crash. That crash can happen at any time, you don’t get to control it.

Let’s go back to our table and let’s say that it’s got this ID column, which is the primary key. Being a primary key, it’s got an index pointing at it with each record present. When we go to insert our third row, let’s say the database crashes right then, but before adding it into the index. What data are you going to see if you query the table now? Postgres is probably going to use the index to do this lookup. The value is not in the index, so the database can safely assume that the value has not been inserted. This third record, turing, is just invisible effectively, even though we said to the user that it was inserted. The reason for write ahead logs is that when the database starts back up after a crash, we can go into them and replay the missing operations. When Postgres starts back up, the first thing it does is play back the last section of the logs, which puts the database back into a consistent state, and makes queries return consistent data. The other thing that write ahead logs get used for is replication. If you remember our setup from earlier, we had these two replicas being fed with data from the primary. The data which feeds those replication streams is the same data in the write ahead logs.

Why The Suspicious Log Caused a Failure

Now that we understand what we mean by write ahead logs, I’d like to take a moment to explain why we thought this might be relevant. Why might this have caused the failure of the promotion of the synchronous replica? To do that, we need to understand the first line now. What’s going on here? Why is this synchronous replica pulling logs from an archive of some sort? That stems from something that I’ve not mentioned yet, but which is relevant to recreating this part of the incident, which is that there’s a second mechanism to transfer those write ahead logs between nodes. The two mechanisms are known as streaming replication and WAL archival. Streaming replication is the one that we have already covered. It is this thing where we’ve got database replicas connecting into the primary and receiving a constant stream of updates. This is the mechanism that, in fact, lets you keep a synchronous replica that is exactly in sync with the primary. What about that second mechanism, the WAL archival? You can configure an extra setting on the primary node called archive_command. Whenever there’s a new write ahead log segment that’s been completed, that is no more data will be written to it, Postgres will run this command and pass the file name that is completed to it. Then you can run whatever command you like there to ship that segment to somewhere, which could be a big storage server that you run, or an object store like Amazon S3, or wherever, just some storage. Then to consume those files, you configure a corresponding restore_command. This is often a way of seeding data into a new replica as it comes up before it joins the cluster properly.

Why would you have these two different mechanisms and why would you configure both of them in one cluster? There’s two reasons and they’re linked. First off, we can reduce the amount of cluster storage that we need on each of the individual nodes. The reason we can do that is that we don’t need to keep as big a history of write ahead logs on each of them. Often those nodes will be provisioned with very expensive SSDs or NVMe disks. We don’t want to provision a bunch of storage that we’re effectively wasting most of the time. The other thing is, when a new node joins the cluster, if we had it pull a long history of write ahead logs from the primary, that would add load to the primary which could impact user traffic. We want to avoid that. Stepping back into our incident, we can understand what happened here. We can understand that this first log line was about it restoring an archive file from some server somewhere. Which suggests that one of the final acts of the primary before it was fully gone, was to ship off a broken write ahead log file into that external storage, which when the synchronous replica crashed, it went and checked and pulled in. Which gave us a hypothesis, maybe that invalid record length message was the reason that the synchronous replica couldn’t be promoted. This was a really plausible lead, but a very frustrating one for us, because we knew the synchronous replica already had all of the writes from the primary. It’s in the name, it’s the synchronous replica. That’s not how debugging works. Just because something shouldn’t happen, doesn’t mean it didn’t happen. We had to follow this lead, and either confirm or disprove it.

Incident Debugging

There was just one problem staring me personally in the face when I looked at this error message, and it was this, that I had zero prior experience working with binary formats. Up until this incident, I’ve been able to treat these replication mechanisms as a trusted abstraction. That was in four or five years of running Postgres in production. I’d never had to look behind the scenes there. Even though it’s unfamiliar, we know that none of it’s magic. It’s all just software that someone’s written. If debugging an incident means leaving your comfort zone, then that’s what you’ve got to do. It was time to figure out how to work with unfamiliar binary data. We had a chrome of good news, though. Postgres is open source. What this means is that we can at least look through its source code to help us make sense of what’s going on in these write ahead logs. I want to emphasize something, which is that all of the techniques that I’m about to talk about now work on closed source software as well, where there’s no documentation or source code that tells you about the file format. We just have a different name for that, it’s called reverse engineering.

Thankfully, in our case, we get to search the code base for the error. If we do that, we can see three places where that log line gets generated. If we jump into the first of those, we find this reasonably small piece of code, this conditional. We can see the log line in fact that we saw in our logs from production, and then we see that it jumps into an error handler. All we need to do to make this happen is to figure out how to make this conditional evaluate to true. We need to make total_len be less than the size of an XLogRecord. We don’t know what either of those is yet, but we’ll figure it out. SizeOfXLogRecord is pretty easy to find. It’s a constant, and it depends on the size of a struct called XLogRecord. We don’t know what one of those is yet, but doesn’t matter, we can find out later. Wouldn’t it be convenient if we could make total_len be equal to 0? If it’s 0, it’s definitely smaller than whatever the size of that struct is. It turns out, we don’t have to go far to find it. It’s in the same function that has that error handler. We can see that total_len is assigned from a field called xl_tot_len from an XLogRecord. It’s the same struct again. If we jump to the definition of XLogRecord, we can see it right there at the start.

I think we’ve got all of the pieces that we need, but we need to figure out how to tie them together. What was this check doing? If we go back to that conditional that we saw, it’s actually saying something relatively simple. Which is that if the record says it is smaller than the absolute smallest size that a record can possibly be, then we know it’s obviously broken. Remember those logs that I talked about earlier. All of the code we just dug through is how Postgres represents them on disk. Let’s see what they look like in practice. If we go back to our boring SQL that we had earlier, and we grab the binary logs that it produces, and we open them in a hex editor, we get hit with a barely comprehensible wall of data. What you’re looking at here is two representations of the same data. I want to bring everyone along to understand what’s going on here. On the left, we’ve got the data shown as hexadecimal numbers base-16. On the right, wherever it’s possible, that’s converted into an ASCII character. What we’re dealing with here really though is just a stream of bytes. It’s just a stream of numbers encoded in some particular way. We’re used to, as humans, generally, looking at numbers in base-10, the decimal number system. That’s the one we’re generally comfortable with. Because it aligns nicely with powers of 2, we typically look at binary data in its hexadecimal form. I’ve included the ASCII representation alongside these because when you’re looking in a hex editor, that’s often a useful way to spot patterns in the data.

The first and most obvious thing that leaps out from this view is some good news. We can see the users that we inserted in the database, which means that somewhere in all of those surrounding characters, is that field we’re looking for. We’re looking for that xl_tot_len field, and it’s in there somewhere, but we just need to find it. How can we make it more obvious? We’re trying to find a size field. What if we generate records that increase in size by a predictable amount? If we produce data that increases in length by one character at a time, then the length field should increase a corresponding amount. This is the part of the binary log that contains that data that we just inserted, the ABC, ABCD, ABCDE. We can see that there on the right. If we go back to the ASCII codes I showed you earlier, I didn’t pick these ones by accident. Look out for them when we go back to the hex editor. Here’s the data we inserted, and here’s those familiar ASCII characters. We saw them earlier in the table, and it’s incrementing one at a time, as we increase the length of the string. I think we might be onto something here. If we go back to the hex editor, then we can highlight their hexadecimal representations too. These are just two views over the same data. We can see here that we have the same incrementing size field and the same data that we inserted just in hexadecimal.

Now that we’ve found the length field, wouldn’t it be convenient if we could make it 0 and trick that error handler? Let’s rewrite some binary logs. If we were doing this properly, we were writing a program for some reason to do something like this which we wanted to deploy into production, then we probably want to import the Postgres source code, import those structs, and work with those to produce and write that binary data. We’re not doing this in production, we’re trying to just recreate an incident on our laptops. Maybe we could just write a regular expression to mangle the binary data how we want to. Let’s save some time and do that. If we go back to our highlighted view of the hex editor, we need to pick one of these three records to break. For no reason other than it being easier to draw on a slide, I’ve picked the top one. Doing so is relatively simple. This is the actual script that we use to do so. It’s in Ruby, but you can write this in whatever you want. It has exactly one interesting line and it’s this regular expression. It replaces 3F, which is one of those sizes, with 0. It uses some other characters to anchor within the string. Those don’t get replaced. That’s just a useful anchor for the regex.

Are you ready for some very exciting computation? We have just changed a byte of data. What happens if we feed that log back into the synchronous replica through our reproduction script? We can pass it to the restore command. We got exactly what we hoped for. We’ve got exactly the error that we saw from the synchronous replica in production. Now we can do it reliably, over again, in our local test setup. That feels good. Unfortunately, that success came with a caveat, which is, doing that wasn’t enough to reproduce the production outage. When we added all of this to the script, and we ran it, then cluster failed over just fine. Clearly, these three conditions weren’t enough. I think there was something important about that read only file system, and the actual act of a weird hardware failure in the RAID card, or we’d missed something else. We were confident that we’d missed something else at this point. It seemed weird for those first two to matter so much.

Backup Virtual IP on Synchronous Replica

Then one of our team noticed it, they compared our Pacemaker config from production and from our real staging environments, with the one that we had in our local Dockerized test setup. They found the sixth part of the puzzle. If you remember our architecture diagram from earlier, we had that virtual IP address that follows the primary node around whenever it moves. What was missing from that diagram was the second virtual IP address that we’d added in not too long before the incident. That was known as the backup VIP. The idea behind this was that when you’re taking your snapshot backups of Postgres, ideally, you’d want to minimize the load, or not put any load on the primary while doing so. We introduced this second virtual IP, so that the backup scripts could go to that instead of the primary, and then take their backup from there, no additional load on the primary, everything is good. We had a new lead, but this was really in surely not territory in our minds. We didn’t have any hypothesis for why this would block promotion of a new primary, but we’ve followed a methodical debugging process and added it into the cluster. We sat there and pressed enter on the reproduction script. We watched, and we waited with bated breath. It worked. This time, there were no caveats. For some reason, adding in this additional virtual IP address was enough to reproduce the production failure. The cluster refused to promote a new primary and sat there not serving queries. There we have it. We’ve deconstructed all those abstractions that we normally got to trust and rely on. We’ve recreated the production outage that caused so much trouble on that day. Hang on, no. Why would an extra virtual IP do that? Surely the cluster should still just repair itself and then move that virtual IP off to wherever it needs to be.

How Pacemaker Schedules Resources

To understand why this extra virtual IP caused so much trouble, we need to understand a little bit about how Pacemaker decides where it should run things. There are two relevant settings in Pacemaker that will make this all click into place. Pacemaker by default assumes that resources can move about with no penalty. This may be true for some of the other things that you can use it to run, but it’s very much not true for databases like Postgres. In databases, generally, there is some cost to moving resources around while clients reconnect or there’s some delay in query processing. Whenever Pacemaker decides to move the primary around, we do see a little bit of disruption. It’s on the order of 5 or 10 seconds, but it is there, and we don’t want to do it for no reason. Pacemaker has a setting to combat this called default-resource-stickiness. If you set this to a positive value, then Pacemaker will associate the cost of moving a resource around with that resource. It will avoid doing so if there’s another scheduling solution it can come up with. The second is that, by default, Pacemaker assumes that resources can run anywhere in the cluster. We’ve already seen an example where that’s not true, virtual IPs.

Pacemaker has another setting called a colocation constraint. A colocation constraint lets you assign a score between two resources that influences whether or not they should be scheduled together. A positive score means please schedule these together. A negative score means please don’t schedule these together. These are the actual settings from our production cluster at the time. We had a default-resource-stickiness of 100. We had a colocation constraint of -inf between the backup virtual IP and the primary, which says, effectively, these shouldn’t be scheduled together at all. It was the way that we’d written this colocation constraint which bit us, and specifically that negative infinity score, which we assigned between the backup virtual IP and the primary. In an effort to ensure that it was always on a replica, we’d gone with this, the lowest score possible. It turns out, unfortunately, that there is a very subtle semantic difference between a very large negative number and negative infinity in Pacemaker, and this bit us really badly. If we compare a reasonably large negative number, -1000 with -inf, while -1000 means avoid scheduling these resources together, -inf means literally never schedule these together. It is a hard constraint. If we take that, and we change it to be -1000 instead, then failover works properly. We were able to take this knowledge, put it back into our cluster config, and stop this outage from ever happening again.

There’s an awkward thing that I have to acknowledge here. Once we had the reproduction script working, we tried removing bits of it, just to see if any of them weren’t necessary. It turns out that the WAL error that we saw wasn’t essential to recreating the outage, it was a red herring. I’m sorry because I know it was the most interesting part of debugging this incident, and it really would have been quite cool if that was necessary for us to recreate it. It was part of the debugging process. The entire point of this talk was to show you that in its depth, not omitting details, just because they ended up not being needed later. I hope that you can take away from this talk, the belief that you too can dive into the depths of systems that are normally abstracted away from you. There we have it, the minimal reproduction of our Postgres outage, a hard failure of the primary, a correlated crash of the synchronous replica, and an extra virtual IP which was being used to take backups.

Lessons Learned from Incident Debugging

What can we learn by going through that incident? How can we take away some higher-level lessons? The first one I alluded to earlier, which is that none of the stack is magic. Sure, we might spend our time way up, whatever level we tend to in our business, whether that’s a backend application, a frontend application, databases, infrastructure provisioning, we all have something that we do day-to-day. That doesn’t mean that we’re not capable of diving down a layer or three when things go wrong, and we really have to. You’ve probably heard the popular refrain when talking about cloud computing, which is, it’s just someone else’s computer. I’d like to repurpose that and talk about the software stacks that we use day-to-day. It’s all just someone else’s abstraction. It can be understood, and it can be reasoned about. A strategy for this that is often thrown about is that you can get better at it by reading other people’s code. I think that’s a good start. I think there’s more to it if you want to get really good at this. The bit that I think is missing is that you should practice modifying other people’s code to do something different, something that you want it to do. Sure, we all do that when we start a new job, but being able to very quickly dive into a third-party library that’s part of our application or anything like that, doing that habitually will train this skill set. It’s a really useful skill set to have in outages like this.

The second thing I’d like to pull out is that automation erodes knowledge. If you remember, I was talking about how long we spent trying to get the cluster back into action. Part of that is because we’d used and trusted that clustering software for so long, that we weren’t all that confident in logging in and manually reconfiguring Postgres to do the kind of replication that we wanted. Would we have reached for that option much earlier if we hadn’t had automation taking care of that for so long? A popular way to mitigate this effect is to run game days, structured days where you introduce failures into a system, or create hypothetical scenarios and have your team respond to those. One that would have been useful to us with the benefit of hindsight, is practicing how to recover the database manually when Pacemaker was broken. We didn’t know that it was going to break but we could have practiced for the event where it did. Lastly, when trying to reproduce an outage, you must never stop questioning your reproduction script. It’s so easy to laser focus in on your current hypothesis, and then miss something that seems simple or just seems like it might be completely unrelated. You have to go back and question your assumptions and question if you’ve missed something from production. Let’s recap those. There is no magic in the stack. Automation does erode our knowledge. We should always question our reproduction script.

Debugging in the Future

As well as those lessons, I’d like to do a little bit of stargazing and make a prediction about the debugging skills I think we might need over the next years. For the last decade and a half, many of us have got used to working with software that sends its requests using JSON over HTTP. Both of these are plaintext protocols. They’ve become a de facto thing in themselves. I think this is going to continue for a long time. This is not going away. Binary formats are coming to web development. They’ve remained common in things like databases and high-performance systems all the way through, but we’ve largely been able to ignore them in web development. That’s changing, because we’re increasingly seeing APIs that send their data using protobuf, and HTTP/2 is getting more commonly used. These are both binary formats. What’s really driving this combo is something called gRPC, which was released by Google. It’s an RPC framework that specifies one way of sending protobufs over HTTP/2. Maybe not so much for external internet facing APIs, but a lot of companies are using this for internal service to service communication now.

The good news is that to tackle this challenge, there is tooling being built that will help us. I’ll give you a couple of examples, but there’s plenty more out there. The first is a tool called buf curl. It’s a lot like the curl tool that lets you send HTTP requests at things. You can use this to send requests at gRPC APIs. It’s useful for exploring and debugging them. There’s also a tool called fq. If you’ve ever used jq to manipulate JSON data, think of fq as that but for binary formats. There’s plenty of them supported out of the box, and it has a plugin system if you want to add your own. Lastly, I’m going to take us back to the humble hex editor. At some point, you may need to just drop down to this level. It’s worth seeing what one looks like with some data that you know and are familiar with in it, just to have it in your back pocket. I’m not saying that you need to go out and spend the next month learning about these tools. Absolutely not. It’s worth spending a little bit of time with them, just so you know what’s out there. Have a play around with anything that catches your eye, and then, it’s in your back pocket for when you might really need it.

Key Takeaway

Most computing in the world, most requests or operations served by a computer somewhere happen successfully, even if it doesn’t always feel like it. Days like the one I just described hinge on the smallest of edge cases. They are the 0.00001% of computing, but they’re responsible for the weirdest outages and the most head-scratching puzzles. Those days have outsized negative impact. While they are a small percentage of all computing that happens, they stick in our minds. They hurt our reputations with our customers, and causes a lot of stress in the process. For me, it’s a real shame not to learn from those outages, those incidents, not just within our own companies, but as a community of developers, infrastructure engineers, and people in other similar fields. How many incident reviews have you read that read something like this, “We noticed a problem, we fixed the problem. We’ll make sure the problem doesn’t happen again.” When we publish these formulaic incident reviews, everyone misses out on an opportunity to learn. We miss out on an opportunity to gain insight into that tiny proportion of operations that lead to catastrophic failure.

Example Incidents

I’m going to leave you with three good examples to read after this talk. The first one is of a Slack outage in 2021. This was caused by overloaded cloud networking combined with cold client caches. The second is from a company called incident.io. It talked about some manual mitigation techniques that they were able to apply during an incident to keep the system running in a broad sense, while they focused another part of the team on coming up with a proper fix for their crashing backend software that they had. Lastly, there’s a post from GitLab from quite a few years back now. It’s about this really nasty database outage that they had that did result in a little bit of data loss as well. Really scary stuff and really worth reading. I think these three are a good starting point if you want to get better at writing incident reviews. If you’re in charge of what gets written publicly, or you can influence that person at your company, then please share the difficult stories too.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.