Presentation: Architecting a Production Development Environment for Reliability

Uncategorized

Presentation: Architecting a Production Development Environment for Reliability

MMS • Henrique Andrade

Article originally posted on InfoQ. Visit InfoQ

Transcript

Andrade: My name is Henrique Andrade. I am a Production Engineer at Meta. We will be talking about the development environment infrastructure that we have at the company. I think the main focus of this conversation is going to be in terms of what can one expect when it comes to the environment that you’re going to be using to do your work as a new software engineer joining the company. It doesn’t really matter here how senior you are. The environment that we put in place is something that is designed to basically tackle the needs of software engineers, production engineers, irrespective of their seniority. This is basically the experience that everybody is going to be exposed to. I am going to be talking about the main tenets that make this environment reliable, so you as a developer who’s joining the company, you don’t have to worry about OS upgrades, your machine going away, maintenance that might occur. All of those things that potentially can be disruptive to you, and shouldn’t be the main point of concern. That you should be focused on doing your project work and moving the project that you are allocated to forward. There is a lot of stuff that I’m going to cover here. A very large theme is behind providing the infrastructure that is as reliable as it is. I’m here as a messenger, and as someone who has used this infrastructure for a while. Also, as the team that basically keeps it going and improves on it, and all of that. Think of me as a messenger. There is a large team that I basically have to thank here for their hard work, and their thoughtfulness, and all of the infrastructure that they helped me build over the years.

Outline

I am going to give you a little bit of an introduction. As someone who is just joining the company, what are your offers in terms of environments that you can have at your disposal to do code development at Meta? One thing that I want to emphasize here is, in having that choice, what is the role of production engineers in providing that service for the software engineering community at the company? Why are PEs part of the DevEnv team, the team that provides those environments? We are going to talk a little bit about the development environment architecture. We’re going to be talking about servers. We’re going to be talking about containers. We are going to be talking about tooling. We’re also going to be talking about the user interface where you have to interact with this environment. We’re also going to be talking about a few challenges behind supporting this development environment architecture. Then towards the second half of the talk, we’re going to be talking about designing for reliability.

As a software engineer, you really don’t want to be concerned about how reliable this environment is, you just want it to be reliable. What does make that environment reliable? Then, as part of that discussion, we’re going to be talking about how intentional we are in terms of disaster preparedness. A lot of the reliability behind this environment is because we are constantly thinking about what can go wrong, and how do we smooth the usage scenarios that one has if they are facing disasters, whether it’s a disaster exercise or a natural disaster. Then I’m going to conclude this talk with a little bit about a discussion in terms of our future challenges.

What Is It Like Developing Code at Meta?

How is it like developing code at Meta? If you’re joining, most of you know that Meta has a boot camp culture. When someone is joining the company, they are not going to go directly to their team to start working. They will go through this boot camping exercise that will basically introduce you to some of the core technologies that we have, how you do things at the company, and all that you need in order to be productive when you actually join your team from day one. At Meta, we have a large community of internal developers. This can be software engineers. It can be production engineers, like myself. These are the main two groups that the developer environment infrastructure is designed for. We also have data engineers, we have data scientists, all of these individuals will be using the developer environment. We also have enterprise engineers and hardware engineers, where we provide some partial support for their functions as well. Enterprise engineers being the software engineers that design and implement the internal systems that support functions like HR, functions like finance, and things of that sort. We have the hardware engineers that are basically designing chips, accelerators, and all of the gear that Meta has in terms of hardware infrastructure.

In terms of the development environment organization, where I am one of the members, the main focus is in terms of scalably supporting the user community, and the fleet of servers that are in place to provide that service. The main goal that we have is to minimize the fuss, so as a new developer, you should be able to basically get to the development environment whatever you choose, fire up VS Code or your favorite editor, and be productive right away. How do we make that happen? The first thing is that we try to be very quick in terms of onboarding the developers. You get all that you need, day one, and it should be all ready to go. You’re not going to spend a week trying to configure your environment or install a piece of software that you need, all of these things should be there automatically for you. That means up to date tooling, and it should be there for you automatically. When it comes to actually coding, you want to be able to put your code changes in as easily as possible. The tooling in terms of access to source control, repositories, the tooling that is necessary for you to create pull requests, which we call Diffs at Meta, all of these things should be there right away. If you do make configuration changes to the environment that you’re working on, that should also be automatically persisted. You shouldn’t have to do anything special. More importantly, you should be insulated from environmental changes that might affect you. The environment is meant to be as stable as possible. If there’s maintenance going on, if there are changes in the source control system, all of those things should be more or less isolated from you. As a developer who just joined, you basically think that you have an environment that is all for you. There is no disruption. There is nothing. It’s ready to go more or less immediately.

Main Offerings

What are the main offerings that we have in the development environment? If you’re a software engineer joining right out of college or coming from another company, you have the same choices. You have basically two main choices. One is what we call devserver. The second choice is what we call on-demand containers. You’re going to see what the tradeoffs are and what the differences are, in terms of these two different environments. On the devserver, this is more similar to having a pet environment. It’s basically a Linux server that is located in one of the data centers, you have a choice of which data center is more suitable for your physical location. We have different flavors of that. It goes from VMs. They have different sizes. They have different amounts of memory. It can go all the way to a physical server for those who need that. Some people are doing kernel development, they are working on low-level device drivers and things like that. It might make sense for them to have a physical server. Then there are certain special flavors. If you’re doing development with GPUs, or if you have the need to access certain types of accelerators, there are certain flavors that you can pick that is more suitable for the work that you’re going to be doing.

In terms of the lifespan, when you get one of these devservers, they might be temporary, or they might be permanent. Suppose that you’re in a short-term project to improve a device driver, so you might need a physical server just for a couple weeks. You also have the choice to get something permanently because you’re going to be using that, basically, continuously throughout your career at Meta. The interesting thing about devservers is that they run the production environment, and they run in the production network. As you are testing, debugging, you’re basically inserted into the environment that will give you access to everything that powers the Facebook infrastructure. You can have certain utilities and tooling pre-installed. We will talk about provisioning, but there is a way for you to basically say, I need all of these tools pre-installed. If I get a new server, I should have all of these tools already pre-installed. You can do things like that. You have remote terminal access, so you can just SSH into the box, or you can also use VS Code and work on your laptop connected directly to that devserver behind the scenes. Every server is managed by a provisioning system, so that means that they are permanently provisioned. That means that if there are updates to external software, to internal software, all of the upkeep that is necessary is done for you automatically.

They have default access to internal resources, but you do not have direct access to the internet. There are tools and there is infrastructure to do that, but that is not necessarily available out of the box. We try to minimize that, because that introduces potential risk as well. There is the ability to install precooked features. By features we mean the infrastructure around certain software packages that might help in your development work. There is also what we call one-offs. Those are things or tools that you as a new developer, you might be using, a spell checker that you like, or an editor that you like. You can also set that up and have that installed in any devserver that you get from that point forward. Devservers can also be shared. Sometimes you’re working with a team, like you’re hired, and you’re going to be working closely with someone else in your team or in a different team, and you can definitely share the access to that devserver with that team. There is also the ability to do migration between devservers. Suppose that for one reason or another, you need to get a bigger devserver, or you need to get a devserver in a different region, you can migrate from one to the other quite easily. One thing that is important here is that devservers, many of them are virtual machines, so they are layered on top of the same virtualization infrastructure that powers Meta. There isn’t anything special about that.

The second offering that we have is called the on-demand containers. The interesting thing about containers is that they are pre-warmed and pre-configured with source control, the repositories that you might be working on, linters, but they are ephemeral. It’s an ephemeral environment for a specific platform. If you’re doing iOS development or Android development, you’re going to get all of the tooling that you need in order to do development with that particular software stack. They have multiple hardware profiles. This means memory amount, whether they have GPU access, whether they have access to certain accelerators. They’re also accessible via IDE, so via VS Code or via the CLI. It depends on how you like to work. They are focused on workflows. As I said, iOS, Android, Instagram, or mobile development, or Jupyter Notebooks, whatever you as a new developer joining the company will need.

They include web server sandbox, so these replicates the prod environment. Suppose that you’re making changes to Facebook, to the desktop Facebook services, you basically have a replica of that environment at your fingertips when you’re using an on-demand container. This is also true for devservers. The point here is that this is ephemeral. It’s up to date. It’s ready to go at a click of the mouse. You can also further configure this environment. Suppose that you have certain features, again, features here being a piece of software and a configuration associated with that and you want that delivered to your container, you can have that as well. This container infrastructure is layered on top of Meta’s Twine, which is something similar to Kubernetes. We have containers and the orchestration that goes with it. You’re able to basically deploy these things very quickly, very much like any container technology that you have out there. If you’re interested in more about this, there is a good talk that was given @Scale 2019, that goes deeper into what we have as part of Twine.

Production Engineering

Why do we have production engineering? I’m part of that team supporting development environment. I just wanted to do this plug here, because one would think that development environment is just something that you need a bunch of SWEs to come together and put all of the software stack necessary to support these two products that we have. The interesting thing about Meta is that in many groups, production engineers are an integral part of the organization, because production engineers and software engineers, they have different focus areas. Production engineers are software engineers, but they are basically interested in terms of how to operationalize a service, so integration and operations. They are usually the ones that are responsible for managing the service deployment, the interaction, troubleshooting, and think about all of those things at scale. PEs tend to have a bias towards reliability, capacity planning, and scalability. They are always focused on deployment, on running upgrades efficiently, on configuration management, and also on a day-to-day basis, performance tuning and efficiency. Many teams at Meta have PEs associated with them. Other companies have similar organizations like the SRE organization at Google, the SRE organization at Bloomberg where I used to work. It is an interesting mix in terms of running efficiently services at scale.

What do PEs do in the DevEnv team? One of the main missions that we have in DevEnv is basically the efficiency of our developer community. In companies like Meta, Google, other companies that are software intensive, the company’s productivity is predicated on how productive the people who are writing the code and maintaining the code are. DevEnv PEs focus on the same things that any other PE team at Meta usually does. In our case, we have a particular target on developer efficiency. We want to make the service awesome, meaning it should be as frictionless as possible. If you’re joining as a new software engineer, you want things to just work out of the box. You don’t want to spend one month trying to figure out how to get your laptop to build a particular piece of code. All of that is provided for you from the get-go.

The second thing is, we are obsessed about automation. We are a relatively small team, if we count the SWEs and PEs, but we have a very large community that we’re providing services for, in the order of thousands of software engineers. As you know, software engineers are very opinionated. You want to make sure that the service is always reliable, works as expected, is fast, all that good stuff. The engagement that we have between PEs and SWEs in the DevEnv environment is actually part of the reason why we can provide a reliable infrastructure for our community. The service is a white box, meaning both PEs and SWEs understand the code that is behind the scenes. We make contributions to the same code base, have code reviews shared between PEs and SWEs. There is a very good synergy between the teams. It’s a shared pool of resources to put together the services that we have in place. This even includes sharing their on-call workload between members of the different sub-teams.

Development Environment Architecture at a Glimpse

Let’s start talking about the development environment architecture. The central point of this talk is, how do we make this whole infrastructure reliable? The first thing that is important to realize is that all of the stack that we have here that is used by DevEnv, is not special cased for our specific service. We are organized from the standpoint of a software stack, the same way that any other project or product at Meta is: whether it’s an internal or external product.

At the bottomest layer here, we have the server hardware and the infrastructure for provisioning servers. You can get on-demand and devservers in any region where we have data centers. We’re basically layered on top of the same infrastructure that any other service at the company has. That means that when it comes to server monitoring and server lifecycle, we are under the same regimen as any other product. When it comes to provisioning, for example, there are ways for us to specialize the provisioning system, but those are basically plugins to the software stack that will provision a server, very much like a server that you would have for a database system, or logging system, or whatever you’re talking about in the company.

When it comes to the services that we’re using, the basic monitoring infrastructure that we have in place applies to the services that are part of the infrastructure that supplies the products that are part of DevEnv. Same thing is true for our service lifecycle. Most of the services, they are also run as Twine tasks. They are monitored, they are logged, all of that is the very same infrastructure that everybody else has access to. When it comes to the ‘servers’ that people have access to, whether it’s a virtual machine or a container, we are basically sitting on top of the virtualization infrastructure that powers the rest of the company and the same containerization infrastructure that powers the rest of the company. Then on top of that, we have the actual service, the devservers are on-demand. On top of the whole thing, we have the developer and the tools that they are going to be using. There is nothing special about the DevEnv environment when it comes to the tooling and the software stack within the company.

Designing for Reliability

Now let’s start talking about designing for reliability. If you are a software engineer working for Meta, the last thing that you want to worry about is, did I install the latest updates to my devserver? You don’t want to do any of that. You don’t want to worry about backups and that particular devserver potentially crashing, and you’re losing a day of working because of that. There are many things that we do when it comes to designing for reliability. We want to make that new developer as well as the longtime developers that we have in the company as productive as possible. The whole software stack relies on the internal infrastructure that we have. We design the service to be scalable and reliable from the outset. Why are those two things important? Scalability is important, because during many years, the company was growing at a very high pace in terms of onboarding new developers, creating more services, and things like that, and we have to have the ability to basically ramp up all of these people that were being hired at a very fast click. We also have to be reliable. The team is small. We don’t have the ability to handhold every single developer, so most of the things have to just work. Providing reliability in a company like Meta or many of our peers, it means that we have to design for an unreliable world. Switches die, servers die, what can you do in order to basically insulate people from all of the dynamic things that happen when you have a very large fleet. We basically relied on a bunch of internal services that were designed to cope with an unreliable world. DNS, all of the infrastructure that powers the internal systems, which is highly reliable and scalable. Service router, which is basically a way that allows users or clients of a service to find the servers. Again, to cope with any unreliability that might befall that particular service. We also rely on the Meta MySQL infrastructure. That means that the databases are running master-slave mode, and you have distribution of the workload, all of that good stuff.

We also rely on the provisioning infrastructure. You can basically ramp up a new devserver very quickly, if you have to. The provisioning system has all of the recipes, all of the infrastructure to basically bring a fresh server to a running state quickly. We also rely on the virtualization infrastructure. It’s very easy to turn up a new virtual machine to potentially supply the devserver environment to a new developer. We also rely on the containerization infrastructure. There is a plethora of other services that we rely on. One of them is the auto-remediation infrastructure, so many of the common failures that one might face when it comes to a particular service that you’re providing, has automatic remediations. Something goes wrong, there is a predefined set of logic that we will run in order to rectify that particular failure. The other things are more on the logical side. One of the things that is important, and it’s integral part of the culture at Meta is what we call SEV reviews. Every time that we have an outage, it can be an outage in our service or in one of the services that DevEnv uses, we have a formalized process to review what created that site event. SEV stands for Site Event. The main thing is that very much like in the aviation industry, if something goes wrong, we want to be able to learn from it and improve from it, so ensure like that doesn’t reoccur.

The other important aspect of designing for reliability is fitting in with the overall company infrastructure when it comes to disaster recovery. Meta as a company has a well-defined process for running disaster recovery exercises, and for potentially automating things that one might need to act on in the face of a natural disaster or a disaster exercise. This is another area that is very interesting. There is an external talk that discusses these in more detail. The key message here is that the development environment fits in very well with the disaster recovery strategy for the company. There are multiple strategies that we have in place in terms of the ops side of the business here. In terms of team facing strategies, we have on-calls, and we have runbooks. If something should happen to the service, we try to strive for having well-defined runbooks that the people who are on-call during that particular week can just follow on. There’s continuous coverage with good reporting, and basically workflows that dictate how you should transition from one call shift to the next. There are well documented workflows for every on-call, whether they are someone who has been doing this for a long time, or someone who had just joined a team. We try to basically spread that knowledge to the people who are holding the fort.

Then there are certain strategies that are user facing. How do we ensure that you have that perception that the environment is reliable? The main thing here is communicating. We try to have well developed communication strategies to talk to the user community. For example, if we know that there’s going to be outage of a particular service because we are running a company-wide disaster recovery, we have a strategy in place to communicate with the user community to alert them that they should be on the lookout and they should prepare for that exercise. More importantly, there are things that we do to minimize the pain when you do face a disaster or a disaster recovery exercise, so transparent user backups. Designing for an ever-changing world, so OS and software upgrades are a constant. How can we do those things without disrupting the user, and giving them a reliable environment to work on?

User-Facing and Infra-Facing Automations

Let’s talk a little bit about user-facing and infra-facing automations, because those automations are the things that are basically going to ensure that the environment is reliable. We try to do those things in such a way that we don’t disrupt the typical development workflow that a software engineer or a PE has in their day-to-day. The first thing is that common issues should be self-service. From a user standpoint, if something goes wrong, many of our tools have a mode by which they can self-diagnose and self-correct the issues that you have in place. There will be scripts, or there will be tooling for you to basically run, and those tools will basically go through a set of validations and eventually correct the situation for you. In the worst case, many of these tools also have what we call a rage mode, which allows you to basically run through all of those self-correcting steps, but also collect evidence in case it doesn’t work. The team that owns that particular product can look at well-defined logs and data to help rectify that situation.

On the team-facing side, we have tools like Drain Coordinator and Decominator. These are automated tools that suppose that you have a server that is going to undergo maintenance, there’s a bunch of choreographed steps that will take place in order to not disrupt the end user. One of the things that we might be doing is that if you have a devserver that happens to be a virtual machine, we could potentially move it to a different physical service without disrupting the end user who is the owner of that devserver. There is also something called Decominator that basically automates the process of sending a server to the shredder, if it has hit end of line, and potentially alerting the users doing all of the tasks that are necessary to basically drain that particular server and indicate to the user that they have to move on to a different environment.

Preparing for Disasters

The next thing that I wanted to talk a little bit about is that, again, if you’re a developer, you don’t want to be doing your own planning for disasters. More importantly, you don’t want to be concerned with when disaster preparedness exercises are going to be run. How do we prepare for disasters in such a way that we don’t disrupt the developer efficiency for the company as a whole, or for a particular developer individually? The first thing is capacity planning. We design our service to have spare capacity in different regions, under the assumption that if you need the extra capacity to move people around, because there is maintenance going on in that particular data center, that people can just migrate automatically from one, if they’re using a devserver that happens to be a VM, that they can be moved to a different physical service. We also design for ephemeral access. In fact, the majority of our developers, they tend to use the on-demand containers, which are ephemeral. Every time that you get a new one, you get the freshest setup possible because these are short-lived tasks. I think they live for a day and a half, and that’s it. When you get the new one, you have the latest version of the source control system, of the linters, whatever. It comes brand new to you. The other thing that is important is that we run these disaster recovery exercises. We have two kinds when it comes to the impact to the developer environment, we have storms. Storms is the term that we use internally. It comes from an actual storm, those tend to hit the East Coast of the U.S. rather frequently. They can be as disruptive as taking down a data center fully. We call these exercises storms. We also have dark storms where we can potentially wipe the contents of random devservers. This is to ensure that we have the infrastructure and people are aware of this particular aspect of their devservers.

The other part that is important here when it comes to disasters is that you have good tooling. You have to be able to drain people from particular regions. For example, if you’re not going to have access to a particular data center, you have to make sure that you don’t let anybody use that particular region, if they’re trying to get a new on-demand container that they don’t go to that region. For those who are in that region, or will be in that region during one of these exercises, you want to basically drain them out so they don’t lose anything as the exercise takes place. We also invest a lot in terms of automating runbooks. Runbooks are basically like a cooking recipe in terms of what one needs to do as you’re running these exercises, from notification to draining, all those things should be spec’d out in the runbook. We also have devserver live migration. That means that we have the ability relying on the virtualization stack to basically move one server from a physical server to a different one without disrupting the user. You don’t even have to power down your VM. We also invest on backup and migration workflows. If you do lose your dev VM, you have to have the ability to basically allocate a new one as painless as possible. Then we have some strategies in place to survive internal DNS failures. If that does occur, we have to lean to basically allowing you to get to the devserver bypassing DNS when that’s necessary. Then the last thing that I want to highlight is the ability to communicate. If you’re running one of these disaster exercises, we email folks, we open a task, which is our Jira like environment. We send a chatbot communication to indicate that if you have a devserver in the affected region, you will have to temporarily get a new one in a different location, and potentially restore your backups, so you’re up and running quickly.

Storms and Drains

Let’s talk a little bit about storms and drains. There are two types of exercises, storms and drains. Storms are those exercises where we completely disconnect the network from the data center. It simulates like a complete loss of that particular region. We also have drains, and drains is when we selectively drain a site and fail over to a different site. The network remains up. Why do we do these things? First, we want to periodically be able to test all of the infrastructure together signal for things not being resilient to the loss of a single region. Why do we do this thing periodically, because once you work out the kinks out of the system, it should remain good? That’s not the reality. The reality is that our software stack is constantly evolving. It might be that you cleared all of the design points in one exercise, but someone introduces a feature that creates a single point of failure again. That’s the reason for doing this thing frequently. The point really is that we want to be prepared for large scale power outages or network incidents, or even self-inflicted issues that might occur. We do this thing on a periodic basis, to again continue to validate that our design decisions, our architecture, everything is in place in order to provide that high efficiency environment that we want to provide. What types of signals do we collect when we run these exercises? Capacity, do we have enough of it that would enable people to migrate quickly if they have, for example, devserver? Or do we have enough on-demand containers in a particular region to accommodate the loss of a different region? We have failovers, so some of the services that we run in that data center will become unusable. Do we have the ability to fail over to a different region? Recovery, are we able to recover from those failures? Do we have all of the process orchestration in place to make sure that everything will remain operational. That’s basically the reason why we run storms and drains.

Runbooks

Let’s talk a little bit about Runbooks, because this is more specific to what each group at Meta eventually has to do, including us. A runbook is basically a compilation of routine procedures and operations that an operator needs to carry out in the face of a particular situation. In this case, in the face of a disaster. The goal here is that we should be able to do these things in a repeatable fashion. Which means that we should be able to automate as much as possible. Meta actually has a runbook tool that enables you to basically list all the steps that need to be carried out. Runbooks can be nested. One runbook can invoke another runbook as a step. These steps can be an operation, a logical barrier. You’re basically waiting for a bunch of steps to be completed. It can also be another runbook. When it comes to runbook development, there is a whole environment behind it that allows you to validate, debug. You can rely on preexisting templates for things that are more or less general purpose. At runtime, when you invoke one of these runbooks, there’s tooling that will basically capture the dependencies that will allow for step orchestration, that will capture execution timeline and logs. This is all in place so you can actually do a post-mortem after you have invoked one of these runbooks.

Comms

Let’s talk a little bit about comms. When you’re running these exercises, they can be highly disruptive. One of the investments that we have made, again, so the individual user, that new software engineer who joined, so they don’t have to worry about is to have a well-defined strategy in terms of how we communicate with those users. The aim here is to maximize the efficiency of a particular developer, so you don’t lose a day if we’re running one of these exercises. The first thing is that we try to communicate ahead of time, whenever it’s possible. We try to be preemptive. Obviously, when we are running a disaster recovery exercise, the whole point is that this should look like an actual disaster. We don’t give a warning like weeks in advance, this will happen a couple of hours before the exercise takes place. Because the other thing that we want to do in terms of the culture is have the developers themselves be aware that they might potentially lose a devserver. They have to be aware of what they need to do in order to be able to survive those few hours that they might not have access to that devserver. One of the things, for example, that we want to educate the users about is that you should never be running a service on your devserver because it’s an environment that can disappear under you.

The other thing that is important here is that we want to be able to empower the user to self-correct any problems that might occur and continue to work. Why is all of these important? We don’t want any surprises. We want the developer efficiency to remain high even in the face of those potential losses. In terms of the mechanics of how we do this, is that we have the ability to run banners. As you go to a particular internal website, there will be an indication that there is a disaster recovery taking place. The Shaman banners are basically a way to broadcast information, so everybody is aware of it efficiently. The second aspect to this is the automation of alerting for developers. There are emails, tasks, and chatbot messages that go out very quickly, which will enable you to basically react to the loss of a devserver, for example.

Live Migration

One other thing that I wanted to talk about is the ability to live migrate users or virtual machines. For people who are using devservers who are virtual machines, we have the concept of a virtual data center. The reason for it being a virtual data center is because every server in that data center has a mobile IP address, which enables us to basically migrate a virtual machine from a physical server to a different physical server without interrupting the workflow. This is very useful when it comes to simplifying maintenance workflows, from time to time. There might be a hardware problem, like a fan died and you need to basically carry out that maintenance workflow on that physical server. We can easily migrate all of the devserver VMs that are on that physical host to a different one, to enable that maintenance workflow to take place. This relies on something called ILA. There’s a very good talk about it at Network@Scale 2017.

Learn From Quasi-Disasters

What is the point of running disaster exercises if we don’t learn from them? The main thing when we are running these exercises is that we want to be able to learn from it. Every time we run these exercises, we open a preemptive SEV for this. Then, after the exercise has taken place, that SEV is closed, but we collect all the information related to that event in-depth in the tool that supports the management of SEVs. Subsequently to it, we have a SEV review. Every SEV that we have at Meta, the goal, at least, is to have all the SEVs reviewed. The owner of that SEV, in the case of DevEnv that will be the person who was on-call during that exercise, will put together an incident report. There is tooling for these to ensure that we do this thing in a consistent way. This thing is reviewed by a group of senior engineers and whoever wants to join that process. As a result, we will produce a bunch of tasks to drive the process of improving whatever needs improving. There will be critical tasks, those are timely reviewed, and they might be addressing the root cause that made something not work. Then we might have medium priority tasks that will basically allow us to mitigate issues in the future. They will also allow us to remediate things or prevent things. There might even be exploratory tasks that will drive architectural changes, potentially redesigning services that have shown not to be reliable to potential disaster scenarios. The key thing here, again, is we learn from disasters, so we can provide that environment where people are always productive. They don’t have to worry about their developer environment.

The Future

The first thing that is in front of us, again, to make sure that that new software engineer or the old timers, remain as productive as possible, is to harden the infrastructure to tolerate disasters. We are currently in the process of better integrating with the reliability program maturity model. We are improving our on-calls. We are investing a lot in terms of observability, and incident management, and also in terms of crisis response. What are the things that we can do in order to better respond to potential failures that we might have in the environment? We have to remember that DevEnv resources are critical in the disasters. Oftentimes, having access to your devserver is what a PE or a SWE who is working on an actual disaster needs. The environment has to be bulletproof to enable people to work through things that might be affecting other parts of the computational infrastructure at the company. There are interdependencies. One of the key things that we have been working on is on being resilient to DNS outages. One thing that might not be clear here is that devservers are Linux servers. How do you get to a Linux server if DNS is down? We have worked a lot on infrastructure to enable that to happen. Then there is the whole thing of being able to work with degraded access to source control and continuous integration and delivery. Oftentimes, in order to fix an actual SEV, or an actual disaster, you have to ship more code. Or to undo things that have been done, how can we make those things work?

Then, there is the whole thing of improving the reliability practice. We are investing a lot in terms of architecting code reviews. This is to make sure that from the outset, as we are adding new features or subsystems, that we’re not creating potential failure points in the whole stack. Then there is the periodic reassessment of the state of our production services. How do we make sure that things don’t decay, because we’re just not paying attention? Then, focus on problem areas. What are the things that on-calls are seeing day in and day out? We are putting effort in terms of improving these as well. All of the work and all of the architecture that you saw here, is in place in order to enable software engineers, production engineers, data scientists, to work as efficiently as possible without having to worry about the environment that they are working in.

See more presentations with transcripts

Mobile Monitoring Solutions

Uncategorized