Presentation: Panel: Real-World Production Readiness

MMS Founder
MMS Kolton Andrus Laura Nolan Ines Sombra

Article originally posted on InfoQ. Visit InfoQ

Transcript

Reisz: We’re going to continue the discussion about running production systems.

All of our panelists have released software, they’ve all carried the pager. At least in my mind, they come from three different personas in software. Laura Nolan is a senior SRE at Slack. She’s a frequent contributor to conferences like SREcon, and QCon. She’s on the steering committee for SREcon. She’s published quite a bit, all on the topic of SRE, even contributed to a bunch of books, including one that I have right here on my desk, a chapter here on this. We’ve got Ines Sombra, who’s an engineering leader at Fastly. She’s got many of the teams that are powering their CDNs or edge compute software. Then we’ve got Kolton Andrus. Kolton is the CEO of Gremlin. Kolton was one of the engineers at the heart of building the chaos engineering capability at Netflix, before he went on to found Gremlin, a company that does something very similar that brings a platform to other companies.

All of them, we’ve got engineering, we’ve got SRE, and we’ve got a chaos engineer here. What we’re going to do is have a discussion. They’re not just going to stay in those lanes, but we have them from those areas, at least, to continue the discussion about SRE.

What Is Production Readiness?

When you think about production readiness, I’d like to know what comes to mind so we can frame the conversation. What do you think about when we talk about production readiness?

Nolan: To me, there’s a bunch of aspects to this. I think that one of the most important things about production readiness for me is that it is very difficult to reduce it down to just a color by the numbers checklist. I think that one of the most important things is that the owning team, the team that’s going to have the skin in the game, is going to be carrying the pager, whether that’s the dev team, or whether it’s an SRE team. I think the most important thing is that that team, which is presumably the team that knows that system best, is going to really take the time to engage and think deeply about how can that system fail? What is the best thing that that system can do in various situations like overload, like dealing with partial infrastructure failures, that sort of thing?

Certainly, there are particular areas that are going to come up again, that teams are going to need to be prompted to think about. One of the most obvious being, how to deal with recovery and failure. Monitoring and understanding the system is a huge area always. How to deal with scaling and constraints and bottlenecks and limits, because all systems have those. How do we deal with load shedding and overload is always a huge area. Then data is another big issue. How do you safeguard your data? How do you know that you can restore from somewhere? How do you deal with ongoing work? In Google, we call this toil, so work of turning up new regions and turning things down, work of onboarding new customers, that sort of thing. How do you ensure that that doesn’t scale as the system grows and overwhelm the team? That tends to be what I think about in terms of production readiness.

Reisz: Prompts, things to think about, not necessarily a checklist that you’re going down.

Sombra: When it comes to production readiness, I translate it into just like operational excellence. Then, for me, it signals to like how uncomfortable or nervous a team is around their own service. It is like, are we scared to deploy it? Are we scared to just even look at it? For me, it’s just like I signal it to health, and how many pages do we get? Are we burning out on on-call? Is the person that is going on-call just feeling that they have to hate their lives. Then just like, is it stressful? For me, it’s just more about like, production readiness equals feelings. When we feel ok, then basically we’re doing things right. That’s where my mind goes immediately.

Andrus: There’s a lot that I agree with here. I’m a big believer of skin in the game, that if you feel the pain of your decisions that you’ll choose to optimize it in a way. You touched on another important point, Ines, on the confidence and the pain. We want people to feel a little bit of friction, and to know that their work is making their lives better. We don’t want to punish them and have them feel overwhelmed or have them feel a lot of stress about that. I’ve been to so many company parties with a laptop in my bag. That’s not the fun part of the job. It might be one of the battle scars you get to talk about later. I think a lot of it comes down to, to me, production readiness is how well do you understand your system? We always have teams and people that are changing. Day one on a new team is like, draw me the system. Tell me what we depend upon. Tell me what you think is important. I think a lot of it comes down to that understanding, and so when we have a good understanding of the monitoring and what it means.

I’ve been part of teams that had 500 alerts, and we cleaned that down to 50. The start of that exercise was, does anyone know what these do? The answer was no, but don’t touch them, because they might be important. If you understand that, you can make much better decisions and tradeoffs. Of course, I’ll throw that chaos engineering angle in. One of the ways you understand it is you see how the system behaves in good times and in bad. You go through some of these, you have some opportunities to practice them so you understand how that system might respond. To me, it’s about mitigating risk. A lot of what we do in reliability I’ve come to learn or feel has a lot to learn from security, in that a lot of what we’re doing is mitigating risk and building confidence in our systems.

Reisz: Nora Jones talked about context, understanding the context of why we’re doing different things on there. I think that totally makes sense.

Having Org Support and Skin in the Game

Sombra: I do want to point one thing out, whenever we say that skin in the game is a factor. Yes, but that’s not the end all, and be all. It also requires organizational support, because a team can be burning. Unless you leave them space to go and fix it, then the skin in the game just helps you only so far, the environmental support for it. If you’re lucky to be in an operationally conscious company, then you will have the opportunity to go and make the resilience a focus of your application. It requires us as leaders to make sure that we also understand it and it is part of our OS. We’re like, is it secure? Is it reliable? Then it just becomes a core design and operating principle for the teams and the management here as well.

Reisz: What do you mean by that, Ines? You space to go fix at Slack? What do you mean specifically to go fix it?

Sombra: A manager needs to pair. If you have a team with 500 alerts, it takes time to winnow them down, understand them, and then just even test them. If you never give them the time, then you’re going to have a team with 500 alerts that continues to burn and continues to just churn through people. The people alone are not enough. You have to have support.

Nolan: Most definitely. I agree with that so much. One of the big precepts of the way that SRE in the early years of Google was, and I think this is not the case in the industry, but there was a very big emphasis that most SRE teams supported a number of services. There was always an idea that you could hand a service back to the developers if they wouldn’t cooperate in getting things prioritized as needed. Of course, sometimes you would have problems in the wider environment, but that idea that almost SRE by consent was this very big thing that teams would just have burning things piled on them until the team circled the drain.

Consulting vs. Ownership Model for SRE (Right or Wrong Approach)

Reisz: Laura, in your talk, you talked about the consulting versus ownership model for SRE. Is there a wrong or right approach?

Nolan: I don’t think either are wrong or right. I think they both have a different set of tradeoffs, like everything else in engineering. Having a dedicated SRE team is extremely expensive, I think, as many organizations have noticed. You need to set up a substantially sized team of people for it to be sustainable. I know we’re all thinking about burnout at this stage of the pandemic, and the whole idea of The Great Resignation, people leaving. In order for a team to be sustainable service owners, in my view, they need to be a minimum of eight people. Otherwise, you just end up with people on-call too frequently having too much impact on your life. That’s an expensive thing. A team is also limited in the number of different things it can really engage with, in a deep way. Because human brains are only so large, and if you’re going to be on-call for something, you need to understand it pretty well. There’s no point in just putting people on-call to be pager bookies with a runbook that they apply in response to a page. That’s good engineering, and it’s not good SRE or operations practice. You need a set of services that really justifies that investment.

What’s true to say is, a lot of organizations don’t have a set of services where it makes sense to dedicate eight staff purely to reliability and availability type goals. Where do you go? That’s where your consulting or your embedded model makes sense, where you have people who bring specifically production expertise. It’s not going to be a team of people who are dedicated to production that are going to own and do nothing but reliability, but you’ll have one person embedded on a team whose job is to spread that awareness and help raise the standards, the operational excellence of the whole team. That’s where the consulting model or the embedded model makes sense, I think.

Distribution of Chaos Engineers among Teams

Reisz: Imagine we can’t have chaos engineers on every single team. How do you think about this question?

Andrus: I’ve lived on both sides of the coin. When I was at Amazon, we were on the team that was in charge of being that consulting arm. That gives you the latitude and the freedom to work on tooling, to do education and advocacy, to have that holistic view of the system. To see those reoccurring issues that are coming across multiple teams. At Netflix we’re on the platform team in the line of fire. It was a different set of criteria. I appreciated the great operations folks we had across the company that helped us that were working on some similar projects. We saw a different set of problems that we were motivated to go fix ourselves. I think that’s a unique skill set. If you have that skill set on your team, and somebody’s passionate about it, having them invest the time to go make the operations better and teach the team itself works very well. If you have a bunch of engineers that maybe don’t have as much operational experience, just telling them they’re on-call may start them on the journey and give them some path to learn, but isn’t really setting them up for success.

It’s funny, we talked about management a little bit and cost a little bit. As a CEO now, that’s a little bit more my world. To me, I think we underappreciate the time and effort that goes into reliability as a whole. Ines’s point that management needs to make space for that and needs to make time for it and invest in it, is something that hits home for me. I heard a story secondhand of somebody that was having a conversation about reliability, and said, how much are you spending on security? The answer was like, all the money in the world. He was like, how much are you spending on reliability? Like nothing in comparison, and not a great answer there. It is like, do you have more system stability issues from security incidents, or availability outages? The conversation was over at that point.

I think framing it in a way that helps our leadership think about this in terms that they understand and how to prioritize and quantify it. That’s the other tricky part is that, how do you know how valuable the outage that didn’t occur was? I’ve been picking on Facebook for the last couple of conversations I’ve had on this. Imagine you could go back in time, and imagine you could find and fix an issue that would have prevented that large outage that was estimated to be $66 million. I almost guarantee you cannot convince anyone that it was that valuable of a fix, but it was in hindsight. How do we help folks understand the value of the work that we do, knowing that sometimes it is super paramount, and sometimes it might not ever matter?

Understanding Value of Work Done

Reisz: How do you help people understand the value of the work that’s being done?

Sombra: I think it’s hard. We don’t reason through risk very well. It’s just hard. People are not used to that. I think there’s a cultural element too, where if you glorify the firefighting, you’re going to be creating a culture where you, like the team said, never have an incident, just get under-looked, no matter how many heroics you do. If you go and then always grease the squeaky wheel, then you’re going to be creating this problem too. Granted, all of our houses are made out of glass, so I’m not going to pick on anyone. I can tell you that there’s not going to be a single solution, they carry nuance. Then we’re not very good with nuance, because it’s just hard.

I do have a point about Laura’s embedded versus consulting model. I’m a fan of having my own embedded SREs. I deal with the data pipelines. I lived in a world where we have an entire organization that I depend on, but depending on your failure domain and depending on your subject matter expertise, I like having both. A centralized team where I can go for core components that are going to have attention, and they’re going to have care, and then just also being able to leverage them, and then just make sure that my roadmap is something that I can control.

The Centralized Tooling Team

Nolan: That’s almost a fourth model where you have the centralized tooling team. That’s not really what I think of as an SRE team. I think of an SRE team as a team that owns a particular service, deals with deployment, deals with all the day-to-day operation stuff, is on-call for it, and really more deeply engaged with the engineering roadmap as well. I know that not many organizations actually have that.

Andrus: It’s a great clarification, and I’m glad you brought it up. Because a lot of folks think, yes, we have a centralized team and they own our chaos engineering tooling, and our monitoring tooling, and our logging, and we’re done.

Nolan: We’re not done.

Sombra: I’m actually missing the nuance, because what you’re saying is just like you would have a team, or for example, services, but you’re talking about applications that are not necessarily core services, for example, a CI and CD pipeline.

Nolan: For years, I was an SRE on a team that we run a bunch of big data pipelines, and that was what we did. We didn’t build centralized tooling for Google. We built tooling and we did work on that particular set of pipelines. You’ve got to be quite a large organization to fall into that.

Sombra: We could consider that in my scope as an embedded SRE. These people that are only just thinking about that, and then just we go to the other centralized as foundational core pieces. Also, it’s difficult to reason to your point, Wes, is because we’re using these terms very differently and contextually they mean things in different organizations. First, calibrate on the meaning, and after that, we’re like, yes, we’re doing the same thing.

Teaching Incident Response

Reisz: How do you teach incident response? Laura, I actually held up this book. I’m pointing to chapter 20 here. How might you suggest on some things to help people learn incident response and bring that into the organization.

Nolan: I want to build a card game to teach people incident response, and I keep meaning to open source it and never quite getting time to. The pandemic came along, and you think I would have time, but then I started doing a master’s degree like an idiot. Teaching it, I think it’s always important to have fun when you’re teaching something. I think it’s important to be practical and hands-on. I don’t believe in long talks where people are just sitting there doing nothing for a long period of time. I always like to simulate things. Perhaps you can run an incident around some of your chaos testing. I’ll talk a little bit about what to do during an incident and what to do after an incident. You’re right, before an incident, everyone needs to know what the process is that you’re going to follow during an incident and get a chance to practice it. Keep it consistent. Keep it simple.

Then, during an incident, the key thing is to follow that process. One of the most important things is to have a place where people go, that they know that they can communicate about an ongoing incident. Have a way for people to find what’s going on and join an incident. Have someone who’s in charge of that incident, and the operational aspects of that incident. Dealing with people that are coming along, asking questions, dealing with updates, and communication with stakeholders. Dealing with pulling people into the incident that they need to be in there. Use the operational person, the incident commander to insulate people who are doing the hands-on technical, hands-on keyboard work of debugging and fixing that incident. People really complicate the whole incident management thing, and they go, firefighters. Really, it’s about having a system that people know, and insulating the people who are doing the hands-on work from being distracted by other things. Having someone who keeps that 360-degree view of what’s going on, making sure that you’re not spending time down in the weeds, and forgetting something important off to the side, like maybe that after six hours people need to eat. Hopefully, you don’t have too many six-hour incidents, but they do occur.

Then afterwards, the most important thing is to look at what happened and learn. Looking for patterns and what’s going on with incidents, like understanding particular incidents in depth what happened, particularly the very impactful ones, or the ones that might hint at bigger risks, the near misses, and looking for things that are happening again.

Andrus: It’s interesting. I could agree with most of what Laura says, and just say, move on. There’s how you prepare people. I think investing in giving people time for training, and there’s lots of different learning styles. Shadowing was effective for me, being able to watch what other people did. This is a perfect use case for chaos engineering. You should go out and you should run some mock exercises, blind and not blind. There’s the set of, tell everyone you’re going to do it so they have time to prepare, and they’re aware, and go do it. Make sure it goes smoothly in dev, or staging, or whatever. Then there’s the blind one, the, ok, we’ve got an advance team. We’re going to run an incident and what we really want to test there is not the system, but the people and the processes. Does someone get paged? Do they know where the runbook is, the dashboards? Do they know where to go find the logs? Do they have access to the right system? Can they log in and debug it? Just going through that exercise I think is worthwhile for someone that’s joined a team.

Being the call leader, that’s the other one that Laura talked about there. That’s an interesting role, because there’s lots of ways to do it. Having a clear process, a clear chain of command, maybe not the most popular terminology. It’s a benefit to the folks that are on the call, you need someone that’s willing to make a call. It’s tricky to make a call, and you will never have complete information. You have to exercise judgment and do your best. That’s scary. You need someone that’s willing to do that as part of the exercise. I also agree pretty heavily with, that person is really managing status updates, letting folks know where time and attention is being done, managing the changes that are happening. The experts of the systems are each independently going debugging, diagnosing, and then reporting back what they’re learning.

Reisz: What’s incident response look like at Fastly?

Sombra: How do you teach it? I agree with both Laura and Kolton, a designated coordinator is really helpful. It’s helpful to also have a function, if you already have incident command as a thing that your company does, then it’s great, because you can rely on those folks to teach it to others. Since, apparently, I woke up with my management hat as opposed to my data hat, I will tell you that a critical thing is to make sure that your incident team is staffed appropriately. Because if they don’t have time, if they’re going from incident to incident that they don’t have time to teach it or develop curriculum, then it’s just really hard to be able to just spread that knowledge across.

Incident Response at Fastly

What does it look like? We have a team that deals with incidents. They’re actually rolling out this training as we go along. Then I like to hear great things. A technical writer, at some point, is helpful to have in your organizational evolution, because how do you convey knowledge in a way that sticks, and how do you educate folks, becomes much more of a concern, because I want to be able to build these competencies.

A Centralized, Consistent Tooling that Covers Security

Reisz: Are you aware of anything out there that offers a centralized mostly consistent tooling capability that covers security?

Sombra: Centralization is hard, the larger you are, and expensive.

A Chaos Engineering Approach to Evaluate AppSec Postures

Reisz: Is there a chaos engineering like approach for companies to evaluate application security postures? From a security standpoint, can chaos engineering be used to actually test your defense in depth?

Andrus: Yes, chaos engineering is a methodology where by which you go out and you cause unexpected or fail things to happen to see what the response and the result is with the system. That can be leveraged in many ways to do it. When I first started talking about chaos engineering, everyone just said, “Pentesting, that’s what you do? We need that.” I think there’s a world where pentesting is super valuable. It’s a different set of problems. I think there’s a lot we can learn from security and how we articulate and quantify and talk about the risk of our systems. Security has to deal with the defender’s dilemma in a different way. The people may be acting maliciously. You have things coming from the outside, you can’t trust anything. I think we actually have a slightly easier mandate in the reliability space, that, in general, people are acting with best intentions. We’re looking for unexpected side effects and consequences that could cause issues. The way you approach that is just much different. The whole social engineering side isn’t really as applicable, or maybe that’s what incident management looks like.

Nolan: We might care about overload and DDoS type attacks, but yes, you’re right, by and large, we’re dealing with entropy and chaos, and not malice, which is nice.

Andrus: That’s a good point. One of the most powerful tools in chaos engineering is changing the network, because we can simulate a dependency or an internal or third-party thing failing, and so you can go in and modify the network, then obviously, there’s a lot of security potential implications there.

Nolan: You’re quite right. One of the great things about doing that testing is you can simulate slow overloaded networks, which are always worse than something being hacked.

Burnout in SRE

Reisz: I want to talk about burnout. Why is burnout a topic that we talk a lot about or seems to come up a lot about when we’re talking about SRE?

Nolan: Burnout in a team is a vicious cycle. It’s a self-fulfilling prophecy. It’s something that gets worse unless you intervene. From a systems perspective, that’s something that we hate in SRE. The mechanism here is a team is overloaded. It’s not able to get itself into a better position, for whatever reason. Maybe it’s dealing with a bunch of complicated systems that it doesn’t fully understand or whatever else, or has 500 alerts. Teams that are in that position, they’ll tend to find that people will get burnt out, will go somewhere else, leaving fewer people with expertise in the system, trying to manage the existing set of problems. It gets worse and it spirals. The most burnt out or the most on fire teams that I’ve ever seen have been ones that have been trapped in that cycle where all the experienced people burnt out and left. Left a bunch of mostly newbies who are struggling to manage the fires and also get up to speed with the system. Having a team that’s burned out, it’s very hard to recover from. You have to really intervene. That’s why we have to care about it, just from a sheer human perspective of, we don’t want to feed the machines with human blood, to use a memorable phrase from my friend Todd Underwood. It’s a bad thing to do to people and to our organizations and our services.

Burnout Monitoring, and Prevention

Reisz: How do you watch for and get ahead of burnout.

Sombra: I think it’s important because it’s a thing that is often overlooked, and then it results on things that are not optimal from a decision making perspective, from a roadmap ability to follow, from an operational reliability. How do I watch out? I tend to look for signals that let me to believe whether the team is getting burned out or not, like on-call, how many pages you’ve gotten? At the beginning of any team that I manage, or even the teams in my organization, I’m like, I want us reviewing all of the alerts that have fired every single planning meeting every week. I want to know, it’s like, were they actionable, or were they not actionable? Is it a downstream dependency that is just causing us to get paged for something that we can’t really do anything about? To your point of the network, the network is gnarly. The closer you are to it, the more unhappy you’re going to be. If there are things that are just completely paging you, you get tired, like you don’t have infinite amount of energy. Conservation and application of that is something that becomes part of running a system. This is a factor of your applications too.

Andrus: I don’t want to sound like, kids, get off my lawn. When I was early in my career, folks didn’t want to be in operations because it was painful, and it sucked. We’ve been through this last decade where SRE is cool. It’s the new hotness. I don’t say that because that’s a problem. It’s important to recognize this isn’t a new problem. I think this is something that’s been around and it’s something that comes, and it probably comes in cycles, depending on your team or your company. It’s always a concern when it comes to operating our systems and doing a good job.

What you just described Ines was the first thing I thought of when Wes brought this up. I owned a tool for a short time at Amazon that was the pager pain tool. It tracked teams that were getting paged too much, and people that were getting paged too much. It went up to the SVP level, because we knew that that impacted happiness and retention and people’s ability to get quality work done. I think actually, we were talking about ways to help quantify things for management. This is quantifying the negative, similar to quantifying outages, but quantifying that pager pain helps us to have a good understanding. If we see teams that are in trouble, teams that are in the red on that, we should absolutely intervene and make it better, as opposed to just hoping that it improves over time, because it won’t.

Nolan: I fully agree with what you both said about pager frequency and the number of pages. I also want to call out the fact that we need to be careful about how frequently people are on-call as well, like literally how many minutes on-call they are. You can get to a point if you’re on-call too much, it just feels like you’re under house arrest. If you’re on-call for something where you might need to jump on a big Zoom call if something goes down, or where you might need to be at your hands on keyboard within 10 minutes, you’re really limited on what you can do in your life. I think that matters as well.

Staffing a Team Appropriately

Sombra: On your point about staffing a team appropriately, it’s completely 100% the way to address that. Like your size of eight is great, if you have eight. Even four is great, because at this point you have a person being on-call every week.

Nolan: Four is survivable, but eight is sustainable I think.

Reisz: What’s the math behind eight, why eight?

Nolan: With eight people you could be on-call for a week, every two months. You got a lot of time to do engineering work. You got enough people around. You can get cover if it’s on vacation. You don’t have to constantly be on-call for holidays and Christmases.

Sombra: Use that flexibility to tier as well.

Reisz: Someone is screaming at their screen right now that we don’t have eight people on our teams, how in the world are we going to deal with a problem like that? How do you answer them?

Andrus: Find a buddy team.

Nolan: I know a lot of teams that don’t have eight people on them. I’m describing something that I think should be the goal for most teams. A team that I was on, for example, in fact, we had two teams that were running somewhat related services, and that were both too small to really be sustainable, we were looking two rotations of three people, and we merged those teams. Because it makes sense, technically, because there’s an awful lot of shared concerns. It made sense in terms of a more sustainable rotation for both sides of the team. Things like that can happen.

Organizationally, I think a lot of organizations optimize for very small, very siloed specialized teams. I think in terms of sustainability, it can be better to go a little bit bigger. Yes, everyone has to care about the slightly bigger, more complex set of services. That’s the tradeoff. If it means that six or seven weeks out of eight, you can go for a swim in the evening, that’s a really nice thing in terms of can people spend 20, 30 years in this career? Which I was like, I can spend 30 years in a career if I’m on-call once every two months. Can I do it if I’m on-call once every three weeks? I probably can’t.

Sombra: Another to that eight number is that it requires a tremendous amount of funding as well, if you don’t have it, to Laura’s point, it’s like you do that. For example, I also don’t have everyone in my organization that has a team that is staffed that large. That is a very large team. I think also, like in my mind, every time I think about that, my mind goes to one particular person that is a subject matter expert of one particular data pipeline that is 100% critical. Then right now, I live and breathe trying to get this person a partner and a pair. Then at this point, you just pitch in as an organization with a greater group, so you try to make it a we problem as well.

Reisz: I like that, make it a, we problem.

Key Takeaways

I want to give each of you to give your one big point. One of your big one takeaway that you’d like everybody to take away from this panel. What do you want people to take away?

Andrus: There’s a couple of phrases I hear a lot of people say, when I have an opportunity to educate, one of them is, how do you quantify the value of your work to management? How do you help them understand that a boring system doesn’t come for free, and that’s what you want? I talk to a lot of less mature organizations that are earlier in their journey, and I hear a lot of, we have enough chaos. That’s always a frustrating one to me, because that’s the perception that chaos engineering is to cause chaos. This is a problem with the name. It’s a bit of a misnomer. We’re here to remove chaos from the system. We’re here to understand it and make it more boring. When people say we’re not quite ready yet, to me, that’s like saying, I’m going to lose 10 pounds before I start going to the gym. The only way to get ready is to go out and start doing it. That’s where that practitioner’s skin in the game opinion that I have is, the best way to learn in many cases is find safe ways to mitigate it. We don’t want to cause an outage, but get out there and get your hands dirty in the system, so that you really understand how it works. I think there’s a lot of value in the academic exercise of what the system looks like. The devil is always in the details. It’s always some thread pool you didn’t know about, some timeout that wasn’t tuned correctly. Some property that got flipped globally that no one was expecting. It’s these things that aren’t going to show up on your whiteboard that are going to cause you outages.

Nolan: I will follow on directly from that point. I think one of the questions that we didn’t quite get time to talk about was, how should SREs and engineers engage with each other? What’s the engagement model? What I would say to that is, we need to remember that SREs are engineers. If your SREs are only doing Ops work and reactive work, they’re not doing engineering, that’s one of the things that will head you towards burnout. I think one of the best ways to describe what it is that SREs do, engineering-wise, if we don’t do feature engineering work, is we do the engineering that makes the system boring. What makes your system exciting right now, make it way more boring. One of the things that worry you about your system, what keeps you up at night, what risks do your team know are there, engineer those out of your systems. That’s what we do.

Sombra: My main takeaway, too, is that we tend to think about it in terms where it’s fancy. If you’re a large company, you do all of this. It’s not necessarily about fancy, it’s about processes that you can do even if you’re not at the scale of a Google or a Netflix, or all of the people that are just rolling out with the monikers that we tend to use a few years later. I think that these things are important. It’s just like being able to just reason, fund, invest in the things that give you confidence. The last point is that you’re not done. This is not a thing that is ever finished. Understand that, incorporate this, and then just live through that. It’s a process of continuous iteration, continuous improvement. There’s not a box to check. If you approach it with a box to check then you’re always going to be surprised. You’re going to be reactive. You’re going to burn your team. You’re never finished. Go to the gym every day.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.