Article originally posted on InfoQ. Visit InfoQ
Transcript
Rosenbaum: I am Sasha Rosenbaum. I’m a Director of Cloud Services Black Belt at Red Hat. Since you are on the effective SRE track, I presume you’ve heard of SRE before, and that you’ve heard some definition of toil. I’m going to start this presentation with giving one more definition, which is pretty common in the industry, which is that toil is the work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that it scales linearly as the service grows. We also know that for most teams, the SRE team usually aims to be under 50% of toil out of their toil-work balance, because toil is considered to be a bad thing, that we want to minimize. Then, I want to ask you a question that comes from one of my awesome coworkers, Byron Miller. The question is, if an employee is told that 50% of their work has no enduring value, how does this affect their productivity and job satisfaction? Just let it sink in, if we’re saying that SRE team has to do usually about 50% of toil, but we’re also saying that toil has no enduring value and is essentially not work that can get you promoted, that can get you raises, that can get you rewards, Then, how are we contributing to essentially the burnout on the SRE teams, which we all know that the SRE teams are experiencing?
Now that I’ve posed this question for you, I’m going to jump in into setting the background for the rest of the presentation. I’ve called this presentation, the Eternal Sunshine of the Toil-less Prod. I’ve done a lot of things in this industry, I think I’m coming up on 18 years in the industry, something like that. I’ve done a degree in computer science. I started off as a developer. I gradually was exposed to more ops type of work. Of course, I got involved in DevOps and DevOpsDays the moment it came out. I’ve done a bunch of other things such as consulting for cloud migrations, DevRel, and technical sales. You can see that I tend to get excited about new things and jump into them.
About Red Hat
I want to also introduce Red Hat. Everybody knows Red Hat as a company that provides Red Hat Linux, but we’ve also been in distributed computing for a really long time. This slide actually doesn’t start early enough. We’ve been involved in providing OpenShift since 2011, which gives us about over 10 years of experience in managing highly available production grade distributed systems. We’re currently providing OpenShift as a managed service in partnership with a number of major public clouds. We’re running at a pretty high scale. This is still not a cloud provider scale, but it’s a pretty high scale.
We have been on this journey, which some of the companies in the industry have definitely been on, which is, we have moved from providing products to providing services. We used to essentially ship software. Then, once we shipped it, it was the clients’ responsibility to run it. We’ve been shifting towards running the software for our clients. Then also, of course, as some of the companies in the industry, we’re providing both products and services at the same time at this moment, which makes it a very interesting environment for us in terms of how we ship and how we think about developing software.
The SRE Discipline
I would like to share some of our SRE experiences, and some of the lessons learned along the way, and as well as I want to set the stage for what I think is the most important thing about SRE as opposed to other things we’ve tried before. What’s the most important thing and innovative thing about the SRE discipline? Because we’ve been doing ops, and then DevOps, and then all whatever we call it today, to provide service to our customers for a really long time. We’re now talking about SRE being a game changer in some ways. What changed? I think, personally, that SRE is about providing explicit agreements that align incentives between different teams, between the vendor and the customer. I’m going to dive in into what makes SRE different than this stuff we’ve done before.
Service Level Agreement (SLA), Service Level Indicator (SLI), and Service Level Objective (SLO)
Probably everyone is familiar with some level of SLA, SLI, and SLO definitions. I’m going to define, not so much the what but the why of these indicators. SLA is financially backed availability. It’s a service level agreement. We’ve been providing this for decades. Every vendor that provides a service to the customer, always has some SLA. This is very familiar. This is an example from one of the Amazon services. It is relatively standard in today’s industry. It’s a 99%, 95% SLA. You can see that if the service availability drops below 95%, the client gets 100% refund. Basically, if the downtime is more than roughly one-and-a-half days a month, the client gets 100% refund. What’s important to think about here in the SLA concept is that SLA isn’t about aligning incentives between vendor and customer. As a customer, if I’m buying something from you, I want to have some type of guarantee that you’re providing a certain level of service. Then SLAs are about financial agreements. At Red Hat, for instance, when we first started offering managed OpenShift, we had a 99% SLA, and that wasn’t enough. Then we gradually moved over a year towards the four nines SLA, which is a higher standard than some of the services in the industry. To keep in mind, SLAs usually include a single metric. Usually, it measures one single thing, which is usually uptime. For financial and reputational reasons, we want to under-promise and overdeliver. When we’re promising four nines, we actually want our actual availability to be higher, because we never want to be in a situation where we have to provide refund to our customers.
SLO is probably the most important indicator. SLO is targeted reliability. The interesting part about SLOs is usually if you’re running a good company providing a good service, you’re measuring SLOs around a lot more things than you’re measuring SLA around. SLA was a single number. At Red Hat, we measure all these SLOs and more. We look at a lot of different metrics, in terms of our ability to assess our own reliability. Then SLI is an actual reliability. This is service level indicator and it measures actual reliability. Again, usually it surrounds many more metrics than just uptime. The important part about SLI, which people often forget, is it requires monitoring. If you’re not monitoring new services, then you have no idea what your actual availability or reliability are. You actually can’t say if you’re breaking your SLAs or not. In addition to that, you also have to have good monitoring. Because if your monitoring is very basic, say you just have a Pingdom pointed at your service that all you know is if your service is returning a 200 OK, and that’s not enough because usually the customers don’t come to your website or your application just to load it and get a 200 OK response, they actually come to get a service from your company. Without good monitoring, you actually don’t know if the service does what the users expect it to do. It’s important to design your SLAs and SLOs around things that are actually informative for your company to know if the service you’re providing is actually meeting your customers’ expectation.
Then, the other thing that’s very important about SLIs is signal to noise ratio. If you are looking at a lot of noise, the signal drowns in that, and so you are not going to be able to distinguish between things that are actually a problem and things that are not. In Red Hat example, for instance, early on, we had a major monitoring problem. We are providing availability for customers’ clusters, and monitoring customers’ clusters. The customer can take the cluster offline intentionally. Early on, we would get a lot of alerts, and we would wake people up at night to deal with these alerts. Then, we had to learn to identify when the shutdown of a cluster was intentional versus unintentional, so that we could reduce noise on these alerts. Without good monitoring, you’re potentially overloading your SRE with unwarranted emergencies, and then you actually don’t recognize real incidents. If you’re dogfooding at your company, then, periodically, incidents may even be caught by internal users. Your monitoring system might not identify the incidents, but someone in your company calls you and says, my cluster is down. This is fine, as long as you implement improvements to your monitoring system. Whenever you catch a problem, the ideal situation is you modify your monitoring systems to reflect tracking that problem in the future.
I’m going to come back to SLO because SLO is probably the key metric that was introduced by the SRE discipline. What’s important about it is it’s business approved reliability. We used to say that we’re striving for 100% reliability, all the time. Now, as an industry, we came to a realization that it’s unattainable, unnecessary, and of course, extremely expensive. Even the five nines. Five nines gives you 5.26 minutes of downtime a year. This used to be a holy grail that a lot of people strived for. Actually, will your users even notice that you’re providing a five nines level of service? The resounding answer is actually, they probably will not, because if you’re providing a web application, your internet service provider background error rate can be up to 1%. You can be striving to provide five nines, but in reality, your users actually never get that level of service that can get closer to four nines anyway, no matter how hard you’re working on it. SLOs are actually about explicitly aligning incentives between business and engineering. We go into this with our eyes open, and we tell the business, the level of availability or reliability we want to provide is four nines, or 99%, 95%. Then we argue if that’s enough, and if that’s a good level of service, and if that’s what’s acceptable in the industry right now, and if this is something we can reasonably expect to provide to our customers. Then, if we have some level of downtime, we have this agreement that provides us the ability to talk about it, not as in all ops are bad because there is some downtime in the system, but we have stayed within our desired SLO.
Error Budgets
This brings in the next metric, which is error budgets. Error budget is an acceptable level of unreliability. Error budget is defined as one minus SLO. If I could give you an example, on quarterly basis, error budget of four nines provides you 0.01% of downtime, and then 13 minutes of downtime a quarter. In these 13 minutes, you can either have an incident that is very quickly addressed, or you can take on some downtime while you’re delivering updates to your system. Everybody recognizes that those 13 minutes are not a problem.
This is the budget that you have for unreliability. Error budgets are about aligning incentives between developers and operations. If developers are measured on the same SLO that SRE people are measured on, then when error budget is drained, ideally, your developers start pushing updates with less speed and testing them more thoroughly. Because otherwise they know they will blow the error budget and not get their bonuses. As long as SRE is the only team incentivized to keep the SLO or SLA, you always have this problem of developers and product managers pushing to move as fast as possible, while SRE is trying to slow things down so they can get tested, verified, and not produce downtime. If you are actually measuring your product managers and developers on the same SLO, you are eliminating that problem. We used to talk in DevOps, about culture of collaboration and working together, and not working towards the same incentives. Measuring people on the same SLO actually provides you a way to write down that incentive, instead of just talking about how culturally you want to align. We’ve written things down that would help. Usually, I love this quote by William Gibson, “The future is already here, it’s just not evenly distributed.” We have companies who are doing an excellent job at SRE. We have companies that are struggling. Most companies are actually somewhere in between, with some pockets of excellence, and some teams that are struggling, or some services that are really difficult to provide good reliability for.
What We All Got Wrong
I want to talk about what I think we all got wrong. One of the things that I think we all got wrong, is the definition of what site reliability is. Unfortunately, the first book defines SRE as what happens when you ask a software engineer to design an operations team. That’s a very elitist and unfortunate take. As a former developer, I wholeheartedly disagree with that. Because actually, we have been talking about DevOps at DevOpsDays, and we’ve been talking since at least 2009, or probably actually a lot longer if you ask the ops people, but automating yourselves out of a job, or actually automating yourselves into a better job. The logical question is, why couldn’t we do it before Google came in and said, let’s assign developers to operations? Effective automation requires consistent APIs. This is something we actually did not have. OS-level APIs actually were not thoroughly available on the overall market. Only 27% of server market used to be Linux based, and Linux is relatively automatable. Windows was by design, not automatable, it was an executable based OS. Windows makers actually believed that people need to be clicking buttons, and that provides the best system administration experience. It brings in one of my favorite transformation stories, actually, with Jeffrey Snover, who pushed for shipping PowerShell as part of Windows, which is a CLI scripting language, which allows automation of parts of Windows operating system. He had a very interesting journey arguing with Microsoft executives for many years about why admins actually want automation and CLIs. He did succeed in the end, and so did many other people pushing for automation. Every wave of automation enables the next wave of automation.
Then we started seeing infrastructure-level APIs. We used to have to click buttons and manually wire servers in data centers, and it was all really manual work that couldn’t be automated. When Borg was first designed, this is also from Google SRE book, central to its success and its conception was the notion of turning cluster management into an entity for which API calls could be issued. In all these companies, Amazon, Azure, Google, all the cloud providers, the infrastructure was designed with automation in mind. Everything in cluster operations, in VM operations could be automatable, and actually, could be issued an API call too, so that actually you didn’t have to go and physically interact with those servers, no matter if you needed a failover, a restart, bringing a new server online, or anything like that. Then because of this push, we also started seeing other companies bringing it to data centers on-prem. We started seeing infrastructure as code automation, which enabled that for more of the traditional data center. The overall push was just to enable consistent APIs in the industry. We did not suddenly get the idea that infrastructure and platform automation were a good idea. We actually just gradually built the tools overall in the industry required to make that automation happen.
Why does this matter? In my opinion, if we get the origin story wrong, we end up working to solve the wrong problem. That’s a huge problem. Corollary 1 of this, hiring developers to do operations work does not equal effective SRE. This is a mistake that I see many companies made. This is a problem. Even at Red Hat we actually started saying, all we need to do to run SRE is just hire people with developer experience. That did not go really well. We eventually arrived at the fact that we want to hire well-rounded folks. We want to hire developers who have done some ops work before, or operations people with a mind for automation and coding, or QE people that are actually exposed to developer experience. More of an overall, well-rounded expertise and desire to solve problems, is a better profile for hiring SREs.
Corollary 2 is that the desire to automate infrastructure and platform operations is insufficient. You need consistent APIs and reliable monitoring to unblock the automation. If you are just hiring, even the best SREs, but you’re giving them a platform that cannot be automated, then they’re not going to be able to do it. Basically, what you need to do is provide them with the tools to make this automation possible. One example of this, early on at Red Hat, we had to move the cloud services build system from on-prem to the cloud, because actually, it wasn’t automatable or reliable to meet the targets of the new cloud services. It worked fine for shipping on-prem at a much slower rate, but it stopped working once we wanted to move with the speed of the cloud. You actually need to be basing your systems on infrastructure and platforms that provide the ability to automate them.
Second thing, then, and this is I think, what we started this presentation with, is that we got this idea that toil is unequivocally bad. We’re again talking about devoid of enduring value, it doesn’t actually provide us any benefits, and that we want to limit it. I even heard people say that we want to actually completely eliminate toil. We want to limit toil to 20% instead of 50%. That has been voiced by certain teams in certain companies. My question is, are we striving for a human-less system? Should we be striving for a human-less system? I want to bring up the only thing I remember from physics, and that is the second law of thermodynamics. It’s both very educational and also highly depressing. It says that with time, the net entropy, which is the degree of disorder, of any isolated system will increase. We know that every system left to its own devices will over time strive towards more disorder. We know this, on a basic level, we know that if we leave something alone to its own devices, it will gradually just get distracted by the forces of natural chaos. We know that entropy always wins.
I want to bring in a definition from Richard Cook, recently deceased, who’s done a lot of work on resilience, and a lot of work on explaining the effects of resilience in IT. There’s this concept of being above or below the line of representation. The above the line of representation is basically people who are working to operate the system. People working above the line of representation continuously build and refresh their models of what lies below the line. That activity is critical to the resilience of internet facing systems and the principal source of adaptive capacity. If we look at the things that humans are doing above the line, we see observing, inferring, anticipating, planning, troubleshooting, diagnosing, correcting, modifying, and reacting. What does this sound like? It sounds suspiciously like what we call toil. If we talk about resilience, and this is a quote from an absolutely excellent talk from 10 years ago, by Richard Cook, these are the metrics we look at for understanding systems’ adaptive capacities and resilience: learning, monitoring, responding, and adapting. What we call toil is a major part of resilience and adaptive capacity of our systems. Perhaps we need a better way to look at toil, and perhaps we need to stop saying that it is so detrimental and we need to minimize it, and it’s completely evil. We know that SRE folks worry that if they spend significant parts of their day focusing on toil, it will negatively affect their bonuses, chances of promotions. Everybody wants to write code, because that’s what gets you promoted. Again, I’m coming back to a quote from Byron, if an employee saw that 50% of their work has no enduring value, how does this affect their productivity and job satisfaction? I think, in general, toil gets us a lot of learning experiences and is critically edifying with our adaptive capacity and resilience. We need to restructure SRE teams to be encouraged to do some of the toil work and rewarded for doing some of the toil work.
SRE Work Allocation (Red Hat)
This brings me to one more of the stories from Red Hat, and it’s SRE work allocation. We’ve been trying out different modes of work allocation over the years. We started actually with naming one team SRE P and SRE O. One of the teams was essentially doing most of the development. One of the teams was essentially doing most of the ops. What does this sound like? Of course, it sounds like traditional IT with Dev and Ops and the wall of confusion and people throwing tickets at each other. Of course, it didn’t work very well. We proceeded from that, and we actually said, we are dealing with too much ops work, we actually want to reduce ops work, because it’s so detrimental. We actually want to put people on-call at most once a month, and all the rest of the time, they will be actually doing more developer-oriented work. Actually, in a surprise thing, SRE teams went to management and asked to be on call more, because they were actually forgetting how to be on-call, they were forgetting their operational experience with the system, and this wasn’t enough. They actually wanted to be on call rotation a little bit more. One of the other work allocations that we tried is rotating engineers working on toil reduction tasks. We said, ok, we’re rotating engineers through ops work, so we’re also going to rotate engineers from working on automation related implementation. Of course, probably you could predict this, but sometimes smart people make less than good decisions, and lack of continuity actually severely impacted the SRE team’s ability to deliver on those toil and automation reduction tasks, because they kept being reassigned from one engineer to another without sufficient context. That significantly slowed us down. In terms of work allocations, Red Hat is still looking for the perfect system. We have a little bit of different practice on different teams, and we are constantly learning to see what works best. I think this is not different from most of the companies in the industry who are continuously trying new things and trying to improve the SRE discipline.
Where Do We Go from Here?
Next question is, where do we go from here? I think to emphasize a couple of insights that we’ve arrived at. Effective automation requires consistent APIs. I think I’m a big proponent of cloud. I think cloud provides us with an industry standard for consistent infrastructure-level APIs. I think if you are not in a data center management business, that for me, the choice would be really clear today that you can go to the cloud and let your cloud provider manage infrastructure for you. Also, I think Kubernetes is a wave that is riding in the industry. From last year’s Red Hat Open Source Report, 85% of global IT leaders look at Kubernetes as a major part of their application strategies. Kubernetes could provide the industry standard for consistent platform level API. I don’t think it provides it quite yet. If building PaaS isn’t your company’s core business, allow your provider to toil for you. I’m biased and I’m paid to present this slide, but we are providing managed services, and so are some other people in the industry. You can outsource your infrastructure services and your platform services to your provider. Then, your operational work, your toil work is in the software services that are a key, core component of your business value. We said toil wasn’t necessarily evil, but if company A automates most of the basic infrastructure and platform tasks, and the toil is reduced to only operating business critical applications, it is going to do a lot better than the company that still toils to just deploy a new software. I would like to advise you to get your skills above the API and toil in the space that provides your company with business value.
If building PaaS is your company’s core business, then your situation is a little bit different. Then, I would really remember that SRE is about explicit agreement that align incentives between different teams. You want to explicitly write down your SLOs and understand exactly what you’re measuring. You want to explicitly write down your error budgets and make sure that you actually stop development when you blow your error budget and stuff like that. You want to actually leverage the tools that SRE discipline gave us to make your SRE team life better. Focus your toil where your business value is. I believe that ideas are open source, so Red Hat is starting a new initiative which is called Operate First. This is a concept of incorporating operational experience into software development. We have a place where we hope the community can come and share their experiences and talk about best practices of SRE teams.
See more presentations with transcripts