Presentation: CI/CD for Machine Learning

MMS Founder
MMS Sasha Rosenbaum

Article originally posted on InfoQ. Visit InfoQ


Rosenbaum: This is the video of a machine-learning simulation learning to walk and facing obstacles, and it’s there only because I like it. Also, it’s a kind of metaphor for me trying to build the CI/CD pipeline. I’m going to be talking about CI/CD for machine learning, which is also being called MLOps. The words are hard, we don’t have to really define these things, but we do have to define some other things and we’re going to talk about definitions a lot actually.

I’m going to start by introducing myself. I’m on the left, this picture is from DevOpsDays Chicago, our mascot is a DevOps Yak. It’s not a Chicago bull, it’s a yak, and it’s awesome. You can come check out the conference. I work for Microsoft on the Azure DevOps team. I come from a developer background, and then, I did a lot of things with DevOps CI/CD and such. I’m not a data scientist, I did some classes on machine learning just so I can get context on this, but I’m coming to this primarily from a developer perspective.

I also run another conference, this is a shameless plug, it’s DeliveryConf, it’s the first year it’s happening, it’s going to be in Seattle, Washington, on January 21 and 22. You should register for it right now because it’s going to be awesome.

The first thing I want to do is I want to set an agenda. An hour is a long time to be here, so I want to set expectations for what we’re in for. I’m going to talk about machine learning and try to define what machine learning really is and what automation for machine learning could look like. Then, we’re going to talk about a potential way to implement that automation. Then, I’m going to demo the pipeline and hope that it works.

This slide is here for me so that I remember that I need to trigger my pipeline. I’m going to go in here and make a super meaningful change. We can go back to the slides, because this thing takes a while to run.

What Is MLOps

Let’s talk about what is MLOps and why should you care about it. Machine learning is the science of getting computers to act without being explicitly programmed. It’s different than traditional programming because in traditional programming we define the algorithm, we state the if/else all of the decision tree and all of that. In machine learning, we don’t do that. We let the machines teach themselves about what the algorithm should look like.

This came up first about 50 years ago and so, we call everything artificial intelligence. People think about the Skynet and AI, general intelligence controlling us. Mostly, it’s a lot simpler than that, it’s narrow intelligence and a subset of that is machine learning, and a subset of that is deep learning. All Deep learning is when we get into things like processing images and identifying them. When I went to college, a lot of people said that computers are never going to be good at identifying images, and now I’m being [inaudible 00:05:15] doing all my pictures.

The machine learning is clearly on the rise, so, I don’t have pretty Forrester reports, so this is anecdotal evidence. Machine-learning searches overtake DevOps. Also, if we look at Stack Overflow, we can see that Python just 10 years ago used to be the least popular language of the 5 that are searched for, and now, it’s the most popular language and that’s primarily because of machine learning. Also I talk to a lot of customers and every single one of them is doing machine learning.

Why Should You Care?

Why should you care? We’re talking about a room full of developers and CI/CD professionals, and why should you care about MLOps? Shouldn’t data scientist really care about MLOps?

This is a tweet from someone from a different conference, but it says, “The story of enterprise machine learning is, it took me 3 weeks to develop a model and it’s been 11 months and it’s still not deployed.” This is really the case for most of the people that I work with. Their experience is that it takes a really long time to deploy these models. Best-kept secret is that data scientists mostly want to data science?

The thing is people go to school for a really long time and learn how to do these things. “This is deep learning, this is some deep neural network,” and “This is bad propagation and I tried to understand it and it was really hard.” People really spend a lot of time trying to understand how to make these algorithms better. It’s a whole job. It’s not a job where it’s just putting it in production; it’s two separate jobs. Someone actually needs to go and put this in production after we’ve developed this. For most data scientists, the work environment looks like this. We talk about [inaudible 00:07:31] in our books or we talk about just developing scripts in Python or Scholar, whatever it is. Now I can see that a lot of people actually have source control, which is awesome, but from there to production is a very long way.

The problem that I see most is that data scientists want to do ops but ops don’t really understand the challenges of data science and how it’s really different. If we talk about how it all works in programming, we develop the algorithm, we give the algorithm data, and then, it gives us answers. When we switch it to machine learning, we actually start with answers in data, and then, we produce an algorithm. Then, this is also called a model and that model, we can give it new data, and then it will going to try to predict the future for us.

If we did that, how would we get into production with this whole thing? For that, I want to dive even deeper into this and say, “Ok, so what is actually an ML model?”

What Is an ML Model?

A lot of times we’re, “What is this abstract object that we’re talking about?” We’ve just said that a lot of times ML is very complicated and takes a long time to learn, but it’s not always the case. For most people that do ML out there, they’re actually doing something fairly simple that can be done with an Excel spreadsheet. We’re doing a lot of linear regression. This in particular is, we’re looking at housing prices in a particular area over time. I build this equation that allows me to predict, in the future, what the housing price is likely to look like. This is machine learning.

Also, this is machine learning. If I have something more complicated, I can have a vector that’s inputted and I can have a vector of an output. Then, if I look at image classification, for an example, this is not the only way that it works but one of the ways that it’s broken down, it would just break it down into RGB matrices of pixels. This is a huge array of input numbers that I take in, and then, in the end of it, I just try to say, “Does this picture have a cat or does it not have a cat?”

Basically, a machine learning model is a definition of the mathematical formula with a number of parameters that are learned from the data. Basically, we define a mathematical formula. Now, it comes in some formula, we didn’t talk about the formula, but there is a mathematical formula that we got from the model training that we’re now trying to deploy. This is great news because usually that means that we know what to do with it. We’re, “Ok, we’re just going to create an API endpoint. We’re going to serve a service and this is the solution to all of our problems and this is going to be great.” The only thing is, these models come in different shapes and forms and training takes a long time and all these things. We define the problem, we want to get into this API that we’re serving, and that would probably be a good thing.

Do Models Really Change that Often?

When I was talking to someone about this conference, they asked me a question “Do the models really change that often? Do we really need a process for automation of models? Maybe it’s ok that it took me 11 months to put in production because it really doesn’t change that much.”

I just want to give one example. This is a reddit thread with 13,000 responses, so it must’ve resonated. It says, “Facebook’s list of suggested friends is quite literally a list of people I’ve been avoiding my entire life.” This was absolutely true for a couple of months and it was highly annoying, but then it went away. This is a machine learning predicting who my friends are. That model got better, and so are the models that are matching you with Uber drivers and things like that. If Uber, say, made a mistake and published a model that was highly inaccurate and matching with a driver in a different city, you really wouldn’t want to wait a couple weeks until it updates that model back. You really want it to be fixed as soon as possible.

The Dataset Matters

The other challenge that we have that I want to talk about is that data set is highly important here. That’s just a very different aspect because code is usually light and it’s easy to check in and it’s easy to version, but then, when we talk about data sets, they can be very large and very unwieldy and very difficult to automate.

I did this as an exercise before, and this is pretty cool – what we know about the world depends on what we know from the data. This is a cat, and this is not a cat. This seems straightforward. Then I can get into, “This is also a cat, and also this is a cat.” I need to be able to identify this picture as well as I identify the other one. If I didn’t have any examples of a picture like that, then I might not know that about the world. Then this is a cat. This might or might not be a cat, depending on your definition. Also, if you come from a machine-learning perspective – again, I’m looking from pixels that are in the matrix, “Is this a cat?” I don’t know. If I’ve never seen a picture of a fox, maybe I think it is a cat. Also, we can get into weird stuff too.

The point I’m trying to make here is that model predictions will highly depend on what is seen. I cannot talk about versioning the model independent of the data set. I have to have some type of way to check in my data set as part of the model version.

These are just some of the challenges, I’m actually not diving into all them and I know that a lot of speakers today actually talked about other things such as parameters and stuff like that. If we look at big companies, like Uber or Microsoft or Facebook, they actually use a lot of ML in production.

This translation is being run for me by a machine-learning translator. It’s being updated frequently and [inaudible 00:14:56] PowerPoint, this is being shipped all the time. What we actually did for this is we built this huge custom process and we implemented a lot of things. We automated the deployment and potentially we have different people in different departments doing the same thing over and over again. Microsoft has this big rule of eating our own dog food, and it works really well for us because we’re our own first testers. This is how stuff starts, a lot of people working on this to ensure that that’s being automated.

How Do We Iterate?

Most people I know don’t work for a big company with thousands of engineers that can invest a lot of time into automating their deployments. We can talk about how we can iterate in a homegrown fashion. I’m going to talk about a little bit more about what the process actually looks like. We want to train the model first. Then again, there’s different challenges in training the model, there’s different aspects that we need to look at. Then we want to package it, validate that it actually works, probably validate that it works better than the model we already have, then we want to deploy it and we want to monitor it. Models also drift over time, and so, even if my model is highly-accurate today, it might not be accurate tomorrow because the world changed or my data has changed, so I need to actually update it. This whole process, ideally, I want to automate it.

The data scientists and DevOps professionals have some of the different concerns but t also have some of the same concerns. Everybody actually cares about iteration. Everybody wants to get there as soon as possible. Everybody cares about versioning, and versioning, again, is quite harder than code. Everybody cares about reuse. Then, we have some of the different concerns, such as compliance, observability, and such things.

If we talk and compare the high-level process that happens, if we do the application development, then we would just check our code into source control. In this case, we want to actually save our code and our data set and also some metadata as well.

Then, when we talk about building the application, we want to automate the pipeline. We ideally want to automate hyperparameter tuning, and have nice things in the training process itself. Then, we want to test or validate the model, so we want to somehow estimate that its security is good and maybe some other parameters about the model and how valid it is. We want to deploy it into production. Then again, we talked about monitoring and the fact that we want to analyze the performance and we want to potentially retrain the model over time.

How could you potentially do it yourself? Let’s say you could use the tools that you already have. If you’re not Google or Microsoft, you probably have something that’s on this slide already in-house. You probably have GitHub, or Bitbucket, or something, you probably have Jenkins, or Azure DevOps, or something. You probably already have pieces of this, so, for this, for my specific demo, I’m going to be using GitHub for source control. I’m going to be using Azure DevOps, surprise, for automation. I’m going to be using both Kubeflow and Azure ML workspaces to do the model training process.

If I talk about the Kubeflow, basically it’s an open-source project that is based on Kubernetes. Anywhere you could deploy Kubernetes, you could deploy Kubeflow. It focuses on doing reusable workflow templates that run on containers. You could develop complicated workflows for your ML training and it will run on containers. Whenever you check in one of these steps, you actually create a new container, package it, and then, you push it into Kubeflow, and then, you run the pipeline using these containers. Essentially, you just need to version the container, and then, you can run the steps.

Then, Azure ML is an Azure-based service, so this one’s not open source but it’s this build versus buy thing. It actually does the same thing, so it aims to be all the things that you need for automating your ML pipelines. I could’ve done this particular demo with just Kubeflow or just Azure ML because they both can do the same things. I’m doing both just because I wanted to play with both and see what it looks like primarily. Azure ML allows you to prep the data, train, deploy, version the model, and actually check in datasets, and work with notebooks right there. Again, Kubeflow allows most of the same things as well.


I’m going to do the demo now and we’re going to hope that everything is working. I did check in codeband that kicked off the pipeline. I’m right now in Azure DevOps. Azure DevOps is an automation server. It’s a SaaS product that allows you to manage all of the steps in your application lifecycle, from project management into CI/CD and all of the things. It actually has a free tier, like a forever-free tier, so you could use it without paying us any money. It also has a free tier for open-source projects, so it’s fairly easy to use and enjoy.

Wat I’m looking at here is a YAML pipeline that is a build pipeline for this. I’m going to show you the code that it all starts with. In my case, I’m actually not even doing notebooks right here, I’m actually just doing Python script so I have Python scripts for three different steps. I’m doing pre-process, train, and then deploy. Then, I have a file for the pipeline, so it’s a Kubeflow pipeline that will just define the Kubeflow steps.

Whenever I trigger my code – again, there’s ways to optimize this, you probably don’t want to trigger your model retraining because of a change to a readme, but for the sake of this example, this works. What I’m doing is, I have the Docker files for my containers and t’re packaging the scripts. Then, I am checking in these containers into an Azure container registry. Then, after that, my last step in here is just kicking off the Kubeflow pipeline. I am surprised running on Azure because I do work for Microsoft and they give me lots of Azure that I can play with.

I have a couple of things in here, a couple of resources. Again, another challenge that happens with machine learning is that you need a lot of compute to run it on, so there’s only so much you can get away with for free. Some of this stuff can be run for free, so if you just wanted to deploy it for the sake of the example, you could do it for free with Azure trial account. I have a Kubernetes cluster that’s running Kubeflow, and then, I have a Kubernetes cluster that’s actually going to run my model in the end of it. Then, I have the ML workspace over here that I’m actually checking my models into and all of these things.

Then, the last thing I wanted to show you is that GitHub repository. It actually has all of the code, so it has the code for the models, it has the scripts, it has the Kubeflow pipeline, and it has the Azure DevOps pipelines, so you can actually come here and take all of this and build it all together. It has the steps together for deployment on Azure, but you could deploy this pretty much anywhere you had a Kubernetes cluster. In Azure ML, you can only deploy in Azure, but it’s a SaaS product again so you can just use it, you don’t have to do much about it. You can go to this link – I have it in resource slides – and, if you have a couple days, go and build this yourself.

I did run the pipeline and it did build-and-deploy steps and it triggered my Kubeflow pipeline. I do have a Kubeflow pipeline and it’s very simple. Again, in this case, I’m not doing anything complicated. The problem with complicated is twofold. One is I have to understand the workflow, and two is it has to run sufficiently fast for me to be able to demo it, and that doesn’t happen. I wanted to show some cool things like GitHub, open-source, a bunch of data on code, and so, you can now learn predictive coding and like code analytics and stuff like that, but it literally takes days to run. I don’t think I want to attempt that on stage.

This is a very simple Kubeflow pipeline. Actually, what it tries to do is tell pictures of tacos from the pictures of burritos. It’s a very important problem that we all need to solve. It does the pre-processing, the training, and registering. The register step just goes into Azure ML and registers the pipeline, and then, it registers the model. Then, I can deploy it as a service using Azure ML. Again, I could potentially do this with Kubeflow as well.

Then, the other interesting thing is that, if we already did one of the steps – pre-processing, for instance, takes a really long time, so we processed all of the data and all of that stuff, so if we already did this process and already learned from it, we don’t have to do it again. Kubeflow will just skip that step because it’s already run. That allows us to optimize our steps.

Then we come into here, which is the last step of this. It has a release pipeline and all it does is, it does some pretty straightforward scripts. Actually, let me show you something else. I’m going to show you the test process now that it’s more informative. I’m using Azure ML which allows me to just run some Azure like commands and say, “Go deploy this model.” In this case, I’m deploying this model into Azure Container Instances, which is my test environment. Then, I’m just throwing some images at it and I’m saying, “Is this a burrito image?” and it actually tells me that it is a burrito image but it’s really not sure that it is. It’s very low in confidence. In this case, I’m not even actually doing anything with this information, so I’m not actually looking at the accuracy, but I could go and start looking at the parameters of the model and try to estimate if it’s good before I’m putting it in production.

Then, the other thing that I’m doing after this is, I’m deploying it to, “production environment”. This one actually has an approval check. I’m going to approve it. This is actually true of pretty much every production deployment that I see out there, most of the people don’t deploy to production without having some human look at the process, so it’s not only for ML models. Definite for ML models, I would say that, at least from what I’ve seen, there’s not a lot of automated testing that you could just rely on without having human validation that this stuff actually looks good before it’s going to production.

Then, this pipeline could be further improved next time I have free time. We could actually look at accuracy and we could actually look at model drift and we can actually go back into retraining and stuff like that. It took me, let’s say, a week to build this, and I did have examples from other people, so this takes time. It took me a week to build this, but now that I want to put a new model into production, I can do this in the span of an hour, instead of 11 months, and that’s a huge improvement. It probably won’t cost you too much money to do this and definitely will save you a lot of hours if you are doing this, comparative to trying to do it all on your own.

The other thing I don’t know if I mentioned is, the tools such as Kubeflow and Azure ML allow you to take models that come in different formats, so you might be using TensorFlow or something else, and take these models and just deploy them as a service. You could also do it on your own, you have to write some code to be able to do it but it’s also possible. Again, I like tools that do stuff for me, so…

Even a Simple CI/CD Pipeline Is Better Than None

If you take only one thing away from this talk is this one: even a simple CI/CD pipeline is better than no pipeline. I know that if you’re working with a team and data scientists that are basically just used to writing code on their keyboards and doing this stuff manually, it’s going to be a change of mindset to get them to do the pipelines but it is definitely worth it in the long run. Also, don’t try to ask them to deploy Kubernetes, I don’t think that’s going to go very well.

Change is the only constant in life and that’s why I’ve been organizing DevOpsDays for six years, because I believe that automation actually helps us eliminate issues.

AI Ethics

I want to finish on a particularly note and that is AI ethics. One thing that I learned when I was learning more about this stuff is that bias is actually a property of information, it’s not a property of humans. If you’re building algorithms and if you’ve given them a particular set of data, they’re going to learn about the state of the world from that data. There’s some examples already in the industry, all the algorithms being biased against certain subsets of population.

One is the racial example. These algorithms are actually being used by judges and by the police to identify if you’re likely to commit a crime or something like that. That definitely has some racial biases in there. The other example was, the ads that ran for CEO jobs were displayed only to males, because guess what, the data set suggested that only males can be CEOs. This is complicated stuff and this is something we must think about when we develop AI because these algorithms are actually deciding your next mortgage and they’re deciding where your kid goes to school and credit score and stuff like that. You just definitely want to think about this when you’re working on ML. So, build AI responsibly.

Questions and Answers

Participant 1: I just had a question about your pipeline, I saw that you ran the model on ACI, Azure Container Instances. Does that support GPUs?

Rosenbaum: I don’t remember if ACI does, but AKS definitely does. My production instances does the GPU. I actually would have to go look if ACI does as well.

Participant 1: We have a similar pipeline that we’ve implemented, not using the Azure DevOps but using ACI. We have to keep a VM up at all times in order to run our model.

Rosenbaum: Yes. You can definitely go in through the AKS stuff, that definitely supports GPUs.

Participant 2: How do you monitor your deployment and when do you decide that you need to redeploy?

Rosenbaum: I think this is more of data-science question. The monitor and deployment in terms of, it returns 200 ok, that is easy to do, but knowing that you have model drift probably is harder. I’m probably not the best person to answer that question.

Participant 2: But you probably have such a mechanism in place.

Rosenbaum: Yes. Internally, at Microsoft, yes, and I would have to find out what it is.

Participant 3: Do you have any advice for versioning the data once the data is too big to put in Git, which is usually pretty soon?

Rosenbaum: In terms of Azure ML, you can actually commit the data set to Azure ML and it will actually have a version attached to it. Also, tags the models of other versions. For Kubeflow, you have to find some type of storage. I wouldn’t put it in Git, it’s always too big to put in Git. I would put it probably somewhere in storage that is relatively cheap. Then, you do have to solve the problem of how you version it.

Participant 4: One of our data scientists was thinking about iterating on their CNN using different branches on Git and doing multiple deploys. Have you had experience with that? Do you have any recommendations around how do you experiment and iterate on that very quickly?

Rosenbaum: What we do at Microsoft is we control the flow through the pipeline, not through branches. I see a lot of people out there doing branches because it’s easier. You’re attached to a branch to the environment and it allows you to isolate stuff. I would say it’s still a valid way to do this and it might make it easier for you to have multiple versions at the same time. The other way is through the pipeline, so the pipeline knows which environment you’re deploying to, and then, that means that you’re putting the same code through the whole process. Whatever it was in your dev and you looked at is the same thing that happens to be in your production.

Participant 5: What if your data set is too big to be checked in to some version control?

Rosenbaum: I think the gentleman asked that question. I definitely wouldn’t check my data set into source control. Again, either you put it in storage, so some Blob storage somewhere that is cheap, because these data sets can be very large, but then, if you do this in storage, you have to find a way to version it and know that this data set is what trained that model so you know if you’re changing something, that it propagates all the way.

Participant 6: Spinning off the versioning and how it relates to software CI/CD, have you had an incident where the checks for the model weren’t sufficient so you actually released bad data? What is the equivalent of a rollback and as far as purging the bad data, does that take a while?

Rosenbaum: The equivalent of a rollback, I would say, just go to previous version. In case of Azure ML, you can actually go and deploy any version so that makes it easier for you. I actually don’t know, in terms of Kubeflow, how easy it would be to rollback. You might actually need to go and retrain the model, which does take a lot of time. Then again, the challenge is that you have to know that the data was the same data. I see the same thing for code people roll forward rather than roll back a lot.

For the second question, if you have the way to have a tagged version model, then you can deploy it fairly quickly. Some of these models can be trained for hours or days, and so, this can take a while. Versioning is important.

Participant 7: We had questions about the data sets, my question is about the models. How do you keep them version-controlled when they become too big? We have three things, the code itself, the models, and data sets. Can you elaborate a little bit more on the version control and the fact that they get too big to be checked into the GitHub?

Rosenbaum: In Azure ML, the model versioning is just part of the thing. This is actually one of the cool things I think. I don’t actually have to put it in version control because I have this hub and it keeps my versions for me and I could actually click on this and deploy it into production environment. If you’re not doing that, then you have to figure out a way. You could put it in storage, but you have to automate the process of keeping them there and knowing which version it was.

Participant 8: Can you repeat the first step that you have there, which is profile the model, what does it do?

Rosenbaum: Profiling the model. This is specifically a command that I can run on Azure ML, and what it does is, it identifies the best compute cluster to run it on. Essentially what it does it’s going to tell me what CPU or GPU I need and what memory. I’m actually faking this because it takes like an hour to run. If you are running actual production models, it can be very very useful to use, you actually know the best compute clusters to run it on.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Presentation: Incident Management in the Age of DevOps & SRE

MMS Founder
MMS Damon Edwards

Article originally posted on InfoQ. Visit InfoQ


Edwards: My name is Damon Edwards. This talk is about incident management in the age of DevOps and SRE and I’ll talk about why that’s important. You can follow me on Twitter @damonedwards. I’ll post the slides there but also they should already be up. If not, they’ll be up shortly, rundeck.com/qconsf-2019 so you don’t have to take pictures of things. The slides will be there. Plus, obviously, QCon does a great job of recording and distributing them.

My first assertion is, number one, the ability to respond and resolve incidents is the true indicator of an organization’s operational capability. How good are we at operating? How many folks here work for a packaged software company or somebody that doesn’t have operations? Anybody? The rest of you all are all in a business that makes its money by operating software. The running service is our business. Everything else we’re doing is to support that so how good we are at operations is fundamentally how good we are at our business.

,p> Number two, my second assertion here is that I think everybody now works in operations. If you think about that that running service is the point of our business, everything we do, everything that starts in development goes into production and then boomerangs right back at you when there’s a problem, we’re all part of that operational chain. Also, looking forward down the path, you see where the future is going, this heavy bias towards the, “You build it, you run it” teams, which also bring us all right into operations. I think the days that we’re going to be saying “Oh, operations, that’s somebody over there,” I think is long gone. Now, operations is something that we all have a critical stake in.

What Is an Incident?

We’re talking about incident management. We should probably have a definition. What’s an incident? I look at an incident as an unplanned disruption impacting customers or business operations. The first part is about thinking all the way back to the ITSM roots, the idea that an incidence; that’s just disrupting a customer, that’s things like outages and service degradation. If you think about it, also business operations, tings like work interruptions, delay, waiting, short notice requests, a euphemism for somebody forgot to tell you until it was urgent, all of those things are the death by a thousand cuts that pile up. It slows your delivery, it slows your response, it results in poor decisions. All those things can slow our internal business operations which in the end affects the customer. Think about it holistically, delay affects our customers.

If you think about an incident, we can’t separate out one of these from the other because we’re the same people behind the scenes that need to deal with both. When I talk about an incident, I’m talking about something traditional like an outage or service degradation or these short-notice interruptions that happen all across our business.

The format for this presentation is, I’m going to walk through the lifecycle of an incident and what we see high-performing organizations doing in these different areas. It’s like a tour, or like a survey course of where things are at today and where they’re going. Along the way, I’m going to mention a whole bunch of people who I’ve looked to, who have provided guidance and a lot of great insight on these different areas, so they’ll be making an appearance at some point or another.

First of all, before we talk about that wheel, the wheel of an incident, I want to talk about the environment that we’re all cooking in right now. There is a lot of context that goes into why people are doing things around incidents the way they’re doing them and I think it’s the flashpoint for a lot of really interesting side conversations that have been creeping into our industry for more than a decade or so.

Digital Transformation

I’m going to start with digital transformation. Don’t groan, don’t leave. It’ll be quick, I promise. A lot of this stems from this, which is this decision at the boardroom level. I hear a lot of people have different definitions of digital transformation, probably more so than even have definitions of DevOps or Agile. If you think about distilling it down, I see that a lot of the communication from the board level down at technology organizations, what are they really after?

The first one is they want everything integrated. Gone are the days where the customer service agent would have multiple screens and multiple windows on a screen and you could look at one, look at the other, and talk to the customer on the phone and stitch together and see what’s going on. Or, business line A and business line B would live in their own silo and they would only cross at the balance sheet level. Now they want everything integrated with everything at all times which allows them to do new business ideas, combine things to extract more value out of the systems that we already have.

The next one is responsive. This is not responsive like web page responsive. This is more responsive like they want the business to respond or the organization to respond to the market, to respond to customers, to respond to their failure demand or value demand. They want everything to feel a lot more responsive than this long lag time that they’ve been accustomed to where things take months, if not quarters, if not years, to come out the other end. They want to see it quick. They want to see it fast.

Then they want all that everywhere. Whether it’s your desktop, or it’s a phone, or it’s an Alexa, they want all this capability to be available to the consumer, and then on top of that, always. It’s got to be there all the time. The idea of change windows and taking maintenance downtime in 2019 is no longer really acceptable. All this is cascading down to the technology organization. This is the Uber driver force that’s pushing us into making these new decisions.

In fact, the first person here, Cornelia Davis from Pivotal wrote this great book called “Cloud Native Patterns.” In it she really describes what the technology feeling is, the signal receiving in the technology organization about the demands that are coming from these digital transformations. That’s good to keep in mind. It’s the force number one.

Clound Native & Microservices

What that’s driving us over to is these ideas of cloud-native and microservices. First of all, it’s been this explosion of technologies because our old infrastructure wasn’t good enough. We need things to be ephemeral, to move faster, to go with that new highly integrated, always on, highly responsive world. A good friend, John Willis and Kelsey Hightower are two people who have really helped me keep a clear thinking on really what’s going on in this world and very much worth following.

If you go on top of that, what all of these technologies have created has really ushered in this new era of microservices. Being developers, I don’t have to tell you that. I’m sure you’ve all seen these Destar diagrams. They’re pretty cool. They list all the microservices around the wheel and then visualize all the interconnectivity to it.

Our world has gone from complicated, but we can generally divide and conquer and keep things to ourselves. In the old world, we could have things segmented and we could have the team for app A take of app B, team for business line B take care of business line B. Now everything is highly integrated which dramatically impacts how we are going after and trying to respond to these incidents.

From that, somebody spoke earlier today who I think has been a great leader in our industry around explaining the power of these microservices, why people are driving so hard into this. Even going back to DockerCon in 2014, Adrian Cockcroft talked a lot about how architecture enables speed and speed enables business advantage. He really talks about this desire to decouple that in the past – I know even that microservices Destar starts to look like a monolith – but the idea in the past was, “We build things in a central way,” and that was the most effective. We’re realizing that we’re really slowing down the business. We can’t achieve that digital transformation dream if everything is so tightly coupled. Now we’re taking those cloud-native technologies and microservices and we’re splitting things up into these different value streams. We’re trying to decouple the organization so people can move faster. They don’t have to constantly be tied to each other.

Now we’re talking about decoupling and fragmentation. Everything has gone a lot more complicated and we’re purposely trying to fragment how we do the work, so that we can move faster. Now, if you’re on the incident side of that problem, things become a lot more difficult.

DevOps & SRE

Moving over, now that we’re talking about these new technologies, it’s driven from the digital transformation, we’re talking about decoupling these people so they can move faster. What’s next? Now comes these DevOps and SRE ideas, that how do we get people to work in a high-velocity way? We’ve got high-velocity systems, in theory, but if we can’t unlock them by changing how the people work together, then what good are they? We just have expensive hosting legacy systems 2.0.

If you want to talk about on the DevOps side, I’ll use trends that are entering in. We use Gene Kim as a good example. I think he’s like the raconteur of the DevOps movement, doing a great job documenting him. Do you know the book, “The Phoenix Project”? We talked about the three ways. There was the first way focusing on flow, looking for these feedback loops and then the continuous improvement along with that. It’s really about feedback. It’s a new book that’s coming out in case you haven’t heard, “The Unicorn Project,” which tells the same story from a different angle. It tells the same story of “The Phoenix Project” but not from the leaders but from the folks in the trenches. A great book, but it’s got these five ideals locality and simplicity, focus, flow and joy, improvement of daily work, psychological safety, customer focus.

The reality is, people took this advice and they really focused on the delivery side of things. It’s, “Ok, we’re in dev and it’s all about this go-go-go, and then what?” We deploy. We deploy 10 times a day but people aren’t talking about ops. How do we operate this thing? Deployment is not the finish line. Deployment is this first step in the rest of your life. It’s like getting married. They say people should not on the wedding day so much but on what the rest of your life is going to be like. Rhat’s where operations comes in and historically, we’ve largely ignored that. It’s, “That’s another problem.” We planned the flag, we delivered this project. It’s like in the movie business, “We’ll fix it in post.”

I actually gave a talk about this last year at the DevOps Enterprise Summit called “The Last Mile.” It was all about showing how unless you can change how you operate, all the blowback and mess that comes back on the rest of the organization stops you, prevents you from really realizing those DevOps dreams. This means that we’re trying to focus on that flow and these feedback loops. We have to pull operations closer to us.

I think in this next wave, an interesting thing to arise is the notion of SRE. It really started from Google. They’re the ones that wrote the first books. Ben Treynor was one of the engineering managers that said, “How do we run operations not like a classic operations organization, but using the same principles if we run an engineering organization and do it in an integrated fashion?”

From that, we’ve got these principles. What’s interesting the principles of SRE – the first one is “SREs need a service-level objective with consequences.” It’s this idea that it’s not just an SLA that you have to adhere to but we’ve got this idea of a service-level objective and it’s got consequences. This means that the business, the development and operations have gotten together and said, “This SLO is what matters to our business, and if we blow that SLO, if we blow through our air budget,” the term they use, “then we have to swarm to that and all work comes after that.”

The idea that you have the power to tell a development organization or to tell a product organization, “We’re not shipping new features until we fix this service,” or, “We’re not going to do these extra things. We’re going to invest in getting this service-level objective back above this level,” and it has real teeth. I know a lot of big enterprises that say it’s just crazy talk to go to the business side and say, “No, we’re not shipping this because we blew this agreed-upon SLO.” The pushback is extreme to say the least.

That plus this “SREs have time to make tomorrow better than today.” This is when we talked about the idea of toil that says it’s repeated, could’ve been automated, it’s not adding enduring value to the business. That’s something called toil; and toil, we want to get as much of that out of the process as possible because we’re not using our human capital to its fullest potential. We’ve got all these smart people and they’re buried in repetitive work that the machines should be doing. How do we get them out of that work so they could spend their time doing engineering work, moving the business forward? Again, it’s that idea of pushback.

Third one there is “SRE teams have the ability to regulate their workload.” They can say no to things. They can say no to releases. They can say no to being overloaded. In a way that’s like a bill of rights, and the idea is really about providing feedback. It’s providing a backpressure. Think of a physical system. It’s providing backpressure to that go-go-go delivery and it should make a self-regulating loop.

Why this is so interesting? If you think about, especially in large enterprises, how incident management traditionally happened in the past, we had this knock, some level one teams. They’re always the aggrieved party. It was, “All the tickets go to them. Let’s have them run around and try to fix them. They’re probably going to escalate it back to you, but hopefully, they’re going to take the brunt of the load.” This new model, it’s saying, “That’s not good enough. We’re not making the most out of our people. How do we set it up so there is these tight feedback loops, and when things get bad, the pressure is coming back on the rest of the organization, not in a really negative way, but just a way to provide that feedback loop, again, like the Gene Kim’s three ways there to help us regulate that workload.”

Very interesting stuff: one of the biggest changes in operations since ITIL – in 1989, I think, was the first edition of the ITIL books came out –developers have all had the Agile seeping into their brains for the last 20 years. There has been books and conference speeches and it’s in the tools and whether or not you were doing Agile, the ideas were still out there, but in the operations world, they were living in a much traditional, different world, which we’ll talk about in a second. Folks like Stephen Thorne, Tom Limoncelli, Niall Murphy, Liz Fong Jones, excellent writers, practitioners, have done a great job of really breaking down what this is all about to its essence.

Now we’re changing how we’re organizing our people. What does it start to look like? If you think about it today, DevOps and SRE, you start to add these things up. This is just a selection of some of the things that we’re talking about, like thinking about products, not projects, continuous delivery, shifting left. Solving operational problems early on in the lifecycle when it’s in development land, not waiting until it’s going and trying to get into production. Getting to production as quick as possible, working in small batches, air limits, toil limits, all this cloud-native infrastructure. If you think about it, what’s actually happening is we’re starting to build these self-regulating systems. We’re starting to align our organization on these horizontal value streams from an idea to where we’re making money from a service, but really thinking about it, how do we build this self-regulating horizontal system. This really works across all different models.

We see the top one, the Amazon model, second one, the Netflix model, and the third one, the Google model. If you’re from those companies, don’t yell at me, it’s like a big-picture. Some people drive towards cross-functional teams and those are ways of building self-regulating systems. You packed all that into the same team and that’s how we have that continuous feedback. This bottom here, we’ll still have a Dev and Ops divide. Think about Google, they wrote a whole book on operations. They have a separate organization called SRE but they’ve created all these tools to recreate that shared responsibility model that you would get if you put everybody on the same team. You see the world going towards these value-aligns like value stream, meaning what’s all the activity we have to do to deliver this point of transaction of the customer, and it’s self-regulating. In fact, this was Jon Hall at BMC who was the first person to really point out to me, “You know what’s going on here, you put them together, it’s a self-regulating system.”

Now, let’s compare that to how the world used to be. In fact, when you’re going to talk to a lot of your counterparts especially in large organizations, this was the view of the world that their whole organization was built on. Yes, there’s still this flow of work that everyone thinks about, but that flow of work has to go through all of these processes. It’s a very vertical lined way to think about the organization. “We’re going to put the firewall team with the firewall team, the database folks with the database., the windows admins, the Linux admins, so on and so forth.”

That was the key way to organize an organization which would always have a process owner, and there was going to be inputs, outputs, triggers, metrics. In fact, this was the ITIL way of doing things. ITIL has their 26, I think now it’s up to 34 distinct processes. Now they call them practices, but they’re breaking things down like service catalog management, incident management, change management. I’m just jumping around here. Service validation and testing, all these processes, and when that goes into a big enterprise, an organization, and you put a process owner and you give it all those criteria, well, what happens? You start to get people who do what they’re going to do. They’re going to say, “This is my new task. I’m going to manage this process. I’m going to be the best firewall rule changing team west of the Mississippi and that’s all we’re really going to care about.”

What ends up happening – I’ll get to the change part in a second – is you get this unintentionally encouraging silos where people start to focus inwardly on doing their part of the task, and what starts getting broken up is that flow of work. That’s when we have the ticket hell, all the ticket systems. How many folks here spend a large amount of time with a request waiting in the ticket queue somewhere? This is where this comes from. Now, anything that has to think holistically like a system has to flow through, or try to make through these hoops. On top of that, there’s this idea of a change advisory or a change control. There’s some external centralized source out there that is going to tell you if something is safe. I know if you talk to the ITIL folks they’ll say, the high priest of ITIL, “It’s just there for advisory. It’s just there for coordination and communication,” but in most organizations, it ends up being the change authority, that someone is going to tell you whether or not you have the ability to change. That really encourages whether it’s unintentional or not, it really encourages very much a command and control mindset.

If you think about that exploding world that we’re entering with the explosion of the microservices, with people decoupling, moving along, and this idea that we’re going to externally inspect and determine quality and safety and any of those things is really starting to be quite far-fetched. Those who study Deming, lean the [inaudible 00:20:35] system, it’s one of his key 14 points. It’s number three if I’m not mistaken that you can’t achieve quality by external inspection. You have to build quality into the system.

Now, what’s going on is, you put this together you see what’s really happening is really, this DevOps and SRE self-regulating, value-aligned systems is really starting to replace this old traditional ITSM way of thinking and fundamentally because of that different orientation, one is the horizontal value align, the other is vertical function aligned, and this idea that we’re going to have a central change or external change and quality advisory, they’re really quite at odds with each other. In fact, it’s an oil and water type thing that I think needs to be resolved at some point. Be aware that this tension is going to exist and it’s going to exist for quite a while.

I don’t think they’re going to agree with everything I say but to folks in the space, Charlie Betz from Forrester, one of the few analysts I actually like and Rob England, often known as the IT skeptic, excellent thinkers and writers in this space documenting this contention, this transformation that’s happening, but keep this in mind. It might explain a lot of your relationships with your operations counterparts.

Acknowledge Complexity

All that going on, I think it’s time to realize that with these super complicated microservices plus all that’s going on on the people management side, we really drifted from complicated systems meaning we can predict them and have an idea, there’s some determinism in what we do to really complex systems.

Paul Reed is a guy that writes a lot about this and talks a lot about networks and Netflix, that our world is really complex, it’s not deterministic. On the development side, it’s easy to think, “Engine X compiles or it doesn’t,” that there is a determinism to this thing we’re doing on the software side, but if you think about the larger distributed system that we’re building and the things that we don’t control like the cloud infrastructure and the impact of that and the traffic from humans and the unusual behavior, all this piles together meaning that we are living in a complex world. We’re operating on complex systems.

What do we know about complex systems? If there’s any physicist in the crowd, don’t get mad at me for this but I distill it down as we can never have perfect information about it. Just because we know how engine X responds to a request, we’re never going to be able to determine.in all the other pieces, we’re never going to be able to determine how the system exactly works. We can’t break a complex system down into sub-parts and say, “I know how this works.”

We can look at an engine of a car. I can know how all the parts work and I can model out how this engine is going to work. You go outside in San Francisco, there’s no way you can model out and break down. You can’t look at that Drumm Street over here and Market Street and use that to figure out how the rest of the traffic in the city flows. We’re never really working with perfect information. We can never really control or predict what’s going on.

One of the biggest eye-openers that I went through is Richard Cook, who wrote this paperback I think in the early ’90s. It’s like five or six pages and it’s just point by point how complex systems fail. If you read that paper and don’t think that all those things are currently at work within your system, then you have another thing coming.

Another way to put it is – this is Charity Majors, called the queen observability now – “Distributed systems have an infinite list of almost impossible failure scenarios.” This proves out to me, talk to folks who’ve spent a lot, organizations have invested a lot in resilience engineering and trying to root out failure from their systems, but I’ll just that it just gets weirder and weirder how improbable the next problem is. It’s the magic bullet theory times a hundred that seems to cause problems time and time again. No matter how good you get at trying to root out failure, we just get weirder and weirder edge cases which is a strong indicator that we’re working in a complex system.

If we’re trying to work towards how are we going to be managing incidents, how are we going to respond to and resolve failure, we have to keep in mind we’re fundamentally working in a complex system, not just a complicated deterministic system.

Safety Science & Resilience Engineering

Now that we know that, how are we going to actually think about these things? I think we’re starting to learn a lot from the safety science and resilience engineering domains. There’s these experts like Sidney Dekker, Richard Cook, David Woods, who are famous in the real world. They work with airplane disasters, healthcare deaths, nuclear power plant controls, high-consequence domains, and really look at how do these systems work and what can we do to try to mitigate the failure and disaster that comes out of them. They’re starting to bring this into the operations world.

Why is that? This gentleman here, John Allspaw used to be the CTO at Etsy, before that, he was actually head of operations at Flickr, one of the people that kicked off the DevOps movement by giving a conference talk with his development counterpart about how they did 10 deploys a day. This is in 2009. People were throwing up in the aisles, it was so sacrilegious. He went a little off the deep end here and he went – with Dekker and Woods and Cook – went and got a master’s degree in Sweden. He went off to Sweden to get a master’s in systems safety and talked about how people respond to and avoid failure and disaster in highly complex, high-consequence environments.

He talked about why is this important. Why do I have to know this stuff? He’s, “If you think about it, there’s this above the line, below the line metaphor,” and the reality is in our work, above the line is all the stuff we do. It’s the source code we see, it’s the processes we go to, it’s the buttons we click, it’s the things we say to people. Below the line is the real work, that’s our systems and we can’t actually see it. We don’t see the zeroes and ones going by. We can’t see what’s actually going on under the hood. All we see is this thin abstraction layer and our mental representation of what’s going on under the covers there.

The whole point of why the systems safety resilience engineering world is coming into play is because it’s all about the communication of the people. How do you keep those mental models in check? How do we make sure that we’re trying to connect the people better because we can’t really fix the underlying system? There’s things that you can do to it, but fundamentally, it’s the human interaction with it that causes the most issues, so an important idea of why this is coming into play.

Some other folks have picked up this banner and run forward with it. There was a great event, every year, it happens here in San Francisco, called Redeploy. That’s really some of these world-class thinkers and academics in this space who get together and talk about all of these things, how do you bring this to our world of operating online services. I know they hate slogans so I made some bumper stickers. It’s ideas, “There is no root cause.” That’s just a political distinction that we decided to stop and say, “It’s that person’s fault or it’s that system’s fault.” It’s just our political desire to have this straight line to go and say there is some blame there when in reality, there’s all these contributing factors and most of them are the same thing people do day in day out to do their job, except that this one day, the right confluence of factors causes a disaster. We can’t then go and blame that person and say, “It’s their fault,” when they were doing what probably had success 99 out of 100 times.

Likewise, there’s a big disdain for the five whys because it forces us into this very linear way of thinking that, “It must be this cause.” That’s the problem. Instead, it’s trying to take a bigger, broader view of causation. This whole idea of Safety-II that in the old world there was Safety I, which they call it the old world, is we only investigate failure. Failure happens, the NTSB shows up, all the investigators show up, the journalists show up, and we dig through it and we try to find what was the problem. What we were actually not looking at is, “Why did it ever work in the first place?” It’s an interesting idea around Murphy’s Law which is not, “What’s going to go wrong is going to wrong.” It’s, “How does it ever work in the first place?”

Safety-II is about studying, “Why do things actually work?” If you think about humans and how they work and your colleagues around you, there’s all kinds of little shortcuts and whether it’s a mental shortcut or a physical shortcut that they take in their day-to-day job, ways they do things, rituals, ways to talk about things. Again, all of those things work on a daily basis day in and day out until some little slight thing changes and then the exact collection of actions that would normally cause a success, causes failure. If you don’t look at why things work, you’re going to basically have a hard time figuring out why things don’t work. At least, you’re going to be trying to do so with one hand behind your back.

There is this great idea that incidents are unplanned investments. It’s going to happen. How many of you get called into incidents when they happen like escalations? Your time is valuable. What is that? That’s an investment. Your company is now investing in this incident. What do you learn from it? The ROI is up to you.

Elevate the Human

Really where this is all going is this notion of we have to elevate the human, that this dream that AI is going to solve our problems, we are decades off from trying to manage these complex systems. If we look at what’s going on in healthcare, nuclear power plants, aviation, all these things, they’ve spent billions of dollars of highly rigorous academic pursuit trying to get the human being out of these processes and they haven’t been able to do it. The fact that we think we’re suddenly going to achieve it on our end, I think is a little bit foolish versus looking at it as, “This automation, things we’re building, is really to elevate the humans,” more Iron Man than HAL from “2001.”

It’s about elevating the human, and also I think you’ll see a lot of folks coming on now “We’re starting to look at the humanity of this.” We love it because we go to conferences and get to have a good time, but if you think about thousands and thousands or actually millions of our colleagues around the world, that life is not so rosy for them. It’s a little bit rough. You’re on the receiving end of the aggrieved party at all times. Burnout is very high.

I highly recommend for your own well being and just for the well being of your colleagues, Dr. Christina Maslach. She’s from Berkley, previously Stanford and one of the most famous people in the world on burnout and she’s now turning her sights on to the IT industry because burnout is so high. If you think about it, some of you will say, “We’re going on a country club here. We have to be nice to people what’s going on,” but the reality is, what she’s really giving you is a formula for high performance. If you look at the things like loss of control and the feeling of being overwhelmed, losing agency over your work, there’s a whole list of things that she’ll highlight. If you flip them around to the inverse side, we’re actually talking about how do we get better performance out of our most expensive assets which are each other?

Jayne Groll is also doing a lot of work with this, with trying to elevate the human being. I think in the operations side, we’re seeing a lot more of the humanizing of what’s going on, and the reason being there’s 18 million IT operations professionals in the world – this comes from PagerDuty S-1 – and 22.3 million developers. That’s a lot of people out there that need our help.

This is the world that we’re marinating in. These are all the trends that are coming together to now look at what’s going on with instance management. Finally, let’s look at the wheel. If you think about an incident, there’s the from zero to, at some point, there’s the observe side. We’re trying to figure out what’s going on. Then there’s the react. We have to take action, whether it’s to diagnose something or to repair something. We’ll often loop at those two levels, and then eventually, we have to learn from what went on.


Those of you coming from the lean world, you recognize this looks a lot like an OODA loop. John Boyd is one of the most famous military tacticians, at least, probably one of the most famous American tacticians. He was a fighter pilot who really came up with this methodology, this is actually his first drawing of it, which is in any tactical application of strategy. There’s a couple of phases: there’s observe it, what’s going on. There’s the orient, deciding what’s going on. There’s the decision, what am I going to do, and there’s an action. He figured out that whoever can make those loops faster seems to always win at the objective. He started talking about dogfighting for aircraft, but soon they found that this way of driving human performance really applies to all sorts of domains. I added that here. I want to think about this top part as an OODA loop. We’re going to observe, we’re going to orient, what’s going on here, we’re going to make a decision, and we have to go ahead and act.

The reason why I like this is it breaks down the different areas of what I see as incident management and then we can focus in on the different developments, what people are driving.


Let’s start with the observe side, with a couple of interesting things. Monitoring, obviously, you all know about. Monitoring is really about spotting the knowns. We’re always looking a bit in the rearview mirror with monitoring. We’re looking for conditions that happened in the past, and that got us quite a bit down this road towards better resiliency because we know something happened in the past, it happens again, chances are we know that there’s a problem here.

The new kid on the block has got a name that’s been around for decades, but people focus on this new idea of observability. If monitoring gave us that we’re able to spot the known, spot things that happened in the past, observability gives us the ability to interrogate the unknowns. How do we actually look at what’s going on now in an unknown situation and figure out whether this is good or bad and figure out what the problem is?

Then it really brings in a few different things. Number one is logging the event. We have to have a record of there was some event, something that happened, far better structured than unstructured. There’s the metrics which is a data point over time. It’s like speed. Are we going faster? Are we going slower? It doesn’t really tell us what’s going on in context. We lose all context of that event, but we can know our things: is that data point going up or down or sideways or whatever it is over a certain period of time.

Then the third one is tracing. How do we take those events and put them together in the context of a single request? Look at what’s going on in the tooling world, Honeycomb, Zipkin, those sorts of things. It’s around building out this observability side of things. Again, Charity and Adrian Cole are great people to follow in this space that really have a keen eye for where this has to go.

Also, there’s another one. Our buddy, John Lewis is here if anyone wants to talk about this idea of automated governance. We’ve seen a collection of highly regulated industries getting together saying, “You remember that Destar, remember the speed we’re moving at.” Governance, this idea that a human being can attest, “Yes, this control is correct and this control is being followed and here is the evidence of it,” we just can’t keep up. It’s like manual testing in 2019. It’s just not going to keep up and get us to where we’re getting.

Now you see the idea of, how do we drive governance? How do we drive compliance into the automation layer or into our systems ourselves so we can prove and attest that we’re compliant with these controls and we can do it in an automated way? Definitely, John [Lewis] is a good guy to talk to about this. IT Revolution got a bunch of those banks and folks together and did this white paper on that you can get off of their site for free.

Why this all matters is that in order to figure out what’s going on and respond as quickly as possible, we have to bring these three things together and make sure that everybody has full awareness of them. This is how we keep that above-the-line behavior in sync; by being able to distribute these three aspects of visibility, the monitoring, the observability, and the governance. I think a lot of organizations have the monitoring, visibility is getting there, but the governance side is always a missing piece.

Orient and Decide

Going on on the visualization side, we want to climb up to the incident in command which is something that’s the mobilization, the coordination, the communication. How do we get good at that? Where a lot of this is coming from, we’ve seen come in in the last decade or so into operations, is really taking directly from incidents in the real world. FEMA, runs this, they’re the keepers of this thing, it started much before this, the incident command system, which is a series of definitions and processes and practices all around, how do you manage the response to some type of major incident? It’s written for forest fires and hurricanes but it’s being applied to our industry.

Actually it was Jesse Robbins, one of the founders of Chef that really was the first person in Amazon to run game days and they called him the master of disaste.? He was purposefully trying to break things in a systemic way so they can apply these practices and see how they actually work. Brent Chapman I think is another folk who’s been doing this and really has come to the forefront. How do you apply these incident command principles to how we mobilize, coordinate, and communicate? It might seem heavy-weight at first, but if you see as organizations get better at it, it becomes naturally a way that they talk and they think that we have to respond to these things in a structured way.

Ernest Mueller is also a great person to follow who did a lot of this in the DevOps space, translating these instance command ideas. Give a shout-out to PagerDuty who’s done a pretty good job. It’s an open-source project where they’ve taken all of their incident response docs based on the incident command system and turned it into an open-source project. They accept poll requests. Again, this is a low-tech side of the conversation but it’s very important to understand how this applies especially as now we’re all being asked to participate in this. It’s no longer some other operations organization that’s going to run this.

Also along this idea is what’s going on with operations itself? We see this split that’s happening. Andrew Shafer made this T-shirt .John, Willis, Andrew and I did the first DevOps days in the U.S. here in Mountain View in 2010 and this was the T-shirt. It was “Ops who are devs, who like devs to be ops, who do ops like they’re devs, who do dev like they’re ops, always should be,” and if you were a teenager in the ’90s, you know the rest of the lyric, but the idea is, first we’re going to be blurring these roles. The past nine years have been about how do we blur the lines between devs and ops.

Now I think we’re going a step further and saying, “Operations is starting to split itself,” and you see different organizations like the folks at Disney, Shaun Norris who was at Standard Chartered. It’s a big global bank, I think 60 different regulated markets, 90,000 employees. Disney is obviously Disney. You see them drive this the same old way. We have to take what was the operations organization. We’re now divided into two parts. There’s platform engineering which just really looks like a product organization. It’s largely a development organization building operational tools, building operational platforms, and then everybody else goes into this SRE bucket. That’s generally our expert operators. They’re starting to be distributed. We’re starting to see this shift where the walls of operations were being blurred before by responsibility. Now, they’re outright disappearing because we’ve got this growing centralized platform engineering organization, and this ephemeral distributed, call them what you want, these folks call them SREs, who are expert operators who are being distributed into the organization.

That also leads to a new view on escalations. In the past, escalations were a good thing, “We’re getting it to the expert.” People here enjoy being escalated to. It’s an interruption. It was, “This is a terrible idea.” Not only that, but it’s also just slowly now the response. If we have to escalate off to people, our instances are going to be longer and now we’re putting all this death by a thousand cuts, we’re interrupting the rest of the organization, which just adds a bunch of delay everywhere else. We’re finally starting to realize this is a bad idea.

Jody Mulkey at Ticketmaster, this is going back like four or five years, had this epiphany which they had this old model of working which their knock, or their talk, they called it, it was basically.they call them the escalators, because all they do is look at the lights and call somebody, and their major web incidents took like 40-something minutes. When the Yankees can’t print playoff tickets for 40 minutes, that’s CNN news. That’s not just TechCrunch news.

The idea was, what was a lot of this time being attributed to? A lot that they found was, because it was stuck in the escalations; because you have to escalate up to different people. What they did is they had this idea of support of the edge which was, let’s take all the capabilities that we need to diagnose and resolve these problems or a large chunk of them and push it down closest to the problem. How do we empower those teams closer to the problem to go ahead and take action? Then on top of that, if they can’t take action, how do we empower them with the right diagnostic tools to be able to figure out who to actually escalate to so we cut that chain down?

Their story was remarkable. I think in 18 months, they went for 40-something minutes, down to four minutes for major web incidents all because it’s the same problems over and over again. How do you just empower those people closest to it to actually be operators and take action? The best part was it cut their escalations in half. Imagine being interrupted half as much as you are now. It was a huge hit and they kept getting better and better at it.

John Hall has come back into this again, BMC. He likes to bring this idea of swarming which really comes from the customer service world and bringing it to, instead of linear escalations, how do we take more of a swarming effect to bring all the people we need to bear on these problems? It’s very interesting how the swarms work together, but it does show that you can solve a lot more problems without having those dangerous escalation chains.


Time to take action. The two actions we have to break it down into is diagnosing, so health checks, exploratory actions; and restoring: restarts, repair actions, roll-backs, clearing caches, all the known fixes for known problems that are out there. I think really what’s important to note here is the return of runbooks. If you lived in enterprise land at all, runbooks, we used mostly manual, a lot of Wikis, how they talked about not that long ago, weren’t really talked about because the world was going to be the Chef and Puppet and Ansible and operations was going to go away. Now thanks to the SRE movement, runbooks are back, except now it’s not about how do we make better Wikis, it’s about how do we automate those procedures so we can give them to somebody else? In Jody’s model at Ticketmaster, how do we give out that access so it’s given to the right place in the organization where they can go and take action?

Runbook automation, it’s the safe self-service access and the expert knowledge that you need to take action, something that myself and my colleagues at Rundeck we work a lot on. The idea is that moving the bits is the easy part. It’s the expert knowledge that is hard to spread around. It’s, “Is the restart automated?” “Sure. It’s automated. We’ll just let the developers do their own restarts in production.” It’s, “They’ve got to know how to quiet Nagios and they’ve got to know how to talk to F5 and pull it out a load balancer pool. Then they’ve got to know how to check these five things, and they have to then run the restart script but check these other six things to make sure it worked. Wait a minute. Before they run the restart script, they’ve got to know the right command arguments, and they’ve got to edit these variables files.” It’s, “Ok. Now we get it. We can’t hand that knowledge off. It’s either going to be weeks of training or it’s going to be months of somebody trying to skip the script and all of that out.” Moving the bits is the easy part, spreading the knowledge around is the hard part.

It’s got to be self-service that you have to empower those closest to the action like I just mentioned, and it’s got to be safe. By safe, it’s two things. One is yes, from the security and compliance perspective, we have to make sure we’re only giving action to certain named procedures, not giving people random SSH access and pseudo-privileges in a script and wish them luck. We’re also making sure that from a security perspective, we’re de-risking it, but also from just the person taking action perspective. How do we put the guardrails around them to know that there’s the right air handling in place or commands are idempotent? You have to de-risk it so even if the person has expert knowledge in some other domain, they’re being guided to the smart and safe options and they’re not going to potentially cause more problems.

What’s going on now is you’ve got these alerts, these tickets, we’ve got the incident command system, people are getting all riled up to do it and they’ve got three options. One is deciphering this Wiki which is, “Is this correct? I’m not really sure what this person was trying to say. Wait a minute. Look at this date. When did they write this?” That’s option one. Option two is, I’m dumpster diving into our shared directory to look for the right script we used last time, “But wait a minute. Did they tell me it was -i not -e,” or, “There’s a new version and that script is in a different directory.” There’s that problem.

What mostly likely happens is the escalations. We’re just pushing a disruption back into the organization versus from a runbook automation perspective. We’re saying, “How do we define the workflow that basically allows us to call all the APIs and scripts and tools that we need to know and therefore, we can basically push that safe, smart options to the people closer to the problem. They can solve the problem and not have to wake you all up.” It stops these incident chains, “Yes, there’s a problem with this particular service,” and then the system N or SRE shows up and says, “This is an application problem. I’ve got to call the developer,” and then they say, “I know it’s a data problem. Let me call the DBA,” and the DBA shows up and finally and goes, “Wait a minute, this is a network problem. There’s a firewall issue here.”

We’ve got these meandering chains of escalations and what all this is doing is interrupting our other work. We want to get to a point where if it’s an unknown, then we can quickly diagnose the problem, take the best ideas out of people’s heads. How would you check this service? How do you check that part of the service? Create them into automation that are now our 01 folks, whoever that might be, can respond to it. Either they’re going to see a known problem that they know a solution to or they know exactly who to escalate to. For known issues, it takes it down from minutes, potentially hours, down to minutes or seconds.

For all of you, I think on the development side of the house, it stops those escalations, the “I need”, the “Can you,” the “Help me with this,” and each of those instances there’s a little bit of waiting that’s being injected into the universe and then big interruptions will be landing on your head. Then when it comes time for you to go and do something, now you’re waiting in somebody else’s queue and it just keeps getting worse and worse.

If you can take that expert knowledge, take these procedures, basically bottle them up and let other people help themselves, you’re getting rid of all those instances awaiting, all those instances of interruptions, and it solves these difficult security and compliance problems. Before it was, “I could fix it but I can’t get to it because we’ve got customer data.” I’ve seen this work in very highly regulated environments. If you’re handing out these access to a named procedure, the compliance people actually like it. You could run it all through an SDLC which now we can do code reviews and decide whether or not this is good to do.

Folks at Capital One talk a lot about this. They’ve created their own internal runbooks as a service. The whole idea is to make like a router that says, “Hey, this thing, let’s run the diagnostics against this instance, whoever gets triggered,” and to have two decision points. One is, either know it’s a known problem and I’m going to fire off this fix and see what happens, or I’m going to know who I’m going to escalate to. They just spoke at the DevOps Enterprise Summit. There’s also a lightning talk at DevOpsDays Austin. It was fantastic.


The last piece this year is this learning part. I’ll just leave this one piece from John Allspaw here. He talks about, a mistake we all make is we want to get to the action items. We’re, “We’ll talk about some things beforehand but where is the action items? What did I really get out of this?” when the reality is, if you think about what you want out of it, it’s the journey, is where you do the learning. It’s understanding what happened and being able to tell those stories amongst each other, get together and figure out all the contributing factors, is what really drives learning in the organization. Again, incidents are unplanned investments. The ROI is up to us. It’s up to us what are we going to get out of it? Failure is a fact of life. What are we going to get out of it?

To recap, don’t forget the environment that we’re all marinating in now that it affects all of our lives and decisions we make and how we talk to each other, as well as follow along this pattern, break it down, and good luck out there.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

NativeScript Replaces JavaScriptCore with V8 for iOS Apps

MMS Founder
MMS Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

NativeScript new JavaScript runtime for iOS, based on Google’s V8 engine, is now in beta after several months of development. This change should bring reduced app startup time for iOS apps as well as simplify NativeScript development process.

The fundamental reason to replace JavaScriptCore, iOS native JavaScript engine, lies with a fundamental simplification of maintenance work for the NativeScript team, writes NativeScript product manager Emil Tabakov. Indeed, the NativeScript team has heavily adapted JavaScriptCore, which Tabakov describes as a non-embedding friendly framework, to provide support for all they needed.

In spite of all the extra effort, using JavaScriptCore for iOS made it impossible to reach full feature parity with Android, mostly due to features that are only available when using V8. This is the case, for example, with V8 heap snapshots, which enable the possibility of initializing a JavaScript context from a pre-made image stored in the app package. According to NativeScript team, V8 heap snapshots reduce by over 30% startup time of simple applications.

What made possible using V8 on iOS was the introduction of a JIT-less mode for V8, in the first months of 2019. This switched off executable memory allocation at runtime, which was not permitted on iOS and other embedded platforms. Unfortunately, this comes with a performance penalty, since JIT-less V8 will basically work as a JavaScript interpreter. According to V8 team’s benchmarks, the performance hit is large with peak optimized code, i.e., code exercising specify language characteristics, but less significant for real-world applications.

As a last not, switching to V8, Tabakov says, paves the way to future support for Bitcode, which is a requirement for writing Apple Watch apps.

Being in beta, NativeScript V8 runtime for iOS requires still some work to be production ready, in particular armv7 support and fully functioning multithreading, so developers should use it at their own risk.

To use the new runtime, you first install it by executing tns platform add ios@beta-v8, then launch you app as usual. Issues can be reported in the NativeScript V8 iOS runtime repository.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Mini book: The InfoQ eMag – Microservices: Testing, Observing, and Understanding

MMS Founder

Article originally posted on InfoQ. Visit InfoQ

Writing code is the easy part of software development. The real challenge comes when your system is running successfully in production.

How do you test new functionality to make sure it works as intended and doesn’t break something? When a problem arises, are you able to identify the root cause and fix it quickly? And as that system keeps growing and evolving, are you able to focus on one small area without having to understand everything?

These challenges exist whether you have a monolithic or microservices architecture. Organizations that build distributed systems have to adopt testing and observability practices that differ from those used when building a monolith.

This eMag takes a deep dive into the techniques and culture changes required to successfully test, observe, and understand microservices.

Free download

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Change Data Capture Tool Debezium 1.0 Final Released

MMS Founder
MMS Jan Stenberg

Article originally posted on InfoQ. Visit InfoQ

Debezium is an open source change data capture (CDC) tool built on top of Kafka that captures and publishes changes made in a database as a stream of events. Debezium 1.0 Final was recently released with event format clean-up, increased test coverage of databases, including Postgres 12, SQL Server 2019 and MongoDB 4.2, as well as a number of bug fixes with 96 issues addressed in version 1.0 and the preview releases. In a blog post Gunnar Morling describes Debezium’s basic concepts and some common use cases, and details about both the current release and what to expect in future releases.

For Morling, software engineer at Red Hat and project lead of Debezium, what’s most important in the new release is the event format clean-up. They have increased their commitment to keep the emitted event structures and configuration options of the connectors correct and consistent, and to make sure these are evolved in a backwards-compatible way. But he also points out that they have never broken things before.

In an interview with InfoQ, Morling notes some features added to the 0.10 release that he thinks are also relevant to mention for the current 1.0 release. For Postgres 10, they now support the pgoutput logical decoding plug-in and exported snapshots with no locks required. An incubating connector for Cassandra, extended and more unified metrics across the different connectors, and customizable message keys are other new features.

Morling emphasizes that the 1.0 Final release is the result of work of the Debezium community at large. Led by Red Hat, about 150 people have contributed to the project, and he thinks this is great for an open source project. He also points out the importance of users sharing their experiences in conference talks and blog posts because hearing about real usage of CDC and Debezium helps the team improving the product.

Upgrading from the earlier 0.10 release is mainly a drop-in replacement. For earlier releases, there are migration notes available describing upgrading procedures and deprecated options.

The work with Debezium 1.1 has started with 1.1.0 Alpha1 just released. One new feature is the Quarkus outbox pattern extension that will simplify the creation of outbox events within an application. Debezium supports the outbox pattern for integration of microservices and the team wants to improve on this with the Quarkus extension that complements the existing event router for events from an outbox table.

Other features added or considered for the new release include incubating support for CloudEvents and IBM DB2, exposing topics with transaction events, and adding SPI for customizing schema and value representation of given columns. A standalone container for running Debezium without Apache Kafka and Connect is also on the roadmap.

Red Hat is working on a commercially supported offering around CDC. In preparation for a GA release, four connectors are currently available as a Technology Preview. These are part of the Red Hat Integration product, targeting deployments on OpenShift via the AMQ Streams Kafka operator.

In a podcast recently published on InfoQ, Wesley Reisz talks with Morling about the Debezium project, CDC, and some of the use cases. They also discuss the long-term strategic goals for the project.

At the microXchg 2019 conference in Berlin, Morling discussed how CDC can be used to create events from a database in a microservices architecture.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Keeping Credentials Safe, Google Introduces Cloud Secret Manager

MMS Founder
MMS Kent Weare

Article originally posted on InfoQ. Visit InfoQ

In a recent blog post, Google announced a new service, called Secret Manager, for managing credentials, API keys and certificates when using Google Cloud Platform. The service is currently in beta and the intent of this service is to reduce secret sprawl within an organization’s cloud deployment and ensure there is a single source of truth for managing credentials. Features of the service include global replication, versioning, audit logging, strong encryption and support for hybrid environments.

This service was created due to customer feedback that managing secrets regionally created a lot of friction. Seth Vargo, a google cloud developer advocate, explains:

Early customer feedback identified that regionalization is often a pain point in existing secrets management tools, even though credentials like API keys or certificates rarely differ across cloud regions. For this reason, secret names are global within their project.

Secret names have a global scope, but the data remains regional. This allows for transparency across the enterprise while keeping the data in a local region. However, administrators can choose which regions secrets can be replicated to. 

Secret Manager has been built with principles of least privileges and only project owners have access to secrets, unless explicit Cloud IAM permissions have been provided. All secret information is encrypted in transit with TLS and at rest, using AES-256 encryption.

Versioning is another capability that is included in Secret Manager. For example, an organization may need to support a gradual rollout of a version of their app or the need to rollback. Versioning is automatically provided for operations including access, destroy, disable and enable.

As organiazations focus on being agile, there is a need for an application to always be using the latest version of a secret. To support this requirement, there is an alias called ‘latest’ that will always return the latest version of the secret. Vargo explains why this is an important feature:

Production deployments should always be pinned to a specific secret version. Updating a secret should be treated in the same way as deploying a new version of the application. Rapid iteration environments like development and staging, on the other hand, can use Secret Manager’s latest alias, which always returns the most recent version of the secret.

Creating and accessing secrets can be accomplished through the Secret Manager API, Secret Manager Client Libraries and Cloud SDK. For example, the following command can be used to create a secret:

 $ gcloud beta secrets create "my-secret" --replication-policy "automatic" --data-file "/tmp/my-secret.txt"

Discovering where secrets are being used within your cloud deployment can often times be difficult. Organizations can discover secrets in Cloud DLP using infoType detectors to identify where entities such as AUTH_TOKEN, BASIC_AUTH_HEADER and ENCRYPTION_KEY are being used.

Some organizations have previously used Berglas, an open source tool, for managing secrets in Google Cloud Platform. Customers can continue to manage secrets using this tool or can use the ‘migrate’ command from a Berglas command line to provide a one-time migration of the secret to Secret Manager.

Secret Manager beta is available to all Google Cloud customers and Quickstart tutorials have been published.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Sonatype Disables Unencrypted Access to Maven

MMS Founder
MMS Erik Costlow

Article originally posted on InfoQ. Visit InfoQ

Sonatype has disabled the ability to download Java dependencies from Maven Central over unencrypted HTTP, helping secure the software supply chain against injection attacks. The change also incorporates verification of Maven’s certificate to safe-guard against man-in-the-middle (MITM) attacks.

Many software build systems in the Java ecosystem, including Maven, Gradle, and SBT rely on Maven Central as a means for locating and downloading software dependencies. The switch to fully require HTTPS was announced by Sonatype product manager Terry Yanko and Gradle engineer Jonathan Leitschuh.

The work performed by Leitschuh builds upon work that was started in 2014, when Sonatype enabled SSL access to the Maven repository. At that time, Max Veytsman had created a write-up and tool explaining how to back-door JAR files coming from Maven as they went over a network. As a result of this attack, systems would unknowingly work with the artifact that they had downloaded. The code artifacts, however, could be modified in transit through a simple process that no longer functions as a result of Sonatype’s change.

  1. A custom proxy server would capture items coming from Maven Central.
  2. The proxy would analyze the incoming artifact as a ZipInputStream.
  3. As it detected the binary of a ZipEntry, the proxy would pass the input to a Java bytecode library, such as ASM.
  4. The custom ASM ClassVisitor would add or alter specific class’ methods, such as the static initializer method, <clinit>()V, to add custom bytecode that took action such as executing a system command.
  5. The proxy would re-zip all contents into a ZipOutputStream that was passed through to the client.
  6. The build system and/or any system that loaded or ran the back-doored classes would execute the payload that was put in place by the proxy server.

The risk required a MITM attack, and was not a vulnerability in Maven or Java itself. Exploitation examples were performed in lab environments and there is no known attack on the Java/Maven ecosystem that leveraged this bytecode injection attack. Large-scale attacks on network routes require significant effort by organizations with significant skill, funding, and access. Within the same time period, Trend Micro’s annual 2014 vulnerability analysis cited positive Java security improtements, “Some good news – there were no Java zero-days in 2014!

“Fortunately, application security transitioned from a single activity prior to release, to a holistic quality process applied throughout the development lifecycle. Improving supply chain security begins by each of us improving our own product security,” explains Milton Smith, principal at AppSec Alchemy. In 2014, Smith was a product manager in Oracle’s Java Platform Group.

For users that require unencrypted HTTP access, Sonatype has created a workaround that enables systems to still function. Older systems that cannot be updated to recent maven versions can simply switch their download URLs from repo.maven.org or repo1.maven.org to insecure.repo1.maven.org. Sonatype has placed the word insecure into the hostname as a way of clearly communicating that making this change is an unambiguous security problem. Organizations with security controls can also choose to block or flag hosts that access this URL to balance between the build continuing to function and bocking potentially insecure downloads.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Gradle 6 Brings Significant Dependency Management Improvements

MMS Founder
MMS Uday Tatiraju

Article originally posted on InfoQ. Visit InfoQ

Gradle, the customizable open source build automation tool, has released version 6.0 with significant improvements to dependency management, out of the box support for javadoc and source jars, and faster incremental compilation of Java and Groovy code. In addition, the latest release 6.1.1 supports a relocatable dependency cache for speeding up ephemeral CI builds.

Gradle’s dependency management saw a number of improvements in version 6. The documentation has been restructured to help users find information on commonly used terminology, and use cases related to dependency management.

Gradle Module Metadata, a format similar to Apache Maven’s POM file, is now published by default when using Maven or ivy based publish plugins. Based on this module metadata, Gradle can recommend and share versions between projects called platforms, a set of modules aimed to be used together.

Gradle’s new component capabilities can be used to detect and resolve conflicts between mutually exclusive dependencies. A capability identifies a feature, like logging, offered by one or more modules or libraries. By using capabilities, Gradle’s dependency management engine can detect incompatible capabilities in a dependency graph, and allow users to choose when different modules in a dependency graph provide the same capability.

For example, say a module depends on SLF4J API library and Apache ZooKeeper library and wants to use the JDK logging as the SLF4J implementation library. Since ZooKeeper itself depends on Log4J as the SLF4J implementation library, the module might end up with two SLF4J implementations on its classpath. By declaring a component capability rule that states that both JDK logger and Log4J libraries provide the same capability, Gradle can preemptively detect the conflict.

In addition, Gradle provides the concept of dependency constraints to pick the highest version of a transitive dependency that satisfies all declared constraints.

Gradle 6 supports automatic creation and publishing of javadoc jars and source code jars. It also publishes the information about these jars using Gradle Module Metadata. This feature can be activated for a Java or Java library project:

java {

Gradle 6 provides faster incremental compilation of Java and Groovy code by analyzing the impact of code changes and excluding classes that are an implementation detail of another class from recompilation. Gradle skips recompiling classes in different projects using the compilation avoidance feature. For large projects with multiple modules and deep dependency chains, this enhancement will reduce the number of recompilations and speed up incremental compilation.

Starting with version 6.1, Gradle’s dependency cache can be copied and made available to an ephemeral build agent in order for the agent to reuse the previously downloaded dependencies, and speed up the build process. An ephemeral build agent is an agent that is used just once and discarded at the end of a build. Since ephemeral agents have no state, each build will need to download dependencies from remote repositories. By copying an existing dependency cache to ephemeral build agents, builds will no longer pay the cost of downloading all dependencies.

Some of the other noteworthy features in Gradle 6 are support for JDK 13, security improvements to protect build integrity, ability to defining compilation order between languages in polyglot JVM builds, and improvements for Gradle plugin authors and tooling providers.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Article: The Road to Artificial Intelligence: An Ethical Minefield

MMS Founder
MMS Lloyd Danzig

Article originally posted on InfoQ. Visit InfoQ

Increasingly-rapid developments in the field of AI have offered society profound benefits but also produced complex ethical dilemmas. Many of the most nefarious issues are often overlooked, even in the engineering community. There also exists the meta-ethical question of who ought to be the ones making decisions concerning the encoding of values into autonomous systems.

By Lloyd Danzig

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Contrasting Sapiens International (NASDAQ:SPNS) and Mongodb (NASDAQ:MDB)

MMS Founder

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.