Presentation: No Next Next: Fighting Entropy in Your Microservices Architecture

MMS Founder
MMS Anna Shipman

Article originally posted on InfoQ. Visit InfoQ

Transcript

Shipman: This is the Financial Times website, ft.com. This screenshot is from sometime in 2015. The site at that time was powered by a monolith. In 2016, a new application was launched to power the ft.com website, and this is the new homepage. It is powered by a microservices architecture. Developers everywhere rejoiced. It’s faster. It’s responsive on mobile. It has a better user experience. It shipped hundreds of times a week. It didn’t take long for entropy to set in. Within a year and a half, things were not going so well. Over 80% of our 300-plus repos had no clear technical owner. Different teams were heading in different technical directions, and there were only five people left on the out-of-hours rota. Entropy is inevitable. It’s possible to fight entropy, and I’m going to tell you how.

Background

I’m Anna Shipman. I am the Technical Director for customer products at the Financial Times. My team is in charge of the ft.com website and our iOS and Android apps. I’ve been at the FT for about four years. Before I worked at the FT, I worked at the government digital service on the gov.uk website, which is also a microservices architecture. I’ve been a software engineer for nearly 20 years.

Outline

Fighting entropy takes place in three phases. Firstly, you have to start working towards order. Secondly, you have to actively remove haunted forests. Thirdly, you need to accept entropy and handle it. I’ll tell you how we did those things at the Financial Times and how you can too. Firstly, I’ll give you a little bit of background about the Financial Times. It’s a newspaper. It’s the pink one. We’re a news organization, we actually do a lot of other things as well. What I’m going to talk about is the ft.com website and our apps. You might think it’s just business news, but we do a lot of other things. It’s a subscription website, but some of the things we make free in the public interest. Something you might have seen is our Coronavirus tracker, this showed a lot of graphs of data about Coronavirus, and this is free to read.

The Falcon Monolith, and the Next Microservice

That is the screenshot of the old website powered by a monolith called Falcon. Falcon had to be deployed out of hours monthly. It’s very old school. There was just no way to move to continuous deployment. The other thing about it, which I think you can see when you look at the screenshot is that different parts of the site were owned by different parts of the business. There wasn’t really a coherent whole for a user. For two years, a small team worked on a prototype of what a new ft.com website could look like, called Next. Next has a microservices architecture. There’s a focus on speed, shipping, and measurement. In October 2016, Next was rolled out to everyone. Now it ships hundreds of times a week. It is much faster. It’s responsive on small devices, which the old site wasn’t. It’s got an A/B testing framework built-in so we can test our features before they go out to see if they’re going to be useful. It’s owned by one team, customer products. That’s my team. What that means is that product and design work together to form a coherent whole for the user. As a user, you get a coherent experience. That means that the tech is governed in one place.

The Financial Times Dev Team

I joined the Financial Times in April 2018, so about a year and a half after the launch. The FT is a great place to work, it has a great culture. This is a photo of our annual Rounders Tournament, which we managed to do last year. We are a diverse team with a lot of autonomy of what work we do. Everyone is smart, and really motivated by our purpose. Our purpose is speaking truth to power. Our motto is without fear and without favor. I think at this time, we can all understand the importance of a free press. All was not well with the tech. When I joined, I met all the engineers on my team, one to one, it’s about 60 engineers, and some common themes emerged. Firstly, the technical direction wasn’t clear. Teams weren’t aware of what other teams were working on. The tech was really diverging. For example, we have an API called Next API. One team told me that Next API was going to be deprecated, so they weren’t working on it anymore. Another team told me they were actively working on developing and improving Next API.

There were haunted forests. Haunted forests means areas of the code base that people are scared of touching. The phrase comes from a blog post by John Millikin. He says some of the ways that you can identify haunted forests are things like nobody at the company understands how the code should behave. It’s obvious to everyone in the team that the current implementation is not acceptable. Haunted forests, people are afraid to touch them, so they can’t improve them, and they negatively affect things around it, so they decrease productivity across the whole team, not just in that area. Another thing that came out was people said feature changes felt bitty. The changes we made didn’t feel like they were tied into a larger strategic goal. The clarity of direction that had come with the launch had dissipated. With a launch, it’s really easy to see where you’re going, to rally behind the mission. That clarity had really dissipated. It didn’t feel like we’re actually increasing value for our customers. One of my colleagues said something that really stuck in my mind, it doesn’t feel like we’re owning or guiding a system, just jamming bits in.

Over 80% of our 300-plus repos didn’t have an assigned technical owner. We had 332 repositories of which 272 were not assigned to a team. It didn’t mean they didn’t have a technical owner, it just meant we didn’t know who it was. That meant if something went wrong with it, we didn’t know who to approach, or if we wanted to improve it, again, we didn’t necessarily know who to ask. I learned tech is an operational risk that you just don’t know about yet. There were five people on the out-of-hours rota. The five people who were on the out-of-hours rota were all people who’d worked on the original Next. As they left, for example, moving to other projects within the FT or leaving the FT, they left the rota, and no new people joined. I’ll talk later on a bit about how we run our out-of-hours rota. The rota relies on there being a lot of people on it, so you get a good amount of time between incidents. Five people on the rota is not sustainable, people get burned out. Also, it’s not sustainable, long-term, because if nobody new is joining, eventually, those five people will leave the FT, and then we’ll have no one on the out-of-hours rota. Finally, the overall view of everyone was the system felt overly complex. When I joined, there was a small group of people called simplification squad, who were, alongside their work, they were meeting to try and work out ways that we could simplify our code base.

Entropy Is Inevitable

I recognized some of these themes from gov.uk. Gov.uk was also a microservices architecture. This is an architecture diagram. One way to solve these problems is to throw it in the bin and start again. Next cost £10 million, and it took 2 years to build. We don’t want to drive the tech so far into the ground that it is not retrievable, and we just have to throw it away and start again. Our vision is no next Next. What that means is a focus on sustainability, on making sure that we can continuously improve it. Swapping things out in flight as they become no longer the right tool for the job. In any case, there’s no point in throwing it away because we’d be here again in X years, because entropy is inevitable. Entropy means disorder. The second law of thermodynamics states, everything tends towards entropy. Basically, it’s a fundamental law of the universe that if you don’t stop it, things will gradually drift, the natural drift is from order to entropy. On the left side of this diagram, you’ve got a relatively neat little hand-drawn diagram. You’ve got order, but over time, it’ll drift towards what you see on the right-hand side, disorder. That means for software, over time, your system will become messy, it’ll become more complex, and eventually will become unmanageable.

You can fight entropy. The first thing you need to do is start working towards order, start working away from entropy. Stop just jamming bits in. You need to do this more consciously with microservices than with a monolith. Because microservices are smaller units which lend themselves more to smaller pieces of work. It’s easy to just jam bits in because you’re only working on the small bit, you’re not looking at the coherent whole. Conversely, it can make it harder to do larger, more impactful pieces of work. Because when you do, you have to make the change in multiple places. Sometimes that might mean across teams. Sometimes that might mean fundamentally changing what the microservices are. It might mean merging or getting rid of some microservices. It’s harder to do that when you come from a position of already having the investment in those microservices.

Clarify Your Tech Strategy

How do you do this? Firstly, you need to clarify and communicate what order looks like. What I’m talking about here is clarifying your tech strategy. Make your intentions clear. I said that our vision is no next Next. One strategy is, it’s diagnosis of the current situation. It’s the vision, where do you want to get to? It’s the concrete steps to get from here, where you are, to the vision where you want to get to. That’s your strategy. You need to communicate your strategy so that people know where to go when they’re trying to stop the drift towards entropy. You need to communicate your strategy. Everybody needs to know where you’re headed and what the strategy for getting there is. You need to communicate your strategy until you are sick of the sound of your own voice talking about it. Even then you need to carry on communicating. Because every time you talk about it, there’ll be someone who hasn’t heard it before. There’ll be somebody who wasn’t listening last time you talked about it, or they were there, but they’ve forgotten, or they’re new to the team. You just need to keep communicating your strategy so that everybody understands what they need to do in order to move away from entropy and towards order.

Start Working Towards Order

Once you’ve clarified your intention, so communicated them, you need to make it easy for people to move towards order. I’ll talk about some of the things we did on ft.com. One thing we did was we moved to durable teams. When I joined the FT, we had initiative based teams. For example, when we wanted to add podcasts to the site, we put a team together to add podcasts, and they worked together for about nine months. Then once podcasts were on the site and working well, they disbanded and went to work on other projects. There are a few problems with that model. One is around how teams work well together. You might have heard the model of forming, norming, storming, performing, which basically means you go through several phases before you start being a highly performing team. If you’re constantly building new teams, then you’re constantly doing that ramp-up. You’re constantly going through those phases, and you don’t get as long in the high performing stage, which is frustrating for the team and it’s not great for the product. Another problem is around technical ownership. Now we’ve resolved it. After that situation, if we’d wanted to make a change to podcasts, who would we go to? There wasn’t a team there. Or if something went wrong with podcasts, again, who would we ask?

What we did was we split up our estate into product domains. It’s not exactly like this now. This is the first draft. It’s very similar to what we have now. Each of these is a team, and they each have strategic oversight of that area, that product domain. They’re an empowered team, so each team is led by a product manager, delivery manager, and tech lead, and they set their direction and priorities because they work in this area. They’ve got the experience to identify what the most valuable work they could be doing is. The team is long-lived. That’s really important, because that means you can make big bets. You can do the impactful piece of work that may take a really long time to have a payoff, but it doesn’t matter, it’s a long-lived team so you’ll be there to see the payoff. They also own all the tech associated with that area. As part of this work, we’ve moved to full technical ownership. This means every system, repo, package, database, has a team assigned as a technical owner, and that team is responsible for supporting it if it goes wrong, and for working on the future strategic direction of it. That did mean that each of the durable teams ended up with some tech that they didn’t know about. Part of ownership is understanding how to be comfortable with how much work you need to do to make sure that you really know something. What things you’re comfortable with not knowing that well. What things are really important, and you need to make sure that you understand fully.

This is a work in progress. I showed you on this diagram, the platform’s team. The platform’s team have 72 repos. What technical ownership means in the context of 72 repos is still TBC, we’re still working on that. Also to note that durable teams improved a lot of things, they had great improvements, and the feature changes now feel like they’re really working towards a bigger thing. There are still problems. Still, some of the problems I mentioned with microservices are here too. You can still have siloed thinking. It can still be harder to do work that crosses domains. It’s definitely better than it was when there wasn’t that pattern about where things were.

Another way you can make it easy for people to move towards order is to have guardrails. What this is about is reducing the decisions that people need to make. Clarify how much complexity is acceptable for you on your project, on your work. On ft.com, we stick to Node.js and TypeScript. This is one example of guardrails. TypeScript is very similar to Node.js. It’s JavaScript but it’s strongly typed. Everything we write is in Node.js or TypeScript. That is one decision that you don’t have to make. That naturally reduces the complexity of things because everyone can understand all the code that’s written. On gov.uk, we didn’t do that. At the start when we were writing the site at the beginning, we talked a lot about how microservices allow you to use the right tool for the job. We include a programming language in that. People chose the tool that was the best tool for that particular microservice. What that meant was, we ended up after a couple years in the situation where most of our repos are written in Ruby, but we also had some that were written in Scala. Ruby and Scala are very different languages, very different mental models. You don’t get many engineers who are really good at Ruby and also really good at Scala. That made things like working across those services very difficult. It made hiring difficult when we were trying to do that. That’s something to watch out for. That’s the thing that increases entropy.

Now at GDS, they have the GDS way. GDS is the government digital service. What this does is it outlines the endorsed technologies. These are things that you could use that other people are using. It means that you get benefits, like it’s quicker to get started. You can use shared tech. These are guardrails rather than mandates. You can still do something different if your project needs it. It’s not you must use this. It’s like, it will be easier for you if you do because this is what others are using. I just want to pull out the thing I said about Scala. In the GDS way, they talk about not using Scala, for the reasons I mentioned. Scala is not one of the endorsed languages. If you’re interested in what they are, they are Node.js, Ruby, Python, Java, and Go.

Another way of thinking about this, very similar, is the golden path. This is a blog post by Charity Majors where she talks about the golden path, and she outlines how to create a golden path. What that’s about is defining what your endorsed technology is, and then making it really easy to follow that. I’m going to read a bit out of the blog post, which makes this point. You define what your default components are, and then tell all your engineers that going forward, the golden path will be fully supported by the org: upgrades, patches, security fixes, backups, monitoring, build pipeline, deploy tooling, artifact versioning, development environment, even tier-1 on-call support. Pave the path with gold, hence the name. Nobody has to use these components. If they don’t, they’re on their own, they’ll have to support it themselves. You’re not saying you have to use this technology, but we will make it easy for you if you do. That is one way to keep order within your system. Essentially, the first thing you have to do is start working towards order. The way you do that is clarify what order looks like, communicate it, and then make it easy to move towards order.

Actively Remove Haunted Forests

The second thing you need to do is actively remove haunted forests, because entropy is inevitable. Things will get gnarled up anyway. Even if you have everything I’ve said, even if you’ve got guardrails, even if you’ve got tech strategy that everybody understands, the fact is everything drifts towards entropy. It’s a natural law of the universe. Things will start to get messy anyway. This could look like things like the tech you used, it’s changed. It’s no longer fit for purpose, or it’s no longer available. Or you’ve added a lot of features, and the features are interacting with each other in ways that don’t quite work properly, or just things will happen, things get messy. Sometimes people talk about this as technical debt. It’s not actually the correct use of technical debt. What technical debt is, is where you make a decision, you make an active tradeoff, so you do something in a hacky way to get it done quickly, rather than the right way. What you’re doing is you’re borrowing against future supportability, future improvability and reliability. The reason it’s called technical debt is if you don’t pay that debt off, it starts accruing interest and everything starts getting harder. What you get through entropy isn’t, strictly speaking, technical debt, but the outcome is the same. You get the same problems. Things gradually get disordered, they get less logical, they get more complex, harder to reason about, they get harder to support.

You can go quite a long way by following the Boy Scout Motto of leave things better than you found them. Whenever you work on a part of your system, improve it, leave it better. Sometimes you will have to plan a larger piece of work. You’ll have to plan it and schedule it and make time for it. One of the things we did on ft.com was we replaced a package we had called n-ui. N-ui handled assets for us. It was basically on every page. It did things like building and loading client side code, configuring and loading templates. It did tracking. It did ads configuration. It did loads. It was a haunted forest. The current team didn’t feel confident at all making changes to it. It was really tightly coupled to decisions that had been made right at the beginning of the project. We put a small team together, and they spent some time splitting out all that functionality into a set of loosely coupled packages. It’s much simpler, it’s much easier to understand, and all of our engineers can now confidently make contributions to different parts of it. As a byproduct, this wasn’t like the stated aim of the project, but as a pleasant side effect. This has also increased the performance of the site.

You do have to schedule this work properly. This took a team of four people about nine months. That is a significant investment. It’s not something you can just do alongside feature delivery, it’s something you actually need to make time for. We’re about to kick off another piece of work, to a similar thing. This is a diagram of our APIs for displaying content. Over time it’s organically grown to look a bit like spaghetti. I went to the Next API. We have a situation now where we’ve got a new homepage, and the new homepage doesn’t use Next API. We’ve got a new app. The new app uses both Next API and the App API. We’re kicking off a project now to rationalize those content APIs.

My point here is, it’s not one and done. Tech changes, things change, entropy is inevitable. You will have to keep doing these types of pieces of work. Because of that, you need to learn how to sell the work. As it happens, coincidentally, recently, I read a really good article on InfoQ, which talks about how to sell this technical work to the business. The key points from the article are stories are really important. You structure your message, the piece of work you want to get done as a story with your business partner as the hero and you as the guide. Then you back it up with business orientated metrics and data. Things like productivity, turnaround time, performance, quality. Learning how to do these kinds of things is what helps you make sure you can set aside the time to fight entropy. Entropy is inevitable, so you have to make sure you schedule in larger simplification projects.

Accept Entropy and Handle It

Then the last thing you have to do once you’ve done those things, is you have to accept entropy. You have to accept that there will be disorder, and handle it. The fact is, you’ve accepted some level of complexity by using microservices. Whether you meant to or not, whether you know that or not, the fact of the matter is, you have taken that on. Because microservices trade complicated monoliths for complex interacting systems of simple services. The services themselves are simpler, the interactions are much more complex. To return to this gov.uk diagram, as an architecture diagram, this is not the most complicated diagram. Everything’s in its right place, the layers are separate. It’s clear where things live in the right place, things where things live. What is complex about this diagram is the interaction. You can see that there are different colors for different kinds of interaction. With microservices, there’s a possibility of cascading failures. Failure in one system causing failures in systems around them. The other thing about microservices is there’s a difficulty of spotting where an error is. The bug might be in one microservice, but we actually don’t see the impact several microservices away, and it can be quite hard to track that down.

A microservices architecture is already inherently more complex. If you your system is complex, the drift towards entropy will make things very messy. The main way you address this is you empower people to make the right decisions. The more people can understand what order looks like, and can make decisions about the work, the better. Devolve decision making as far as you can. Get to a position where people can make decisions about their work. For this, people need context, they need authority, and they need to know what to do. One really good way to get to this position is to involve people in building tech strategy. I don’t go off into a room on my own and think about the tech strategy. I work with a wide group of people to build our tech strategy. I work with senior engineers, tech leads, principal engineers, product, and delivery. The reason I think it’s really worth involving this wide group of people with tech strategy, is firstly, it’s really useful to share context with each other. We all have context each other don’t have. Specifically, like as leadership, I have context around the business and what other teams are working on, things like that. Then the teams actually know about the actual work. Input from people doing the work leads to better decisions. Also, if people have helped create a tech strategy, then they’ll feel more empowered to actually enact it.

The first way that I involved people with building the tech strategy is, when I first joined the FT, we had an away day with the people I mentioned, senior engineers, principal engineers, product delivery. We did this exercise, a card laying exercise. What you do here is you write down all the technical pieces of work we needed to do, and you lay them out on the floor. You start at one end of the room, and you put down what you think we need to do. January, and then February, and then March, and you lay them out in the room. Then as a group, you walk through them together. You start at January and you talk about the things that are on the floor. As you do that, you start to talk about things and how they depend on each other. You might say, we’ve got this thing in January, but actually it’s less important than this thing we’ve actually put in April. Let’s move that April thing earlier in the year. Or you might say, actually, this thing we’ve got in April really depends on this thing that we don’t have until May. We can’t do that because there’s dependencies, and we need to bring that May thing earlier. Doing this exercise is great. It was incredibly useful for me joining the FT, because at that stage, I didn’t know about the tech, so I got a lot of context. Everyone gets context that everyone else has in their heads. You also leave that room with a shared understanding of what the next steps are, and what order to do them in and what later steps are as well.

We did that right when I joined the FT. That exercise saw us through with the priorities and the next steps, and it saw us through for about two years. We’ve got to a situation where we need to do the exercise again, to see where we were. We’ve booked in another away day, we’ve booked in for the 24th of March 2020. On the 23rd of March 2020, the UK went into lockdown, so we cancelled that. We did not do it. We did not have that full away day, partly because I couldn’t work out a good way to do a card laying exercise remotely. Mainly because asking people to take part in a full day meeting remotely is a horrible thing to ask people to do. I didn’t want to do it. Over the next couple of years we worked on tech strategy together in various different ways. One thing we tried was, we had a Trello board. We did voting, and we discussed it. We did that a couple of times. There was one situation where we had to make some technical decisions. I just made those decisions with me and my technical leadership, my principal engineers. We made good decisions, decisions had to be made. I do not think that that is a very sustainable approach, because it is really important to involve the wider group in the tech strategy.

What we’ve done this time round, is we wrote a document with a proposal for what we thought the next priorities were for the tech strategy. We shared that document with a group of people, so senior engineers, principal engineers, product delivery, and gave them about a week to read it. Then we had six smaller meetings. Each principal engineer invited the tech lead and product manager of the teams that they oversee. That was a small group of about five to six people. Then we talked through the document. That was a smaller meeting. It was a shorter meeting, it was an hour. There are a few advantages to doing that. The point of the meeting, and the document, we shared the document for the comment and the meeting. This was for people to tell us what they thought of the suggested priorities. Had we missed something? Was there something that we had in the document that actually was more important than what we thought the priorities were? Where had we got it wrong? Did they think it was right? To get their input and to get further information that they had. We did that both via the document and then in these smaller meeting conversations. Because there’s a smaller meeting, it’s easier to contribute. It can be quite hard to contribute in a large meeting with 30 people. Another big advantage of doing it this way is that most of us don’t have our best ideas in the moment in the meeting. Quite often you think about something, you get a better idea when you’ve had a chance to reflect on it. This process allowed some reflection.

You had time to reflect from reading the document. Then there was a meeting and you could raise some thoughts there. Then also there was time after that where you could actually come back and comment on the document, or make some comments about the meeting. There’s disadvantages to doing it this way. It’s not perfect. One thing is it takes longer. With an away day, it’s one day and you’re done. You come out with your strategy. With this thing, you have a week between the meeting and the week for the document, it’s actually taken several weeks. The other bigger disadvantage is that you don’t hear what other groups think. I was the only one who got the full context there because I was in every meeting. Each smaller group only heard what those people in the group thought and they didn’t get the context from other teams. What we did about that was we made notes on every meeting in one shared document. Although they didn’t hear it, they could go and see what other teams had said. This has had quite a positive feedback. I think we will carry on doing this process, possibly with some refinement.

Documentation Is Key

The next thing I want to talk about in empowering people to move away from entropy and towards order is, it’s really important to have good documentation. I don’t need to tell you about the importance of documentation. What I will tell you is three kinds of documentation that we use. Firstly, I mentioned that there was a lack of awareness between engineering teams of work that was going on in other teams. What we found was that people with relevant experience in a different team could add useful experience to a plan, but they just didn’t know the work was happening. What we’ve done now is we’ve introduced these technical design documents. When tech leads are making architectural decisions, this is how they communicate them. They share them with other tech leads. They give two weeks for people to contribute, so that everyone in the group knows what change is happening, and also they can contribute if they have useful information. It also has the benefit of documenting the reason why architectural decisions are made as an architectural design record. This has addressed the problem that teams didn’t know what work was going on in other teams, especially architectural work that could impact them.

The second thing I want to talk about is this amazing application called Biz Ops. This is built by another group in the FT, our engineering enablement group. This is a brilliant application that has information about 1500 different services across the FT. It stores loads of information. I can go in here and I can see what teams are in customer products. I can drill down into a team and see what repos each of them own. I can drill down to the code. I send a link to GitHub. Or vice versa, Ops can see if the system is alerting. They can look in here, and they can see who owns that system, who’s the tech lead for that team. This is really incredibly useful.

The last form of documentation that I want to talk about is public blogging. I am a big fan of blogging as documentation. Because once it’s out there on the internet, it is really easy to share with anyone. It doesn’t get lost. It’s not somewhere where only some people have access. It’s out there. It’s good for sharing with people who are new to the team, sharing externally. It’s got a couple of big advantages. One is a blog post has to make sense. It really clarifies your thinking. There have been situations where I’ve been writing a blog post about a piece of work. As I’m writing the post, I’ve realized that there is something that we didn’t do that would have made sense to do, and so I’ve gone away, done that work so I can come back and finish writing the blog post. The other big advantage of blogging is it needs to be a good story. A good story means that you can communicate your ideas better. Writing the blog post as a story really helps with communicating it even through different mediums. One of the main audiences for our blog is the internal audience.

Increasing Out-of-Hours Rota Participation

The last thing I want to talk about here is how these things aren’t enough, and it’s not just enough to have involvement with tech strategy, it’s not just enough to have good documentation, you need to use a variety of different methods to solve problems. I’m going to talk to you about how we solved the problem that there were only five people on the out-of-hours rota. First, I’ll tell you a bit about how we do out-of-hours. We’ve got, outside of customer products, a different team, we’ve got a first-line Ops team, and they’re for the whole of the FT. They look after over 280 services. We have levels of service, and they look after services that are platinum or gold. They’re based in Manila, and they look after all of them. They can do various things. They can fail over from U.S. to EU or vice versa. They can scale up instances. They can turn things off and on again. There are various troubleshooting steps in the runbooks. They can take those steps. If they can’t solve the problem, then they call out each group’s out-of-hour support.

The out-of-hour support on customer products or across the FT is not in our contracts, it’s voluntary. It’s not something we have to do. It’s not voluntary, as in, you do get paid for doing it. You don’t have to do it. You don’t have to be on the out-of-hours rota. You don’t have to stay by your laptop making sure you don’t have a drink or anything like that. The way we do it is you’re on the rota all the time, but it’s best effort. Meaning, if you can’t take the call or you missed the call, that’s fine. Ops will just call the next person on the list. What they do is they call people in the order that they were not most recently called. They don’t call the same people all the time. They look to call someone who it’s been the longest since they were called out. If you do get called out, you can claim overtime, but you can take time off in lieu. We don’t get called out very often. Middle of the night call-outs, on average, they’re every two months. That’s how we do Ops out-of-ours.

When I joined, we were down to five people on the rota. Sometimes you need a variety of techniques. I’m going to tell you what techniques we used to improve that. The first thing I’ll talk about is our Documentation Day. We have runbooks. I’ll define what we mean by that, because people use it in different ways. Runbooks are documents on how to operate a service. At the FT, the runbooks have a lot of detail. They’ve got last release to production. They’ve got what failover it has, is it active-active? Is it a manual failover or an automatic failover? They’ve got architecture diagrams. They’ve got details about what monitoring is available. They’ve got some troubleshooting steps that can be taken by first-line support. They’ve got who to contact if there’s an issue. They weren’t up to date. These things definitely need to be up to date. Brilliant Jennifer Johnson, and some of her colleagues organized this Documentation Day, where everybody on customer products downed tools, and we spent the whole day making sure that the runbooks were up to date. This had the advantage of not just making sure that the runbooks were up to date, but also getting some understanding about different areas of the system.

Another thing we did, organized by another brilliant colleague, Sam Parkinson, was arranged incident workshops. What this was, was running through old incidents in a mock way. Actually, the picture I showed at the beginning here, this is people running through an incident workshop, which is why they look so cheerful. Basically, we took an old incident that had happened and ran through it, and role played through it. The main advantage of the incident workshops, the really great thing that came out of it was that more junior people saw more senior people not knowing what to do. Because when you’re more junior, you think that you need to know everything to be on the out-of-hours rota. You think, I’m going to get called in the middle of the night and I won’t know everything about it, and so I can’t join the out-of-hours rota, because you need to know everything to be on the out-of-hours rota. What they saw in the incident workshops was more senior people saying, “I have no idea what’s going on. I don’t know what this is. Here’s where I’d look.” These were really good. They really encouraged people to feel confident about joining the rota.

The last thing I want to talk about is we introduced a shadow rota. What that is, is you sign up for the rota as a shadow person. When there’s an incident, Ops will call the person who’s actually on the rota and they will also call a shadow rota person. You just get to watch and see what’s happening. Actually, we found that people on the shadow rota often make really valuable contributions to incidents, but they’re not on the hook for it. All these things together meant that we quadrupled the number of people on the rota. We went from 5 to 22, so more than quadrupled. We went from 5 to 22 people in the rota, and the team is about 60. That was really good. We’ve made our rota sustainable.

Summary

With your microservices architecture, you have accepted a level of complexity. The drift towards entropy will make it more messy, so you need to empower people to handle that. We fought entropy and won. We’ve now got a clear technical direction. We’ve got full technical ownership. We’ve quadrupled the out-of-hours rota. We lived happily ever after. Not quite. It’s an ongoing project, because entropy is inevitable. We have to carry on fighting these things. You can fight it. You start working towards order, actively remove haunted forests, and accept entropy and handle it.

Conclusion & Resources

There will always be complexity in microservices. There are things you can do to reduce and handle it. You have to address these things consciously, they won’t just happen. I’ve given you some tools you can use. I’ve got some useful links. The first one is my blog post about our tech strategy. The second is the article I mentioned on InfoQ about how to sell technical work to the business. The third one is our careers website.

Questions and Answers

Wells: The first question was about the built-in A/B testing capability. Is it considered to be owned by one of your platform teams or by a different durable team?

Shipman: We do have something that we built, an A/B testing framework, and that is owned by our platform team. On customer products, we have a platform team. We are looking to replace that because technology has moved on quite a lot since we built that, but it’s currently owned by the platforms team.

Wells: The second question is about the size of your platform team that are currently supporting 72 repositories.

Shipman: It varies. At the moment there are, I think four or five engineers. It’s quite a small team.

Wells: Is that big enough?

Shipman: We do have a couple of vacancies on that team that I would like to fill.

Wells: It’s always a challenge to have people working in a platform team, when you’ve also got feature development to be done, I think.

Shipman: They’ve come up with some really interesting ways to clarify which tech they’re supporting, and which tech they’ll help people with pull requests on, and which tech they’re not supporting at the moment.

It went really well. It was a good day. Definitely read Jen’s blog post about it.

Wells: The things that I really liked about it, is I think everyone gets quite excited but someone made trophies for everybody. There were these brilliant trophies. I was working in operations at this point, and the runbook, the quality of all of your runbooks just got so much better in one day. I’ve never seen developers be that excited about doing documentation for runbooks.

Shipman: There were branded pencils as well.

Wells: You said that the ability to tell a story is important, how can people get better at doing that?

Shipman: People suggested things like improvisation workshops are very good for getting into that kind of stuff. There’s a few people who talk a lot about how useful improvisation workshops are for helping you think about things.

Wells: Who are in charge of preparation of incident workshops? Do you have internal training team or some special squad?

Shipman: That was completely on the initiative of Sam Parkinson, who’s the one who wrote the blog post. He’s principal engineer on customer products. We do incident reports. After an incident, we have a blame-free postmortem when we write up what happened and our recommendations for next steps. It includes a timeline of what happened and when, and what steps were taken. He took some of those previous incidents and ran through them. We didn’t have a training team around that.

Wells: Although actually, the operations team at the FT also did do some things like that as well for other teams. People thought that idea was so great that they copied it.

Aside from explaining the context to encourage people to decide to do the right thing, were there any incentives that needed to be changed in order to encourage it?

Shipman: My sense of it was that people were really motivated by solving these problems. When I joined, there was this group of people called simplification squad. They were attacking this problem on a quite small scale, because they were doing something alongside their work. I think there was real interest in getting those problems addressed. The vibe I got was that people were really just enthusiastic about getting these problems solved.

Wells: I think if you’ve got something that’s a bit painful for people, anything that allows them to make themselves have less pain onwards, is great.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.