Presentation: K8s: Rampant Pragmatism in the Cloud at Starling Bank

Uncategorized

Presentation: K8s: Rampant Pragmatism in the Cloud at Starling Bank

MMS • Jason Maude

Article originally posted on InfoQ. Visit InfoQ

Transcript

Maude: My name is Jason Maude. I am the Chief Technology Advocate at Starling Bank. What that means is I go around and talk about Starling Bank’s technology, its technology processes, practices, its architecture, and present that to the outside world. Technology advocacy is very important, and I think growing more important as technology grows more present in everyday life. It’s especially important for a bank, because as a bank, you need to have trust, the trust of your customers, the trust of the industry, the trust of the regulators. As a highly technological bank, we need to have trust in our technology. It’s my job to go round and talk about that.

What Is Starling Bank?

What is Starling Bank? Starling Bank is a fairly new bank. It was founded in 2014. It’s a bank based in the UK. At the moment it’s UK only, but we hope to be expanding beyond that. It launched to the public in 2017. Its launch was slightly unusual in the sense that there were no branches, no physical presence on the high street, no way you could walk into. It was an entirely app based bank. For the customers, the way they got a bank account was like going on to their mobile phone, their iPhone, their Android phone, downloading the Starling Bank app, and then applying for a bank account through there. Having applied for the bank account, doing all of their banking via the app. We’re entirely based in the cloud, and always have been, with a few caveats, but almost entirely based in the cloud, and have been since day one. Completely cloud native. Our technology stack is fairly simple, generally speaking, where most of our backend services are written in Java. They’re running over Postgres relational databases. There’s nothing fancy, not very many frameworks in place. They’re running in Docker containers on EC2 instances. We are just starting to migrate all of these over to a Kubernetes based infrastructure.

Starling Bank’s Tech Stack

I often get asked at these presentations, do you use such a framework, or such a way of doing things? My answer is always, “No, not really. We don’t use that framework. We don’t have many complicated frameworks, or methodologies, or tooling in place.” We have some, obviously, but we don’t have a lot of them. People are often surprised. They’re surprised that a very modern, tech savvy, cloud native company is running on such vanilla infrastructure, as well.

I’ve seen someone already asking, you started in the 2010s, why are you using Java? It’s an old language. Absolutely, why are we using Java? Why are we running on such a simple vanilla tech stack? That is going to be the main focus of my presentation, trying to answer this question, why are we focused on such a simple tech stack? Why are we focused on this very vanilla, very old school, not modern languages way of doing things? It comes down to this philosophy of rampant pragmatism. I want to talk through what rampant pragmatism is, how we apply it at Starling Bank, and use the example of how our deployment procedures have evolved over time. Why we’re only just now getting to Kubernetes to try and draw out the reasons for this.

Essential Complexity and Accidental Complexity

I’m going to start with the academic paper that inspired this philosophy of rampant pragmatism. It’s a paper that is called, “No Silver Bullet—Essence and Accident in Software Engineering.” No Silver Bullet. It’s a paper about the difference between essential complexity and accidental complexity. It tries to draw a distinction between these two in software engineering. What is the difference? Essential complexity is when you have a problem, the essential part of it is the part you’re never getting away from. It’s the part about looking at the problem and the solution, and trying to work out what the solution to the problem is. How the solution should be formed. That is the essential complexity. Then the accidental complexity is everything that surrounds that to implement that solution, to actually take that solution and put it into practice.

Let’s use an example to illustrate the difference between the two of these. We can imagine a scenario in a far-flung mythical future in which a group of people are sitting down to design a new software product. They are, as many of us do, starting by the time honored tradition of standing round a whiteboard with marker pens, drawing on the whiteboard, boxes and circles and arrows between them. Trying to define the problem space that they’re working in, and how they would like to solve it. Trying to define the architecture of the system that they are trying to design. They’ve drawn all these boxes, and arrows, and circles, and so on. Except, rather than then picking up their phones and taking a picture of it, they instead go to a magic button. This magic button lies by the side of the whiteboard, and when you press it, it takes what is drawn on whiteboard, turns it into code. Make sure that code is secure. Make sure it will work. Make sure it is optimized for the volumes you’re currently running at. Tests it, make sure there are no regressions. Integrates it into the rest of the code base, and deploys it into production. All through this magic button.

The point of this example is to illustrate that the drawing on the whiteboard, the drawing of the circles and the squares, and the lines, and arrows, and so on, that bit is the essential complexity. That bit is the bit that really truly you can never get away from, and to the problem space, is what you actually need. All of the rest of the job that we do, the creating of the code, the deploying of it, the testing of it, all of that is accidental. It’s an implementation detail. It’s an implementation detail that we can eventually try and get away from. We have tried to get away from. We know no one is sitting there trying to write machine code. No one is there tapping out 1s and 0s. We’re all writing in higher level languages, which are gradually compiled down. We’ve tried to move away from the low level implementation into something that is closer to the essential complexity, but we still have a lot of accidental complexity in our day-to-day jobs. The accidental complexity isn’t bad. It’s not a problem to have accidental complexity. Some accidental complexity such as the writing of the code is necessary. It’s required for our job to work. Some accidental complexity isn’t. It’s unnecessary. It slows us down. It’s there but it complicates matters. It makes matters more difficult and tiresome to deal with. Trying to deal with tools and processes and pieces of code that slow you down and stop you implementing your problem, your whiteboard circles, and diagrams, and squares, that is unnecessary accidental complexity.

Unnecessary Accidental Complexity and Necessary Accidental Complexity

You’ve got unnecessary accidental complexity and necessary accidental complexity, but how do you tell the difference between the two? How do you make sure that you are working with the necessary stuff only and not introducing any unnecessary stuff? What has this got to do with the cloud? The cloud and our move to the cloud as an industry has made this question more important, more pertinent. The reason it’s made it more pertinent, is because the cloud opens up a world of possibilities to us. It opens up the ability for us to go in and say, we’d like to try something out. We’d like to try a new thing. Let’s write something up, spin it up on an instance, and test it out, see if it works. If it doesn’t, we can take it down again. Gone are the days where you would have to write up a business case for buying a new server, and installing the server and all of the paraphernalia and fire safety equipment in a server room somewhere, in order to be able to try out your new thing. That squashed a lot of innovation. Nowadays, you can just spin up a new instance, brilliant, and try it out.

As a wise comic book character once said, with great power comes great responsibility. The power that you now have to create, and design, and spin up things very quickly means that you also have the responsibility to question whether that thing you’re generating is necessary accidental complexity or unnecessary accidental complexity? Is it actually helping you deploy what you need, what you drew on the whiteboard, the problem that your organization is trying to solve, or is it going to slow things down? The sad reality is that the answer for the vast majority of tooling that you have is probably both. It’s probably introducing both necessary accidental complexity and unnecessary stuff. It’s probably both helping you deploy the thing and putting blockers in your way. The question is less how can you stop introducing unnecessary complexity, but how can you limit it as much as possible?

This has been Starling Bank’s key thinking since day one. Our philosophy of rampant pragmatism is the question of how do we make sure that we don’t introduce, or introduce unnecessary accidental complexity as little as possible, to reduce it down to the minimum it can possibly be? That is the philosophy. We carefully weigh up our options every time we have something. We go, what do we need right now? What is necessary now in order to deploy this thing? Is what we’re deploying here really necessary, or could we get away with doing something simpler? As a result, our tech stack ends up being fairly simple, because every new tool, every new framework, every new way of doing things is subjected to a rigorous question of, is this going to increase the amount of unnecessary accidental complexity too much? If it is, then it is rejected, which is oftentimes why we use old languages such as Java, and so on.

Starling Bank’s Code Deploy Journey

Let’s talk through our journey through our various different ways we have deployed code into production, as an illustration of this philosophy. We start off with, what does Starling look like in the backend? What Starling looks like is reasonably simple. We have about 40 to 50, what we call self-contained systems. Whether these are microservices or not, is a question up for debate. I personally think they’re too big, but then, how big is a microservice is a question that we could spend all day discussing? I call them microliths. What they do is each one of them has a job to do that covers an area of functionality, such as processing card payments, for example, or storing a customer’s personal details perhaps, or holding details about customers’ accounts and their balances, and so on. Each one of these self-contained systems holds those details.

We had all of those individual systems and they are theoretically deployable on their own. They’re theoretically individually deployable. We could go and increase the version of one of them, and none of the others. When we started, we rarely did that. By convention, we deployed them all together. They were all on version 4, and then we all move them all up to version 5, and then 6, and so on. They were all in lockstep with one another. We were deploying as a monolith. The question becomes, if you were deploying as a monolith, why not just write a monolith? Which is an interesting question. The reason that we kept deploying them as a monolith is because it made things much simpler. It reduced the amount of accidental complexity we had to deal with. We weren’t being hamstrung by the fact that we had to deploy them all together as a monolith and keep going. It wasn’t slowing us down in the early days. Trying to deploy them individually would have meant that we had to test the boundaries between them. They each had their own APIs, which they use to contact each other. If we had deployed them individually, we would have had to put tests around these APIs to make sure that we weren’t deploying a breaking change, for example, and those tests would have hardened the API shell and made it more difficult to change and that sort of thing.

In the early days, we didn’t want that, because we were changing these services all the time, not just internally, we were also changing what concepts they were dealing with. We were splitting them apart. We were moving pieces of functionality between one service and another, and so on. We were moving all this functionality around, and as such, we decided, it’s easier for us if we just test the whole thing, test everything as a whole, rather than trying to test individual bits and pieces of it individually, and the boundaries and the contracts between.

This went on for some time, and we were nicely and safely deploying things. Then things started to get a bit difficult with that model. The thing that caused the difficulty was scale. As we increase the number of engineers, as we increase the number of services, as we increase the volume of customers, we quickly developed a problem, where the amount of code we were trying to deploy all at once was too big for us to really get a handle on. We had been deploying up to that point every day, and we noticed that cadence slowing down. We were deploying every day, then every two days. Some days it became too difficult to deploy our platform services. We were starting to slow down. We sat down and decided, what can we do about this? The answer was, it was time to stop deploying as a monolith. That was what was slowing us down, because we had to ensure the safety of every single change that took place. As those number of changes had increased, that process of trying to ensure every single change every time we deployed, the bank as a whole, mean we became too slow. If there was a problem in one section of the bank, we stop deploying everything, even if other services were perfectly fine.

We decided to split things down. We decided not to deploy everything individually, we decided to deploy things as large groups of services. For example, anything that dealt with processing payments was one group. Anything that dealt with customer details and account management was another group. Anything that dealt with customer service and chat functionality and helping customers was another group, and so on. Once again, the reason was that each of these groups were working as a team. There was a team surrounding this group of services, and they could ensure the efficacy, the workingness of these services, and deploy them as groups.

It also helped us not fall afoul of Conway’s law, the law that states that your organizational structure and your software architecture will align. I forget which way Conway says the causation arrow goes in that alignment, but the two are intimately linked. If you have three teams, but five different services, you will quickly find that you have a problem, because what the teams do and how the teams work will mean that if you have five services and three teams that parts of your software will either get neglected or will be fought over by two different teams. That can be difficult.

Surrounding your deployment procedures and methods around the teams, again, helps you stop introducing accidental complexity because they are aligned, and we decided it was easier to manage 6 teams than 50 teams, which would have been excessive. We moved into this situation where we were deploying these groups of services all together, and each group deployed all their services at once as a collection of services in a small monolith. A minilith, however we’re going to determine these terms here.

Kubernetes

That was the situation for a while. Then we started bringing in Kubernetes. Why only Kubernetes just now? What we realized we wanted to do is that as these teams had split up, and they were all responsible for their individual sections, they wanted to create more of these services to better control their functionality. Creating new services was a bit difficult, it was long and laborious spinning up a new service and creating all of the infrastructure around it. A tool like Kubernetes helps you to have a standard framework template that allows you to spin up these things without the developers having to do endless amounts of work in infrastructure, defining all of these services in any great detail.

Another key aspect of doing this is, back to our old friend, the regulator. Starling Bank is working as a fully licensed bank, so it’s in a regulated industry. Regulators are starting to worry, at least in the United Kingdom, about what they call concentration risk, that is, the risk that everyone is using AWS, and if AWS goes down, everyone goes down. In order to stop that they have tried to get people to move to a more cloud neutral solution. This is a good thing. We need to move to a situation where one cloud provider either disappearing, unlikely as that might be, in all zones, in all regions simultaneously. The way out of that is to make sure that you are running on multiple cloud providers at once.

We started on AWS and our infrastructure was very tied to it, not by design, simply by the fact that that’s how we had started and that was what was practical at the time. The philosophy of rampant pragmatism often demands that you have to go and you have to make sure that you’re doing what’s right at the time, not just trying to push yourself forward to an imagined future where you think things could be better. We decided for those reasons, and the increased speed of delivery that we hope to get under Kubernetes to start migrating onto Kubernetes, and moving in. Once again, we’re trying to keep it fairly simple, fairly vanilla. I’ve been to talks on Kubernetes, where people have rattled off huge lists of tooling, and so on. I couldn’t tell you whether they were reciting lists of Kubernetes tools or Pokémon, I have no idea. There were so many of them, it’s difficult to keep track of. Once again, there is that temptation, let’s grab all the tools, let’s bring all the tools in, without stopping and asking, is this tool going to increase our unnecessary accidental complexity, or decrease it in net sense? Which way is the dial going to move? That’s been Starling Bank’s journey.

Lessons Learned

Various lessons I draw from this on trying to work out where the difference lies between necessary and unnecessary accidental complexity, and how you can tell the difference. One simple way is just doing a cost benefit analysis. If you introduce a new tool, or a new framework, or what have you, are you going to spend more time maintaining that framework than you’re going to save by implementing that framework? It sounds like a simple idea. This is really an exhortation for everyone to stop and think, rather than becoming very excited about new tools and the possibilities that they bring. New tools, new languages, new frameworks can all be very exciting, but are you really going to get more benefit out of it than the cost to maintain it?

Another point is about onboarding time. The more tools and so on that you introduce, the more complicated frameworks and so on, the longer it takes to get someone up to speed who has just joined your organization. I’d love for there to be some standard tools or standard frameworks, but my experience tells me that there really isn’t. There’s a plethora of different things that people use. Whatever you think is standard, you’ll always be able to find somewhere that doesn’t use that and uses something else. How long does it take you to get someone up to speed? At Starling Bank, we really want our new engineers who join the company to be able to start committing code and have that code running in production within the week that they join. We really want to make sure that they can have an impact on the customer, really change what the customer is seeing very quickly. If we have lots of accidental complexity, it becomes harder to do that. Measuring, how long does it take someone to be productive? How long does it take someone to learn all the tooling? How long does it take to train a new engineer at our company, even if they are experienced as an engineer? A good measure of whether you have introduced too much accidental complexity or not.

Of course, there is the big one, which is, how long does it take you to actually fix a problem? If something goes wrong, and the customer is facing a problem, trying to scroll through the endless tools and going, is it this tool? Is it something in our CI/CD pipeline? Is it a code bug? Is it something to do with our security? Walking through all of these different things can be long enough if you’ve got loads of tools, and loads of different things, and loads of different frameworks and so on, walking through all of them to try and work out where the bug is, could take ages. The more accidental complexity you introduce through this tooling, the longer it’s going to take you to actually fix something when it goes wrong, and deploy it and get it out and working. Those are my lessons I have learned. I encourage you all to think about essential and accidental complexity, and then your necessary and unnecessary accidental complexity, and see if you can pragmatically reduce down your unnecessary accidental complexity to as low as possible.

Questions and Answers

Watson: Why would you pick that language? Is it still pertinent today?

Maude: There are a number of reasons. A lot of people go, why Java? Java is like 30 years old almost at this point. One of the answers is because it’s 30 years old at this point. It’s been around for 30 years, and it’s still around. It’s still working. That gives us the reliability aspect that we need. Also, it’s really easy to hire people who can code in Java, and the ability to have a good recruitment pipeline, an easy recruitment pipeline is a good thing. That’s a positive thing. Again, something that increases our reliability, and so on. You should be able to find out the talks I’ve done at QCon in the past about why reliability is so important to us, and how we insure it. That’s one aspect of it.

Watson: I know that at a former company I was at, one of the benefits of Java was people could use Clojure, Scala, like a lot of the base RPC language libraries could be common. I’m in a company now that has five languages and it’s unpleasant, just to be clear.

Was there a decision made early on you would deploy services as a monolith? Are these microservices tightly coupled or loosely coupled on your end?

Maude: They are reasonably loosely coupled by now, mainly because they have very distinct jobs as it were. This distinctness of job means that the coupling is as loose as we can make it, as proper as we can make it. We know whose responsibility storing a customer’s address, for example, is. If we got a new piece of personal information about the customer, we know exactly which service to store it in, and exactly what the API should look like to transfer it out. They’re relatively loosely coupled. We know under the covers that service can just constantly churn as much as it likes, because as long as it keeps fulfilling its contract, we’re happy with it. I’m happy with the level of coupling we have.

Watson: I saw Alejandro picked up on YAGNI, you aren’t going to need it, like the pragmatism. I think people connected with that. I don’t know if you have anything you want to say on that.

Maude: Absolutely. I think this is YAGNI taken to its logical conclusion. YAGNI is a bit too definitive for me. It’s like, you aren’t going to need it. That sounds like you are not. We know that you’re not going to need it. I think that’s rarely true. What is true is you don’t know that you’re going to need it yet. You have to be able to say, we don’t know, so let’s not do it, because we’re not really sure. The caveat to that is that you have to be ready if you don’t know you’re going to need it. You have to be ready when you do know that you need it to change quickly. That’s the essential thing you have to have in place in order to employ this rampant pragmatism, for example.

Watson: How are you managing networking among services, are you using service mesh? How do you look at the networking perspective of your move?

Maude: Again, no, there isn’t anything. We are very much of the philosophy that the smarts should be in the services, not in the pipes. The connection between our different systems is simply a REST API, that’s it. We send JSON payloads over the wire between the two services, and that’s it. There’s nothing smart or clever happening in the pipework. All of the smarts take place in the service. As soon as the service receives an instruction, it saves the instruction. That basically becomes a thing that must be executed, a thing that must be processed. That’s put onto queues and the queues have a lot of functionality to make sure that they do things. They do things at least once and at most once. They make sure that they retry if things go wrong. That things are picked up again if the instance disappears, and then reappears. There is a retry functionality in there. There is idempotence in there to make sure that multiple processings don’t produce like sending the same payment five times or what have you. All of that smarts is contained in the service, not in the pipework, the pipework is boring, just a payload over the wire.

Watson: How do you handle deploy and rollback with your monolith approach, if you don’t have API versioning?

Maude: This was interesting in the really early days, when we were deploying the whole thing as a monolith. We didn’t need to deploy the whole thing as a monolith. We could have deployed it as individual services, but deploying it as a monolith was easier to test. When we needed to roll back, oftentimes, if an individual service was the problem, we would check that rolling back that service would be fine. We would essentially do a manual check to see if the APIs had broken, and then roll back that individual service. We then would have a situation where one service was on version 5, and everything else was on version 6. That was possible. We could do that in an emergency scenario. The same is true nowadays. We will sometimes do that. Obviously, the preference nowadays is to roll back the entire service group. If a service group, payments processing, say, encounters a problem in one service, we’ll roll back the entire service group to an older version, if we can.

Watson: Besides the Silver Bullet paper, do you recommend some literature paper blogs on the topic?

Maude: I don’t have anything off the top of my head that I can think of.

Watson: Given you went through the process of moving from monolith to microservices, have you ever backed out of a tech decision because it unexpectedly produced too much unnecessary complexity?

Maude: We’ve certainly backed out of using suppliers because they introduce unnecessary complexity. Yes. A lot of the time, what we do is we get a supplier to do something. Then we think actually maintaining the link with the supplier is too complicated. It’s basically generating more complexity than it’s solving, we write it ourselves. We have done that on a number of occasions. If you are doing Software as a Service, please make sure that you are there to reduce the unnecessary complexity of your customers, not increase it. Because otherwise, hopefully, they will get rid of you.

Watson: You mentioned the cost benefit ratio of adopting something as a factor, and someone said, what type of timeline would you assume when you’re assessing that cost benefit ratio? Does it depend, and do you have criteria?

Maude: We’re assessing it all the time. We’re fairly open to people grumbling about tooling and saying, I don’t like this tool, and so on. Because if the grumbling gets too much, then people will start going, this tool really does need to be replaced. We don’t have a formalized, “We will review it every six months to see.” Instead, we’re constantly looking at what are the problems now? Where are the accidental complexities now? How can we increase, decrease them, and so on? That sort of investment.

Watson: How do you test contracts between these big blocks, those that are launched independently?

Maude: How we test them is we have what we call an API gateway, which is a fairly transparent layer, that traffic into and out of a group we’ll go to. We have contract tests that we write, one on either side. We have a consumer and a provider contract test on both sides of the line. The consumer in Group A and the provider in Group B, will both be tested. One to test that it generates the correct payload, the other to test that if it is given a payload, it can consume it. It’s very much not trying to test the logic of the consumption. It’s not trying to test that it processed correctly, simply that it can be consumed. That you haven’t broken the API to the extent that you can’t send that payload. Again, one of the reasons we can do that is because the pipeline is very simple. There is no complexity in the pipeline that we need to be concentrating on maintaining.

See more presentations with transcripts

Mobile Monitoring Solutions

Uncategorized