Article originally posted on InfoQ. Visit InfoQ
Transcript
Patel: I want to talk about some of the concrete things we’ve learned whilst building out our architecture here at Monzo. Things we’ve got them right, and our tales of woe when things have gone wrong. I’m Suhail. I’m one of the staff engineers within the platform group at Monzo. I work on the underlying platform powering the bank. We think about all of the complexities of scaling our infrastructure and building the right tools, so engineers in other teams can focus on building all of the features that customers desire.
Monzo’s Goal
At Monzo, our goal is to make money work for everyone. We deal with the complexity to make money management easy for everyone. As I’m sure many will attest, banking is a really complex industry, and we undertake the complexity in our systems to give customers a better experience. We have 7 million customers relying on us for their core banking needs. To lean into that complexity a little, on the slide is some of the many payment integrations we build for and support. Some of these schemes are still based on moving FTP files around. Each of them have different standards and rules and criteria, which we’re constantly iterating on to make sure that we can launch new features for our customers without surfacing the internal complexities and limiting our product offering.
Monzo’s Tech Stack
In September 2022, we became direct participants of the Bacs scheme. Bacs powers direct debits and direct credits here in the UK. It’s how your salary gets paid, or how your bills get automatically deducted from your bank account on a fortnightly or monthly basis. It was originally built around a 3-day processing cycle, which originally gave time for operators at banks to walk magnetic tapes with payment instructions between banks, almost like a sneak in there. Monzo has been integrated with the Bacs scheme since 2017, but that was through a partner that handled all of the integration on our behalf. We did a whole bunch of work in 2022 to build that integration directly ourselves over the SWIFT network. We rolled it out to our customers, and most crucially, no one noticed a thing. This is going to be a particular theme for this talk.
A key decision is to deploy all of the infrastructure and services on top of AWS. It’s almost a given now in the banking industry with FinTech coming along, but at the time, it was completely unheard of in financial services. At this time, the Financial Conduct Authority, which is the regulator for financial institutions here in the UK, was still putting out initial guidance on cloud computing and outsourcing. If you read some of the guidance, it mentions things like regulators being able to visit data centers during audits. I’m not sure whether that would fly with AWS, if you rock up to an availability zone, and ask them to go see the blinking lights for your instances. Monzo was one of the first companies to embrace the cloud. We do have a few data centers to integrate with some of the payment schemes I mentioned previously, but we run the absolute minimum compute possible to interface messages with our core platform that does all of the decisioning on top of services that we build on top of AWS.
We’ve got all the infrastructure we need to run a bank, thanks to things like AWS, and now we need the software. There’s a bunch of vendors out there that will sell you pre-built solutions of banking ledgers and banking systems. However, most of these rely on processing everything on-premise in a data center. Monzo’s goal was to be a modern bank, so the product wasn’t burdened by legacy technology. This meant being built to run in a cloud environment. One of the big benefits of Monzo is that we retain all of our Slack posts from the very beginning, and here’s one of the first mentions of our tech strategy. When you’re building something new, you really want to jump into the new hotness. That Tom there, is Tom Blomfield, who was Monzo’s CEO until 2020. Whilst this was probably intended to be an offhand post, Tom was actually much closer to reality than anyone realized. I was wondering why banks often have downtime during time changes from summer to winter time. A lot of that comes down to concern around really old systems seeing a clock hour repeat twice. We didn’t want to be burdened with that technology.
Adoption of Microservices
One really early decision we made at Monzo is to adopt microservices. The original engineers had plenty of experience working in prior organizations where every change would be part of a massive bundle of software that was released infrequently. There isn’t a magic formula to kickstarting the technology for a bank. You need a reliable way to store money within your systems. A few of the first things that we built were services to handle our banking ledger, a service to handle signups, a service to handle accounts, and a service to handle authentication and authorization. Each of these services are context bound and crucially manage their own data. We define protobuf interfaces and statically generate code to marshal data between services. This means that services are not doing database level joins between entities like accounts and transactions and ledgers. This is super important to establish a solid API and semantic contract between entities and how they behave. A side benefit for this is making it easier to separate entities between different database instances as we’ve continued to scale, which I’ll talk about. For example, the transaction model has a unique Account entity flake ID with the data it stores, but all the other information lives within the account service. To get full information on the account, we call the account service using an RPC, which gives us the most up to date information on a particular account.
In the early days, Monzo did RPC over RabbitMQ, you can actually see remnants of it in our open source RPC library, Typhon, on GitHub. A request would be pushed onto a queue using AMQP, and then a reply would go into the reply queue. RabbitMQ dealt with the load balancing and deliverability aspects of these messages. Remember that this was in 2015, so the world of service meshes and dynamic routing proxies was still in their infancy. This was way before gRPC. It was about the time that Finagle was getting adoption. Nowadays, we just use plain old HTTP for our RPC framework. Take for example, a customer using their card to pay for something. Quite a few distinct services get involved in real time whenever you make a transaction to contribute to the decision on whether a payment should be accepted or declined. This decision making isn’t in isolation organizationally. We need to interact with services built by our payments teams that manage the integration with MasterCard, services within our financial crime domain to assert whether the payment is legitimate, services within our ledger teams to record the transaction in our books, and much more. This is just for one payment scheme, and we don’t want to build separate account and ledger and transaction abstractions for every scheme or make decisions in isolation. Many of these services and abstractions need to be agnostic and need to scale and grow independently to handle all of the various payment integrations.
Cassandra as a Core Database
An early decision we made was also to use Cassandra as our core database for our services. Each service would operate under its own key space, and there was a strict isolation between keyspaces so a service could not read the data from another service directly. We actually encoded this as a core security control so that cross keyspaces reads were completely forbidden. Cassandra is an open source, NoSQL database, which spreads data across multiple nodes based on partitioning and replication. This means that you can grow and shrink the cluster dynamically and add capacity. This is a super powerful capability, because I’m sure many of us have stories of database infrastructures being massive projects and sometimes going wrong at the worst possible time. Cassandra is an eventually consistent system. It uses timestamps and quorum-based reads to provide stronger consistency guarantees and last write wins semantics. In the example here, the account keyspace specifies a replication factor of 3. We’ve defined in our query that we want local quorum, so it will reach out to the three nodes owning that piece of data indicated in orange and return when the majority, which is two out of three here, agree on what the data should be.
As your data is partitioned over nodes, it’s very important to define a partitioning key which will allow a good spread of data across the nodes to evenly distribute the load across nodes and avoid hot partitions. Similarly, to choosing a good hash algorithm, choosing a partitioning key is not always easy. You may need to strike a balance between fast access and duplication of data across different tables. This is something that Cassandra excels at. Writing data is extremely cheap. Often, you have a data model that works well for a period of time. You may accumulate enough under one partition, which then requires changes to reshard it. This is not too dissimilar from tuning indexes in a relational store.
In the example above, if a particular hotel has millions of points of interest, the last table may be a very expensive one to read for all of the points of interest for that particular hotel. Additionally, the partition nature of Cassandra means that iterating over the entire dataset is expensive. Many distinct machines need to be involved to satisfy a select star and count queries. It’s often advised to avoid such queries. We completely forbid them. There are constraints that we’ve had to continuously train our engineers for. Many engineers come from an SQL past with relational stores, and often need a bit of time to adjust to modeling data in different ways. Another big disadvantage is the lack of transactions. A pattern we’ve adopted quite broadly is the concept around canonical and index tables. We write data in reverse order, first to the index tables, and then lastly to the canonical table. This means if a row is present in the canonical table, we can guarantee that it’s been written to all prior index tables too, and thus the write is fully complete. Say we wanted to add a point of interest to this scheme. We’ve chosen the top table, which is the hotels table as our canonical table. First of all, we’d write to the pois_by_hotel table here, then we’d write to the hotels_by_poi table afterwards. This means that our two index tables have the poi inserted. Lastly, we’d add the poi to the hotel’s table set. This means that the hotels table there is our canonical table.
Migration to Kubernetes
Scalability is great, but comes with some amount of complexity and a learning curve, like learning how to write data in a reliable fashion. We provide abstractions and autogenerated code around all of this, so that engineers are not dealing with this by hand. It’s important that we don’t wave away distributed systems failures that do occur from time to time. We have services, and we have data storage, and a way for services to communicate with each other over RPC. We need a way to run these services in a highly available manner. We moved to Kubernetes in 2016, away from a previous Mesos and Marathon cluster, which we were running. At the time, Kubernetes was still quite rough, it was still the 1.0 days. You can sense that the Kubernetes project was going to be big, and the allure of an open source orchestrator for application development and operations is really tempting. At the time, there weren’t all of these managed offerings, and the documentation and production hardness, which we take for granted today, wasn’t present. This meant that we had to get really good at running Kubernetes and understanding all of its different foibles. That knowledge has paid off in abundance over the last couple years.
In mid-2016, is when we decided to switch to HTTP as part of our migration to Kubernetes and use an HTTP service proxy called Linkerd, which was built on top of Finagle for our service discovery and routing. It plugged in really nicely with a Kubernetes based service discovery. We gained much fancier load balancing and resiliency properties, especially in the case of a service instance being slow or unreliable. It wasn’t without problem though. There was a particular outage that we had in the early days where an interaction between Kubernetes and etcd meant that our service discovery completely broke and we ended up with no healthy endpoints for any of our services. In the end, though, these are teething problems, like with any technology that is still emerging and maturing. There’s a fantastic website called k8s.af, which has stories from many other companies about some of the footguns of running Kubernetes and its associated cloud native components. You might read all of these and come to some conclusion that Kubernetes is really complicated or should be avoided at all cost because all of the problems that many companies have surfaced. Instead, think of these as extremely valuable learnings and experiences from companies and teams that are running these systems at scale.
These technological choices were made when we had tens of engineers and services, but have allowed us to scale to over 300 engineers, and 2500 microservices, and hundreds of deployments daily, even on Fridays. We have a completely different organizational structure, even compared to two years ago. Services can be deployed and scaled independently. More crucially, there is a clear separation boundary for services and data, which allows us to change ownership of services, from an organizational point of view. A key principle we’ve had is uniformity of how we build our services. For the vast majority of microservices that we build, they follow the same template and use the same core abstractions. Our platform teams deal with the infrastructure, but then also embed the best practices directly into our library layers. Within a microservice itself, engineers are focusing on filling in the business logic for their services. Engineers are not rewriting core abstractions, like marshaling of data, or HTTP servers, or metrics for every new service they add, or reinventing the wheel on how they write their data to Cassandra. They can rely on a well-defined and tested set of libraries, and tooling, and abstractions. All of these shared core layers provide batteries-included metrics, logging, and tracing by default, and everyone contributes to these core abstractions.
Monzo’s Observability Stack
We use open source tools like Prometheus, Grafana, Jaeger, OpenTelemetry, and Elasticsearch to run our observability stack. We invest heavily in scraping and collecting all of these telemetry data from our services and infrastructure. As of writing this talk, we’re scraping over 25 million metric samples, and hundreds of thousands of spans at any one point from our services. Let me assure you, this is not cheap, but we are thankful each time we have an incident and can quickly identify contributing factors using easily accessible telemetry. For every new service that comes online, we immediately get thousands of metrics, which we can then plot on templated, ready to go dashboards. Engineers can go to a common fully templated dashboard from the minute their new service is deployed, and see information on how long requests are taking, how many database queries are being done, and much more. This also feeds into generic alerts. We have automated generic alerts for all of our services based on these common metrics, so there’s no excuse for a service not being covered by an alert. Alerts are automatically routed to the right team, which owns that service, and we define service ownership within our service catalog.
This telemetry data has come full circle. We have this customer feature called Get Paid Early, where you can get your salary a day earlier than it is meant to be paid. This is a completely free feature that we provide to our customers. As you can imagine, this feature is really popular because who doesn’t want their money as soon as possible. In the past, we’ve had issues running this feature, because it causes a multitude spike in load as our customers come rushing in to get their money. As we are constantly iterating on our backend and our microservices, new service dependencies would become part of the hot path, and may not be adequately provisioned to handle the spikes in load. This is not something that we can statically encode because this is continuously shifting. Especially when you have expected load, needing to rely on autoscaling can be problematic, as there is a couple minutes of lag time for an autoscaler to kick in. We could massively overprovision all of the services on our platform, but that has large cost implications. It’s compute that is ultimately being wasted when it is underutilized. What we did is we used our Prometheus and tracing data to dynamically analyze what services are involved in the hot paths of this feature, and then we scale them appropriately each time this feature is provided. This means we’re not relying on a static list of services, which often went stale. This drastically reduced the human error rate and has made this feature just another cool thing that runs itself on our platform without human intervention. This is thanks to telemetry data being used in full circle.
Abstracting Away Platform Infra
One of our core goals is to abstract away platform infrastructure from engineers. This comes from two key desires. Firstly, engineers shouldn’t need a PhD in writing Kubernetes YAML to interact with systems like Kubernetes. Secondly, we want to provide an opinionated set of features that we deeply understand and actively support. Kubernetes has an extremely wide surface area, and there are multiple ways of doing things. Our goal is to raise the level of abstraction to ease the burden on application engineering teams that use our platform, and to reduce our personnel cost in operating the platform. For example, when an engineer wants to deploy a change, we provide tooling that runs through the full gamut of validating their changes, building all of the relevant Docker images in a clean environment, generating all of the Kubernetes manifests, and getting all of this deployed. Our engineers don’t interact with Kubernetes YAML, unless they are going off the well paved road for their services. We are currently doing a large project to move our Kubernetes infrastructure that we run ourselves on to EKS, and that transition has also been abstracted away by this pipeline. For many engineers, it’s quite liberating to not have to worry about the underlying infrastructure. They can just write their software, and it just runs on top of the platform. If you want to learn a bit more about our approach to our deployments, code generation, and things like our service catalog, I have a talk from last year at QCon London, which is live on InfoQ, where we go into more details on the tools that we have built and use, and our approach to our developer experience.
Embracing Failure Modes of Distributed Systems
As hard as we might try to minimize it, you have to embrace the failure modes of distributed systems. Say for example, a write was happening from a service and that write experienced some turbulence, so you didn’t get a positive indication whether that data had been written, or the data couldn’t be written at quorum. When you go to read the data again at quorum, depending on which nodes you read the data from, you may get two different set of results. This is a level of inconsistency, which a service might not be able to tolerate. A pattern we’ve been using quite regularly is to have a separate coherence service that is responsible for identifying and resolving when data may be in an inconsistent state. You can then choose to flag it for further investigation, or even rectify known issues in an automated fashion. This pattern is super powerful because it’s continuously running in the background. There’s another option of running these coherence checks when there is a user facing request. What we found is that this can cause delays in serving what might be a really important RPC request, especially if the correction is complex, and even more so when it requires human intervention.
The coherence pattern is especially useful for the interaction between services and infrastructure. For example, we run some Kafka clusters and provide libraries based on Sarama, which is a popular Go library for interacting with Kafka. To give us confidence in changes to our libraries and the Sarama library that we vendor in, we can experiment with these coherence services. We run these coherence services in both our staging and production environments continuously. They use the libraries just like any other microservice would at Monzo. They are wired up to provide high quality signal if things are looking awry. We wire in assumptions that other teams may have made so that an accidental library change or a configuration change in Kafka itself can be identified before the impact spreads to more production systems.
The Organizational Aspect
A lot of this presentation has been about the technology and software and systems, but a core part is the organizational aspect too. You can have the smartest engineers, but not be able to leverage them effectively, because the engineers and managers are not incentivized to push for investment in the upkeep of systems and tooling. When used correctly, this investment can be a force multiplier in allowing engineers to write systems faster, and to run them in a more reliable and efficient manner. Earlier on, I mentioned the concept of uniformity and the paved road. It’s a core reason why we haven’t ended up with 2500 unmaintainable services that look wildly different to each other. An engineer needs to be able to jump into a service and feel like they are familiar with the code structure and design patterns, even if they are opening that particular service for the first time.
This starts from day one for a new engineer at Monzo. We want to get them on the paved road as soon as possible. We take engineers through a series of documented steps on how we write and deploy code, and provide a support structure to ask all sorts of questions. Engineers can rely on a wealth of shared expertise. There’s also value in implicit documentation in all of the existing services. You can rely on numerous examples of existing services that are running in production at scale, which have demonstrated tried and tested scale and design patterns in production. We found onboarding to be the perfect place to see long lasting behaviors, ideas, and concepts. It’s much harder to change bad behavior once someone has become used to it. We spend a lot of time curating our onboarding experience for engineers. It’s a continuous investment as tools and processes change. We even have a section called Legacy patterns, highlighting patterns that may have been used in some services, but we avoid in newer services. Whilst we try hard to use automated code modification tools for smaller changes to bring all of our services up to scratch, some larger changes may require large human refactoring to conform to a new pattern, and this takes time to proliferate.
When we have a pattern or behavior that we want to completely disallow, or there’s a common edge case, which has tripped up many engineers, we put it in a static analysis check. This allows us to identify issues before they get shipped. Before we make these checks mandatory and required, we make sure to clean up the existing codebase so that engineers are not tripped up by failing checks that may not be because of the code they explicitly modified. This means we get a high quality signal, rather than engineers just ignoring these checks. The friction to bypass them is very high intentionally to make sure that the correct behavior is the path of least resistance.
We have a channel called Graph Trending Downwards, where engineers post optimizations they may have made. An example above is an optimization we found in our Kafka infrastructure. We had a few thousand metrics being scraped by the Kafka JMX Prometheus Exporter. This exporter is embedded as a Java agent within the Kafka process. After a bit of profiling, we found that it was taking a ton of CPU time. We made a few optimizations to the exporter and half the amount of CPU that Kafka was using overall, which massively improved our Kafka performance too. We’ll actually be contributing this patch upstream to the open source JMX Exporter shortly. If you use the JMX Exporter, watch out for that. These posts don’t need to be systems related either, and we also have graph trending upwards. It’s a great place to highlight optimizations we may have made in reducing the number of human steps to achieve something, for example. It’s a great learning opportunity to explain how these optimizations were achieved too.
Example Incidents
In April 2018, TSB, which is a high street bank here in the UK, flipped the lever on a migration project they’d been working on for three years to move customers to a completely new banking platform. The migration unfortunately did not go smoothly and left a lot of customers without access to their money for a long period of time. TSB ended up with a £30 million fine, on top of the nearly £33 million compensation they had to pay to customers, and the cost of reputational damage too. The report itself is a fascinating read, and you can find it on fca.org.uk. The report talks about the technological aspects, as you would see in any post-mortem, but focuses heavily on the organizational aspects too. For example, overly ambitious planning schedules compared to reality, and things like testing and QA not keeping up with the pace of development. Many of us here can probably attest to having some of these things in our organizations. It’s easy for us to hand wave and point blame on the technological systems and software not meeting expectations or having bugs, but much harder to reflect on the organizational aspects that lead to these outcomes.
Learning from incidents and retrospecting on past projects is a force multiplier. For example, in July 2019, we had an incident where we misunderstood a configuration in Cassandra during a scale-up operation, and had to stop all writes and reads to that cluster to remediator. This was a dark day that set off a chain reaction spanning multiple years to get operationally better at running our database systems. We’ve made a ton of investments since then on observability, deepening our understanding of Cassandra itself, and become much more confident in all things operational, through runbooks and practice in production. This isn’t just limited to Cassandra, this is the same for all of our other production systems. It went from a system that folks poked from a distance to one that engineers have become really confident with. Lots of engineers internally remark about this incident, even engineers that have joined Monzo recently, not for the incident itself, but the continuous investment that we’ve put afterwards. It isn’t just a one and done reactionary program of work.
Making the Right Technological Choices
Towards the beginning, I talked about some of the bold, early technological choices we made. We knew we were early adopters, and that things wouldn’t be magic and rainbows from day one. We had to do a lot of experimenting, building, and head scratching over the past seven years, and we continue to do so. If your organization is not in a position to give this level of investment and sympathy for complex systems, that needs to factor into your architectural and technological choices. Choosing microservices or Kubernetes, or Rust, or Wasm, or whatever the hot new technology without the level of investment required to be an adopter usually ends up in disaster. In these cases, it’s better to choose simpler technology. Some may call it boring, but technology that is a known quantity to maximize your chance of success.
Summary
By standardizing on a small set of technological choices, and continuously improving these tools and abstractions, we enable engineers to focus on the business problem at hand rather than the underlying infrastructure. Our teams are continuously improving our tools and raising the level of abstraction, which increases our leverage. Be very conscious about when systems deviate from the paved road. It’s fine if they do, as long as it’s intentional and for the right reasons. There’s been a lot of discussion around the role of platform teams and their roles in organizations. A lot of that focuses on infrastructure. There’s a ton of focus on things like infrastructure as code, and observability, and automation, and Terraform, and themes like that. One theme, often not discussed, is the bridge or interface between infrastructure and the engineers building the software that runs on top. Engineers shouldn’t need to be experts in everything that you provide. You can abstract away core patterns behind a well-defined and tested and documented and bespoke interface, which will help save time and help in the uniformity mission, and embrace best practices for your organization.
This talk has contained quite a few examples of incidents we’ve had and others have experienced. A key theme is introspection. Incidents provide a unique chance to go out and find the contributing factors. Many incidents have a technical underpinning, like a bug being shipped to production or systems running out of capacity. When you actually scratch the surface a little further, see if there are organizational Gremlins lurking. This is often the best opportunity to surface team members feeling overwhelmed or under-investment in understanding and operating a critical system. Most post-mortems, especially those that you read online, are heavy on the technical details, but don’t really surface the organizational component. Because it’s hard to put that into context for outsiders, but internally, it can provide really valuable insight. Lastly, think about organizational behaviors and incentives. Systems are not built and operated in a vacuum. Do you monitor, value, and reward the operational stability, speed, security, and reliability of the software that you build and operate? The behaviors you incentivize play a huge role in determining the success or failure of your technical architecture.
See more presentations with transcripts