MMS • Ian Thomas
Article originally posted on InfoQ. Visit InfoQ
Transcript
Thomas: Welcome to my talk about orchestration versus choreography, patterns to help design interactions between our services in a microservices architecture. To start off, I’m going to take us on a little tangent. I want to talk to you about footballs, specifically this ball that’s on screen right now, which is the official match ball for the World Cup 2022. Footballs consist of many different panels that are normally stitched together to make a sphere. In modern footballs, they’re not stitched together using traditional methods anymore. They’re thermally bonded. The panels can be of all sorts of different shapes and sizes. This allows manufacturers to make balls as round as possible, which contributes to different properties and makes them fly faster and straighter, and give players more options for giving the ball bend and stuff like that. Just, park that idea for a second because I’m going to take you on another tangent, which is about the bit that’s inside this football. This football is incredible. It’s got a sensor in the middle that it will be measuring different aspects of the ball 500 times a second, so that FIFA can make better refereeing decisions. They can determine how fast the ball is moving, how often it’s been kicked, where it is on the pitch. It can help inform decisions around offsides, and other things like that. Also, they’re going to use it to produce AI graphics so that you can see at home exactly what’s going on at any time, with absolute millimeter precision.
This isn’t what I wanted to talk to you about footballs for. Remember, I was talking to you about those panels. I think the panels on a football, especially old footballs, bear a striking resemblance to how our software systems look. Take this, for example, this ball has been well and truly used to within an inch of its life. You can see that it’s worn, and the seams are starting to show. In certain places bits are falling off it. This is a bit similar to how our systems age, if we don’t look after them properly. This one here you can see has been pushed even further, and it’s no longer even round. It’s debatable whether this will roll properly, whether it will fly straight through the air. It’s really in a bad way. If we let our systems get in this state, there’s normally only one course of action left for us, and that’s to throw it away and start again with something new. If we don’t take action soon enough, perhaps we’ll see something like this, where bits of our football will explode and the innards will pop out and it becomes completely unusable. It might still have a chance of being fixed, but it’s a serious incident. Or if it’s just too far gone, and everything’s exploded, and there’s no chance of fixing it, we literally are left without any choice but to start again. A very costly mistake.
Background
I’m Ian. I’m a software architect from the UK. I want to talk to you about quite a few things that, unlike the footballs, help us to maintain our systems to keep them in good working order over time. Specifically, I want to look at this idea of change and how the decisions that we make when we’re designing our architectures, and the way that we implement our architectures affects our ability to effect it. Within that context, I want to think about coupling and complexity and various other aspects of how we put our systems together, including the patterns that we might use, such as orchestration and choreography that help us to effect long term evolvable solutions. As we go through this talk, I’d like you to keep in mind these five C’s that I’ve called out on screen right now. I want you to think about them not just from a technology perspective, but also from a human and organizational perspective. Because ostensibly, this is a talk about decision making in design, and that involves people as well as systems.
Lehman’s Laws of Software Evolution
If we start at the point of change, a good reference, I think to help us understand why change is necessary is a paper by a man called Manny Lehman who was a professor at Imperial College London. Before that, a researcher at IBM’s division in Yorktown Heights, New York. His paper programs life cycles and laws of software evolution, lays out some laws that I think are quite relevant for this discussion. He’s got eight laws in total, and I’m not going to go through all of them in individual detail. I just want to call out four in particular that all relate to various things around how we will build and have to maintain our systems. Change, complexity, growth, and quality, they’re all affected and describe the need for change. Without change, our systems become less satisfactory. Without growth, they become less satisfactory. By adding change and adding things as new features, without intentional maintenance or intentional work to manage and reduce complexity, the effort to change in the future is also made harder, so we slow our rate of change possible. Then this idea at bottom left of declining quality is a really interesting one, because it’s saying that, essentially, stuff’s changing all around our system, and if we don’t keep up with it, then we’re also going to have a perception of reduced quality.
That leads on nicely to this other framing, which is thinking about where change comes from and what stimulus triggers it. The orange boxes are fairly obviously things that happen outside of our system. Either user needs or feature requirements or wanting to adapt to modern UI standards, whatever. These are things that would cause us to have to update our systems. The purple ones, they reflect more the problem that if we’re not careful and deliberate in the way that we want to work with our systems and put the effort in to maintain them properly, we will self-limit our ability to impose change on the systems. The two that are in the middle, the organizational stability and conservation familiarity laws are really interesting from a sociotechnical perspective, because they are effectively talking about the need for teams to remain stable. They can retain a mastery at a decent enough level to make change efficiently. Also, they need to be able to keep the size of the system to a relatively stable level. Otherwise, they can’t hold all the information in their heads. It’s an original source of talking about cognitive load. Useful things to think about.
In his paper, Manny Lehmann calls out how complexity and how the growth of complexity within a system leads to a slowdown in change. This graph really clearly shows how, over a number of years, the ability to effect change is slowed quite rapidly. The amount of change that happens at each release is reduced. Typically, when we think about microservices, it’s in the context of our organizations wanting to scale, and that’s not just scaling the systems themselves, it’s also scaling the number of people that can work on them. We’d like to think that adding more people into our business would lead to an acceleration in the rate of features that we deliver. Actually, I’ve seen it more often than not where without proper design and without proper thought into how we carve up our systems and the patterns that we use, actually adding more people slows us down. Just take a minute to think which of these charts most closely matches your situation, and perhaps have a reflection thinking why it is that you are in the situation you’re in.
Workflows
We’re talking about interaction patterns between our services. That probably means we’re thinking about workflows. It’s useful to introduce the concept of what a workflow actually is. I’ve got this rather crude and slightly naive version of an e-commerce checkout flow to hopefully highlight to you what I would consider a workflow. You can see there are various states and transitions in this diagram. If you go along the top, you’ve got the happy path. If you also look, there are quite a few states where we’ve had a problem, and we need to recognize that and handle it. Going through this, you can see there are lots of different things to consider not least the fact that some of these stages don’t happen straightaway. Clearly, once we’ve processed the payment, the product isn’t going to immediately jump off a shelf and into a box and into a wagon to deliver it to your house. Equally, if there are problems, some of them are completely outside of our control, and people end up getting stuck in quite horrible ways.
One of the useful things about having diagrams like this, when we’re thinking about designing systems, is that we can start to look at meaning behind them. Here I’ve drawn some boxes around related areas and related states. This is where we start to see areas that we could perhaps give to teams, individual ownership. A lot of the terminology that I’m going to use to talk about these workflows will relate to domain driven design. Just as a quick recap, to help make sure everyone’s on the same page with what the words mean, I’m going to use, the blobs in the background, they represent bounded contexts. A bounded context is a way of thinking about a complete and consistent model that exists strictly within the bounds of that bounded context. People from different aspects of the business will have shared understanding of what all the terms and models mean, and that is our ubiquitous language. It’s important to think that within a bounded context, we’re not just saying it’s one microservice. There could be many services and many models in there. The key thing is that they’re consistent. We could further group things within a bounded context by creating subdomains. The last thing to call out is the bit between all of these bounded contexts is really important because, ultimately, these things have to work together to provide the value of the whole system. That bit in between is where we define our contracts. It’s the data layer. It’s the interchange context that allows us to transfer information between our bounded contexts in a understandable and reliable way.
With that being said, we’ll just zoom back out to the overall workflow, and look at this in a different way too, because, again, it’s helpful to think about the meaning when you see the lines between the states. Here, we can identify when something might be considered an external communication. We’re going to have to think about our interaction patterns and our interchange context, or whether it’s an internal communication, and we’ve got a lot more control. We don’t have to be quite as rigorous and robust in our practices, because we have the ability within the team to control everything.
Interaction Patterns – Orchestration
Let’s move on to interaction patterns. I’m going to start by talking about orchestration, because perhaps it’s the slightly simpler of the two to understand. Orchestration typically is where you would have a single service that acts as an orchestrator or a mediator, and that manages the workflow state. You would probably have a defined workflow that it knows about and has modeled properly. It involves organizing API calls to downstream services that are part of bounded context. There’s generally no domain behavior outside of the workflow that it’s mediating, it’s generally just there to effect the workflow and all the domain knowledge is in the bounded context themselves. One of the more naive ways of thinking about orchestration is that it’s a synchronous approach. That’s true to an extent, because a lot of the times the orchestrator will be calling into these services using synchronous API calls. It doesn’t necessarily mean that the overall workflow is synchronous. Because we’re using these synchronous calls, it does mean that there might be a tendency for there to be some latency sensitivity, especially if things are happening in sequence. If the workflow is particularly busy, and is a source of high traffic, that scale is obviously going to cascade too, which might also lead to failure cascading down.
Clearly, this is a very naive view of what an orchestration workflow might look like. One of the things that’s quite interesting to consider is that actually, orchestration does have informal implementations too. You might have quite a few of these without really thinking about it within your code bases. Where we have bounded contexts before that just looked like they were the end of the request from the orchestrator, actually, they could act as informal orchestration systems themselves. We had such a system like this in a backend for frontend application that we designed when I was working with PokerStars sports. Here you can see, it’s a fairly straightforward pattern to implement, because we’re only going to go to three services to build up data. The end result that a customer will see relies on calling all three of these systems. Now this UML sequence diagram is a little bit dry. Some of the problems with this are probably better called out if we think about it in slightly more architectural diagram manner. The key thing to show here is that we kind of see this representation of a state machine in this flow. One of the things that’s missing is all the error handling. This is the problem with this approach. Because in a general-purpose programming language, you might find that these ad hoc state machines exist in a lot of places. Essentially, these workflows aren’t just state machines or statecharts, or however you want to call it. If we’re not careful, and we don’t think about them, and design them properly, and do the due diligence that you might do if you were defining a workflow using an actual orchestrator system, such as Step Functions or Google Workflows, then it’s easy to miss some of the important nuances in error handling scenarios.
Human-in-the-Loop and Long Running Workflows
Another thing to think about, and I mentioned it before, is that there’s this simplification about orchestration being synchronous, and choreography is asynchronous. Actually, if you think about it, a lot of the workflows that you will be modeling in your architectures often have human in the loop. What I mean by that is that there’s some need for some physical thing to happen in the real world before your workflow can progress. Think about a fulfillment in our e-commerce example. Perhaps closer to home a better way to emphasize what I mean here is to think about something that most developers will do. Whether you work in e-commerce or not, this will hopefully bear some resemblance to your working patterns, and that is the development workflow. Because here we’ve got a good example of something that needs to have both human and automation aspects. There are systems that need to talk to each other to say, yes, this has happened or, no, it hasn’t, in the case of maybe a PR being approved and that triggering some behavior in our CI pipeline to go and actually build and deploy to a staging environment, for example. Then you can also see how there’s needs for parts of this workflow to pause, because we’re waiting for a manual approval or the fact that a PR review is not going to happen instantaneously, there are things happening in the real world that need to be waited on.
Handling Failure
The other thing to think about because we’re using APIs a lot with these orchestration patterns is that there’s significant things like failures that we have to consider in our application design. Thankfully, I think we’ve got quite a mature outlook on this now as an industry. Many of these patterns are very well known and often implemented in our technology by default. With the rise of platforms and platform teams, and the way that they think about abstracting the requirements for things like error handling and failure handling, you often find that many of these requirements are now pushed to the platform themselves. For example, if you’re using something like a service mesh. In the cases where the failure and/or errors are expected, and they’re going to be part of the workflow themselves, so they’re not unexpected things like network timeouts, or similar problems. The nice thing about having a central orchestration system is that when a thing goes wrong, you can model the undos, the compensatory actions that you need to be taking to get the previous steps in your workflow back into a known good state. For example, here, if we had a problem where we couldn’t fulfill an order and we needed to cancel the payment and reallocate stock or update our inventory in some way, it’s easy for the orchestrator to manage that because it has knowledge of the whole workflow. The last thing to say is that because of the way you can build some of these workflows using cloud systems, like Step Functions, and Google Workflows, or Azure Logic Apps, it’s very trivial to have a single instance of an orchestrator per workflow. Rather than one single thing for all workflows, you can build them similar to how you would build your microservices, so you have one per specific use case.
Orchestration, Pros and Cons
Orchestration, pros for it. There’s a lot you can do to help with understanding the state and managing the state of your workflows. That means that you can do quite a lot of stuff regarding error handling and complex error handling in a more straightforward and easily understood way. Lots of platform tooling exists to help you remove some of the complexity around unexpected failures and how you’d handle them. Ultimately, there’s probably a slightly lower cognitive load when working with highly complex workflows. On the negative sides, though, because you have the single orchestrator at the heart of the system, you do have a single point of failure, which can cause some issues. That partly might lead to an issue with scalability or additional latency because the orchestrator itself is acting as the coordination point, and all requests have to go through it. You might also find there are some problems with responsiveness. Generally, because of the way you’re coupling things together with the APIs and the knowledge of the workload being encoded into the orchestration system, there is a higher degree of coupling between the orchestrator and services.
Interaction Patterns – Choreography
Let’s move on to choreography, and think about how that might differ a bit to orchestration. The easiest thing to say, and the most obvious thing to say outright is that there is no single coordinator that sits at the heart of this. The workflows are a bit more ad hoc in their specification. That normally is implemented using some intermediate infrastructure like a Message Broker, which is great because it gives us some nice properties around independence and coupling, and allowing us to develop and grow and scale our systems in a slightly more independent way. Because of that, as well, it might mean that we’re slightly less sensitive to latency, because things might be able to happen in parallel. There are additional failure modes to consider, and there’s other aspects to managing the overall workflow that will become apparent as we go through in the next few slides.
Again, this is a pretty naive view. If we look at something that’s a bit more realistic where there might be several events being emitted by different bounded contexts or many more bounded contexts involved in the overall workflow, you can see that there are some problems come in the form of understanding the state of things, and also the potential to introduce problems like infinite loops. It’s not beyond the realms of possibility to think that the event coming out of our bottom blue bounded contexts could trigger the start of a workflow all over again, and you could end up going round and round. Decisions that you have to make when you’re working in this way, include thinking, I’m not going to use an event or I’m not going to use a command. A command would imply that the producer knows about what’s going to happen in the consumer. That’s perhaps not necessarily the most ideal scenario because one of the things we’re trying to get out of this approach and one of the benefits that we get is having weaker coupling. Maybe that’s inferring a bit too much knowledge about the downstream systems. If we do use events, they’re not just plain sailing either. There’s quite a few things to consider there, which we’ll talk about. I think on balance, I would generally prefer to use an event over a command, because it would allow us to operate independently as a producing system and allow many consumers to hook into our events without necessarily having to infer that knowledge onto the producer. Some words of warning, though, issues can arise if events are rebroadcast, or if you have multiple different systems producing the same event. Something to keep in mind, it should be a single source of truth.
When we have got events, Martin Fowler’s article about being event driven, provides four nice different categorizations of the types of event that you might want to consider using. Starting from the top and the most simple is event notification, where you’re basically just announcing a fact that something has changed. There’s not much extra information passed along with that notification. If there is a downstream system that’s interested in that change, it probably means it’s going to have to call back to the source of the event to find out what the current state of the data is. If you’re not happy with that, and that’s a degree of coupling and chattiness that you’re not wanting to live with, then you could use the event carried state transfer approach, which would reduce that chattiness. It would infer that there’s going to be more data being passed within our events. It does allow you to build up your own localized view of an event or an entity. That might mean that the downstream systems are more responsive, because they’re able to reply without having to coordinate with an upstream system if they get a request, which is a nice property to have.
Event sourcing takes that a step further, where all the events are recorded in a persistent log. That means that you can probably go back in time and replay them to hopefully get you back to a known good state. That’s assuming that when you are processing these events in the first place, there’s no side effects to going off to other systems that you would have to think about and account for when you were doing a replay. Another thing that needs to be considered in this approach is how you would handle schema changes and the need for consumers to be able to process old events produced with one format and current events that might have a new, latest schema. The last pattern to talk about here is CQRS. The only thing I’m really going to say about this is it’s something that you should perhaps consider with care and have a note of caution about, because while on the face of it, it sounds like a really nice idea to be able to separate out your patterns for reading and writing, actually, there’s quite a lot of nuance and stuff that can go wrong there that might get you into trouble further down the line. If you are interested in that pattern, do go away and have a read, and find out more for yourself and think about the consequences of some of the design choices.
Once we’ve decided on what type of event we’re going to use, we have to then think about what message pattern we’re going to use. Because again, we’ve got quite a lot of stuff to think about when we’re designing. They have implications on the performance of our system and the coupling between the components. If we’re going to look at a standard AMQP style approach, where you’ve got different types of exchanges, and queues, we could look at something like this and say, we’ve got a direct exchange and we’re going to have multiple queues bound to different routing keys. Here, I’ve called out a couple of different decisions that you’re going to have to make. Because again, it’s not as straightforward as it might first appear, you’ve got to think whether you’re going to have durable or transient exchanges or queues, whether you’re going to have persistent or transient messages. Ultimately, you’re going to have to think a lot about the behaviors you want to have if the broker goes down or if a consumer stops working. One of the things that you might find here is that while messages will be broadcast to all queues that are bound to the same routing key, so every consumer, every queue will get all messages that it’s interested in. If you bind multiple consumers to a queue, you end up with what’s known as a worker pool. This is great if you have no real need to worry about ordering. If precise ordering is important, then this can have implications on the outcomes of your system. Again, it’s another consideration you have to take into mind when you’re thinking about implementing things.
Other types of exchanges that you can have include fanout. While this is great, because it means that you can broadcast all messages to all consumers in a really efficient way. If you choose to use event carried state transfer, then it might mean that you’re sending an awful lot of data because each queue receives a copy of each event. That might have implications on your bandwidth and the cost of bandwidth. Depending on your world, it might make it more expensive or it might saturate and you have a limit to the throughput you can achieve. The last type of exchange you can have in this setup is a topic exchange where your consumers can define queues that bind to patterns. They can receive multiple different types of messages through the same queue, which is great. There’s a lot of great things you can do with this. Again, if you have a high cardinality, in the set of types of messages you’re going to receive or lots of different schemas, in other words, this can have serious implications for the ability of your consumers to keep up with change.
Other approaches that you might have for these systems include something like this, which is a Kafka implementation, so it works slightly differently to the previous ones. In that it’s more akin to the event sourcing model where each record is produced to a topic and stored durably in a partition. There’s no replication, there’s no copies of events going on here, it’s a slightly more efficient way to allow multiple consumers to consume the same data. Because of the way that partitions work, a nice property is that as you scale your consumers, assuming you don’t scale beyond the number of partitions in your topic, you’ll be able to manage to process things really efficiently, because you can assign more resources, so a consumer per partition, but you also retain your ordering within the partition. We’re not having the same issues as we would do with the round robin approach from before. Another nice property of using something like Kafka is that your consumer groups basically keep a record of where they are independently of each other. You’re not worried about queue backup or anything like that, because ultimately, it’s just the same data being read. It’s an offset that’s being updated by the consumer itself, by the group itself. Again, things to consider here is how you want to handle offset lag. Because if your offset lag grows too much, and the retention policy of your topic is set such that it will start to delete messages before they’ve been consumed, then you’ve got big problems.
Choreography Considerations
Some of the considerations that I’ve called out there just summarized in one place, and a couple of extras to think about as well. Two, in particular, on the left-hand side, that I think people need to be aware of are delivery semantics. Are you going to go for at-least once or at-most once delivery. I would imagine properly, at-least once is the preferable option there to make sure that you’re going to actually receive a message. Then that means you’ve got to answer the question of what you’re going to do if you receive an event twice. That’s where idempotency comes in. On the right-hand side, the key differences there between orchestration and choreography, relate more to the error handling, which I’ll touch on. Also, how you’re going to handle a workflow level timeout, or something where you want to see the overall state of a workflow. They become much harder in this world because there is no central organizing element that’s going to be able to tell you the answers or coordinate a timeout in that way.
I mentioned error handling. If we go back to this original view of how things might go wrong, you can see how if we’ve got a system that’s choreographed, and we’re using events to say all the different transitions that should happen, if we get to a point where we have payment that fails, we might already have allocated stock. At that point, we’ve got to work out how we’re going to make a compensatory action. You can see for that slightly simplified view, it’s probably ok, you could probably work out a way to emit an event and make your stock system listen to the event and then know what to do when it hears it. If you’ve got a more complicated or an involved workflow, having to do those compensatory actions in an event driven system becomes increasingly difficult.
Choreography, Pros and Cons
In summary, then, choreography, you have lots of good things to think about because you get a much weaker coupling between your services if you do it right, which leads to greater ability to scale things, more responsiveness in your services. Because you’ve got a single point of failure, you have a decent chance of making more robust fault tolerance happen. You could also get quite high throughput. There’s a lot of good things to think about here. On the con side, if we have a particularly complicated workflow, especially if you’ve got complicated error handling requirements, then the complexity of that implementation grows quite rapidly. Other things that you need to bear in mind is that it might not be possible to version control the end-to-end workflow, there’s no single representation of it, which is something that is again, in a complicated workflow, quite a nice property to have.
Choosing Your Patterns
How would we go about choosing between these different approaches? If you’ve read about this or seen any content on the internet, you might have come across this blog post from Yan Cui, where he basically says, as a rule of thumb, say, orchestration should be used within a bounded context of a microservice, but choreography between bounded contexts. To put that into our diagram from before, that would mean that within inventory and payment, we’re all good to use our orchestration systems, but between the bounded context, we’re going to just send events. I think actually, that’s quite problematic, especially when you consider what I was talking about with complicated error handling.
How Formally Do You Need to Specify your Workflow?
If we think about how we might want to handle coordination and understand the overall workflow, that’s where we can start to think about this continuum of how formally specified or informally specified, our workflows might be. On the far-left hand end we’ve got something that’s a very specific orchestration service, say like a Step Functions, where we can use a DSL or a markup language like YAML, or JSON, to define the various stages of our workflows. We can be very declarative in explaining it, and we can version control it. In some cases, you can even use visual editors to build these workflows too. Stepping inward and thinking about the example I gave with the backend for frontend approach, you could use a custom orchestrator that’s defined using a general-purpose programming language, but then you’re taking on the responsibility of ensuring you’ve handled all the errors properly yourself. It’s perhaps a little bit less robust overall, for more complicated approaches.
Go into the other end, as I mentioned, choreography doesn’t necessarily have the same ability to specify your workflows in one place. There is an approach you can take where you would use something called a front controller where the first service in a workflow has a burden of knowledge of the rest of the workflow. It can be used to query all the systems involved, to give you an accurate update of the state. There is a way that you can semi-formally specify the workflow in some regards. Typically, right at the very right-hand end you would have what would be a more stateless approach. There is no central state management. In this case, the best you can hope for is documentation. Possibly, you’re only going to have the end-to-end understanding of the workflow in a few people’s heads. If you’re unlucky, you probably don’t have anybody, you might not have anybody in your business that knows how everything works in one place. There’s a bit of a gap in the middle there. If that is a SaaS opportunity, and we’re in the right place in San Francisco, for the brains to think about an opportunity to build something there, if you do do something and you make money off it, please keep me in mind.
CAP Theorem
Distributed systems talks wouldn’t be complete without a reference to the CAP theorem. This is no different. There’s a slightly different lens to this that I want to broach, which is thinking about the purpose of microservices from the perspective of an organizational view. Traditionally, we think about this in terms of the technology and saying that in the face of a network partition, you can either choose to have consistency or availability, but not both. If we flip this around and say, how does this apply to our teams, bearing in mind that now we have very distributed global workforces? If we’ve distributed our services across the world, we need to think about how the CAP theorem might apply to our teams. Instead of partition tolerance, if we replace that with in the face of a globally distributed workforce, we can either optimize for local consistency, so teams are co-located, and they’ve got a similar time zone, so they can make rapid decisions together. They can be very consistent within themselves. Obviously, they can’t be awake 24 hours, so they’re not going to be highly available. On the other hand, you might say, actually, it’s important the teams are available to answer questions and support other teams around the world, so we’ll stripe the people within them across different time zones. In that case, yes, you’ve improved your availability, but again, people can’t always be online all the time. The ability for the team to make consistent decisions and have local autonomy is reduced. Perhaps it’s a rough way of thinking about it, but it makes sense to me.
Complexity and Coupling
Complexity and coupling are clearly going to be key elements that allow us to build these systems and have successful outcomes in terms of both the effectiveness of how they work, but also the long-term ability to change them. One thing to say is that you can have a degree of coupling that is inevitable, a domain will have necessary coupling and it will have necessary complexity. That’s not to say that you can’t push back and say, actually, this thing you’re asking me to do, is going to be almost impossible to build and is there a simpler way we can do it. Generally, there’s a degree of coupling and complexity that we have to accept, because it’s mandated by the domain requirements of the solution. However, what we can find as architects and tech leads, is that how we choose to implement our workflows can have a significant impact on how much worse we make things. We need to really consider coupling in quite a deep level of detail.
Michael Nygard offers his five different types of coupling, which I think are a really nice way of thinking about it. In particular, I’m really interested in the bottom one, incidental coupling, because especially when you think about event driven systems, this is where things can go bang or go wrong. You say, how on earth are these things linked? Another way to look at this is to consider whether coupling is essential, as I mentioned, or accidental. In this frame, I think you could say the essential stuff is that this is the necessary coupling for the domain that you’re working with. Whereas accidental, is probably down to not doing enough design, or the right design up front. This can manifest itself in slightly more subtle ways, it might not necessarily be that you’re being called out every night for incidents, it might be that you see something like this appearing. This is from when I was working at Sky Bet back in about 2018, and we had a situation where we’d decomposed our systems into smaller microservices or smaller services, and teams owned them nominally with a you build it, you run it mindset. We were finding increasingly that the amount of work it required another team to do some work on our behalf was growing all the time. Our delivery team put together this board, so we could map up the dependencies and talk about and have a standup every week. I can guarantee you after a short amount of time, there wasn’t an awful lot of whiteboard left, it was mainly just all Post-it notes.
To really emphasize just how bad this thing can be if you don’t get your coupling right, let’s have a think about this example from “Making Work Visible,” where Dominica DeGrandis introduces the idea of going for a meal at a restaurant where you’re only going to get seated if everyone arrives on time. If you’re both there on time, and it’s two people going for a meal, great, you’re going to get seated. If the other person’s late or you’re late, or you’re both late, you’re not going to get seated on time, so you’ve got a 1 in 4 chance. If a third person is added to the list of attendees, then your chances decrease quite rapidly. Instead of being 1 in 4, it’s now 1 in 8, or 12.5% chance of being seated. Adding more people over time, it quickly drops your chances of being seated on time to effectively zero because the probability is 1 over 2 to the n. To put this in a slightly more visual format. If we think about our coupling and the number of dependencies in the system, that model shows that if we have 6 dependencies, we’ve only got a 1.6% chance of success. If we have 10, it’s less than 0.1%. Clearly, dependencies are very problematic. Coupling can be very problematic. We have to be very deliberate in how we design things.
Another example from a company at a slightly different scale to us when I was at Sky Bet, is this from “Working Backwards” by Colin Bryar, and Bill Carr. They talk about this NPI approach where you would have to submit your prioritization request to say, I need another team to work on this thing, because it’s important for my thing that I’m trying to deliver. At its worst case, what happened was you would submit your requests, nothing would be prioritized for you. Instead, you’d be told to stop working on your projects that you thought were important because you were being prioritized to work on somebody else’s stuff. You were literally getting nothing for something. With all that being said, I’m going to add a sixth thing onto our list of types of coupling that we have to bear in mind when we’re deciding on these patterns. That is, what’s the level of organizational coupling that we’re going to be putting into our system? That’s when progress can only be achieved together. The last thing to say about coupling is that often when we think about microservices, we talk about decomposition and taking our monolith and breaking it up. Again, if we’re not deliberate in our design, and we don’t put enough thinking into it, we can get into a position where we’re actually adding coupling rather than removing it, so decomposition does not necessarily equal decoupling.
Complexity in our systems is something I’ve touched on and I mentioned that Manny Lehman had some of his laws that touched on thinking about cognitive load. I think this model helps to show a way of thinking about this and how we might want to be concerned and addressing cognitive load when we’re thinking about our patterns. This is an example from, I think it’s Civilization 4. It’s typical of most of real time strategy games where you have a map that you basically can’t see anything in apart from a bubble that’s your immediate area, and you go off exploring to find what’s there. The dark bits are the bits you haven’t yet seen. The slightly faded bits are areas that you’ve visited, but you no longer have vision of. You have an idea of what’s there, but realistically anything could be happening there. You don’t actually have visibility. You just have a false hope that you do know what’s going on. I think this maps nicely to thinking about an architecture. If we consider the whole map to be our city map, or whatever that you would describe your system design, your system diagram, the size of a team’s vision bubble is restricted by the complexity of the system. Rather than trying to make the vision bubble bigger, what we ultimately need to aim to do is reduce the size of the map to match the size of their vision bubble. Depending on whether you’re choosing orchestration, or choreography, or event driven versus synchronous, and more standardized API driven approach with a coordinator, that will have impacts as to just how much stuff they need to think about at all times.
As mentioned, I think a better way of framing the choice and to boil it down to a nice snapshot snappy view is to think if I have a particularly complicated workflow, something with notably complex error scenarios, that’s probably a good case to say, I’m going to put some orchestration pattern in place. Whereas if I need highly scalable, responsive, and there’s generally not too much in the way of error handling needed, and there’s no directly high complexity, error handling cases in there, choreography is probably a good fit.
Making It Work
We’ve chosen our patterns, what are some of the things that we need to consider about when we’re actually going to go and implement them? If I explode out from my example from before, with the backend for frontend, here, you can see a nice example of actually, you can probably use both effectively within your overall architecture, and you’re going to need to use both. Where we had the orchestration in the backend for frontend, you’ve also got the choreography between the trading system and the data processing pipeline, that’s going to build up a view of the world that’s going to ultimately provide our catalog of events for our systems. Some key things to call out here, though, is that that intermediate infrastructure that I said, this isn’t in here, was a set of different Kafka topics. There’s things here that aren’t obviously owned, don’t really belong to one bounded context or another, you now have something, some infrastructure that sits in the interchange context. There’s a lot of things to consider in this area. How a scheme is going to be evolved. How we’re going to manage ACLs. What do we need to do about certificates if we want to implement mutual TLS, that kind of thing?
If we take this diagram and think about it in a bit more detail, we want to consider first, what’s inside the boxes? How can we make them more effective when we’re building these patterns? I’m going to reframe something that was published as part of a series of blog posts by Confluent, where they were talking about four pillars of event streaming. These four pillars map nicely to thinking about these patterns and how we can make them successful. The systems themselves, obviously need to have some business value. We also need to ensure they’ve got appropriate instrumentation, so we can tell they’re working. The other two things I think are often overlooked, especially when we’re designing event driven systems. That’s probably because they’re a little bit harder to implement. These are the things that will help us to stop processing if we need to, to make it so that we’re not going to compound any problems once they’re detected. We can inspect the state of systems and understand what’s going on before maybe implementing a fix or providing some compensatory actions, and getting us back into a known good state. That operational plane on the far-right hand end, this is the thing where I’ve spoken to ops managers in the past and told them that we were moving to using Kafka for our state management. They started to sweat a little bit because they were like, how on earth am I going to go about rectifying any data inconsistencies or broken data that I would have ordinarily just been able to manage through logging into a database server and correcting it with SQL commands. Lots to think about within the boxes.
Perhaps more importantly, and something that I think doesn’t get enough attention, because of the implied simplicity of diagrams is what’s between the boxes. If we think about what’s in the line that we often see between these services, there’s quite a lot of things that we have to consider. These are just the technical aspects. If we add in all of the social and organizational aspects, you can see that that small thin line is holding an awful lot of weight, and it’s enough to cause us to have some severe overload. We have to think when we’re doing our designs, when we’re implementing our patterns, we have to think how we’re going to handle all this stuff, and everything else that won’t fit on this slide. In particular, some things that’s called out here that I think are really important, especially again, with our event driven approaches, is making sure we’ve taken the time to be explicit in how we’re handling our schemas. Because the worst thing that can happen is that you can get started quite quickly with event driven architectures. Before you know it, you’ve got yourself into a situation, you’ve got many different types of events flowing around, lots of ad hoc schema definitions. When you try and change something, you notice there’s a ripple effect of failure across your systems.
To counteract that, you can be deliberate in how you’re going to manage your compatibility. This guideline from again, Confluent from their schema, and Avro schema compatibility guide for their schema registry is a good way of framing the thinking you need to have. Rather than just saying I’m going to be backwards or forwards compatible, we also need to consider whether we’re going to have transitive compatibility. That means when you publish an update, you’re not just checking against the previous version of a schema, you’re checking against all of the previous versions. That will mean that if you do ever have to replay data, you’re going to make sure that your consumers are going to be able to process that because you’ve guaranteed that all versions of your schemas are going to be compatible.
Equally, we have to think about what format we’re going to send our data in. Because again, depending on our choices, for things like events, whether our event carried state transfer, or notifications, or whatever, these can have significant impacts on the operational capabilities of our system. If we have something that is great for human readability, but doesn’t compress particularly well, we might find that we have issues if we’re sending lots of messages with that data in, or if there are lots of updates that try and read something back from the system. We found out this to our horror on a high traffic day, Boxing Day, it was I think, in 2014, when a document that was used in our system, it was a notification pattern. A notification or an event would come through and this document will get pulled out of Mongo, and it will be used to determine what updates to make. It grew so large that we actually saturated our network, and we were unable to handle any requests from customers. Other things to think about if we do want to reduce the size, and serialization is clearly important, and thinking about balance and tradeoff between compression and efficiency of serialization is going to be important. There’s lots of documentation. You can do experiments, such as the one that Uber have done here, to identify what’s most appropriate for your scenario. The other thing to think about is whether you want to have an informal schema, or whether you want to have something that’s a bit more firm, something that’s going to be documented and version controlled, like in protobufs or Avro.
Instrumentation
One of the pillars of the event streaming systems and overall systems is instrumentation. I would firmly encourage people to ensure they put distributed tracing at the heart of their systems, because this is going to make working with these patterns and these workflows significantly more straightforward. Especially if you want to try and find things in logs, when the logs are coming from lots of different systems and telling you about all sorts of different things that are happening at once. Having a trace ID and correlation in there is such an effective way of helping you to diagnose problems or understand the state of a system. It’s probably the only realistic way you’ve got of understanding how things are working in a truly event driven distributed system.
There’s lots of other tools that you can use. One thing I would call out is that in the more synchronous, RESTful HTTP API space, I think Swagger or OpenAPI as it’s now known, is a fairly well-understood piece of technology that helps people with managing their schemas and testing and all that stuff. There is a similar approach coming about from AsyncAPI to help with event driven systems. I would encourage people to check that out. Equally, while we’re thinking about all these patterns, key and foremost for them is thinking about how our teams are going to work. Definitely, I encourage people to go away and think about how team topologies might apply to help them with thinking about the design of their systems and who’s going to own what.
All That Was Decomposed, Must Be Recomposed Again
The last thing to really touch on here before finishing up, is this idea that actually, when we’ve decomposed our systems into microservices, ultimately, at the end of the day, we’re going to have to recompose them again. That again, is where we might find a use for our orchestration pattern. From this reference architecture from AWS, you can see we’ve got this example here of AppSync, which is a GraphQL layer that’s orchestrating requests to the rest of the system. I’ve sneakily called this in here, because it’s not a coincidence that I’ve used a reference architecture from AWS, because one of the things that I think the bits between boxes, the lines are difficult is because they’re often striped between multiple different technology teams, because teams are aligned at the technology level, not at a domain level. Cloud gives us an opportunity to redress that and to think about it in a different way, and to give whole ownership to our teams. That’s because IAC is the magic glue that helps us bind teams, the infrastructure together for our teams and manage that complexity between the boxes.
Summary
I think that the title was a bit cheeky. Obviously, I don’t want you to have a monolith but we need to think about things working holistically together. A system is the sum of its component parts, so we need to make sure that they all work together well and they provide us the ability to change over time. Whether we have one system or many that work together, the same rules apply, we need to be deliberate in our design. To that mind, I want to encourage you, please do design for evolution, consider changing the reasons for it upfront. Be thorough and deliberate when you’re choosing the patterns for implementing your workflows, because how you implement things is the way that you will make things harder for yourselves in the future. The one thing that I hope everybody will go away and remember is that you have to remember the bits between the boxes, and that they all need people too.
See more presentations with transcripts