MMS • RSS
Article originally posted on InfoQ. Visit InfoQ
At the recent Event-Driven Microservices Conference in Amsterdam, Russ Miles claimed that the biggest challenge for an architect is that you get ignored. You have great ideas like event-driven microservices, but the reaction too often is that it sounds good, but that it’s overly complicated for the needs at hand. Miles commonly get this reaction when he suggests that companies should consider looking at asynchronous event-driven systems as a way of introducing scaling, redundancy and fault tolerance. The words often make sense to the company, but just as often they get ignored.
The main goal for Miles in his work is having reliable systems. Reliability for him is a measure of what the customer wants; a system that is feature-rich and always running. This means we have two opposing forces which don’t coexist easily, especially notable in complicated systems – continuous innovation and change, versus a system that is always working.
According to Miles, the hardest thing for an architect is to get everyone to understand that you are building resilient systems, and Miles emphasizes that he is not just talking about technology, he is referring to the whole system which includes the people, the practices and the processes that surrounds it. Considering all this, he regards it a minor miracle that systems in production ever work.
Miles refers to John Allspaw for defining resilience. If you build systems with a lot of redundancy, replication, distribution and so on, you may be building robust systems. For Allspaw resilience is when you also involve people. In the same way, chaos engineering is beyond the tools – it’s about how people think and approach a system.
For Miles, chaos engineering is a technique for finding failures before they happen, but also a mindset:
- Never let an outage go to waste. Learn from failures when they happen.
- Have a pre-mortem attitude. You learn from outages but it’s better to explore weaknesses before they occur.
- It’s collaborative; you don’t run experiments against other’s systems. Everyone should know beforehand and agree on what you want to learn.
- Start tiny with one small experiment. If the system survives, then you can choose to increase the scope.
- Start working manually using your brain. After that you can start automating using the tools available.
The single most important thing about chaos engineering for Miles is that you must be part of the team working on the system. You cannot be someone that hurts a system and then wait for others to fix the problem. You must be part of the effect of what you have done and work with everyone else to fix it. Miles has seen companies that have a group of people that hurt systems for a living, but in his experience this doesn’t work.
Miles points out that in his mind, chaos engineering is simple. There are only two main key practices to learn, and he emphasizes that there is no need for any certification program:
- Game days, where you gather all the teams and change some condition in production that you all agreed can happen, and see how you deal with it. He notes that game days can be expensive since they take a lot of the team’s time.
- Automatic chaos experiments are when you automate the experiments to be able to continuously explore and look for weaknesses.
If you are ready to start working with chaos engineering at your company, Miles’ first advice is to not use the term at all. Don’t talk about breaking things; instead talk about incidents that have happened and what you can learn from them and improve. He notes that you are in a learning loop trying to get a system that gradually gets more and more resilient.
In a summary of his points, Miles noted some rules from the “Chaos Club” that you must follow:
- Don’t talk about chaos. The concepts are becoming more mainstream, but the term may still set people off. Start using it when people are more comfortable.
- Learn without breaking things. You are trying to improve across the whole socio-technical system by finding and dealing with the weaknesses before the users.
- Chaos should not be a surprise.
- If you know the system will break, don’t do the experiment. Try to fix the weaknesses you already know about before you try to find new weaknesses.
When working with event-driven microservices based system, one of the hardest things is to get developers to understand how to become a good citizen in production. This includes having the right endpoint exposures to declare your health and the right touchpoints to say if you are OK or not. Good logging is an important aspect and a way to improve on this is to have developers read their own logs, for example during a game day when they must understand what the system did through their own logs.
When doing chaos engineering, one advantage with event-sourced systems is the observability it brings. For Miles, observability means the ability to debug the system in production without changing it. If you are doing some form of chaos experiment, the first thing you want to do is debugging the system to figure out what went wrong, and with an event-sourced system you have a system of record, you know exactly what happened and when.
Miles concluded by stating that for the first time in his career, there is a best practice. For the complex and maybe chaotic systems we build today, chaos engineering is a technique for which he wants to say “just do it”. Do a small amount of it, manually, as a game day or whatever works for you. If you care about the reliability or resilience of your systems, he believes it’s a tool for you.