MMS • RSS
Article originally posted on InfoQ. Visit InfoQ
Emotions often come to the fore when there is an incident; psychological safety in blameless post-mortems is essential for the learning process to happen. The post-mortem session must be fairly moderated, preferably by an outsider, giving everyone a turn to speak without criticism. Don’t start the analysis of the incident before there is a clear and common understanding of what actually happened.
Matt Saunders, head of DevOps at Adaptavist, spoke about psychological safety in blameless post-mortems at the Atlassian Summit Europe 2018. InfoQ is covering this event with Q&As, summaries, and articles.
InfoQ spoke with Saunders about when to do blameless post-mortems, how they differ from agile retrospectives, dealing with emotions, what can be done to make everyone feel safe in the post-mortem, and how to conduct effective blameless post-mortems.
InfoQ: When do you suggest to do blameless post-mortems?
Matt Saunders: The top answer to this is that any time there is an incident that causes disruption for customers, there should be a post-mortem. And additionally to that – 100% of the time you should go to lengths to ensure that they are blameless. It’s easy for an outage analysis to become a witch-hunt, but this rarely gets to the root of a problem. If someone made a mistake, then it is short-sighted to not also analyse why that person was put into a position where a mistake was possible. So there should be a blameless post-mortem absolutely every time there’s an incident, or even when something unexpected happens operationally.
InfoQ: What are the differences between and similarities of agile retrospectives and blameless post-mortems?
Saunders: Some of the techniques here are very similar. A key tenet of agile retrospectives is to analyse what happened from the perspective of the team, and the same is true with a post-mortem. However, a post-mortem is generally conducted in difficult circumstances – perhaps your company has lost customers due to an outage, people are mad and looking for answers. There can of course sometimes be similar pressures in agile retrospectives, but the likelihood of a post-mortem being conducted in a stressful and potentially aggressive manner is much higher.
InfoQ: In your talk you will dive into the emotional impact of dealing with an incident and how it affects engineers. Can you elaborate?
Saunders: Engineers always want to do the right thing. It’s not just a matter of professional pride; emotions often come to the fore especially when there is an incident, as people can struggle to stay calm. Everyone wants to fix the outage as soon as possible, but this can manifest itself in heightened emotions and raised voices. Decisions made long ago may be revisited in an emotional fashion, and this often isn’t helpful. Dr Richard Cook explains how computer systems can be highly complicated in his frequently cited paper “How Complex Systems Fail.” Hindsight often biases post-incident analysis and this can often lead people to feel stupid, defensive, or even that their job is under threat. It is essential to enter the post-mortem with issues such as this in mind.
InfoQ: What are some of the main things that can go wrong in blameless post-mortems?
Saunders: Prejudging the outcome is a frequent problem. The aforementioned hindsight can lead post-mortems to conclude obvious problems, when the reality of how these problems came to occur can be highly complicated. Emotions running over and people getting personal is another frequent problem, and also the influence of senior people must be carefully judged. Perhaps an employee’s manager is in the room, and he or she acted on the manager’s advice which turned out to be badly judged. This puts the employee in a dilemma where he or she may not feel he or she can speak freely.
In addition, the organisational constraints put on the team may lead to mistakes. Perhaps a deployment went wrong because it was performed by someone working in a central team who didn’t understand some key differences to other systems. This probably isn’t something under the control of the team but is still a contributory factor to the incident that needs to be accounted for.
InfoQ: What can be done to make everyone involved feel safe throughout the process?
Saunders: The key point is that we’re doing a post-mortem on the incident, not on the person who made a mistake (if indeed there was a single mistake that caused the incident). The session should thus be run with this front and centre. It’s key to clarify right at the start that this is a learning process for the team or organisation, and not a blame game.
It’s well acknowledged today that blaming individuals is not a good outcome, as this is likely to lead to more fear in the future, people being scared to operate on systems, and a general slowdown in operational fluidity. Instead, basing the post-mortem around learning how to make the teams processes better so that the system helps its operators to not make mistakes should be the key takeaway.
If you can set the scene in this way, and also convince senior stakeholders that this is what the outcome will look like, then people will feel safe, willing to contribute, and help design better systems for the future.
InfoQ: What suggestions do you have for conducting effective blameless post-mortems?
Saunders: Ensure that the session is fairly moderated – preferably by an outsider, that everyone is given a turn to speak without criticism, and that the analysis of the incident is only started once there is a clear and common understanding of what actually happened. Separating the session into three sections: agreeing the timeline, agreeing what went wrong, and – most crucially – what work needs to take place to prevent the problem occurring again is a good formula for conducting blameless post-mortems.