MMS • Ben Linders
To understand how systems are being used, we can collect metrics and identify trends over time. The data and insights gained can be used to improve system quality by improving software design or testing patterns.
Craig Risi spoke about using data to improve system quality at Agile Testing Days 2022.
It’s impossible to focus on every single defect and root cause and try to find suitable mitigations for them, Risi said. The time involved is often just not feasible. What can help, Risi mentioned, is the categorization of root causes and issues, and then tracking these over time to identify trends and patterns that can then be focused on in a team or organization.
This is especially useful for companies with many development teams, as Risi explained:
Often there are similarities in issues and root causes across teams that can possibly be solved by a change in the way a company works. Most companies aren’t even aware of these issues, as teams tend to focus on these in isolation, but in tracking this data at a corporate level, bigger patterns can often be picked up that will lead to bigger culture or process changes that can reduce defects or issues across the board.
To understand how systems are being used, you need to get all the data. This has proven to be one of the biggest challenges, Risi mentioned:
You not only need to have the right tooling in place, but need to set up the time to implement and build the right monitoring systems. Along with simply collecting the data though is also finding an appropriate way to visualize them so that it makes sense to the different stakeholders. This might mean having different dashboards for different stakeholders to show them the information that is important to them. All this is something that takes deliberate effort to get right.
Risi said that it’s important to get product owners, developers, and testers to agree on a root cause of an issue, so that all paths of development, testing, requirements, and processes can be explored to help identify these root causes.
InfoQ interviewed Craig Risi about using data to improve system quality.
InfoQ: What techniques can be used to find the root causes of problems?
Craig Risi: This can be a challenge, because often how an issue looks when you first discover it versus what might really be causing it are two different things.
Typically when we address issues in a team, we ask them a few different questions:
- How does the issue appear?
- What is causing the issue to occur?
- Was there anything we could’ve done to catch this issue earlier?
- What can be fixed to prevent this issue from occurring again?
- What can be changed to prevent issues like this from occurring again?
They might seem like trivial questions, but they do help teams realize there is a separation of how a defect appears versus what is actually wrong. And, by focusing on what needs to be fixed rather than where the fault lies, it helps to prevent any blame shifting and get teams looking into what is really wrong with the software and fixing it properly.
The last question is especially important because it doesn’t just ask teams to address the immediate issue, but also think about how they work and design software that can prevent a similar issue from arising in the future. This last question will often lead to a team finding the real root cause of a problem and not just tagging it as a simple coding error.
InfoQ: What tools can be used to collect and analyze data?
Risi: I have found tools like Qlik, Thoughspot, Sisense, Tableau, Grafana, and New Relic to be useful in that they can help with identifying usage trends, system performance and correctly visualizing them. All the big tech players like Amazon, Microsoft, Google, and Oracle also have tools that can assist with this in their respective cloud environments.
Once you have all the monitoring and tooling in place, there is a significant culture change that needs to happen to actually make use of the data and build in the right alerting. This can often only be done by helping teams see the value in the reporting and showing how it can lead to solutions.