In a distributed application it is difficult to use debugging techniques common in developing non-distributed applications. Bringing production observability to your testing environment helps to find bugs, argued Francisco Gortázar at the European Testing Conference 2019. He presented ElasTest, a tool for developers to test and validate complex distributed systems using observability.
Observability helps to understand why the application is behaving the way it behaves, said Gortázar. The only way to understand the behaviour of a distributed application is to look at the data that it generates (its external outputs) to figure out the internal states. The purpose of observability is to bring useful information from the observed application to testers and developers, argued Gortázar.
Gortázar suggested building high-level abstractions that are close to business in order to discriminate interesting events from those that maybe won’t have a direct impact on users. In production we usually start with raw metrics, but soon enough we develop our own metrics, and set alarms on those metrics, he said. Collecting everything at any time requires managing a huge amount of information, so Gortázar proposed to use dynamic sampling to avoid collecting non interesting events and focus on what matters. When things go weird, start collecting more information to better diagnose the problem and hopefully find a solution.
ElasTest is an open source tool for providing observability in testing environments, which aims to help developers test and validate complex distributed systems. It provides similar capabilities to the ELK stack, the main difference being the awareness of the testing process. ElasTest understands what a test is, and is able to collect relevant information when necessary and present it in a way that makes it pretty straightforward to look at separate bugs and tests.
The ElasTest project is publicly funded by the European Union. The platform is developed by a consortium of European academic institutions, research centers, large industrial companies and SMEs, using the common tools and services of the open source software community.
InfoQ: Why is it so difficult to collect information about what’s happening in distributed systems and cloud-native applications?
Francisco Gortázar: Basically, two main issues arise: the first one is about collecting information from several services that run on different machines. Specific software needs to be used to send all the information to a central database for online or offline inspection. The second one is the issue about the amount of information that is generated. We’re running into big data problems here, so good policies for removing old data are needed.
In addition to this, testing environments pose some specificities that make them different from production environments. So it could be the case that the same tools that are used in production require so much effort to deploy and configure, that sometimes companies do not want to invest such efforts on the testing environment. Operations teams are usually so busy with the production environment that they cannot devote time to other environments.
InfoQ: What’s your advice for setting up observability?
Gortázar: There are many different tools that could help to set an appropriate level of observability for our production systems, but it requires a big effort to put all these pieces together in order to have the necessary visualization abstractions in place.
Ideally, we should bring the same tools that are used in production to our testing environment. However, usually the teams that are in charge of those observability systems for production don’t have time to deploy and maintain a similar toolset for other environments. When this is the case, at least developers and testers can still use some straightforward tools that can bring some level of observability to their testing environments.
InfoQ: What tools exist for observability in end-to-end testing, and how can we use them?
Gortázar: The so-called ELK stack (where ELK stands for ElasticSearch, Logstash and Kibana, a common toolset for managing raw logs and metrics from Elastic co.) along with the Beat agents developed by this company make it really easy to collect logs and metrics from your systems. It is not difficult to have those agents deployed alongside with the system we want to test, and have them send information to an ElasticSearch database. For those tests that failed we can use Kibana to further investigate the issue. Kibana is able to graph the metrics collected, and can be used as well for querying our logs. These two features can help to better localize our bugs.
However, these tools have some limitations when used in a testing environment. When we take a look at Continuous Integration systems, things are very different. Usually, one collects information only during the duration of the tests that are run as part of this integration process. Moreover, this information is only useful when there are test failures, i.e., parts of the application not working as expected. It is only then when this information is usually inspected to find the root cause of the failure. However, the information is presented in ways that make really difficult to understand the behaviour of the system. For instance, dashboards don’t usually understand the boundaries of tests (when a test starts and ends), and cannot filter out information of those tests that are not relevant, unless we know at which specific point in time they started and ended. So isolation of the interesting information is one of the problems with standard observability tools.
InfoQ: What can be done to find the root causes of bugs?
Gortázar: In a distributed system, finding the root cause of a bug is definitely not easy. When I face the problem, I’d like to be able to compare a success execution with a failure one, side by side. This is difficult due to the nature of logs, which can vary between executions. Such a tool should be able to identify the common patterns and discard the irrelevant pieces of information that might vary between two different executions.
This comparison feature should be available as well for any other kind of metric. If I can compare the memory consumption or latency of requests of two consecutive executions I might be able to understand why the second failed.
In general, we need more specific tools for this task. In a testing environment, we have more control as to what information to store but we need to raise awareness within the tools about the testing process. This way, we can gather the necessary information during test runs, and provide the appropriate abstractions to understand why a specific test failed. We are researching ways to visualize the information we collect with ElasTest so that bug localization can be done faster and more accurate. We think there’s a lot of improvement to be applied to our Continuous Integration environments yet.