Presentation: Observability Is Also Programmed

MMS Founder
MMS Yury Nino Roa

Article originally posted on InfoQ. Visit InfoQ

Transcript

Roa: The title of my talk is observability is programmed, and it is a work, thought, ideas, and processes for automating observability as code. My name is Yury Niño. I am cloud infrastructure engineer and chaos engineer advocate in the Spanish community. I code for Google, designing and provisioning infrastructure solutions as code.

Outline

Specifically, these are the topics that I am going to cover. I am going to present a landscape of observability which includes concepts, practitioners, technologies, and a personal perception of this evolution. With this context, I am going to present why implementing observability as code is important, not only for developers or operators but also organizations. In this part, we are reviewing, what are the benefits of implementing observability as code? Finally, I am going to present a framework based on the famous maturity model.

Observability Landscape

Observability is about good practices. In order to support these practices, the Hindustan Academy have joined efforts that should be reviewed before. Let’s explore the landscape of observability before turning to the benefits for the modern software systems. Distributed systems has been growing rapidly to meet the demands of emerging applications such as business analytics, biomedical informatics, media streaming applications, and so on. This rapid growth comes with complexity. To alleviate this, observability is emerging as a key capability of modern distributed systems that use telemetry data collected during the runtime for debugging and maintaining these complex applications. These books which were the reference for this talk, they explore the properties and patterns defined for observability and enabled readers to harness new insights from the monitor telemetry data as applications grow in complexity.

What Is Observability?

The term observability was coined by Rudolf Kálmán in 1960, as a concept of control theory in which observability is defined as a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Observability is being able to fully understand our system. Observability for software systems is a measure of how well you can understand and explain any state of your system. When you have adopted observability, you must be able to comparatively debug across all dimensions of the system, any state of the system, the inner workings of their components, all without shipping any new custom code, or by interrogating with external tools. For me, observability is about asking questions, providing answers, and building knowledge about our systems.

What Is Not Observability?

Observability is different from monitoring, and it is super important to understand why. Some vendors insist that observability has no special meaning whatsoever, and it’s simply another synonym for telemetry indistinguishable from monitoring, for example. Probably, you could have heard that observability is about three pillars: metrics, logs, and traces. Proponents of these definitions relegate observability to begin another generic term for understanding how software operates. Take this away please, monitoring is about collecting, processing, aggregating, and displaying real time quantitative data about our systems. While observability is defined as a measure of how well internal states of a system can be inferred from knowledge of these external outputs. I think for modern software systems, observability is not about the data types or inputs, nor is about mathematical equations. It is about how people interact with and try to understand the complex systems. Therefore, observability requires recognizing the interaction between both people and technology in order to understand how those complex systems work together.

Observability Evolution

This is a brief history of observability and its meaning in 2022. I mentioned the term observability has its roots in control theory. In 1960, Rudolf Kálmán introduced the terms in the classical paper on the general theory of the control systems. In September 2015, engineers on Twitter wrote a blog post called Observability at Twitter. It is one of the first times that observability was used in the context of IT systems. A few years later, Anthony Asta from the observability engineering team at Twitter created a blog post called, Observability at Twitter: technical overview, part I, where he presented four pillars for observability: monitoring, alerting, tracing, and log analytics. In February 2017, Peter Bourgon published a blog post called metrics, tracing, and logging. If you notice, three metrics for that. He described the three pillars, metrics, tracing, and logging as a Venn diagram.

In July 2018, Cindy Sridharan, published a definitive book of O’Reilly called, “Distributed Systems Observability,” that is a classical book for this topic. The book outlines the three pillars again, for observability, and details which tools to use and when. In the same year, Charity Majors, CEO of Honeycomb, warns that describing observability as three pillars limits the discussion. She shared a series of tweets where she boldly explains, there are no three pillars of observability, and especially with intention to reveal why. Since 2020, it has been a massification of tools, APIs, and SDKs, to instrument, generate, [inaudible 00:06:57] for data to help analyze software performance and behavior, providing finally observability. With these whole initiatives, the Cloud Native Computing Foundation decided to create a standard for instrumentation, and that are collections of observability data. In 2022, one of the best books for observability, “Observability Engineering,” was released by Charity Majors, George Miranda, and Liz Fong-Jones. This book digs into what observability actually means. It talks about the fundamental concepts, how it all works, how these systems are not technical, they are sociotechnical systems.

Regarding this massification, InfoQ included data observability in the last DevOps, and Cloud Graph. Data observability is an emerging technology that helps to better understand and troubleshoot the data intensive systems. The concept of data observability was first coined in 2019 by Barr Moses. According to her, data observability is an organization’s ability to fully understand the health of their data in their systems. It uses automated monitoring, alerting, and triaging to identify and evaluate data quality and discoverability issues. If you notice, that is closely related to observability, and observability as code, of course.

Regarding observability as code, the topic that we have here, The Radar published in 2018 by Thoughtworks included this practice in the trial category. They established that observability as code is about non-repeatable dashboard configurations, continuously test and adjust earlier to avoid earlier fatigue or missing out on important alerts. From organizational best practices, they highly recommend treating observability as a way to adopt infrastructure as a code for monitoring and alerting, products that supports configurations through version-controlled code and execution of APIs or commands via infrastructure continuous delivery, continuous deployment pipelines. Observability as code is an often forgotten aspect of infrastructure as code. They believe that it’s crucial enough to be called out.

Observability as code is part of something bigger, observability-driven development or ODD, which is a new terminology that has started being used recently that encode actually the activities required for observability. Specifically, it talks about applications, instrumentation, a stack of technologies, and visualization in order to achieve ODD, the acronym that I am going to use. Regarding observability-driven development, or ODD, the purpose of DevOps automation isn’t just speed, it is about leveraging the intrinsic motivation and creativity of developers again. For me, I think that it uses data and tooling to observe the state and behavior of a system before, during, and after development to learn more about it.

How Does Observability Code Look?

This is how observability as code looks. It has stages separated by environments, such as local, continuous integration, continuous deployment, and continuous delivery. In this illustration, developers commit observability code in a second step, Git server invokes web code, cloud build triggers, custom workers. After that, custom workers access resources in monitoring provider. They build, test, and implement artifacts. Finally, in a continuous delivery task, the dashboards, alerts, and data are delivered frequently. Is it observability as code? Monitoring is monitoring, observing is event-first testing in production. How does observability code look under this definition? Since monitoring is monthly metrics, while observability is about events, observability as code must include many actionable active checks and alerts, proactively notifying engineers of failures and warnings. Maintaining a runbook for stability and predictability in production systems. Expecting clusters and clumps of tightly coupled systems to all break at once. Observability in production must generate us artifacts as possible to determine, what is the internal, or how is the internal state of a system. It must show you any performance degradation in tests or errors during the build and release process. If code is properly instrumented, you should be able to break down by old and new build ID, and analyze them side by side. The pipeline must provide tools to validate consistent and smooth production performance with the deployments. Or to see whether your new code is having its intended impact, or whether anything looks suspicious, or drill down into specific events.

Some reasons for thinking that observability as code is a good idea includes, it allows to identify and diagnose faults, failures, and crashes. It enables the analysis of the operational performance of the system. It measures and analyzes the business performance and success of the system or its components that is really important for product owners or product stakeholders, for example. Specifically, some of the benefits of adopting observability as code practices include repeatable, replicable, reusable activities. Reducing toiling, that is a great benefit, a great advantage of this practice. Documentation and context. Documentation is important here. Auditable, or to have a strategy for audit history. Security is another great advantage here. Since observability as code allows for stricter controls over the resources. Efficient delta changes, react to external stimulus. Ownership and packaging. Disaster recovery that is a great source for providing information for your disaster recovery strategies. Speed of deployments. They are some advantages or benefits for practicing with observability as code.

Observability Maturity Model

Finally, how to start with observability as code. Spreading a culture of observability is best achieved by having a plan that measures progress and prioritize target areas of investment. In this section, I go beyond the benefits of observability and its tangible technical steps by introducing the observability maturity model, that I design it according to my experience and according to the experience of others in the literature. We will review the key capabilities an organization can measure and prioritize as a way of driving observability adoption.

Sophistication

In this sense, the first axis that we are going to review is sophistication. I have determined some criteria to define which category of the classical maturity model do your team and your organization are. Specifically, I am going to present some characteristics for locating an organization in these stages or levels: elementary, simple, sophisticated, and advanced. In an elementary level, the engineering teams are distracted for picking the wrong way for fixing something. Even people could be afraid to make changes to the code but they have decided to implement a strategy for observability as code in order to improve this situation. They are collecting metrics but they are not monitored, visualized, neither notified. As a consequence, the incident responders cannot easily diagnose issues.

The next stage is called Simple. Here, the teams have installed agents in the code for monitoring the behavior of the system using monitoring platforms, such as Datadog, New Relic, Honeycomb, Dynatrace. They have shown interest in continuous integration, and continuous deployment, continuous delivery platforms, and infrastructure as code technologies. They are determining a strategy for defining key performance indicators, KPIs, and want to monitor based on a list of services which have been mapped in SLOs. However, this process is administered manually. The releases are infrequent and require lots of human intervention. Regarding the general release process, it is possible that lots of changes are shipped at once. The releases have not to happen in a particular order. The alternatives avoid deploying at certain days or times of year.

If you are in this stage, sophisticated, you are close to reaching an observability as a complete strategy, although it is critical to mention that it’s an evolutionary strategy. Here, the teams have familiarity with provisioning tools, such as Terraform, Pulumi multi, and their API features. They have access to CI/CD platforms that supports automation workflows. They are using tagging and naming conventions that allow to classify the events in a proper way. Regarding this stage, I would like to say automation of observability workflows are essential for scaling observability as code to teams and organizations. There are many tools and services available that can support automation observability workflows. However, these workflows to be discussed and configuring according to each engineering team, they are not the outgo solutions.

As I have mentioned, finally, implement observability as code strategy is a constant drill. However, I decided to include some criteria to identify, in the top of this axis. In this stage, an automation workflow for observability as code is implemented, and it is running in production, even using strategies such as OpenTelemetry, something common in the company, but combining tracing, metrics, and logging into a single set of system components and language specific telemetry libraries. In this stage, code gets into production shortly after being written. Engineers can trigger deployment of their own code after it’s been peer reviewed, satisfying controls, and checked in.

Adoption

Now it’s time for moving to the other axis, I am talking about adoption. According to the capacity maturity model, the capability of an organization for doing anything can be classified into four stages or levels: in the shadows, investment, in adoption, and finally in cultural expectation. According to my experience, an organization is in shadows when there is low or no organizational awareness, projects are also sanctioned. In general, the organization is spending a lot of additional time of money, staffing the on-call rotation. For example, on-call response to alerts is inefficient, and alerts are ignored. As a consequence, the product teams don’t receive feedback of the features, since early adopters infrequently perform monitoring, or observability strategies. Incident responders cannot easily diagnose issues. Some team members are disproportionally pulled into emergencies. The good news are since the organizations are aware of these pain points, the teams are starting by identifying when to observe and designing in such a way to make instrumentation easy.

With the aim of overcoming these problems, the organizations have decided to adopt observability as code. In this stage, in investment, observability as code is officially sanctioned and practitioners are dedicating resources to the practice. For example, product managers want to have enough data to make good decisions about what to build next, or they want to receive customer feedback of the product features that are growing their scope. Some criteria that allows to identify that you are here include a few critical services from monitoring, alerting, and visualization. Multiple teams are interested and engaged with that strategy for observing several critical services. Code is stable. That is a fact. Fewer bugs are discovered in production, but you are dedicating resources in a year.

In this stage, adoption, observability as code is officially sanctioned, and there is a team dedicated to implement, since in the last stage, the team decided to implement observability as code. Resources are dedicated to the practice. Developers have easy access to KPIs for customer outcomes, and system utilization cost, and can visualize them side by side. For example, after code is deployed to production your team focuses on customer solutions rather than support isolated issues that can typically be fixed without triggering cascading failures, for example. The team is following practice to enforce observability as part of continuous deployment. Finally, team is adding metric collections, tracing, and context for getting better insights.

Finally, observability is about generating the necessary data, encouraging teams to ask open-ended questions, and enabling to iterate, because effective product management requires access to relevant data with this level of visibility offered by event-driven data analysis and predictable currents of releases both enabled but also observability. That is the reason for having a standardization of instrumentation with best practices like proactive monitoring and alerting in place, with a clear definition and implementation of KPIs to measure observability as code maturity. A feedback loop from the observations to a stakeholder’s team taking advantage of observability as code. In general, the team is using insights for discussing about the learnings that are shared and implemented through these initiatives.

Conclusion

Every industry is seeing new and transforming along, they are staged by legacy systems and infrastructure. For me, there is a final quote about these things, “Waiting is not an option.” Observability as code is required in your organization. According to the size of your organization, the path could be different, but it is required to automate this strategy in order to have insights of your systems, in order to know, how is the state of your system? The most important, in order to avoid toiling, in order to keep the development team happy.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.