MMS • Ben Linders
Article originally posted on InfoQ. Visit InfoQ
We can make good decisions with speed when we limit the cognitive load on any one person or team. Observability can help to increase delivery speed, by providing information to developers that helps them to make decisions quickly.
Jessica Kerr spoke about applying observability for speed and flow of delivery at QCon London 2022 and QCon Plus May 2022.
There’s only so much each person can hold in their head, as Kerr explained:
When the code is changing, there’s only so much a person can keep up with. When new people are joining, only so quickly can they load all this custom knowledge in.
When developers are making decisions, it helps to have information readily available, Kerr mentioned. This information should answer questions like “Who calls this service?” “How long does this function run?” “What values does this field hold?” “How many times do we hit the database, and for what?”
Kerr explained how developers can use distributed traces:
Distributed traces tell a story about each request. See who called whom. See what happened concurrently and what waited. See where the performance bottleneck is. Traces provide thousands of stories of individual requests moving through the software.
With thousands of stories, how do you find the one you want to look at? Kerr mentioned that querying over the traces helps with that: search for one that is slow, or one for a particular grumpy customer, or one that failed with a certain error message:
Then when I make a change to the code, I add new spans to the trace or attributes to the spans. I can see the new results locally, in test, and in production. I can be confident it’s working. And I can get satisfaction from knowing it’s useful to customers!
InfoQ interviewed Jessica Kerr about dealing with cognitive load and how observability can be an asset.
InfoQ: How does cognitive load limit the ability of teams?
Jessica Kerr: Our job as software developers isn’t typing. It is making decisions: what to change, where to change it, what to change it to, where else to look, what to name it all. When we make informed decisions, then rework is rare. When we don’t have enough knowledge of the system to know everywhere to look, then our decisions become bugs. We circle back again and again.
As we add capabilities, every feature request comes with the unstated requirement “… and everything else still works.” All that adds up to a lot to know.
To work smoothly, to have fast flow, we need most of that knowledge at our fingertips. Not “Let me google that again” and “Maybe if we search all the codebases we can find a reference to this.” We need instead “I understand how this works” or “I know who to ask about this.”
InfoQ: How can observability in software become an asset to organizing teams?
Kerr: As leaders of teams, we can use observability. An important input to team management is error budgets and service level objectives. Observability in the software lets us count the percentage of incoming requests that are succeeding fast enough, and check that against our agreed-upon service level.
For instance, maybe our checkout service is expected to return within 900 milliseconds, 99% of the time over any 30-day period. That leaves 1% as an error budget. When our service is meeting that objective with no sweat, the team can keep working on features, and try stuff like reducing capacity. When that error budget is almost gone, it’s time to direct the team’s effort toward reliability instead.