Mobile Monitoring Solutions

Close this search box.

Presentation: Reliable Architectures Through Observability

MMS Founder
MMS Kent Quirk

Article originally posted on InfoQ. Visit InfoQ


Quirk: I’m here to talk about architecture reliability, observability as a whole. I’d like to just start you off by thinking about thinking. If you’re designing something, if you’re working on something, if you’re building something, would you rather think about the telephone with its fairly obvious connection between a speaker and a microphone? This is a picture from Ericsson in Sweden, back in the 1890s of a telephone switch. All those gray fuzz you see there, that’s not fog, that’s wires. This is network connectivity with physical wiring connecting everything together. When you imagine it, you can hold that entire problem on the left in your head. The thing on the right, that’s engineering. Engineering really starts when the problem is too big to fit into your brain. When you see engineers who are super smart, super skilled, they can build systems that nobody else can understand because it fits in their brain, it doesn’t fit in your brain. That part is hacking. The act of engineering is when it no longer fits in your brain, and you have to start adding process and capability, and working with other people in communications.

What Is a Reliable Architecture?

I’ve been doing this a while. I’m over 40 years now working professionally in the software industry. I got my start writing device drivers in hardware in assembly. Now I’m working on large scale backend systems at Honeycomb. Honeycomb makes an observability tool. We’re not really here to talk about Honeycomb. I want to talk about architecture. What does reliable architecture mean? It’s not about failure. It’s not about avoiding failure. Any real system is going to fail in various ways. It may not all fail at once, but bits and pieces of it will. It’s about handling that change. It’s about being resilient to changes in the environment, to your customers. You deploy a new library. Your cloud provider happens to stop running its services, whatever it might be, things are going to fail. I was having an email conversation recently with somebody who said, just assume everything is going to fail. The systems you build to accommodate that is what ends up giving you reliability. It’s rarely clean, though. You can’t think about it, here’s where I put my reliability. We need architecture for the big problems. The way we do architecture is we break the big problems down into smaller problems, and the smaller problems into smaller problems like that. Everything is a hierarchy. It’s a fractal situation. I have been here for the evolution of thinking about programming functions, breaking your programs down into functions, and then it became objects. Then it became like split things up into processes and microservices and services. We solve bigger problems by going to the next scale of the thing we already understand.

Like I said, it’s not always clean. You get this hierarchy of things that are interdependent on each other or maybe not. Now what? The systems get really big and messy, especially when they encounter the real world. This is where we get works on my machine. Works on my machine means I know all the things. When you put it into the real world, all of a sudden you don’t have control over all the things anymore. Now you have to add in the connections between the parts. That’s why a debugger doesn’t really help you. Once your system escapes into the real world, you can’t stop it. Asynchronous communication, running services. You can’t just say, “Wait a minute, I want to look at this.”


We really need telemetry. We need the ability to read what’s going on in these systems that are living out in the cloud. The way we do that is with what we call observability. It’s what makes that telemetry meaningful. It’s the ability to ask questions about it, and then get meaningful answers back. Observability is a word that a lot of vendors like to throw around, everybody has observability. What does it mean? We had this little battle for a couple of years, where everybody was talking about the three pillars of observability: logs, traces, and metrics. I personally think this is a terrible framing, because it’s not like three pillars all holding up the roof of observability. They’re different sized. They’re different meanings. They have different values. I reject the framing of three pillars. I also reject the framing that they’re all equally important.


Number one, metrics. Everybody loves metrics, everybody has metrics. We’ve all seen and maybe we’ve been the engineer who can just look at that dashboard over there, and there’s a little pixel drop in some metric. The engineer goes, that means the user services is falling over. How did you know that? They knew it from experience. They knew it from having stared at these metrics. That’s not sustainable. That’s not scalable. Metrics are pre-aggregated answers to your problems. They’re fighting the last war. You’re saying this was a thing that went wrong before, therefore, I’m going to track it, I’m going to add it up. When that number changes, then hopefully it tells me something meaningful. I did work at one company where we had the ability to build our own dashboards for our production services. We had 1100 metrics we could pick from to put on those dashboards. You could spend weeks just figuring out which metrics you wanted up there. Some of them are useful. It’s a real problem. Metrics are fine. They can be useful. They’re valuable to have. Sometimes it’s nice to just see what’s going on. They don’t really help you debug. You get a lot of them precisely because they’re not adequate to the problem.


Logs, they’re good. They’re also useful. This is where I actually got into the whole observability world. Imagine all these things sending logs, they’re all sending them to CloudWatch. They’re all integrated. They’re all going up there by buckets, maybe in time, maybe they’re integrated by time, maybe they’re not, but you got to query them. I had a little CloudWatch collection of logs, and I wrote a query language in Slack so that I could query my logs and post the results of the query back into Slack automatically. That was really useful for sharing that information with my team. You still had to write a clever query. It was all correlated by time. You could only get 10 lines. Maybe you found what you wanted, and maybe you didn’t. Then eventually realized that we could take that same structured data that we were sending up to CloudWatch and send it to a service provider, who could help us process that data and do something more with it. It’s still up to me to do the correlation between all the systems, all the events, and to know what’s in the logs. That is where tracing comes in.


Tracing is basically fancy logs at its base. With tracing, you have spans. Those spans show you what’s going on in the world. What we see here, there’s a trace. You can see at the top level, there’s the root span, and it takes 5 seconds. Then you have all these various elements. It shows you basically when everything happened, the correlation between them, how long they all took, and the relationship between them. You have a hierarchy here. That gives you a lot more information about what’s going on. Specifically, when we talk about traces, the things that matter in tracing is these key fields. You have a trace ID, which is just basically a large random number, ideally. Every child span gets a reference to the parent. This green triangle here, parent span is 111, which is the span ID of the top one. Now we’ve defined a hierarchy. All the spans are independent, but they can refer to each other in a way that allows us to reconstruct the ordering and sequencing and timing of everything. One of the questions is, what happens if my trace cross services is not in service, you have to propagate that trace information.

Let’s talk about tracing in a little more detail. I want to go through a slightly idealized, actual trace situation that I came across about a year ago. I was working on a project that had been shipping for several years. I was noticing as I was starting and stopping it, it was taking a long time to start up in certain scenarios. There wasn’t a lot of tracing around. I threw some tracing in, and I was just like, we’ll just track some of these functions that I don’t know what’s happening during startup, I just decorated the startup code with some tracing logic. This is simulated code for clarity. It’s not the real thing, but it’s basically what was going on. Down here on the left, you see this process key’s function. What’s happening is there’s a bucket of keys that gets handed to the service. The services’ job is to encode or decode them. It uses a lot of caching. The keys are coming from user data, and so there may be lots of duplicate keys. What happened was that this code would start up, since those keys may not have been in the cache, we started up a goroutine for each key, basically a parallel routine to go off and query the database if the thing wasn’t in the cache, otherwise, just return it from the cache. That’s the get cache value up here. We look, is it in the cache? No. Go get it from the database.

This is what the trace looked like. Again, I’m sketching this out in general here. You can see we spun up in parallel all those get cache values, they all started at about the same time. The first one went out, looked at the cache, and then it goes and tries to get it in the database, it didn’t find it. Then the next one waited until the first one was finished before it went out to the database. Then it waited for the next one to finish before it went out to the database. What looked like perfectly reasonable code and is, in a world where your cache is mostly full, in this case, where the cache was mostly empty, we were serializing access to all these database values. I ended up with process keys taking way longer than I expected it to. The problem is, we didn’t all need to go to the database after we checked the cache. Instead, what you have to do here is add some code to check the cache again, after you get the lock. Because while you were holding the lock, somebody else might have put it in the cache. This is concurrency 101, but it’s also a real easy mistake to make. The point I’m trying to make is, I wouldn’t have seen this, I wouldn’t have addressed it, I wouldn’t have fixed it if I hadn’t known from looking at the telemetry. We turned this trace into that trace. Because everybody locks, one of them goes out and gets the database query. Then everybody else is like, now it’s in cache, you can go. I’ve highlighted a lot of details, but this actually was a 10x speedup for a pretty complex little problem.

How to Do It (OpenTelemetry)

Hopefully, that has maybe motivated the idea of why some of the stuff is useful to you. Now I want to talk about, in practical terms, how you should do it today. Basically, I’m here to tell you the good news about OTel. OTel is an open source, vendor independent standard for data formats and protocols for telemetry. There are standards for traces. That’s where it got its first start, based on the merger of OpenTracing and OpenCensus. Metrics are close. In some languages, they’re in pretty good shape. Logs are coming along. The idea is that there are libraries in basically every language under the sun. The Cloud Native Compute Foundation is the open source organization that runs Kubernetes. OpenTelemetry is the second largest project after Kubernetes. There were hundreds of developers contributing to it. There are dozens of special interest group meetings every week. It’s out there building the tooling to do this telemetry. Also, as part of that tooling is the OpenTelemetry Collector, the OTel Collector, as we call it. It’s a processing proxy. What you can do is it has receivers that can pull data or accepts data from all sorts of sources. It has processors that can process that data. Then there are exporters that can export it. It also can work with Grafana, Prometheus, that kind of thing. You can put it into the world and use the same tools you’re already using. You can also send this data to third-party vendors like my employer.


Where do we do this? We have our stack, the operating system, and the container runs on the operating system and our app runs in the container. As you go lower in the stack, you can get a lot of detail, memory usage, and that kind of thing, but you lose your context. You can ask your operating system, which services am I communicating with? How much memory am I using? It doesn’t tell you which functions are doing that. Whereas if you go higher in the stack, you can get things from the app. You can do auto-instrumentation and you can do manual instrumentation. Auto-instrumentation works great low in the stack. You can go and attach auto-instrumentation at the operating system level or at the container level, that really can get a lot of information quickly and easily, and sometimes too much information. It’s a good place to start. BPF is the Berkeley Packet Filters. They’ve been around for 30 years. eBPF is more recent, and is basically a detailed protocol for reaching into an operating system getting a lot of useful information or injecting information in some cases. It’s really interesting as an auto-instrumentation tool that’s growing rapidly and quickly in the context of OpenTelemetry. It’s great stuff. We’re seeing a lot of evolution there. It will get much better as you go. Part of my point here is, get started now and that stuff can come along and help you. When we look at auto-instrumenting apps, it depends a lot on what your technical environment is, basically which programming languages or environments you’re running in. If you’re using a VM based system like Java, .NET, that VM is very easily instrumented, and controllable. It’s pretty easy to take your Java deploy, add an extra JAR to the list of things you deploy, and that JAR can reach in and watch everything that’s going on in your VM. It can keep track of your HTTP requests, it can keep track of your file system access, all that kind of thing. Again, it also can know about the libraries you’re using. Your auto-instrumentation library can say, you’re using Postgres. That means that I can auto-instrument that and keep track of things like, what was your database request?

Auto-instrumentation in those kinds of languages is great. When you look at something like Python or Ruby, what you’re talking about is decorating the functions, the structure of the app within the Ruby code itself. That’s a monkey patching approach where your function gets wrapped in another function that auto-instruments it, and then lets it run. These are pretty easy to get spun up as part of your configuration depending exactly on the language capabilities and what you’re trying to do, but it’s not that hard to do auto-instrumentation in those languages. Then you have languages like Go and C++, where auto-instrumentation, it’s a compiled language, so there’s basically two big methods. One is injection of source code by reverse compiling, essentially, your source, and adding lines of code to your source before compilation. Or, something like eBPF, where the language runtime is providing capabilities that the auto-instrumentation system can read. Again, we’re seeing some really good stuff happening here. It’s still early in that space. When everybody wants auto-instrumentation, they want the easy button. Sometimes the easy button is like, hit it again, not that easy. You can get too much. It can be the wrong kind of information for you to care about. Manual instrumentation is still a good thing to do with the app level, it’s still worth investing in. The nice thing is it intermingles. This is more hands-on. You have to modify your source and redeploy it. It’s the only way to really see what’s happening deep in your code. When I was talking before, I added lines that were around the function calls I care about, but I didn’t add auto-instrumentation at the tightest inner loop that was iterating through 1000 items. You can do this in a way that it interoperates with the auto-instrumentation. The libraries from OTel are pretty easy to use, again, depends on your language, depends on the language support and community, and things like that. They’re pretty easy to use and are getting easier. One of the hardest parts is actually just learning the vocabulary. You have to learn to think like OTel. Once you learn to think like OTel, then it becomes a lot easier to implement the OTel part.

Just as a way of illustration. This is just a quick snippet of JS Node code. We just have to require the OpenTelemetry API, that was pretty easy. Then we can start a span here on line eight. We get a tracer. A tracer is a thing that generates traces. You get a tracer, you give it a name, and then you start a span. Then after you do the work you want to do, you end the span. When you have a nested situation, instead of getting a new tracer when you make a call, you’re going to start nested span, and pull in. It will reach out into the system and say, what’s the current span? Then make a child of that current span. That’s basically all you need to do. There are a couple lines of code you wrap around the things you care about. If your language has the concept of some defer or destruction, then it’s pretty easy to do it in one or two lines right next to each other. You start span, it automatically ends depending on the function, that kind of thing. Basically, this is a straightforward way to do this. The fun part is, this can work client side, too. You can do this in your clients, for mobile apps, for web apps. You can give those applications the ability to start a span and send that span, and watch what’s going on in your user data. This isn’t RUM or analytics. You’re not at the level yet, but I think we’re going to get there, of being able to replace all your RUM tools for monitoring user behavior. Theoretically, you could do that within the context of OpenTelemetry, you’re just probably going to, today, have to roll most of it your own self. We’ll get there, though.

What About Async?

Pretty much everything I’ve talked about so far has been mostly synchronous. In an async world, you’re doing either something like Kafka where you’re putting things into a queue and then pulling them out sometime later. Or you’re just doing async function calls. Your root span is up here and it’s done. Meanwhile some other task is over here, filling in the rest of the screen. Async architecture is actually a really good model for a lot of things. It can be more resilient. You’re not making users wait for things. You’re not making processes wait for things. They add complexity, though, in terms of thinking and your mental model of the system, depending on so many things having to do with the languages you’re working in, and what exactly you’re trying to do. If you’re used to thinking about debuggers, “I just spun up this process, and I’m going to put a breakpoint in the thing that started the process.” That does no good at all for the process that’s happily running in the background while your debugger has stopped. How do you see what’s going on? The way to think about it is not so much about hierarchy, but about cause and effect. What caused this, and what was the effect of it? This is where you apply context. This is where the trace is about adding context to that trace, where did it come from? What was the source of this information? Adding metadata to this information.

Tracing isn’t still ideal, because it’s not necessarily about cause and effect. Tracing is largely built around this idea of a hierarchy, but it’s still better than logs. If you’re talking about stuff that’s happening quickly, it’s async, but it’s almost the same thing, then you’re probably fine with just saying, start a trace on that receive. Then this stuff that happens over here, it isn’t contained within the parent, but it’s still related to the parent. Your biggest question is deciding what it is you’re going to have as your parent-child relationship. Is dequeue a child of enqueue or is it a child of your original request. That’s just for you to decide, and whatever you want to think about. It’s fine either way. If you have something like a Kafka, where that thing might be sitting in Kafka for seconds or days, depending on your queuing and stuff like that, you don’t really want to try to connect those traces. That’s going to be really hard to think about. A trace that lasts a million seconds is not really easily analyzed. You’re better off using what’s known in OpenTelemetry as span links. You have two traces here, receive is doing its thing, and then it’s handing it off to Kafka. Sometime later, the dequeue comes along and it processes that result. As a part of the request, we included the metadata about the trace that caused it. Now in the dequeue request, we have the information we need to put a link back to the source trace. Then our tooling can connect those for us. We can get from here, “This one failed, why did it fail? Let’s go back and look at what caused it.”

OTel Baggage

The way information flows between processes in OTel is called baggage. There are standard headers, and this service over here can make a request, and this service over here can receive that request. The headers will contain the information from the source request, namely, the trace ID, the parent ID, that kind of thing. Various OTel libraries will pick that stuff up automatically, which is great until it surprises you. Because they’ll also bundle it automatically. Now you make a call to a third-party API, or worse still, your customer makes a call into your API. They included their baggage, and you picked it up automatically and now your trace is based on somebody else’s root span. You need to think about baggage from the point of view of it’s easy to start up, it’s easy to use within your network. Once it leaves your network, you probably want to filter it out either outgoing or make sure that it doesn’t leak into your systems. You also need to be aware of this with things like if you’re using one of the mesh management tools like Istio. We had one customer who was saying, how come my traces have 11 million spans? It was because your mesh system was injecting the same span ID into everything. Yes, that can be fun. Tests are systems too. You’re going to want to filter that stuff out, but it works across services. Again, it’s a standard. This is part of what I’m trying to sell you on here. Using the standard gives you all of this stuff. It comes along, and you can understand it as you need to. It’s there. It’s available. It’s in the tooling. It’s being thought about and being improved over time.

Architectural Strategies

Let’s bring this back to architecture, because one of the things about architecture is we’re talking about planning. We’re talking about construction. We’re talking about maybe new ideas. It’s not always about things we already have. That may be architecture we deal with, but sometimes you want to think about it from the beginning. Part of the question is, when do we start thinking about it? First of all, any time you’re thinking about it, I do think you want to be planning for observability. You want to be thinking, how do I know? How do I know this is true? Within my company’s GitHub template, there’s a little line in there that says, how will we know this is working in production? That’s meant to prompt you to say, what are you building into this pull request for observability to make sure the thing you say it does, it actually does. Plan for it. One of the things that’s interesting about any tracing architecture is the problem of, how do you propagate the information within your process, and within your application? This is one of those places where I will hide my software engineering hat and say, yes, global variable, that’s actually a pretty good model. In Go, we stick it on the context, typically. You have a context that gets passed around anyway, and so now it has the active trace to the context. The call can say, give me the current context, and then make a child span. You see that pattern in a lot of the different languages in a lot of the different frameworks. There’s some global floating somewhere that’s tracking this stuff for you. Don’t freak out, it’s ok.

The mistake I talked about earlier, where we had a limitation of, we’re running in that thing, that was detectable just on my desktop. I was sending traces from the app that I was running on my own machine, and I’m just looking at one trace. I’m not looking at traces in volume, and I’m not looking at main performance at that point. I was just looking at, what is this particular instance of this particular application doing right now? That’s valuable. It’s a debugging tool. It’s a more powerful debugging tool than starting a debugger and trying to walk through the logic. Because that debugging tool is not going to show you any kind of parallelism or interaction. This is the exact problem that goes away when you try to study it in detail, because you’re not sending massive quantities of duplicate data at it. Working with observability from the very beginning of a project can be valuable. I worked on another project where one of the first things I did was like, you stand up a dummy app, and you add some dummy observability. You can do it in OpenTelemetry. You can do it with collector. You can do it with the trace viewer that runs on the command line.

Observability Strategies

Telemetry is cheap. It’s basically cheaper than engineers anyway. Err on the side of recording too much and then trim it down. You can trim it down either at source or you can trim it after the fact with a sampling strategy. Send it all to a local collector, like start up a collector and run it and then you can have everything go there and then you can forward it. Then your plan for deployment is that you have a collector running in your network, but you can also do that locally very easily. I’m also here to say, deemphasize dashboards. Dashboards don’t tell you enough. Get the ability to ask the question you really want to know. Don’t ask these binary questions of just, is it up or is it down? Expect degradation and be able to have enough information to detect degradation when it happened. My coworker Fred, said the other day that p95 isn’t going to show you that. We were looking at a thing where you had a small number of users with a spike in duration. Your average query time is going to be fine, but you’ve got a few customers who are having a bad day, that won’t show you with metric.

The Collector (Receivers, Processors, Exporters)

Let’s talk about the collector. Collectors have receivers. Receivers collect data. They can either be passive, which means they are an endpoint, and you have systems that point at them and send to them, that’s the way OpenTelemetry normally works. You have a OTel receiver, and you send it OTel data. They can also be active collectors which reach out and pull it out of the cloud or out of a S3 bucket or whatever it might be. The collector can do that, too, it can be active at receiving data. The point is that it’s getting it into a pipeline. Basically, the job of the collector is to build a pipeline or multiple pipelines for data. There are 75 or 100 plugins supported for collector, and there are probably more that people have written for themselves. Processors are the next step in the pipeline. After you’ve collected the data, you can run it through processors, and there are all sorts of processors that can do aggregation of data. They can filter it and sample it and redact it, and take out PII, and that kind of thing. Or remap it, this thing over here sends telemetry and the field name is wrong, so I’m going to change the field name so that my upstream systems all get the same value. Then, finally, the exporters take all the post-processed data and they can send it places. You can send it to your local Prometheus instance. You can send it to your third-party vendors. Some vendors have custom exporters that are designed to transform things into whatever format they want. Some vendors just accept OTel. In either case, you’re just going to set it up. You can also save it to a file. You can send it to a database. There’s a bunch of different exporters available.


Sampling is the idea that, ok, now I have a firehose, I probably don’t need the entire firehose. The idea is you send the full firehose to the collector, and then you can filter it out. The simplest model of sampling is pure mathematical deterministic sampling. You just use the trace ID as a random number. Actually, it’s defined to contain at least 7 bytes of randomness. Now you can choose based on that trace ID, some fraction that you’re going to keep, and you will get consistent sampling across your system. Dynamic sampling is something that is less common, less possible, but my employer does. That’s what I work on personally, is this idea that you can keep the stuff that matters to you, and drop the stuff that doesn’t matter enough. There’s a Tolstoy, every happy family is alike, but every unhappy family is unhappy in its own way. You keep all the unhappy ones, and you keep only a small fraction of the happy ones. That’s what that is useful for. That’s collector.

Service Level Objectives (SLOs)

The other thing that’s really interesting about telemetry is the idea of using it in real time to keep track of what’s going on in your production systems. This is where you establish service level objectives. Here, what you’re doing is you’re measuring the performance of your system from the point of view of an observer who cares about it, like a customer. You’re ideally measuring experience. Response time, for example, is a good service level indicator. Then you can filter that using like a binary filter. The response time was under 2 seconds, therefore it’s good. If it’s over 2 seconds, then it’s not good. That’s a service level objective when you take that number and say, ok, I want that number to be 99.9% over a month. Now you can look and you could say, what’s my error rate? My error rate is, things are performing well, things are going great. That thing is running along at 99%, 100% of my capacity for the month, but now we have a bad time and my database slows down. Now I start having long response times that SLI starts triggering, and I’m burning through an error budget. Now you can warn yourself, you can say, I’m going to burn through this budget in the next 24 hours, I probably better work on this tomorrow. Or if it says, you’re burning through your budget and you’re going to be gone in four hours. Ok, I have a real problem, wake somebody up. That’s what SLOs can do for you. It lets you have a budget for your failure. Again, this is about resilience reliance. This is not about not having failures, it’s about reacting to them appropriately, in a way that’s sustainable for us as humans.

Working In a Legacy System

When you’re working in legacy code, when you’re working in a legacy system, where you start is start with a collector. Give yourself a place to send the stuff you’re about to send, then you add observability incrementally. If you can apply an easy button, say you’re working on a legacy Java code base, you can go and you can turn on that auto-instrumentation for Java with like one line and you deploy. Now you’re going to start getting stuff. Maybe you’ll get too much stuff, but you can adjust that. You can generate metrics and traces. You can use that information to pull out a service map. Now it’s like, ok, that took a long time but I don’t know why because I can’t see inside the code. Now you can go in and add explicit spans for the things that are mysterious, the things you have questions about. We have a little Slack channel called the Swear JAR, in our telemetry team, and basically, they post there whenever they wish they had something they don’t have. Somebody who’s got a couple of hours in between their on-call shifts, or whatever, can go in and say, I can add telemetry here. Keep track of things you’re curious about that you don’t have the answers to. Again, the question you want to ask is, how do I know this is working in production?

Architect for Observability

This is what I mean about architecture, you’re architecting for observability. It doesn’t change the way you architect the rest of your system. It means that you need to think about it throughout your process from the beginning to the end. Nearly everything is important enough to decorate telemetry with it. The only reason you might not want to is because it’s so high performance or so core. Basically, if you develop the muscle of adding telemetry, as you’re in there working, you’re going to appreciate that as you go. Again, you pass the metadata along when you make calls across processes, across functions, and things like that. Then use the standards. OTel is there. OTel is going to get better. Keep going. Record enough information. Seriously, just use OTel. It’s still evolving. It’s not perfect yet, but it’s getting there. If you want to come along, my team, my company goes to a couple of dozen meetings a week, and we’re participating in all of these different things. There’s a bunch of different things, your language of choice, your platform of choice. There’s a lot of places where you can help. If you have some bandwidth to do some open source work, OpenTelemetry is a great place to think about it.

Infrastructure is Architecture

Your infrastructure is your architecture. Your business logic is your architecture, but so is your infrastructure. Put it in a place you can control it, and distribute it from there. You can start today. You can literally make something work.

Questions and Answers

Participant 1: We started looking into observability and introducing traces to Kafka processing. One of the things that we saw that it’s very challenging when we process batches. If you process one by one, it makes sense. If you start processing the batch of 1000 messages, and then the next consumer also processes 1000 messages, it starts to be really challenging, and we didn’t see literature that talks about it.

Quirk: One of the things that we recommend is like, you get those batches, and if you try to make the batch a root span, then you might end up with a 4-hour root span, which is really hard to track. One of the techniques we tend to recommend people use is that you put a limit on your batch size and duration. You set things up, so, essentially, I’m going to process until I get to 100 traces, or until I get to 5 minutes, or whatever that might be. Then you end that trace, and you start a new trace, so that you have smaller chunks of traces, but then decorate all of those traces with your batch ID. Your batch ID is not your trace ID. Your batch ID is found in all of those. You can do a query and say, show me everything from this batch. You’ll find that there are 4, 5, 10, 1000 sequential traces within that batch, each of which handles a bucket of related things. It’s a way to make sure that the telemetry keeps flowing, that you’re getting the information along the way. It’s a very custom decision you have to make about doing that. That’s the general technique that we found.

Participant 2: [inaudible 00:47:45].

Quirk: It depends a lot on your programming environment. It’s changing rapidly. For traces, things are pretty mature. In most languages, it works pretty well. For metrics, metrics is still stabilizing. There are a couple of languages that are less good than others. I was tearing what little hair I have left out the other day over trying to do some OTel metrics in Go. That still needs work. Logs are still further behind. In particular, when we start talking about things like sampling logs and stuff like that, that’s not really there yet. It’s still an evolving process. I still think it’s worth investing in it now and doing what you can with the tools you have. Start with tracing, get that working. Maybe you want to wait a little bit on that trace.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.