MMS • Frederic Branczyk
Branczyk: Meet Jane. Jane finds herself thinking about a new business idea. Jane does come up with a new business idea. Without further ado, Jane starts coding, starts writing a bunch of software, starts putting it on servers, people start using it. It feels like overnight, there’s a huge amount of success. Thousands of people start using the software, and everything just goes through the roof. All of a sudden, we see unexpected things happening in our infrastructure. Maybe we put a little bit of Prometheus alerting into place for some high-level alerting. We put all of this on Kubernetes as we do these days, and we get a high CPU alert, for example. In all of our success, we were not thinking about observability enough. Now we’re in the situation where we are maintaining a production service, but have no insight into what’s happening live. This really happens to all of us. Whether it be starting a new company, starting a new project, creating a new service within a company, companies get acquired, code bases get inherited, probably all of you can relate to this problem. A lot of you probably have even experienced one or the other version of this, where now you’re in charge of maintaining a piece of software that you don’t have the amount of visibility into that you would like, and maybe even that you need.
I’m here to tell you why you should be sprinkling some eBPF into your infrastructure, and how and why you can be confident in this. In this case, I’m deploying a profiling project called Parca, it’s open source, onto an existing Kubernetes cluster. I waited for a little bit of time for data to be generated. All of a sudden, without ever having touched any of these applications that are running on my Kubernetes cluster, I can now, down to the line number, explore code, and where CPU time in this entire infrastructure is being spent. I can look at Kubernetes components. I can look at other infrastructure. I can look at application code, everything. I didn’t do anything to my code, but I was still able to explore all of this at a very high resolution, and in a lot of detail.
My name is Frederic. I am the founder of a company called Polar Signals. We’re a relatively early-stage startup. I’ve been working on observability since 2015. I’m a Prometheus maintainer. I’m a Thanos maintainer. Prometheus is an open source monitoring system modeled after internal Google systems. Thanos is essentially a distributed version of Prometheus. Both of these are Cloud Native Computing Foundation projects, CNCF. Also, for several years, I was the tech lead for all things instrumentation within the Kubernetes project. I am actually the co-creator of the project that I just showed you. As I said, everything that I just showed you is open source. As a matter of fact, everything I’ll be showing you is open source.
Let’s talk about eBPF. I want to make sure that we all have the same foundations to talk about the more advanced things that I’m going to dive into. I’m going to start from zero. Basically, since forever, the Linux kernel, literally, the kernel itself has been actually a fantastic place to implement observability. It always executes code. It knows where memory is being allocated. It knows where network packets go, all these things. Historically, it was really difficult to actually evolve the kernel. The kernel was on a release schedule. The kernel is very picky, and rightfully so, about what is being included in the kernel. Millions of machines run the Linux kernel. They have every right to be very picky about what goes in and what does not. eBPF on a very high level allows us to load new code into the kernel, so that we can extend and evolve the kernel in a safe way, but iteratively and quickly, without having to adhere to kernel standards, or do things in a certain way.
What is an eBPF program? At the very bare minimum, an eBPF program consists of a hook. Something in the Linux kernel, it may be a syscall like we’re seeing here that we can attach our eBPF program to. Essentially, what we’re saying is, every time this syscall is being executed, first, run our eBPF program, and then continue as usual. Within this eBPF program, we can do a whole lot of things. We can record the parameters that were parsed to the syscall. We can do a bunch of introspection. It is also very limited what we can do. The first thing that I was talking about were hooks. There are a lot of predefined hooks in the Linux kernel, like I said, syscalls, kernel functions, predefined trace points where you can say, I want to execute this eBPF program when this particular trace point in the kernel is passed. Network events. There are a whole lot more. Probably there is already a hook for what you’re looking for. The reason why this ecosystem is already so rich, even though eBPF is a relatively young technology, is these things have always been there. The kernel has always had these hooks. eBPF is just making them accessible to more people than just those working directly on the kernel. Then the kernel allows you to create a bunch of custom hooks, kprobes, these are essentially that you can attach your eBPF program to any kernel function that’s being called. Uprobe, you can attach your eBPF probe to a function by a program in userspace. The applications we typically write, we can run an eBPF program. Or perf_events. This is essentially working on some overflow and saying every 100 CPU cycles, run this eBPF program. We’ll see particularly perf_events because this is the one that I happen to be most familiar with, we’ll see a lot more of that.
Communicating With Userspace, and Compiling eBPF Programs
How does an eBPF program even communicate to us normals? The way this works is, like I said, a hook, and the hook can be either something from userspace or from kernel space, which is why it’s here in the middle of everything. It runs our eBPF program which definitely runs in kernel space. It can populate something called eBPF maps. These are essentially predefined areas of memory that we also allow userspace programs to read. This is essentially the communication means. Userspace programs can also write to these eBPF maps, and eBPF programs can also read from these, so they’re bidirectional in both ways. This is essentially our entire communication mechanism with eBPF. In order to compile a piece of code to be an eBPF program, we use an off-the-shelf compiler. LLVM and clang are the standard tool chain here. GCC does have some support for this as well. For some of the more advanced features, which we’ll be looking at as well, LLVM is really the thing that you want to be looking at. Typically, if we have an AMD64 machine on Linux, this may be the target that we’re using to compile our programs to. With eBPF, we have a new target, because eBPF essentially defined its own bytecode. Maybe you’re familiar with bytecode in the Java Virtual Machine, where we pre-compile also something to Java bytecode. Then this bytecode is what the Java Virtual Machine loads, and ultimately executes. eBPF works incredibly similar to that. We take our compiler client here, and say, I want this piece of code to be BPF bytecode, once you’ve compiled it. Then we can take that binary and load it and attach it to a hook.
Ensuring It’s Safe to Load
That’s essentially all that a userspace program has to do. It needs to have some eBPF bytecode that it wants to load. We’ve already compiled this with our previous command, and now we’re handing it off to the kernel. The first thing the kernel does is, it verifies that this bytecode that you’ve just handed it will definitely terminate. If you’ve done computer science in university or something, you may be familiar with something called the halting problem, where essentially, we cannot prove that a Turing complete program will halt.
There’s theory behind this. Basically, what the kernel has done is it starts to interpret the bytecode, and makes sure that it doesn’t do anything that potentially could cause the program to be unpredictable. Essentially, it unrolls all loops. It cannot have potentially endless for loops. There are a whole set of things that this verifier restricts you from doing. Essentially, what we end up with is something where we can write arbitrary code, and load more or less arbitrary code into the kernel. The kernel ensures that whatever we’re loading is not actually going to impact the kernel enough that it would be totally devastating, or at least, it attempts to do that. Once the Kernel has verified this, it actually turns the generic bytecode using a just-in-time compiler to actually machine executable code that our hardware understands how to execute, just like any other program, essentially at that point. We have ensured that what we’re executing is definitely safe to be executed.
In the early days of eBPF, you would often find something like this. Here, we can see an eBPF program at the top that actually is attached to a perf_event. Essentially, what we’re saying is up to 100 times per second, call this eBPF program. We’ll go into later why this might make sense. I want to focus our attention on the later thing in this slide. What we used to do, using a tool chain called BCC, the BPF Compiler Collection, we essentially wrote C code, but in essence, it was just a templateable string. We would literally string replace something that we would want to configure with the actual value that we would want, throw it through clang, and then actually do the loading at runtime, like I just showed you. This was fine for a little bit while we were exploring eBPF. Over time, as we realized this is pretty solid technology, we want to continue evolving this. We realized that this is not going to be all that good in the long run. For one, if we’re doing string templating, we’re losing essentially all ability to know whether this code that we’re producing is valid at all. Maybe worse, we could have injection attacks and execute arbitrary code in the Linux kernel. A lot of really terrible things could happen, apart from this being pretty difficult to get right. Then also, and this is maybe more of an adoption/convenience hurdle, but you actually have to have a compiler locally on every machine that you want to run eBPF code on. That also means you have to have all the libraries, all the headers, all of these things available locally. This can be a huge burden on people running your software. This resulted in eBPF programs that were not very portable, and overall, pretty brittle and complex to use.
Compile Once, Run Everywhere (CO:RE)
The eBPF community came up with a very catchy acronym, compile once, run everywhere, CO:RE. Essentially, the goals were pretty simple. All of the things that were previously causing a lot of difficulties, we wanted to get rid of, so that we could truly just compile our eBPF bytecode once, and then run it everywhere. That is very easy to say, and very hard to get right. Ultimately, reiterating here, the goals were, we did not want to need a compiler on the host, we did not want to have kernel headers, or any headers for that matter, have to be on that host. We want to be able to configure our eBPF programs, at least to some degree, to the degree that we could do before ideally with string template. How do we create portable CO:RE eBPF programs? One change once this was introduced, was that we needed to start using these somewhat specific function calls if we wanted to read data from kernel struct. Part of the whole problem of needing kernel headers was that we needed the kernel headers in order to say, what is the layout of these structs when we get them in memory? How do we read the right information from memory, essentially?
Compile once, run everywhere introduced a set of functions and macros to make this easier. We’ll see how this works. At the end of the day, essentially, the way it works is that we parse some version of kernel headers. In this case, it’s actually in the format of what we call BTF, BPF Type Format. We parse that at compile time, and it essentially gets included into our BPF bytecode, the layout of the kernel that we compiled it with essentially. Then, when the BPF program is loaded, or wants to be loaded by our userspace program, we actually look at the locally available BPF Type Format and modify the bytecode so that it now matches the Linux kernel that we actually have locally. The first time I heard this, I thought this sounds completely insane, and that it could never work. I don’t know what to say, kernel developers are pretty genius. They did make all of this work. It truly just works like a charm ever since this has been implemented.
Let’s go through this one more time now with compile once, run everywhere. Again, we pre-compile our program using clang, with the target being BPF. We have our bytecode already. Our userspace program first modifies our bytecode using the locally available BPF Type Format information to relocate the struct to access the things the way that they’re actually available on our local machine. The userspace program then takes that now essentially modified bytecode and parses that to the kernel. The kernel then verifies that bytecode, and then just as before, just-in-time compiles it to be executed by the kernel locally. Now we’ve actually created portable BPF programs. In order to be able to do this, I highly recommend using at least clang version 11. With earlier versions, some of this may actually work. For some of the more advanced features, you definitely want version 11 or later. We actually use a much newer version than 11, but require an absolute minimum of 11.
Then the whole magic essentially happens in a library called libbpf. Theoretically, all these relocations, they could be reimplemented in just about any language, and could be completely custom. It turns out that a standard library has evolved for this called libbpf. There are wrappers in various languages, there’s one for Go, there’s one for Rust. I think these are the only ones that I’m aware of. I wouldn’t be surprised that there were more packages out there that would use libbpf from other languages. The really cool thing is now we can finally create BPF programs. The thing that eBPF was always promising us to do, we can actually now compile our eBPF programs, ship them side by side with our userspace program, or maybe even embed it into the binaries of our userspace program so that they can then load this locally to the kernel wherever they’re being executed.
What does our previous example look like with compile once, run everywhere, then? As we can see, we are no longer doing string replacements. We define a constant with the process ID. Previously, we were doing string replacements to template the process ID into the program and then compile it. We’ve actually already compiled this code, and we already have it as BPF bytecode. Our userspace program is only replacing the constant within the BPF bytecode with the new PID that we want to modify. Then it loads it and attaches our BPF program to some hook that we specify. Some stats, we no longer need clang on a host. This means that we’re saving up to 700 megabytes in artifact size. We do not need kernel headers locally because all of these things are either located in BPF type information, or we don’t need them at all. We’ve saved potentially up to 800 megabytes of space that we didn’t actually need or don’t need anymore. Interestingly enough, it doesn’t really stop here, because of various reasons, compile once, run everywhere programs are also more efficient. There are lots of reasons for this. Essentially, it’s better on every dimension. It’s more portable. Artifacts are smaller. It uses less memory at runtime. Feels like magic.
Language Support with eBPF
eBPF is portable. This was one of the core things that I wanted to talk about, because I think it’s important to have a good foundation of the technology that you may either be developing with, or more likely that you may be operating. I want to give you a framework of what tools are using the technologies that we should be using, and give you a framework to understand how all of these things fit into the eBPF ecosystem. Thanks to compile once, run everywhere, we’re actually able to distribute eBPF based observability tools with single commands, like the one that we’re seeing here to install it into a Kubernetes cluster, for example. Just like what I showed you at the very beginning, in that demo video. All of a sudden, we go back to the beginning where Jane started a new company. Now there are colleagues, and you’re using all of these technologies, Kubernetes, Spark, Java, C++, Python, and you need visibility into all of those.
I started with the premise that eBPF is here to save the day. I’m going to contradict myself here for a second. The thing that often comes up with eBPF, and I think this is a myth, is that with eBPF, it’s very easy to support all the languages out there. You’ll hear this a lot in the marketing speak for all the eBPF tools. There’s some truth to this. I want to make sure that I give all of you the information you need to understand to what extent this is actually true, or how you can understand whether this thing is just marketing talk, or if this is real. The brutal truth is, language support with eBPF is incredibly hard. This next section, I’m going to walk through a real-life example. The one that I’m most familiar with is profiling, because that’s what I happen to do on an everyday basis. This is where we’re going to look at how can we actually support lots of languages for profiling with eBPF. Here, I have a snippet of code that actually shows us the very concrete hook that we happen to use for profiling purposes. I want to take one more step back to talk about profiling. Profiling essentially allows us software engineers to understand where resources are being spent, down to the line number. The Linux kernel actually has a subsystem called the perf_event subsystem that allows us to create hooks to be able to measure some of this. We’re going to walk through how this works.
The first thing that we do is we create this Custom Event. Remember earlier when I was saying that you can create custom hooks to attach your eBPF programs to, so this is the very example of that. In this case, what we’re saying is, I want you to call my eBPF program at up to a frequency of 100 Hz per CPU. I want this only to happen on CPU, is what we call this. That means we’re not actually measuring anything if the CPU isn’t doing anything, but if it’s doing something at 100%, we’ll be sampling 100 samples per second. Here I have a version of our actual eBPF program with only a few lines highlighted just for the sake of being able to get it onto this slide. Let me walk you through what we’re seeing here. At the very top, what we’re seeing is just the struct definition, plain old struct. In this case, what we’re defining is the key that we’re using in order to say, have we seen this stack trace before? The combination that we’re using here is the process ID, the userspace stack ID, so the stack of the code that we tend to be working on. Then the functions that call stack within the kernel. This is essentially the unique key that identifies a counter.
We’re setting a couple of limits, because all the memory within an eBPF program essentially needs to be predetermined. We need to ensure that it’s exactly the size, and it will only execute so many instructions, and so on. These actually tend to be, relatively speaking, fairly simple programs. What we’re seeing here is, the only piece of code that I highlighted is obtaining the process ID, but it’s also only four or five lines of code. We’re starting our key. Our key starts with our process ID. Then we retrieve our stack ID from the userspace stack, and add that to our key. Then we obtain our kernel space stack. Finally, we check if we’ve seen this key before. If we’ve seen it before, we increase it. If we haven’t seen it before, we initialize it with a zero, and then increase it. Ultimately, all we’re doing here is finding out, have we seen the stack before? If yes, we count one up. It’s really that simple. Because we’re sampling it up to 100 times per second, that means that if we’re seeing the same stack trace 100 times per second, then 100% of our CPU time is being spent in this function. That’s ultimately how CPU profilers work.
How Do We Get a Stack?
The problem is, how do we as humans actually make sense of it? Because if you’ve ever tried something like this, you will have noticed that the only data that we get back from a program like the one that I’ve just shown you is, the stack is just a bunch of memory addresses. I don’t know about you, but I don’t happen to know what my memory addresses correspond to in my programs. We’ll need to make sure that these memory addresses actually make sense so that we can then translate them into something that we humans can understand. The problem is, how do we even get this stack? For this, we need to look back at how does code actually get executed on our machine. Essentially, you may be familiar with the function call stack. The way it works is that we have these two registers, the rsp register, and the rbp register. Rsp is essentially always pointing at the very top of our stack. Essentially, it’s telling us this is where we’re at right now. Rbp is something that we call the frame pointers. They essentially keep pointing down the stack, so that we understand where the two frames start and end. We essentially just need to keep going through the saved rbp addresses to figure out where all the function addresses are. If we ignore that this is some random memory in our computers, what I’ve just described to you is we’re walking a linked list. We’re just finding where a value is to find out where the next value is. We just keep going until we’re at the end.
Frame pointers are really amazing when they are there. Unfortunately, there are very evil compiler optimizations. Unfortunately, in some compilers, these are even on by default, that omit these frame pointers. Because what I’ve just described is essentially that we’re using two registers in order to keep track of where we need to go. The thing is, the machine doesn’t actually need this to execute code. We will always have to maintain where we’re at, because the machine needs to be able to place new things onto the stack, but we don’t actually, truly need rbp register. Essentially, what these compilers are saying is they would rather use this register, because with more registers they can do more things simultaneously. They rather utilize this register for performance reasons, than allow us to debug these programs very easily. That’s essentially the tradeoff that’s being made here. It turns out, most hyperscalers, Meta, Facebook, I believe Netflix, they actually have frame pointers on for everything. This works out really well when you can dictate this throughout the entire company. At least from what I’ve been told, there were discussions about this internally, as well, of course. Essentially, what it always boils down to is, it’s always better to be able to debug your programs, than squeeze out a little bit of performance. It’s really minuscule the amount of performance that you can get out of an optimization like this.
However, the problem that we’re facing is hyperscalers may be able to do this, but the rest of the world cannot. We’re at mercy for what binaries we can obtain. Unfortunately, most Linux distributions, for example, omit frame pointers. We have to find a way to walk these stacks without frame pointers. This is where something called unwind tables come into place. These are standardized, so these are pretty much always going to be there. The way these work, is that given some location on the stack, we can calculate where the frame would end. Then we can keep doing this, to figure out where we need to go next. This is an alternative way. What I want you to go away from this section with is, this is theoretically possible, but it’s an insane amount of work to get this working. Just to give you a data point, we’ve had up to three full-time senior engineers that already understand how these things work for almost a year working on this now. We’re just about there to getting this to work. This is all getting extra complicated when we have to do these pretty complex calculations in eBPF space, where we’re restricted with the programs and what these programs can do. They have to terminate and all of these things. Lots of challenges. If you are interested in the very nitty-gritty details of how this works, because these unwind tables don’t just exist, of course. There’s a whole lot of effort that you need to go through to generate the tables, make sure that they’re correct, and then ultimately using them to perform the stack walking. If you’re interested in how this works, my colleagues Vaishali and Javier just gave a really amazing talk at the Linux Plumbers Conference that was in Dublin, about exactly how this works.
Ultimately, what that means is, we are able to walk stacks and unwind stacks for anything that is essentially considered a native binary. Where compilers output a binary for us to execute: C, C++, Rust, Go, anything that compiles to a binary, essentially. Now we can support those languages. How about Java? How about Python? How about Node.js? The good news for anything that is a just-in-time compiled language, so Java, Node.js, Erlang, all of these, just like the name says, they are eventually just-in-time compiled to native code. For those kinds of languages, “all we need to do” is figure out where is that natively executable code ending up, and wiring it up with all the debug information, so that we can perform everything that I just walked you through on that code as well. More complicated, but ultimately doable. The ones where it really gets really complicated, are those that are true interpreters, truly interpreted languages like Python, I think maybe still currently Ruby, but I think Ruby is getting a just-in-time compiler. Python in particular, there’s no just-in-time compiler. The only way that we can reconstruct frames and function names and all of these things in Python is we actually have to read the memory of the Python virtual machine to reconstruct what pointers into some space in memory corresponds to in terms of our function names, or something like that. All of this is doable, but it’s incredible amounts of work to get this working.
eBPF Tooling – Pixie
It’s easy to say that eBPF allows you to have wide language support. It turns out, it’s actually incredibly hard. The really amazing and good news is there is already tooling out there that does a lot of the very heavy lifting for you. One of the ones that I wanted to talk to you about is called Pixie. Pixie is a pretty cool project. It is eBPF based. It has a bunch of features, but I picked out three or four of the ones that I think are the most amazing ones where Pixie really shines. It can automatically produce RED metrics, so rate of requests, errors, and durations. This is super cool, because it can automatically generate these for gRPC requests, for HTTP requests, all of these kinds of things. It can understand whether you’re talking to a database, and automatically analyze the queries that you’re doing against these databases. It can do request tracing, which ultimately corresponds to the RED metrics. The other ones, if you’re doing it well, you may be able to do some auto-instrumentation by dropping a library into your project and you may get something of similar quality. All of this works without ever modifying your applications. That’s the whole premise. The thing that I think is really cool is this dynamic logging support. Essentially, you write these little scripts called Pixie script, where you can insert new log lines without having to recompile code. Essentially, it works the same way as I was just talking about walking frames. It uses a couple of tracing techniques that are available in the Linux kernel to stop and inspect the stack at a certain point. If you’re reading a variable, for example, it can write out that new log line and continue function execution at that point. You don’t have to deploy a new version just to see if something may be interesting to understand in this function. Of course, with great power comes great responsibility. I thought this was mind blowing when I saw it for the first time.
eBPF Tooling – Parca
I happen to work on the Parca project, where we concern ourselves with always profiling everything in your entire infrastructure all the time. eBPF is a really crucial piece of this puzzle, because eBPF allows us to not only do all of the things that I just showed you about stack walking from within the kernel. The best prior thing that we had was we had to snapshot the entire stack, copy it into userspace, and unwind it there asynchronously. Now we’re able to actually do all of this online, so that we can not do a bunch of copies of data unnecessarily, we may be copying potentially sensitive data out of the kernel. We’re not doing any of that, because we’re walking the stack within the kernel, and the only thing we’re returning is a couple of memory addresses that ultimately just correspond to function names. eBPF allows us to capture all of this data, exactly the data we need, at super low overhead. We’ve done a couple of experiments with this, and we tend to see about something anywhere between 0.5% to 2% in overhead doing this. We find that most companies that actually start doing continuous profiling can shave off anywhere from 5% to 20% in CPU time, just because, previously, we didn’t know where this time was being spent. Because we’re always measuring it all the time throughout time now, it’s becoming painfully obvious.
eBPF Tooling – OpenTelemetry Auto-Instrumentation for Go
The last thing that I wanted to highlight is something that I think is more of an experimental one. Pixie and Parca, they’re both pretty production-ready. They’re pretty battle-tested. They’ve been around for a couple of years at this point. This one is one of the more recent ones, and I’m still pretty excited about it. Although I think I haven’t made up my mind yet whether this is crazy or crazy cool. I wanted to highlight it anyways, because I think what they’ve done is pretty ingenious. Essentially, what this project is doing is it is utilizing uprobes. Probes that are attaching to some function name within your Go process, in this case, to be able to figure out where to stop a program to patch distributed tracing context into your application. This solves one of those problems that is notoriously difficult about distributed tracing, which is, you have to instrument your entire application infrastructure to be able to make use of distributed tracing in a meaningful way. I think this is super exciting but at the same time, this essentially monkey patches your stack at runtime. That’s also super scary because it actually breaks a bunch of other things. It’s also definitely one of the more experimental things out there.
I hope I have shown you that eBPF is here to stay. It’s incredibly powerful to use as an instrumentation technology. While it is a lot of complicated work to make language support possible, it is possible, but most projects out there aren’t actually doing what I described to you. I think it’s healthy to go with some skepticism to projects that just claim eBPF means lots of language support. Hopefully, I’ve convinced you to sprinkle some eBPF into your infrastructure. I think eBPF is incredibly transformational. It is not magic, just like it often seems like, but it is actually a lot of work to make it useful. There is already eBPF tooling out there that is ready for you to use today.
See more presentations with transcripts