Presentation: Resiliency Superpowers with eBPF

MMS Founder
MMS Liz Rice

Article originally posted on InfoQ. Visit InfoQ

Transcript

Rice: My name is Liz Rice. I’m the Chief Open Source Officer at Isovalent. I’m also an ambassador and board member for OpenUK. Until recently, I was chair of the Technical Oversight Committee at the Cloud Native Computing Foundation. I joined Isovalent just over a year ago, because that team has so much expertise in eBPF, which is the technology that I’m talking about. I’ve been excited about eBPF for a few years now. From my CNCF work, I’ve seen some of the really incredible range of things that eBPF can enable. I want to share some of the reasons why I’m so excited about it, and specifically talk about the ways that eBPF can help us build more resilient deployments.

What is eBPF?

Before we get to that, let’s talk about what eBPF is. The acronym stands for Extended Berkeley Packet Filter. I don’t think that’s terribly helpful. What you really need to know is that eBPF allows you to run custom code in the kernel. It makes the kernel programmable. Let’s just pause for a moment and make sure we’re all on the same page about what the kernel is. Kernel is a core part of your operating system, which is divided into user space and the kernel. We typically write applications that run in user space. Whenever those applications want to interface with hardware in any way, whether they want to read or write to a file, send or receive network packets, accessing memory, all these things require privileged access that only the kernel has. User space applications have to make requests of the kernel whenever they want to do any of those things. The kernel is also looking after things like scheduling those different applications, making sure that multiple processes can run at once.

Normally, we’re writing applications that run in user space. eBPF is allowing us to write kernels that run within the kernel. We load the eBPF program into the kernel, and we attach it to an event. Whenever that event happens, it’s going to trigger the eBPF program to run. Events can be all sorts of different things. It could be the arrival of a network packet. It could be a function call being made in the kernel or in user space. It could be a trace point. It could be a perf event. There are lots of different places where we can attach eBPF programs.

eBPF Hello World

To make this a bit more concrete, I’m going to show an example here. This is going to be the Hello World of eBPF. Here is my eBPF program. The actual eBPF program are these few lines here. They’re written in C, the rest of my program is in Python. My Python code here is going to actually compile my C program into BPF format. All my eBPF program is going to do is write out some tracing here, it’s going to say hello QCon. I’m going to attach that to the event of the execve system call being made. Execve is used to run a new executable. Whenever a new executable runs, execve is what causes it to run. Every time a new executable started on my virtual machine, that’s going to cause my tracing to be printed out.

If I run this program, first of all, we should see we’re not allowed to load BPF unless we have a privilege call CAP BPF which is typically only reserved for root. We need super-user privileges to run. Let’s try that with sudo. We start seeing a lot of these trace events being written out. I’m using a cloud VM. I’m using VS Code remote to access it. That it turns out is running quite a lot of executables. In a different shell, let’s run something, let’s run ps. We can see the process ID, 1063059. Here is the trace line that was triggered by me running that ps executable. We can see in the trace output, we don’t just get the text, we’re also getting some contextual information about the event that triggered that program to run. I think that’s an important part of what eBPF is giving us. We get this contextual information that could be used to generate observability data about the events that we’re attached to.

eBPF Code Has to be Safe

When we load an eBPF program into the kernel, it is crucial that it’s safe to run. If it crashes, that would bring down the whole machine. In order to make sure that it is safe, there’s a process called verification. As we load the program into the kernel, the eBPF verifier checks that the program will run to completion. That it never dereferences a null pointer. That all the memory accessing that it will do is safe and correct. That ensures that the eBPF programs we’re running won’t bring down our machine and that they’re accessing memory correctly. Because of this verification process, sometimes eBPF is described as being a sandbox. I do want to be clear that this is a different kind of sandboxing from containerization, for example.

Dynamically Change Kernel Behavior

What eBPF is allowing us to do is run custom programs inside the kernel. By doing so, we’re changing the way that the kernel behaves. This is a real game changer. In the past, if you wanted to change the Linux kernel, it takes a long time. It requires expertise in kernel programming. If you make a change to the kernel, it then typically takes several years to get from the kernel into the different Linux distributions that we all use in production. It can be quite often five years between a new feature in the kernel arriving in your production deployments. This is why eBPF has suddenly become such a prevalent technology. As of the last year or so, almost all production environments are running Linux kernels that are new enough to have eBPF capabilities in them. That means pretty much everyone can now take advantage of eBPF and that’s why you’ve suddenly seen so many more tools using it. Of course, with eBPF, we don’t have to wait for the Linux kernel to be rolled out. If we can create a new kernel capability in an eBPF program, we can just load it into the machine. We don’t have to reboot the machine. We can just dynamically change the way that that machine behaves. We don’t even have to stop and restart the applications that are running, the changes affect the kernel immediately.

Resilience to Exploits – Dynamic Vulnerability Patching

We can use this for a number of different purposes, one of which is for dynamically patching vulnerabilities. We can use eBPF to make ourselves more resilient to exploits. One example that I like of this dynamic vulnerability patching is being resilient to packets of death. A packet of death is a packet that takes advantage of a kernel vulnerability. There have been a few of these over time where the kernel doesn’t handle a packet correctly. For example, if you put a length field into that network packet that’s incorrect, maybe the tunnel doesn’t handle it correctly and perhaps it crashes, or bad things happen. This is pretty easy to mitigate with eBPF, because we can attach an eBPF program to the event that is the arrival of a network packet. We can look at the packet, see if it is formed in the way that would exploit this vulnerability, the packet of death. Is it the packet of death? If it is, we can just discard that packet.

Example – eBPF Packet Drop

As an example of how easy this is, I’m just going to show another example of a program that will drop network packets of a particular form. In this example, I’m going to look for ping packets. That’s the protocol ICMP. I can drop them. Here’s my program. Don’t worry too much about the details here, I’m essentially just looking at the structure of the network packets. I’m identifying that I’ve found a ping packet. For now, I’m just going to allow them to carry on. XDP_PASS means just carry on doing whatever you would have done with this packet. That should emit whatever tracing you get. This is actually a container called pingbox. I’m going to start sending pings to that address and they’re being responded to. We can see the sequence number here ticking up nicely. At the moment, my eBPF program is not loaded. I’m going to run a makefile that will compile my program, clean up any previous programs attached to this network interface, and then load my program. There’s make running the compile, and then attaching to the network interface eth0 here. You see immediately it’s started tracing out, Got ICMP packet. That hasn’t affected the behavior, and my sequence numbers are still just ticking up as before.

Let’s change this to say, drop. We’ll just make that. What we should see is the tracing here is still being generated. It’s continuing to receive those ping packets. Those packets are being dropped, so they never get responded to. On this side here, the sequence numbers have stopped going up, because we’re not getting the response back. Let’s just change it back to PASS, and make it again. We should see, there’s my sequence numbers, there were 40 or so packets that were missed out, but now it’s working again. What I hope that illustrates is, first of all, how we can attach to a network interface and do things with network packets. Also, that we can dynamically change that behavior. We didn’t have to stop and start ping. We didn’t have to stop and start anything. All we were doing was changing the behavior of the kernel live. I was illustrating that as an illustration of how handling packet of death scenarios would work.

Resilience to Exploits – BPF Linux Security Module

We can be resilient to a number of other different exploits using BPF Linux security modules. You may have come across Linux security modules, such as AppArmor, or SELinux. There’s a Linux security module API in the kernel which gives us a number of different events that something like AppArmor can look at and decide whether or not that event is in or out of policy and either allow or disallow that particular behavior to go ahead. For example, allowing or disallowing file access. We can write BPF programs that attach to that same LSM API. That gives us a lot more flexibility, a lot more dynamic security policies. As an example of that, there’s an application called Tracee, that’s written by my former colleagues at Aqua, which will attach to LSM events and decide whether they are in or out of policy.

Resilience to Failure – Superfast Load Balancing

We can use eBPF to help us be resilient to exploits. What other kinds of resiliency can we enable with eBPF? One other example is load balancing. Load balancing can be used to scale requests across a number of different backend instances. We often do it not just for scaling, but also to allow for resilience to failure, high availability. We might have multiple instances so that if one of those instances fails in some way, we still have enough other instances to carry on handling that traffic. In that previous example, I showed you an eBPF program attached to a network interface, or rather, it’s attached to something called the eXpress Data Path of a network interface. eXpress Data Path is very cool, in my opinion. You may or may not have a network card that allows you to actually run the XDP program, so run the eBPF program on the hardware of your network interface card. XDP is run as close as possible to that physical arrival of a network packet. If your network interface card supports it, it can run directly on the network interface card. In that case, the kernel’s network stack would never even see that packet. It’s blazingly fast handling.

If the network card doesn’t have support for it, the kernel can run your eBPF program, again, as early as possible on receipt of that network packet. Still super-fast, because there’s no need for the packet to traverse the network stack, certainly never gets anywhere near being copied into user space memory. We can process our packets very quickly using XDP. We can make decisions like, should we redirect that packet. We can do layer-3, layer-4 load balancing in the kernel incredibly quickly, possibly not even in the kernel, possibly on a network card to decide whether or not we should pass this packet on up to the network stack and through to user space on this machine. Or perhaps we should be load balancing off to a different physical machine altogether. We can redirect packets. We can do that very fast. We can use that for load balancing.

The Kube-proxy

Let’s just briefly turn our thoughts to Kubernetes. In Kubernetes, we have a load balancer called the kube-proxy. The kube-proxy balances or allows load balancing or tells pod traffic how to reach other pods. How can a message from one pod get to another pod? It acts as a proxy service. What is a proxy if not essentially a load balancer? With eBPF we have the option not just to attach to the XDP interface close to the physical interface as possible. We also have the opportunity to attach to the socket interface, so as close to the application as possible. Applications talk to networks through the socket interface. We can attach to a message arriving from a pod and perhaps bypass the network stack because we know we want to send it to a pod on a different machine, or we can bypass the network stack and loop straight back to an application running on the same physical machine or the same virtual machine. By intercepting packets as early as possible, we can make these load balancing decisions. We can avoid having to go through the whole kernel’s network stack, and it gives us some incredible performance improvements. Kube-proxy replacement performance compared to an iptables based Kube-proxy can be dramatically quicker.

eBPF Enables Efficient Kubernetes-Aware Networking

I want to now dive a little bit more into why eBPF can enable this really efficient networking particularly in Kubernetes. In that previous diagram, I just showed the kernel network stack as one box. Networking stack is pretty complicated. Typically, a packet going through the kernel’s network stack goes through a whole bunch of different steps and stages, as the kernel decides what to do with it. In Kubernetes, we have not just the networking stack on the host, but we typically run a network namespace for every pod. Each pod, by having its own network namespace has to run its own networking stack. Imagine a packet that arrives on the physical eth0 interface. It traverses the whole kernel’s networking stack to reach the virtual Ethernet connection to the pod where it’s destined to go. Then it goes through the pod’s networking stack to reach the application via a socket. If we use eBPF, and particularly if we know about Kubernetes identities and addresses, we can bypass that stack on the host. When we receive a packet on that eth0 interface, if we already know whether that IP address is associated with a particular pod, we can essentially do a lookup and just pass that packet straight to the pod where it then goes through the pod’s networking stack, but doesn’t have to go through all the complication of everything happening on the host’s networking stack.

Using an eBPF enabled networking interface for Kubernetes like Cilium, we can enable this network stack shortcutting because we’re aware of Kubernetes identities. We know what IP addresses are associated with which pods but also which pods are associated with which services, with namespaces. With that knowledge, we can build up these service maps showing how traffic is flowing between different components within our cluster. eBPF is giving us visibility into the packet. We can see, not just the destination IP address and port, we can route through a proxy to find out what HTTP type of request it is. We can associate that flow data with Kubernetes identities.

In a Kubernetes network, IP addresses change all the time, pods come and go. An IP address one minute may mean one thing, and two minutes later, it means something completely different. IP addresses are not terribly helpful for understanding the flows within a Kubernetes cluster. Cilium can map those IP addresses to the correct pod, the correct service at any given point in time and give you much more readable information. It is measurably faster. Whether you’re using Cilium or other implementations of eBPF networking, that ability to get the networking stack on the host gives us measurable performance improvements. We can see here that the blue line on the left is the request-response rate for number of requests per second that we can achieve without any containers at all, just directly sending and receiving traffic between nodes. We can get performance that’s nearly as fast using eBPF. Those yellow and green lower bars in the middle show us what happens if we don’t use eBPF, and we use the legacy host routing approach through the host network stack, it’s measurably slower.

eBPF Network Policy Decisions

We can also take advantage of having that knowledge of Kubernetes identities and the ability to drop packets to build very efficient network policy implementations. You saw how easy it was to drop packets. Rather than just inspecting the packet and deciding that it was a ping packet, can compare the packet to policy rules and decide whether or not they should be forwarded or not. This is quite a nice tool that we have. You can find this at networkpolicy.io to visualize Kubernetes network policies. We talked about load balancing, and how we can use load balancing within a Kubernetes cluster in the form of kube-proxy. After all, Kubernetes gives us a huge amount of resiliency. If an application pod crashes, it can be recreated dynamically without any operator intervention. We can scale automatically without operator intervention.

Resilience to Failure – ClusterMesh

What about the resiliency of the cluster as a whole, if your cluster is running in a particular data center and you lose connectivity to that data center? Typically, we can use multiple clusters. I want to show how eBPF can make the connectivity between multiple clusters really very straightforward. In Cilium, we do this using a feature called ClusterMesh. With ClusterMesh, we have two Kubernetes clusters. The Cilium agent running in each cluster will read a certain amount of information about the state of other clusters in that ClusterMesh. Each cluster has its own database of configuration and state stored in etcd. We run some etcd proxy components that allow us to just find out about the multi-cluster specific information that we need, so that the Cilium agents on all the clusters can share that multi-cluster state.

What do I mean by multi-cluster state? Typically, this is going to be about creating highly available services. We might run multiple instances of a service on multiple clusters to make them highly available. With ClusterMesh, we simply mark a service as global, and that connects them together such that a pod accessing that global service can access it on its own cluster, or on a different cluster, should that be necessary. I think this is a really nice feature of Cilium, and remarkably easy to set up. If the backend pod on one cluster is destroyed for some reason, or indeed if the whole cluster goes down, we still have the ability to route requests from other pods on that cluster to backend pods on a different cluster. They can be treated as a global service.

I think I have an example of this. I have two clusters. My first cluster is up, we can see cm-1, standing for ClusterMesh 1, and a second cluster, cm-2. They are both running some pods. We quite often in Cilium do some demos with a Star Wars theme. In this case, we have some X-wing fighters that want to be able to communicate with the Rebel base. We also have some similar X-wings and Rebel bases on the second cluster. Let’s just take a look at the services. In fact, let’s describe that Rebel base, service rebel-base. You can see it’s annotated by Cilium as a global service. It’s been annotated by me as part of the configuration to say I want this to be a global service. The same is true if I look on the second cluster there. They’re both described as global. What that means is, I can issue requests from an X-wing on either cluster, and it will receive responses from a load balanced across those two different clusters, across backends on those two different clusters. Let’s try that. Let’s run it in a loop. Let’s exec into an X-wing. It doesn’t really matter which X-wing. We want to send a message to the Rebel base. Hopefully, what we should see is, we’re getting responses from sometimes it’s cluster 1, sometimes it’s cluster 2, at random.

What if something bad were to happen to the Rebel base pods on one of those clusters? Let’s see which nodes are on the code. Let’s delete the pods on cluster 2. In fact, I’ll delete the whole deployment of Rebel base on the second cluster. What we should see is that all the requests are now handled by cluster 1. Indeed, you can see, it’s been cluster 1 now for quite some time. That resiliency where we literally just have to mark our services as global, it’s an incredibly powerful way of enabling that multi-cluster high availability.

Visibility into Failures – eBPF Observability Instrumentation

Lest I give you the impression that eBPF is just about networking, and advantages in networking, let me also talk a bit about how we can use eBPF for observability. Which is, after all, incredibly important, if something does go wrong. We need observability so that we can understand what happened. In a Kubernetes cluster, we have a number of hosts, and each host has only one kernel. However many user space applications we’re running, however many containers we’re running, they’re all sharing that one kernel per host. If they’re in pods, still only one kernel however many pods there are. Whenever those applications in pods want to do anything interesting, like read or write to a file, or send or receive network traffic, whenever Kubernetes wants to create a container. Anything complicated involves the kernel. The kernel has visibility and awareness of everything interesting that’s happening across the entire host. That means if we use eBPF programs to instrument the kernel, we can be aware of everything happening on that whole host. Because we can instrument pretty much anything that’s happening in the kernel, we can use it for a wide variety of different metrics and observability tools, different kinds of tracing, they can all be built using eBPF.

As an example, this is a tool called Pixie, which is a CNCF sandbox project. It’s giving us with this flamegraph, information about what’s running across the entire cluster. It’s aggregating information from eBPF programs running on every node in the cluster to produce this overview of how CPU time is being used across the whole cluster with detail into specific functions that those applications are calling. The really fun thing about this is that you didn’t have to make any changes to your application, you don’t have to change the configuration even to get this instrumentation. Because as we saw, when you make a change in the kernel, it immediately affects whatever happens to be running on that kernel. We don’t have to restart those processes or anything.

This also has an interesting implication for what we call the sidecar model. In a lot of ways, eBPF gives us a lot more simplicity compared to the sidecar model. In the sidecar model, we have to inject a container into every pod that we want to instrument. It has to be inside the pod because that’s how one user space application can get visibility over other things that are happening in that pod. It has to share namespaces with that pod. We have to inject that sidecar into every pod. To do that, that requires some YAML be introduced into the definition of that pod. You probably don’t write that YAML by hand to inject the sidecar. It’s probably done perhaps in admission control or as part of a CI/CD process, something will likely automate the process of injecting that sidecar. Nevertheless, it has to be injected. If something goes wrong with that process, or perhaps you didn’t mark a particular pod as being something you want to instrument, if it doesn’t happen, then your instrumentation has no visibility into that pod.

On the other hand, if we use eBPF, we’re running our instrumentation within the kernel, then we don’t need to change the pod definition. We’re automatically getting that visibility from the kernel’s perspective, because the kernel can see everything that’s happening on that host. As long as we add eBPF programs onto every host, we will get that comprehensive visibility. That also means that we can be resilient to attacks. If somehow our host gets compromised, if someone manages to escape a container and get on to the host, or even if they run a separate pod somehow, your attacker is probably not going to bother instrumenting their processes and their pods with your observability tools. If your observability tools are running in the kernel, they will be seen regardless. You can’t hide from tooling that’s running in the kernel. This ability to run instrumentation without sidecars is creating some really powerful observability tools.

Resilient, Observable, Secure Deployments – Sidecarless Service Mesh

It also takes us to the idea of a sidecarless service mesh. Service mesh is there to be resilient and observable and secure. Now with eBPF, we can implement service mesh without the use of sidecars. I showed before the diagram showing how we can bypass the networking stack on the host using eBPF. We can take that another step further for service mesh. In the traditional sidecar model, we run a proxy, perhaps it’s Envoy, inside every pod that we want to be part of the service mesh. Every instance of that proxy has routing information, and every packet has to pass through that proxy. You can see on the left-hand side of this diagram, the path for network packets is pretty torturous. It’s going through essentially five instances of the networking stack. We can dramatically shortcut that with eBPF. We can’t always avoid a proxy. If we are doing something at layer-7, we need that proxy, but we can avoid having a proxy instance inside every pod. We can be much more scalable by having far fewer copies of routing information and configuration information. We can bypass so many of those networking steps through eBPF connections at the XDP layer within the networking stack, or at the socket layer. eBPF will give us service mesh that’s far less resource hungry, that’s much more efficient. I hope that’s given a flavor of some of the things that I think eBPF is enabling around networking, observability, and security, that’s going to give us far more resilient and scalable deployments.

Summary

I’ve pretty much been talking about Linux so far. It is also coming to Windows. Microsoft have been working on eBPF on Windows. They’ve been part, alongside Isovalent and a number of other companies that are interested in massively scalable networks. We’ve come together to form the eBPF Foundation, which is a foundation under the Linux Foundation, really to take care of eBPF technology across different operating systems. I hope that gives a sense of why eBPF is so important, and it’s so revolutionary for resilient deployments of software, particularly in the cloud native space, but not necessarily limited to. Regardless of whether you’re running Linux or Windows, there are eBPF tools to help you optimize those deployments and make them more resilient.

Resources

You can find more information about eBPF at the ebpf.io site, and Cilium is at cilium.io. There’s also a Slack channel that you’ll find from both of those sites, where you’ll find experts in Cilium and in eBPF. Of course, if you want to find out more about what we do at Isovalent, please visit isovalent.com.

Questions and Answers

Watt: Which companies are using Cilium in production at the moment that you’re seeing and know about?

Rice: We’ve actually got a list in the Cilium GitHub repo of users who have added themselves to the list of adopters. There are certainly dozens of them. There’s companies using it at significant scale. Bell Canada, for example, using it in telco, Adobe, Datadog, these are just a few examples of companies that I know I can speak about publicly. It’s pretty widely adopted.

Watt: It’s certainly one of the technologies on the up and coming road. I think the fact that there are already some big players in the market that are already using this, is testament I think to where it’s going.

Rice: The other two integrations to really mention, the Dataplane V2 in GKE is actually based on Cilium. Amazon chose Cilium as the networking CNI for their EKS Anywhere distribution. I feel like that’s a very strong vote of confidence in Cilium as a project and eBPF as a technology.

Watt: One of the areas we’re looking at on the track is around chaos engineering, and that side of things. How do you see eBPF potentially helping out or providing ways to do different things from a chaos engineering perspective?

Rice: I think this is something that we just touched on, about having eBPF programs running in the kernel and potentially changing events, that could be a really great way of triggering chaos tests. For example, if you wanted to drop some percentage of packets and see how your system behaved, or insert errors, all manner of disruptive things that you might want to do in chaos testing, I think eBPF could be a really interesting technology for building that on.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.