Introduction
Wes Reisz: In October of 2020, we did a podcast with Colin McCabe and Jason Gustafson about KIP-500 or the removal of the ZooKeeper dependency in Kafka. Today, we catch up with Colin on how the project is going ,and what’s planned for the current release of Kafka.
Hi, my name is Wes Reisz and I’m a technical principal with Thoughtworks and co-host of the InfoQ Podcast. I also have the privilege of chairing the San Francisco edition of the QCon software conference held in October – November of each year. If you enjoy this podcast and are looking to meet and discuss topics, like today, is directly with engineers working on these type of problems, just like Colin, the next QCon will be held in London, March 27 to 28th. There you can meet and hear from software leaders, like Leslie Miley, Radia Perlman, Crystal Hirschorn, Justin Cormack, Gunnar Morling, on topics ranging from computer networks, web assembly, team topologies, and of course, the meat and potatoes of every single QCon, modern software architectures. If you’re able to attend, please stop me and say hi.
As I mentioned on today’s podcast, I’m speaking with Colin McCabe. Colin is a principal engineer at Confluence, where his focus is on performance and scalability of Apache kafka. Most recently, he’s been center stage with the implementation of KRaft. KRaft is kafka plus Raft, and it’s where kafka removes that dependency to ZooKeeper by implementing a new quorum controller service based on that Raft protocol. Today, we catch up with the progress of the project, so we talk a lot about community reception, we talk about the upgrade path for shops that are currently on ZooKeeper, and we’ll be moving to a KRaft version, what that looks like and what you need to do. We talk about the overall plan for deprecation, and we talk about lessons learned on the implementation of KRaft.
As always, thank you for joining us on you at another edition of the InfoQ Podcast. Colin, welcome to the InfoQ Podcast.
Colin McCabe: Thanks. It’s great to be here.
Wes Reisz: I think last time we spoke was maybe a year ago, maybe a little bit over a year ago, but it was pretty much when KIP-500 was first accepted, which is KRaft mode. It’s basically replacing ZooKeeper and putting kafka with Raft, KRaft mode into kafka, so self-managed metadata quorum. I think that was available with 3.3.1. How’s the reception been with the community?
Colin McCabe: Yeah. People have been tremendously excited about seeing this as production ready. We’ve been doing some testing ourselves. We’ve heard of some people doing testing. It takes a while to get things into production, so we don’t have a huge number of production installations now. One of the most exciting things is starting to bring this to production and starting to get our first customers, so that’s really exciting.
What do you see as the distribution around versions of Kafka?
Wes Reisz: What do you see as the distribution of people on versions with kafka? Do you see people in a certain area or they tend to stay up-to-date?
Colin McCabe: Yeah, that’s a good question. I don’t have ultra precise numbers, and you’re right. I’m not sure if I’d be allowed to share them if I did. I will say that in general, we have sort of two sets of customers right now. We have our cloud customers and we have our on-premise customers. Our on-premise customers tend to be a few versions behind. Not all of them, but many of them are in the cloud. We’re very up-to-date. So, there’s a little bit of the divide there, and I don’t know the exact distribution. One of the interesting things about the last few years, of course, has been so many more things are moving to cloud, and of course kafka is huge in the cloud now, so we’re starting to see people get a lot more up-to-date just because of that.
What is KRaft mode?
Wes Reisz: Let’s back up for a second. What is KRaft mode? Improves partition scalability, it improves resiliency, it simplifies deployments with kafka in general. What is KRaft? What was the reason behind going to KRaft?
Colin McCabe: Basically, KRaft is the new architecture for kafka metadata. There’s a bunch of reasons which you touched on just now for doing this. The most surface level one is, okay, now I have one service. I don’t have two services to manage. I have one code base, I have one configuration language, all that stuff. But I think the more interesting reason, as a designer, is this idea that we can scale the metadata layer by changing how we do it, and we’re changing from a paradigm where ZooKeeper was a multi writer system and basically you never really knew if your metadata was up-to-date. So, every operation you did had to kind of be a compare and swap, sort of, well, if this, then that. We ended up doing a lot of round trips between the controller and ZooKeeper, and of course all of that degrades performance, it makes it very difficult to do certain optimizations.
With KRaft, we have a much better design in the sense that we believe we can push a lot more bandwidth through the control plane, and we also have a lot more confidence in what we do push, because we have this absolute ordering of all the events that happen. So, the metadata log is a log, it has an order, so you know where you are. You didn’t know that in ZooKeeper mode. You got some metadata from the controller and you knew a few things about it, like you knew when you got it, you knew the controller epoch, some stuff like that. But you didn’t really know if what you had was up-to-date. You couldn’t really diff where you were versus where you are now.
In production, we see a lot of issues where people’s metadata gets out of sync, and there’s just no really good way to trace that, because in ZooKeeper, we’re in the land of just like I got an RPC and I changed my cache. Okay, but I mean, we can’t say for sure that it arrived in the same order on every broker. Basically, I would condense this into a few bullet points. This manageability thing like, hey, one service, not two. Scalability, and also just the ability to, what I would call, safety. The safety of the design is greater.
What is the architectural change that happens to Kafka with kraft mode?
Wes Reisz: If you’re using ZooKeeper today, what is the architectural change that will be present with KRaft?
Colin McCabe: Yeah. If you’re using ZooKeeper today, your metadata will be stored in ZooKeeper. For example, if you create a topic, there will be Z nodes in ZooKeeper that talk about that topic, which replicas are in sync, what is the configuration of this topic. When you move to KRaft mode, instead of being stored in ZooKeeper, that metadata will be stored in a record or series of records that is in the metadata log. Of course, that metadata log is managed by kafka itself.
Wes Reisz: Understood. Nice. You mentioned a little bit about scalability. Are there any performance metrics, numbers you can talk to?
Colin McCabe: Yeah. We’ve done a bunch of benchmarks, and we’ve gotten some great improvements in controller shutdown, and we’ve gotten some great improvements in number of partitions, and we are optimistic that we can hopefully 10X the number of partitions we can get at least.
Wes Reisz: 10X. Wow.
Colin McCabe: Yeah. I think our approach to this is going to be incremental and based on numbers and benchmarking. Right now, we’re in a mode where we’re getting all the features there, we’re getting everything stable, but once we shift into performance mode, then I think we’ll really see huge gains there.
What does the upgrade path look like for moving to KRaft?
Wes Reisz: You mentioned just a little while ago about moving to KRaft mode from ZooKeeper. What does that upgrade path look like?
Colin McCabe: Yeah, great question. The first thing you have to do is you have to get your cluster on a so-called bridge release, and we call them that because these bridge releases can act as a bridge between the ZooKeeper versions and the KRaft versions. This is a little bit of a new concept in kafka. Previously, kafka supported upgrading from any release to any other release, so it was full upgrade support. Now with KRaft, we actually have limited the set of releases you can do to upgrade to KRaft. There’s some technical reasons why that limitation needs to be in place.
But anyway, once you’re on this release and once you’ve configured these brokers so that they’re ready to upgrade, you can then add a new controller quorum to the cluster, then that will take over as the controller. At that point, you’ll be running in this hybrid mode where your controllers are upgraded but not all of your brokers are upgraded. Your goal in this mode is probably to roll all of the brokers so that you’re now in a full KRaft mode. And then, when you’re in full KRaft mode, everybody’s running just as they would in KRaft mode, but with one difference, which is that you’re still writing data to ZooKeeper. We call this dual write mode, because we’re writing the metadata to the metadata log, but also writing it to ZooKeeper.
Dual write mode exists for two reasons. One of them is there’s a technical reason why while we’re doing the rolling upgrade, you need to keep updating ZooKeeper so that these old brokers can continue to function while the cluster’s being rolled. The second reason is it gives you the ability to go back if something goes wrong, you still have your metadata in ZooKeeper, so you can then go back and you’re like, all right, we’re going to go back from the upgrade and you have your metadata. Of course, we hope not to have to do that, but if we do, then we will.
Wes Reisz: What I heard though is it’s an in place upgrade so you don’t have to bring it offline or anything to be able to do your upgrade?
Colin McCabe: Yeah, absolutely. It’s a rolling upgrade and it doesn’t involve taking any downtime. This is very important to us, of course.
Wes Reisz: Yeah, absolutely. That’s awesome. Back to the bridge release for a second. The bridge release is just specifically to support ZooKeeper to KRaft. It isn’t a new philosophy that you’re going to have bridge releases for upgrades going forward, you’re still going to maintain being able to upgrade to any point, is that still the plan?
Colin McCabe: Yeah. I mean, I think that going forward, we’ll try to maintain as much upgrade compatibility as we can. This particular case was a special case, and I don’t think we want to do that in the future if we can avoid it.
Wes Reisz: If I’m right, I believe we’re recording this right around the time of the code freeze for 3.4. What’s the focus of 3.4? Is that going to be that upgrade path you were talking about?
Colin McCabe: The really exciting thing to me obviously is this upgrade path, and this is something users have been asking for a long time. Developers, we’ve wanted it, and so it’s huge for us I think. Our goal for this release is to get something out there and get it into this pre-release form. There may be a few rough edges, but we will have this as a thing we can test and iterate on, and that’s very important, I think.
Any surprises on the implementation of KRaft?
Wes Reisz: Kind of going back to the implementation of Raft with Kafka. Anything surprising during that implementation that kind of comes to mind or any lessons that you might want to share?
Colin McCabe: Yeah, that’s a great question. There’s a few things that I thought were interesting during this development, and one of them is we actually got a chance to do some formal verification of the protocol, which is something I hadn’t really worked on before, so I thought that was really cool. The ability to verify the protocol with TLA+ was really interesting,
Wes Reisz: Just to make sure we level set, what does formal verification mean when it comes to something like Raft?
Colin McCabe: TLA+ is a form of model checking. You create this model of what the protocol is doing, and then you have a tool to actually check that it’s meeting all of your invariance, all your assumptions. It sort of has its own proof language to do this. I think TLA+ has been used to verify a bunch of stuff. Some people use it to verify their cloud protocols. Some of the big cloud providers have done that. I thought it was really cool to see us doing that for our own replication protocol. To me, that was one of the cool things.
Another thing that I thought was really interesting during this development process was I learned the importance of deploying things often. So, we have a bunch of test clusters internally, soak clusters, and one thing that I learned is that development just proceeded so much more smoothly if I could get these clusters upgraded every week and just keep triaging what came through every week.
When you have a big project, I guess you want to attack it on multiple fronts. The testing problem, I mean. You want to have as many tests as you can that are small and self-contained as many unit tests. You want to have as much soak testing going on as you can. You want to have as many people trying it. So, I really thought that I learned a few things about how to keep a big project moving that way, and it’s been really interesting to see that play out. It’s also been interesting that the places where we spend our time have not exactly been where we thought. We thought, for example, we might spend a lot of time on Raft itself, but actually that part was done relatively quickly, relatively the whole process. The initial integration was faster than I thought, but then it took a bit longer than I thought to work out all the corner cases. I guess maybe that wasn’t so unexpected after all.
What languages are used to implement Kafka and how did you actually implement KRaft?
Wes Reisz: Kafka itself is built in Scala now, right? Is that correct?
Colin McCabe: Well, kafka has two main languages that we use. One of them is Java and the other is Scala, and the core of kafka is Scala, but a lot of the new code we’re writing is Java.
Wes Reisz: Did you leverage libraries for implementing Raft? How did you go about the implementation with Raft?
Colin McCabe: Well, the implementation of Raft is our own implementation, and there’s a few reasons why we decided to go this path. One of them is that kafka is first and foremost a system for managing logs. If kafka outsources its logs to another system, it’s like, well, maybe we should just use that other system. Secondly, we wanted the level of control and the ability to inspect what’s going on that we get from our own implementation. I guess thirdly is we had a lot of the tools laying around. For example, we have a log layer, we have fetch requests, we have produced requests, so actually we were able to reuse a very large amount of that code.
What are some of the implications of raft with multi-cluster architectures?
Wes Reisz: KRaft mode changes how leaders are elected in the cluster and the metadata or metadata stored. What does it mean or does it mean anything for multi cluster architectures?
Colin McCabe: Yeah, that’s a good question. I think when a lot of people ask about multi cluster architectures, they’re really asking, “I have N data centers. Now, where should I put things?” This is an absolutely great question to ask. One really typical setup is, hey, I have everything in one data center. Well, that one’s easy to answer. Now, if you have three data centers, then I would tell you, well, you put one controller in each data center and then you split your brokers fairly evenly.
This is really no different than what we did with ZooKeeper. With ZooKeeper, we would have one ZooKeeper node in each cluster, therefore if a cluster goes down, you still have your majority. Where I think it generally gets interesting, if interesting is the right word, is when you have two data centers. A lot of people find themselves in that situation because maybe their company has invested in two really big data centers and now they’re trying to roll out kafka as a new technology and they want to know what to do. It’s a really good question. The challenge, of course, is that the Raft protocol is all majority based. When you have only two data centers, if you lose one of those data centers, you have a chance of losing the majority, because if you had two nodes in one and one node in another, well then the data center with two nodes is a single point of failure.
People generally find workarounds for this kind of problem, and this is what we saw with ZooKeeper. People would, for example, they would have multiple data centers but they would have some way of manually reestablishing the quorum when they lost the data center that had their majority. There’s a bunch of different ways to do that. The one that I’ve always liked is putting different weights on different nodes. The Raft protocol actually does not require that every node have the same weight. You could have nodes that have different weights, so that gives you the flexibility to change the weight of a node dynamically, and then the node can then say, okay, well now, I’m the majority. Even though I’m a single node, I’m the majority. That allows you to get back up and running after a failure.
There’s also other customers we’ve seen who actually, they have two big data centers but they want to have a third place for things to be. So, they hire out a cloud provider and they’re like, “Well, we only have this one ZooKeeper node in the cloud. Everything else else is on premise, but this is what we’ve got.” That can be a valid approach. It kind of depends on the reasons why they’re on premise. If the reason why you’re on premise is regulatory, this data can’t leave this building, then obviously that’s not going to work. But if the reason you’re on premise is more like, “Well, we looked at the cost structure, this is what we like,” then I think that sort of approach could work.
Wes Reisz: In that three data center architecture that you talked about where you put a controller maybe in each data center, what are you seeing with constraints around latency, particularly with KRaft mode? Are there any considerations that you need to be thinking about?
Colin McCabe: Right now, most of our testing has been in the cloud, so we haven’t seen huge latencies between the regions, because typically we run a multi-region setup when we’re doing this testing. It’s kind of hard for me to say what the effect of huge latency would be. I do think if you had it, you would then see metadata operations take longer, of course. I don’t know that it would be a showstopper. It kind of depends on what the latency is, and it also depends on what your personal pain threshold is for latency.
We’ve seen very widely varying numbers on that. We had a few people who were like, “Man, I can’t let the latency go over five milliseconds roundtrip,” which is pretty low for a system, one of the systems we’re operating. Other people are like, “Well, it’s a few seconds and okay.” And we’ve seen everything in between. So, I think if you really want low latency and you own the hardware, then you know what to do. If you don’t own the hardware, then I think the question becomes a little more interesting. Then, we can talk about what are the workarounds there.
Are there any other architectural considerations to think about with KRaft?
Wes Reisz: What about on a more general level, are there any specific architectural considerations that you should consider with KRaft mode that we haven’t already spoke about?
Colin McCabe: I think from the point of view of a user, we really want it to be transparent. We don’t want it to be a huge change. We’ve bent over backwards to make that happen. All the admin tools should continue to work. All the stuff that we supported we’re going to continue to support. The exception I think is there’s a few third party tools. Some of them are open source, I think, that would just do stuff with ZooKeeper, and obviously that won’t continue to work when ZooKeeper’s gone, but I think those tools hopefully will migrate over. A lot of those tools already had problems with versions, they would have sort of an unhealthy level of familiarity with the internals of kafka. But if you’re using any supported tool, then we really want to support you. We really want to keep everything going.
Wes Reisz: That was a question I wanted to ask before I forgot, is that we’re talking about KRaft and we’re talking about removing ZooKeeper, but as someone who’s using kafka, there’s no change to the surface. It still fills and operates like kafka’s always done. And as you just said, all the tools, for the most part, will continue to work unless you’re trying to deal directly with ZooKeeper.
Colin McCabe: Yeah, absolutely. If you look at a service like Confluent Cloud, there really isn’t any difference from the user’s point of view. We already blocked direct access to ZooKeeper. We had to because if you have direct access to ZooKeeper, you can do a lot of things that break the security paradigm of the system. Over time, kafka has been moving towards hiding ZooKeeper from users, and that’s true even in an on-premise context, because if you have the ability to modify znodes, you have kafka’s brain is right there and you could do something bad. So, we generally put these things behind APIs, which is where I think they should be. I mean, APIs that have good stability guarantees, that have good supportability guarantees are really what our users want.
Wes Reisz: That’s some of the reason for the bridge release was to move some of those things to a point where you didn’t have direct access to ZooKeeper.
Colin McCabe: The bridge release is more about the internal API, if you want to call it that of kafka. The brokers can now communicate with a new controller, just as well as they could communicate with the old controller. In previous releases, we were dealing with external APIs. There used to be znodes that people would poke in order to start a partition reassignment. Obviously, that kind of thing is not going to be supported when we’re in KRaft mode. Over time, we phased out those old znode-based APIs, and that happened even before the bridge release.
What does the future roadmap look like for Kafka?
Wes Reisz: Okay. The last question I have is about future roadmap. What are you looking at beyond 3.4?
Colin McCabe: We’ve talked a little bit about this in some of our upstream KIPs. KIP-833 talks about this a bit. The big milestone to me is probably kafka 4.0 where we hope to remove ZooKeeper entirely, and this will allow us to focus all of our efforts on just KRaft development and not have to implement features twice or have older stuff we’re supporting. So, I think that’ll be a really red letter day when we can do that.
In the more short term, obviously upgrade from ZooKeeper is the really new cool thing that we’re doing. It’s not production ready yet, but it will be soon. Hopefully, 3.5, I don’t think we’ve agreed on that officially, but I would really like to make it production ready then. There’s a bunch of features that we need to implement, like JBOD. We want to implement delegation tokens, we want to implement a few other things, and once we’ve gotten to full feature parody with ZooKeeper mode, we can then deprecate ZooKeeper mode.
Wes Reisz: I’m curious about that. What does deprecation mean for kafka? Does it mean it’s marked eligible for deletion in some point in the future, or is there a specific timeline? What does deprecation mean?
Colin McCabe: Deprecation generally means when you deprecate a feature, you are giving notice that it’s going to go away. In kafka, we do not remove features until they’ve been deprecated in the previous major release. In order for us to remove ZooKeeper in 4.0, it’s necessary for us to officially deprecate ZooKeeper in a 3.X release. We originally wanted to do it in 3.4. That’s not going to happen because of the aforementioned gaps. But as soon as those gaps are closed, then I think we really do want to deprecate and basically give notice that, hey, the future is KRaft.
Wes Reisz: Well, Colin, I appreciate you taking time to catch us up on all things happening around KRaft mode, the removal of ZooKeeper with kafka.
Colin McCabe: It’s been such an interesting project because we have this vision of where we want to be with Kafka, and I think we’re so close to getting there, so that’s what’s exciting to me.
Wes Reisz: That’s awesome. We appreciate you giving us kind of a front row seat and taking us along with you on your journey. Once again, Colin, thanks for joining us on the InfoQ Podcast.
Colin McCabe: Thank you very much, Wesley. It’s always a pleasure to speak with you.
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.