MMS • Monica Beckwith Gil Tene Todd Montgomery
Article originally posted on InfoQ. Visit InfoQ
Transcript
Printezis: I’m Tony Printezis. I’ve been working on JVM garbage collection for way too long. I’m currently at the Twitter VM team.
Montgomery: I’m Todd Montgomery. The closest thing you could call me is a network hacker. I’ve been designing protocols and things like that for a very long time. That’s why I have the white beard. I’ve been around high performance networking and high performance systems for about that long as well.
Beckwith: I’m Monica. I’ve been working with OpenJDK for a long time now, even before it was OpenJDK, so during the Sun JDK days. I am currently the JVM architect at Microsoft.
Tene: I’m Gil Tene. I’ve been working on garbage collection almost as long as Tony. Tony’s paper on the CMS collector is one of the first garbage collection papers I read. That makes it more than 20-plus years now. I just worked in all kinds of different software engineering and system engineering parts, built operating systems and kernel things, and JVMs obviously at Azul. Accidentally built an application server in the ’90s. I played with a bunch of different things, that generally means I’ve made a lot of mistakes, I’ve learned from a few. Some of them are in the performance area. At Azul, I play with Java virtual machines, obviously, and some really cool performance stuff.
Java’s Just-in-Time (JIT) Compilation
Printezis: We’ll pick the topic for this panel, which is just-in-time compilation versus ahead-of-time compilation. Let’s maybe spend a couple minutes just to give a little background so everybody can understand what the difference between the different approaches are. Do you want to give a quick explanation of why Java has JIT compilation, and why it needs it, and how it works?
Beckwith: For the JVM to reach an optimal compilation ability with lots of compilation tricks such as inlining, or loop unrolling, there has to be some information that is provided, and many of the advanced optimizations or optimizers call this profile guided optimization. For the JVM, when we are thinking of the bytecode and we’re trying to get the bytecode to work on our hardware, be it x86-64 or ARM64. We want the execution to be in native code, because that’s what the hardware understands. That’s where the JVM comes in, and the JIT helps optimize when we have this series of opcodes that are coming out of the JVM, the JIT helps us optimize and give better performance. Performance that the underlying hardware understands and has the appropriate unit, such as the cache, or an offloading unit or anything like that. The JIT helps us with those optimizations.
GraalVM and AOT in OpenJDK
Printezis: Apparently, lots of people complain that JIT compilation always has to do work at the beginning, it doesn’t work very well. There was a couple of AOT solutions in Java, one that has been removed from OpenJDK. It was built in OpenJDK, and was removed. Then there is also GraalVM as well. Do you want to give an overview of GraalVM, AOT in OpenJDK?
Tene: Actually, I don’t like the terms AOT and JIT, because I think they’re weirdly named. In fact, both of them are named for what they don’t do. If you wanted to categorize them, a just-in-time compiler will take the bytecode, the code that you want to optimize, and optimize it for the machine at the time that it is needed that it’s used. It has the ability to optimize then. It has also the ability to optimize later to replace code with other code, which actually empowers a lot of optimizations that are pretty interesting. What a just-in-time compiler can’t do is compile ahead-of-time. What an ahead-of-time compiler does is it takes all the code and it compiles it to your binary before you ever run the program. It could do all that and avoid all the later work of doing this. What ahead-of-time compiler can’t do is compile just-in-time. The annoying thing is the choice. If you have to choose between them, I am definitely on the just-in-time side. I’ve got some strong arguments for why, because you just get faster code, period. It’s provable. The real question is, why do we have to choose?
Go and Ahead-of-Time Compilation
Printezis: Todd, you did mention in your presentation that you have been playing around with and using Go and Rust, that, as far as I understand, they both generate binaries. I know Rust is actually at a different level, a bit much lower level than Go and Java, of course. Any thoughts on why Go does pretty well with just basically an ahead-of-time compiler and doesn’t do any dynamic optimization?
Montgomery: I think that one thing that is glossed over, but I don’t think that you will gloss over or Monica would is the fact that Java is a little bit behind the OpenJDK, not other JDKs, or JVMs, in terms of the great amount of work that’s been done in things like LLVM for the last 15, 20 years. There is a lot of optimizations that are not available as easily, and most of those are ahead-of-time compilation. In essence, I think that there is a lot of stuff that you can get from ahead-of-time compilation and optimization. There are some things that really work well for certain types of systems. Go happens to be one, but C++ is a huge one, because you can do a lot of different metaprogramming that really makes a lot of the optimizations extremely effective. That’s where I think a lot of that sits, is there’s a lot of good stuff that’s in those cases.
I think to get the most out of it, you actually need both. I think that you can do ahead-of-time, a lot of different global optimizations that just make sense because we as humans can’t see everything and think of everything, but the compiler can see some things, and just make things more efficient overall. There’s still profile guided stuff that based on workload, based on what has happened, that is really great stuff. I think to get the most out of it, you need both. I don’t think you can get away with just one. I think you can use both and use it very effectively.
Printezis: I think Java maybe gets more benefit from just-in-time compilation, because basically everything is a virtual method essentially in it. Doing some runtime profiling can actually eliminate a lot of virtual method calls, inline more.
Tene: I think we shouldn’t confuse the implementation choices with the qualities of just-in-time or ahead-of-time. It’s absolutely true that with ahead-of-time compilation, people feel like they can afford to throw a lot more analysis power at the optimizations, and therefore lots of times people will say, this analysis we can do ahead-of-time. In reality, anything an ahead-of-time compiler can do a just-in-time compiler can do. It’s just a question of, can you afford doing? Do you want to spend the time while you’re running the program to also do that? That’s one direction.
The reverse is also true. If we just stopped this line between ahead-of-time and just-in-time, the fundamental benefit of a just-in-time compiler is it can replace code. The fact that you can replace code allows you to speculate and optimize, hopefully, rather than only for things you can prove, because you know that if it’s wrong, you can throw it away and replace it with other code that you can optimize. That ability to do late optimization enables faster code. This is true for all languages, Java is certainly one, but it’s true everywhere. If you could speculate that today is Tuesday, you can generate faster code for Tuesday. When Tuesday turns into Wednesday, you can throw away that code and generate fast code for Wednesday. That’s better than ahead-of-time.
Ahead-of-time compilers should be able to speculate if they knew that somebody could later replace the code. There’s no need to do all the analysis, we could do just-in-time, if we could do it ahead-of-time and retain the ability to later do additional just-in-time optimizations. Putting these two together actually could give you the best of both worlds. I can afford this because somebody else did it, or I did it well ahead-of-time. I can also allow myself and afford to do optimizations that it can only do if I can later replace the code if I was wrong.
JAOTC and the JVMCI Interface
Beckwith: We had this in HotSpot where we replaced, the first execution will go into AOT. Then of course it goes into C1 with full profile and everything. I wanted to go back to Todd. I’m asking questions, just because I wanted to understand, if you ever used I think it was in Java 9, Java 10, and I think 11 to 13 or 14, we had the privilege of using the JAOTC with the JVMCI interface. Did you ever use it? Were there any feedback that you would have, because I know you mentioned Java has these nuances.
Montgomery: Even from Java 8 to Java 9, there was a difference in terms of, it’s really difficult for when people are doing optimizations, specifically for Java, it’s been my experience that first get something to inline. That’s not always as easy as it might seem. Because that enables all the other optimizations. When doing that, things that go from Java 8 to Java 9 was a fairly big change in terms of stuff that used to be able to be inlined well, all of a sudden didn’t inline well, which then hindered other optimizations. When that is jarring, and I can think of one specific thing that I saw with that jump that was a little jarring. Then, several other things along the line of going from different Java versions. It’s really tough. Sometimes it just works. You upgrade a JVM, things are great. You get some performance improvements that you didn’t expect and everything’s fine. What typically happens though, 7 to 8 wasn’t too much of a jump in that direction. Eight to 9 was. Nine to 14, there’s been changes there that people have seen. I think you get to do that once. Then after that people are like, should we look at other languages besides Java? Because when it’s latency sensitive, and I think about this specifically, it’s really difficult for people to look at an upgrade and go, it’s worth us spending the time to upgrade, when they see regressions from the platform that they’re using.
I’ve seen some instances of that going from different versions. This does have an impact, I think, that people tend to not look at so much. That’s one of the reasons why I do know of several teams that they upgrade to a new version of the JDK extremely slowly. Some which will not move off Java 8 until they know every single thing about what a current version will do, and will even look at something like 17, and go, it’d be great if we had some of the things that are in 17, but it also is going to translate into lost money. That’s a real hard argument to say, you also probably make some money, so what does this look like? It’s hard to do that. It’s definitely visible from the number of clients that I look at in terms of this specifically in the trading space.
Tene: I think you raise an interesting point about the changes across versions. I spend a lot of time looking at these, and the reality is that you meet people that say, Java 11 or Java 17 now is much faster. Then they meet people that say, no, it’s much slower. Then they meet people that say, I can’t tell the difference. They’re all right. Every one of them is right, because there are some things that got faster, some things that got slower, and some things that didn’t change. Some of these are inherent to the JDK libraries themselves, a specific example like stack blocking, where there’s new APIs for stack blocking, they’re much better abstracted, but much slower. All the APIs for stack blocking are gone, so what are you going to do? There are counterexamples like stream APIs that got much faster in other under the hood implementations. Collection, if you’re going to HashMap, stuff like that got better. It varies as the platform goes along. Those aren’t actually JIT versus AOT, it’s just the code.
The fragility of the JIT compilation is another point that you raised. This is where I’ll raise my pet peeve, that version of Java in which implementation of a JVM you’re using to run it is not the same thing. It’s true that OpenJDK and so did the mainstream, took some step backs, and inlining is a specific sensitivity. If you look at the JIT compiler world out there beyond OpenJDK’s base ability to do C1 and C2, you have multiple strong JITs out there, including Azul’s Falcon JIT for our Prime platform. GraalVM has a JIT, OpenJ9 has a JIT. All of those vary in how they approach things. Both the GraalVM JIT and the LLVM based JIT that we use for Falcon, take a much more aggressive approach to optimization and inlining, which a JIT allows you to do because you can inline just in the paths you’ve profiled down and even speculatively. If you apply that, you get some pretty strong benefits. A lot of times, you can reduce that sensitivity of, yes, if it was above 35 bytecodes, did it get inlined or not? When you’re more aggressive in your inlining because you decide you can afford to throw more CPU and work at optimization, you blow through those kinds of limitations too. You inline what needs to be inlined. You inline what escape analysis helps with. You inline where you’re hot, even if it’s fat. Yes, all those things come at a cost, but if you just decide to spend the cost, you can get some really good speed out of it, even in a just-in-time compiler.
AOT Shortcomings
Beckwith: I agree with that. Gil, you mentioned about speculative optimizations. Speculative optimization and the risk with it. We can take the risk which is like, be on the aggressive side, or we can help the speculation by doing data dependency analysis or whatever. At Microsoft, we’re looking at escape analysis, because Gil mentioned LLVM and Graal. I think one of the advantages is the whole escape analysis and how we design the umbrella. How do we spread out with respect to that? That will help your inlining as well. My question was mostly that when we have this AOT trying to feed our profile guided and stuff like that, so basically, we don’t start into the interpreter, we just go into the AOT code. Were there any issues with respect to getting at least the libraries and everything like AOT’ed? That was my question was, did we have any shortcomings?
Tene: I actually clocked it a little bit. I actually think the approach that was there with the Java AOT was probably the healthier direction, as I said, you can AOT but later JIT. The reason that didn’t show a lot of value is because the AOT was fairly weak. The AOT only did C1 level optimization. C1 is very cheap, and you never keep that, you want the C2 costly optimization, or the stronger Falcon, or stronger GraalVM in each stuff later anyway. The AOT wasn’t offsetting any of the JIT stuff. All it was doing is helping come up a little quicker, and C1 is pretty quick. If you want C1 to kick in, lower your C1 compilation threshold, and then it’ll kick in.
The thing it was offsetting wasn’t much and it was doing it without a lot of performance for that code. It was a tight little tweak at the beginning, but it wasn’t replacing most of the JIT’ing. The cool thing is if you can actually optimize at the same level to JIT 1 with the same speculation to JIT 1, so that the JIT wouldn’t have to do it unless you are wrong. Then you effectively get ahead-of-time JIT’ing, if you’d like. Think of it as, one JVM already ran through this, already have the experience of all this. It tried, it guessed all kinds of stuff. It was wrong. It learned what was wrong, but it settled on speculatively optimizing a successful, fast piece of code. What if the next JVM that ran started with that, so this JVM ahead-of-times for that JVM, a JIT could AOT for future runs. A JIT could recover from a prior AOT speculating, which would allow the AOT to dramatically speculate just like a JIT does.
Beckwith: You think PGO and AOT. You think, get the profile information and that could give it AOT, and then get another AOT, which has this profile info. I agree.
Tene: Like I said, I hate AOT and JIT as terms, because all AOT means is not JIT, and all JIT means is not AOT. PGO, profile guided optimization, all JITs tend to do them and AOTs could PGO, no problem with that. Speculative optimization? JITs speculatively optimize. You can do speculative optimizations in AOTs if you also add things to the object code that let you capture what the speculation was. If you think about it, if I compile code that is only correct on Tuesday, in most current object code formats, I have no way to say this code is only correct on Tuesday. It’s fast, but when it turns into Wednesday, throw it away. There’s no way for me to put that in the object file. When you do add that, then an AOT could encode that. It could say, this is code for Tuesday, that’s code for Wednesday, that’s code for Thursday, they’re all faster, don’t run them on a Monday. Code replacement, deoptimization, and on-the-fly replacement of code as opposed to JIT’ing is the enabler for speculation. AOTs could speculate, and AOTs could PGO, if we just coordinate on the other side. Then a JIT turns into an AOT and AOT turns into a JIT. There’s no difference between them, and we’re in this Nirvana place and don’t have to argue anymore.
Escape Analysis
Montgomery: Monica, you mentioned escape analysis. I won’t even say it’s a love-hate relationship. It’s a hate-hate relationship, because I can’t rely on it at all. Statically, I can look at a piece of code that has inline, and I can tell visually if there’s no way it escapes, but somehow the escape analysis thinks that it does, which then blows other things up. I don’t necessarily think this is an AOT versus JIT type of thing. Some of the reasons that we don’t have things like stack allocation and other things in Java is because it should be something that gets optimized. I agree with that. However, in practice, for systems that want to rely on it, there’s no way that they can. It doesn’t, for me, seem to have much to do with AOT or JIT, when I can look at a piece of code, know that this is not going to escape, but yet, it will have the effect of escaping. It feels to me that that’s where a lot of things can fall down in JIT, is that, yes, a PGO type of situation where you can look at it, and no other way can something escape, but yet, there is more conservative approach taken, and it therefore does escape. Although, realistically, it can’t, because something else makes it so that it can’t be optimized.
That’s what a lot of the AOT work that’s done for the last decades has looked at, is, can we make this so that it is always optimized? It seems to me that a lot of times, we look at the JIT, specifically in Java, and say, it couldn’t optimize this because this chain of things that have to happen before that happens was broken by something else. Yet, an AOT analysis, which, I don’t know if it’s more thorough, or it’s just different, it’s looking at things from a different perspective. On the AOT side, there’s lots of things I can think of which can also defeat optimizations. What I’m thinking here is that escape analysis is one of those things, it’s always pointed at as being great, but in my experience is one of those things that I just wish it would just let me have stack allocation and go off and do something else with those cycles, instead of trying to analyze it.
Printezis: Won’t you get that with value types, basically, so we don’t have to worry about escape analysis that much?
Tene: Value types will only bite a tiny amount of that. I think this is colored by which implementations you use. Unfortunately, the C2 escape analysis has been pretty weak. It hasn’t moved much forward in the last several versions. Both GraalVM and Falcon have done a huge amount of work in escape analysis and have shown it to be very effective. I think there are two parts to this. One is, does escape analysis work or not? You could look at it and say, I can tell, but the compiler can’t tell, stupid compiler. Then just get a smarter compiler. Then separately, I think what you’re also pointing to is, regardless of whether it’s able to or not, there’s this feeling of fragility, where it worked yesterday, but as something changed and escape analysis doesn’t work anymore, for whatever reason. Something changed in its lower code, and it seems fragile, brittle, in that sense.
There’s this sense of predictability you get with an AOT because it did what it did, and it’s done and it’s not going to change. There’s that, whatever speed it has, it has. That’s something you could put as a check on the AOT side of, it’s going to run it multiple times on the same machine with no NUMA effects and all that, and you’ll probably get similar speeds. I think JITs can strive for that as well. It’s true that there’s a lot more sensitivity in the system, of everything works except that you loaded this class before that class, and that became too complicated, so give up or something. Sometimes it’ll work. Sometimes it won’t, and it gets that feeling.
I do want to highlight, escape analysis is very powerful. We’re not alone in showing that. Escape analysis combined with inlining is very powerful. Usually, escape analysis driven inline is very powerful. There’s one other part which is, there are the escape analysis where you could look and you say, there’s no way this could escape, so why isn’t it doing it? It really should be catching it. Then there are all these cool, partial or speculative escape analysis things that you could do where you say this could escape, but in the hot path it doesn’t, let’s version the code. The JIT will actually split the code, have a version that has escape analysis benefits, and if you’re lucky, 99% of the time you go there you get the speed. That way, it could escape. It’s a different version of the generated code.
Again, one of the powers of a JIT compiler is you can do that because you can survive the combinatorial mistakes. If you do deep inlining and cover all the paths, the problem explodes to be totally impractical with a year’s worth of optimization. If you only optimize the paths you believe happen and then survive the fact that you took other paths with the optimization mechanisms, then you can afford to do very aggressive escape analysis and inlining together. Both Falcon and GraalVM each show that. You see the amazing, like 30%, 40% improvements in linear speed as a result of these things now. They’re certainly paying off.
Beckwith: There are so many. During our investigation, and we’ve shared it on OpenJDK as well, we’ve seen certain optimization opportunities that we can bring to OpenJDK that are currently missing from OpenJDK. It’s exactly what you said, Todd and Gil. It’s the conservative approach versus a little more aggressive. Then partial escape analysis is another great, kind of like an aggressive approach as well. In OpenJDK, we’ve just scratched the surface of escape analysis. I think escape analysis was put in OpenJDK to show that it can be done, and now we just have to get it right. Maybe it took so many years, yes, but we’re getting there.
Tene: My take is, what we need in OpenJDK is a modern JIT compiler, so we can build all this in it. Really, we have a 23-year-old JIT compiler in HotSpot, which got us this far. It’s really hard to keep moving forward, which is why it tends to fall behind in some of the more modern and more aggressive optimization techniques. It’s not that it’s unable to do it, it’s really good, but enhancing it is slow. This is where you can look at multiple newer JITs out there. Obviously, our approach has been take LLVM, use it as a JIT. We contributed a lot to LLVM to make it usable as a JIT. We use it that way. GraalVM has the approach with the Graal JIT compiler, which is a more modern JIT compiler. OpenJ9 has its own. Also, there are projects within OpenJDK for future JDK stuff. We’ll see which optimizers we go in. Actually, we’re going to see more than one. Really, in my opinion, and this is based on some experience of trying to get it to do otherwise, it’s hard for us to enhance C2 with velocity to do these optimizations, which is why we invested in a different one. I think OpenJDK will eventually have a different JIT that will allow us to get a lot more of these optimizations into mainstream OpenJDK as well.
Printezis: At Twitter, a lot of our services will use Graal, and we have a lot of Scala code. We see a lot of benefit for several of our services using Graal versus C2. We did some looking into it and stuff, and we believe that a lot of the benefit is because of the better escape analysis that Graal has, at least for the versions which we also do right now, that we have tried.
Tene: We do a lot of testing with Finagle code which you guys created, and Scala based, and we do see regularly 15% to 25% performance improvements, driven strongly by escape analysis, vectorization. Auto-vectorization is amazing, but you need a JIT compiler that does it. Modern hardware has amazing vectorization capabilities that is built for more power and higher speed.
Printezis: The version of Graal we’re using, though, was doing actually a pretty poor job with any vectorization. It was not doing any vectorization. I don’t know whether they’ve published their vectorization code to the open source. I think it was proprietary.
Tene: We use the LLVM auto-vectorizer, which Intel and AMD and Arm all contribute the backends to, so the backends that match the hardware. We get to leverage other people’s work, which is pretty cool. Most of what we’ve done is massaged the Java semantics into it so it’ll get picked up. When you look at what modern vectorizers can do, which wasn’t around until about six, seven years ago, you can do vectorization of loops with ifs in them, and things like that, which would seem unnatural before, because vectors now have predicates on predicates on predicates. In fact, I recently tried to create code that can’t be vectorized, and it was hard. I had to step back and even I tried it vectorized. Then I had to think hard, what can I do that it can’t possibly vectorize? I had to come up with that, because everything I threw at it, it just picked up and used the vector instructions for it.
Montgomery: We’ve been here before where you and I, we work at it, we can’t break it but somebody else tries something that we didn’t think of, and all of a sudden, now it’s slow again. That’s a never-ending cycle.
Tene: You’re right. The fragility is there, but I actually don’t think the fragility is as much about the JIT as about the optimizers themselves, like if you change a line of code, whatever optimizer you have, you might have just gone outside the scope [inaudible 00:35:59], and it gives up on stuff it can do. JITs are probably a little more sensitive, because most stuff moves around, but an AOT compiler is just as sensitive to code change as a JIT is.
Montgomery: I’ve spent a lot of my career chasing down and going, I compiled this yesterday, and nothing changed. All of a sudden, now it won’t optimize this, what is going on? That is across both approaches. You try to minimize it, but it does happen. I totally agree on that.
Tene: What is different in terms of this is cache line alignment, and speculative optimization can do cache line alignment, because you run and your arrays happen to be aligned, and everything’s fine. Then the next run, though, malloc was off by 8 bytes, and they’re just not aligned. The code doesn’t work the same way. It’s just two runs, one after the other with the same code, AOT or not, different results.
Beneficial Examples of JIT and AOT Utilization
Printezis: Can you please give me some best example where the JIT and where AOT can be used, and they can be beneficial?
I would guess that most cases, for any application that runs for any non-trivial amount of time, it doesn’t run for like 5 seconds, so a JIT will work pretty well. The application will get more benefit of it. Maybe you can use some AOT in order to have a better starting point to save you for startup. Just for the long term, a JIT will do a much better job. I think there are some cases where it will make sense to just use AOT. If you want to implement something like ls in Java, you don’t necessarily want to bring up an entire JVM in order to look at some directories and then just say that’s it. I’m not picking on ls, just if you have a small utility that’s going to run for a very short period of time, I think, just generating a binary and AOT’ing everything, is going to be the right approach. I think that’s how I see it.
Montgomery: Actually, it’s not time related, but it’s also the same thing of if you’ve got something that’s totally compute bound, it’s just simply straight compute, then AOT is going to be the same as your JIT. The downside, though, to the JIT in that case, is that it has to wait and learn, so a startup delay. Again, that can be addressed with other things. It is a good point, though, that certain things don’t need to have the JIT and would react much better to AOT. That’s the minority of applications. Most applications, especially enterprise applications, for business and stuff like that, almost all of them are going to have way too much conditional load. The way that they work between times a day, and stuff like that, JIT is a much better approach, honestly, if you have to pick between the two.
Tene: If you’re going to do AOT, take a hard look at PGO for AOT. Because having AOT optimized, given actual profiles, it makes for much better optimizations. Even then, speculation is extremely [inaudible 00:39:43], data driven speculation. The belief that no number will be larger than 100, and the ability to optimize because that’s what you believe, is something you can’t do in an AOT unless you can later survive being wrong. Sometimes, you’ve got pure compute on entire spectrum data, and you’re going to do all the combinations. Lots of times, all you’re doing is processing U.S. English strings. U.S. English strings do fit in 8 bits. You can speculate that everything you’re ever going to see is that, and you’ll get better everything. An AOT just can’t deliver that, because that might run in China and it won’t work, where you can survive that. There are data driven optimization, speculative data value range driven optimizations, they’re just more powerful if you can survive getting them wrong and replacing the code. That’s fundamentally where JIT wins on speed. Where it fundamentally loses is, it takes a lot of effort to get the speed and all that, so tradeoffs. People tend to turn down the optimization capability because they don’t want to wait 20 minutes for speed. I do think there’s a hands-down, it wins if you can speculate. The real trick is, how do we get AOTs to speculate? I think we can.
See more presentations with transcripts