Article originally posted on InfoQ. Visit InfoQ
Mehmandarov: Have you ever created data structures, and put your data into those and actually thought about how much memory do those take inside your application? Have you thought what will happen if you multiply that tiny little set, or a list, or whatever you created by hundreds, thousands, or millions, and how much memory that would take and how much memory savings you can have? This talk is about all that. This talk will be talking about performance and scale in Java and how you can handle both domain-oriented objects and tabular data structures in JVM, in Java in general.
Raab: My name is Donald Raab. I’m a Managing Director and Distinguished Engineer at Bank of New York Mellon. I am the creator, project lead, and committer for the Eclipse Collections project, which is managed at the Eclipse Foundation. I’m a Java Champion. I was a member of the JSR 335 Expert Group that got lambdas and streams into Java 8. I am a contributing author to the, “97 Things Every Java Programmer Should Know.”
Mehmandarov: My name is Rustam. I am a Chief Engineer at a company in Norway, Oslo called Computas. I’m a Java Champion as well, and also a Google Developer Expert for Cloud. I am a JavaOne Rockstar. I also have been involved throughout times in different communities, and I’m leading one of those at the moment as well.
The Problem with In-Memory Java Architectures (Circa & After 2004)
Let’s talk about a little bit of history. Rewind a tiny little bit back and talk about what was happening in 2004 and why it’s important.
Raab: Back in 2004, to give some context, I was working in a financial services Java application, and I had a problem, it didn’t fit into a 32-bit memory space. My mission was to try and fit 6 gigabytes of stuff into 4 gigabytes of space. Once I dug into looking at the problem, it turns out, there were millions of small lists that map instances roaming around that were created with default sized array. At the time, the solution we came up with was to actually roll our own small size Java Collections. For those that know the movie, The Martian, I felt a bit like Mark Watney, where he said, “In the face of overwhelming odds, I’m left with only one option, I’m going to have to science some stuff out of this.”
You fast forward to after 2004, when we got 64-bit JVMs with the JDK 1.4 release, we could now access more memory on our hardware, but the software problem, accessible memory became a hardware problem. It’s like you wind up with, over a period of time, how much memory do you actually have to access? Really, it’s like, we wound up pushing hardware limits in some cases, with some of our heap sizes. Compressed OOPS gave us a great benefit when that became available, JDK 1.6 in 2006. We then got the ability to have 32-bit references instead of 64-bit references in our heap. Saved a significant amount, but there was this then problem where it’s like, we got back to a software problem again. It’s like, now, we have the sweet spot for the 32-bit references, we got to keep it below 32 Gigs. Or there’s this ability to play around with different shifting options where you could maybe get 32-bit references up to 64-bit Gig, but you’ll get a different alignment so it might cost you 16-byte alignment versus 8-byte. You got to be careful with that. 32 Gig winds up being the sweet spot. Because of this, we wound up then rolling our own solutions for our own mutable sets, maps, and lists, and also built eventually our own primitive collections. The whole goal being that we wanted to reduce the total memory footprint of our data structures and make sure our data was taking the most space, not the data structures.
The (Simulated) Problem Today
Mehmandarov: The problem today is, I call it simulated problem today because we have actually created a little bit of that problem, because now you have 64-bit memories. It’s the systems, you have a bunch of memory. Everything is nice and shiny. The problem is still there because you still need to process more data and you would like to do it in the most efficient way. What we have now, what we’ve actually tried to create to this problem is that we’ve thought of a data put into a CSV file that we would like to read in. Since we do quite a bit of conferences, both of us, we decided to go with conference data. I’ll show you the data and what it consists of on the next slide. For now, we just want to say that we need to read in data. We need to process. We need to do something with that data, and we want to do it in the most efficient way. How can we do that in Java? What decisions can we make to make it more efficient? How should we think about our data? Should we think of it as rows of data, columns of data? Which one is better? Also, what libraries we’ll be looking at. We’ll be looking at three different ways of doing stuff. We’ll be looking at the batteries-included version of Java. Java Collections that Java comes with. We’ll also be looking into Eclipse Collections, a library that Donald started, and a bunch of people have been contributing and creating, evolving. Also, we’ll be talking quite a bit actually about a library that was built on top of Eclipse Collections created by another developer, Vlad, called DataFrame-EC. EC actually stands for Eclipse Collections. This talk will be focusing on memory and memory efficiency, and the strategies and techniques connected to that. We’ll not be talking about other kinds of performance tuning and all those things.
Sample CSV Data to Load
Let’s talk about data. The data, the way it looks, it’s a bunch of columns, like typically what you would expect from your data. It has an event name. It has a country where that event exists or will take place. It has a city where it will take place. It has two sets of date objects or dates, where we have a start date and end date. It also has some indication of what session types does it have, so a list of elements, so it might be lightning talks, regular talks, or workshops. We have some integer values, or typically values that would be represented by integers: number of tracks, or number of sessions, or maybe even number of speakers, or cost, so how much that would cost. To create all that, we have a script or a Java class that randomly generates data. Then we can create and then generate that in a deterministic way, where we can just randomly generate a number of random strings going as names between a certain length of a string. It can be a little bit smaller or a little bit bigger, but roughly the same size and randomly generated. Same goes with the countries. We have a set of countries that we pull from, and we can use those countries. Same goes for the cities. We also have dates. Dates are, in our dataset, just for our testing purposes. We limited that to be all possible dates within 2023. Also, session types. You can choose between those three that you can see there. It can be different variations of those and different number of those. Also, the other numbers are going to be random within a certain range. We’ve generated it in different shapes and sizes. We generated it for 1 million rows. That takes roughly 90 Megs on disk. We have a 10 million one which is 10 times bigger, so roughly 900 Megs. We also have 25 million, which takes roughly 2.2 Gigs on disk.
Measuring Memory Cost in Java
Measuring memory cost, so how do we do that?
Raab: This is a practical thing to walk away from this talk with is, there’s a tool at the OpenJDK project called Java Object Layout, referred to as JOL. It’s a great tool for analyzing the object layout schemes of various objects. It’s very precise in terms of you can look at individual objects and look at their cost and layout. You’ve got the Maven dependency here. You can go check that out. Also, there’s a nice article here from Baeldung talking about memory layout of objects, and actually talks about JOL in more depth. There’s also a Stack Overflow question, which was in regards to whether or not JOL works with Java records. It does. Pretty much it requires using a VM Flag when you’re running with Java 14 or above. If you want to use records, you got to set this magicFieldOffset to true.
Mehmandarov: Memory considerations, there are actually quite a few things. We talk about boxed versus primitive. I’m just going to list them up, and then leave the explanation of that as a cliffhanger. Boxed versus primitive is a thing that we’ll be thinking about, and we have been thinking about, and we’ll be talking about. Also, we’ll be talking quite a bit about data structures that are mutable versus immutable. We’ll be talking about data pooling, and what that actually means and what results it will give you. We’ll also talk about way of thinking of data. Do you want to think about it in a row-based way? You typically think about your data in relational databases and stuff, like row and row, another row and another row of data, versus column based. That’s another thing. We’ll also be talking a bit about how memory can be improved or will be improved in the future by the things that are planned to become a part of Java as well.
Memory Cost – Boxed vs. Primitive
Raab: We’re looking at two classes here, and there’s some giveaways. On the left, you got the MinMaxPrimitivesBoxed, and you’ve got a set of fields which represent the min and max values of different types. On the right, you’ve got something that looks very similar, different class name, all the field names are the same, and the primary difference is the type. On the left, we’ve got the type with the uppercase letter, on the right, the type with the lowercase letter. A question for you to think about is, what is the difference going to be between these two classes, specifically from a memory footprint perspective?
What we’re going to do is we’re using JOL here to basically print out the memory footprint of these two objects. You can see we’re using this thing called GraphLayout or parseInstance. We give it an instance of MinMaxPrimitivesBoxed, and we print out its footprint to system out. Then we do the same thing with MinMaxPrimitivesPlain, and print out its footprint. For the type that we have on the left that had the uppercase type names, like Boolean, Byte, Character, all uppercase. What you’ll see is that we actually created in memory, 17 different objects. Pretty much it was like the object that we wanted, the MinMaxPrimitivesBoxed, and then we wound up creating two objects for each of the MinMax values of these boxed types. In total, 17 objects, and for a total of 368 bytes of RAM for that single class. What you can see is like, in the middle, you see the average, you’re seeing the cost of each of these boxed types. It’s just a useful metric for you to understand. When you’re using these boxed types in memory, what do they actually take in memory to use them? Then the MinMaxPrimitivesBoxed as well, you see it takes 80 bytes itself. That 80 bytes is counting up the 4-byte references plus the object header cost. In addition to the cost of each of these objects, you’ve got the reference to the object as well, contained in the main object, so 368. If we look then at MinMaxPrimitivesPlain, you’ll see we get one total object. We don’t have to create 16 extra objects, we just have the primitive values, and the total cost of that object is less as well. It’s 8 bytes less than the cost of the MinMaxPrimitivesBoxed minus also then the 16 extra boxed wrappers that we get. Our recommendation here is pretty simple, don’t box primitive values. Understand what the cost of doing that is. Unfortunately, because Java has autoboxing in it, that can be a useful feature, but it’s also somewhat evil in that what it’s doing is silently creating stuff on the heap for you taking up memory. Using autoboxing, you may actually be hiding memory bloat.
Memory Footprint – Boxed vs. Alternate vs. Primitive Sets
Now we’re going to actually take a look at boxing different data structures, so boxed versus primitive data structures as well as an alternate data structure in here as well. Here, what we’ve got is, we’re creating three different sets. One is a java.util.HashSet. We’re going to then create a UnifiedSet from Eclipse Collections. Basically, it’s going to be a set of numbers from 1 to 10. The first two are boxed. You can see they have HashSet of integer and mutable set of integer. The third one is not boxed, it’s primitive. Here, we’re going to basically compare two set implementations and then a primitive set implementation using JOL, doing the same thing, just parsing each one and doing their footprint. What do you think the difference is going to be between these three classes?
Looking at HashSet, what you’ll see is like, we wind up actually creating 24 objects here. Ten of the objects are the boxed integers and they’re taking 160 bytes. We have the HashSet itself. Then contained within a HashSet is the HashMap. You’ll see like within a HashMap, you’ve got an array of this HashMap$Nodes, and you’ve got 10 of these HashMap$Nodes. These are basically the map entry objects contained inside of the map. Now the interesting thing is like, a set containing a map, winds up carrying along with it all this extra infrastructure to support its setness. If you then compare it to UnifiedSet from Eclipse Collections, so the set for Eclipse Collections doesn’t contain a HashMap. You immediately cut the number of objects that you’re creating in half. We wind up with an array of objects. We still have the 10 integer objects. We can’t get rid of those because this is boxing happening. Then we’ve got the UnifiedSet cost itself. A big difference there, from 640 bytes down to 272, so more than cutting in half.
The third thing we can look at is the primitive set, so IntHashSet. IntHashSet, what you can see is we get rid of the integer objects, so that’s 160 bytes gone. We have an int array, which is that [I. Then we have the cost of IntHashSet itself, which is 40 bytes. In total, it’s 120 bytes. From UnifiedSet to the IntHashSet, you can see it’s a tremendous savings. Once again, more than cutting in half in terms of the memory. Recommendation here, avoid using java.util.HashSet, unless you’re using in a place where you’re going to create it and then get rid of it. Don’t hold on to it long term, because it’s a lot of cost for the value it’s providing of being a set. It’s a memory hog. Also, remember, don’t box primitive values if you can. Autoboxing is evil. You can see we get that 160 extra bytes for integer objects when they’re just int values, they should be 4 bytes each. It’s hiding bloat in your heap, potentially.
Memory Footprint – Mutable vs. Immutable Sets
Next thing we’re going to look at, mutable versus immutable. Here, we’re going to compare two things, they’re both in the JDK. We’re going to compare HashSet holding on to two integer values, 1 and 2. We’re going to compare it to the ImmutableSet in the JDK. In this case we’re going to create a copy of the HashSet to save some code. We do Set.copyOf, and what this is going to do is create an ImmutableSet for us. Then we’re going to compare their two footprints. If you look, here’s the thing you just saw on the previous slide, you’ve got the footprint. Now this is only holding on to two integers now. Those cost then 32 bytes. Then you’ve got the cost of the data structure itself. For a HashSet with two elements, you can see it’s 272 bytes, compared to this special type called Set12, it’s an inner class in ImmutableCollections. All you have is the two integers, 32 bytes, and then 24 bytes for the Set12 for a total of 56 bytes. A real tremendous savings in the mutable range. Once again, avoid java.util.HashSet, if you can. Recommendation here is, when you’re loading data up, if you can trim it at the end using immutable data structures, you can save a lot of memory that way. I would say, load mutable, because it’s going to be faster performance-wise to grow something that’s mutable. At the end, if you can trim, and wind up with just a memory required going to an immutable version of the collection is very helpful.
Memory Comparison – Sweating the Small Stuff
Then we can talk about sweating the small stuff. We looked at a set in a two-element range. It’s like, what other optimizations can happen in that small element range? We’re going to look at the ImmutableList space. In the JDK, there winds up being two optimized immutable things. There’s list 1, 2, which covers basically one element and two element. Then there’s list n, which is basically going to be a trimmed array to the number of elements that you have. You can see, comparing JDK to Eclipse Collections, this is what the differences look up to size 11, from size 0 to 11. Obviously, you’re looking at memory in bytes, so the smaller the better. You can see there’s a reasonable gap at each level between JDK and Eclipse Collections. The reason is Eclipse Collections actually has named types from 0, so it’s 0. Even though it looks like JDK cost a lot more than Eclipse Collections here, it’s completely meaningless because empty, there’s only one in the JVM. You actually have a singleton instance of that. You only ever create one. There’s no multiple. The multiple for it is one. It’ll never be more than that. Whereas the other ones you’re going to wind up with a multiple effect, depending on how many you have. If you have millions of them, that’s where savings can add up. Eclipse Collections actually has named types from singleton all the way to decapleton, and everything in between. Then at 11, it switches to be array based, which is where then JDK and Eclipse Collections get very close, in terms of the cost.
There’s a thing to be aware of, in terms of with these memory savings that you get, there’s potentially a tradeoff of performance. The core JDK team made sure that they limited the number of classes they created to try and reduce the possibility of megamorphic call site sneak into your code. When you have a call site that is bimorphic, or monomorphic, it’s very fast. Once you enter into the range of megamorphic, where you’ve got three or more implementations of an interface, you wind up with a virtual lookup and method call. That actually gives you a significant potential performance hit. You got to be aware of like, what’s more important to you? Is it memory or is it performance? In that case, then making sure you understand your performance profile, and where you have potential bottlenecks, just be aware of the tradeoffs. In 2004, I had the problem of, I needed to shave as much memory as I could to fit into a 32-bit memory space. That’s where the design decision was like driving having all of these smaller types available.
Exploring Three Libraries
Let’s go on and talk about exploring the three libraries we looked at.
Mehmandarov: We talked quite a bit about how should we present that to you. We’ve mentioned already all of them. We’ve talked about Java Streams, Eclipse Collections, and DataFrame-EC. We thought of, what is the best way of actually introducing them and actually explaining to them. If you’ve used or played or have seen Legos before, you have seen them in lots of different shapes and sizes and different things you can build into things. In general, there, you will see three different types of sets. You’ll have the basic ones that are consisting of more generic pieces. That’s what Java Streams is. As fun, we actually looked at the age limits for those sets that we see in the picture here. It’s a fun comparison also for us as a programmer, so think about that as well. Call it maturity, or experience, or whatever. It doesn’t really translate to that, obviously, but it’s a little fun thing that made us giggle a little bit. Java Streams is basic building blocks. It has quite a bit of assembly required to make it look like a car, but it’s also a standard set of things that you can build into a house or an airplane or a boat or something. Most of the things. With Java Streams, it’s the same. Some assembly is required. More of a low-level control, so you can actually build your own stuff in different ways. Also, Java Streams has this row-based approach to domain objects. You have domain objects and you have row-based approach to those. You put a bunch of attributes into an object, and they represent a row in a database.
Eclipse Collections is a little bit different. It’s more closer to what you see in a technique where you have all the pieces from the standard and basic stuff, but also has extra pieces that are very specialized to build that particular thing. A particular set of cogs to do a transmission box for a car, or a tiny little thing to do the grille on a car. It’s not the standard, you can use it on many other things. It’s the same thing with Eclipse Collections, it has more specialized building blocks, but it’s still compatible with the basic one and it’s also optimized for performance. It’s optimized for building that particular car, obviously. It still has the row-based approach to domain objects. You still put things into objects, and you still handle them as a bunch of attributes inside an object.
DataFrame-EC is a little bit different. It has a bit more of an approach similar to Lego Mindstorms, where it has a smart thing in the middle there that can be programmed to do things. The way I like to think about DataFrame-EC, which is probably not exactly the correct way, but it still helps me to build a mental model is think of it as a spreadsheet. Where you have both data, but also smartness and filtering that you can build into a spreadsheet that you typically would have on your computer, whatever program you’re using for that. Here, it’s like more specialized building blocks, even more specialized. They’re still compatible with the two previous libraries or versions of doing things. It is a much more higher-level approach. It also can be programmed in a more specific way to do specific tasks. It simplifies some of the tasks that can be done by the other ones, but are a bit more tedious. This one actually has a different approach. It has a column-based approach to the tabular data structure. Now we’re looking at data in columns instead of looking at it in rows.
The Conference Explorer class that we had to implement to read all that data that we’ve shown you earlier, it’s in at least two of the three cases that we’ve been playing around with for Java Streams and Eclipse Collections. Remember, row-based approach? That is a record. That’s implemented as a record where we have a bunch of attributes that we put into that. Then, for Java Streams, we create a set of conferences and countries. For Eclipse Collections, we do a specific type of a set, which is ImmutableSet, which is an Eclipse Collections specific object, of again, set conferences and countries. For DataFrame-EC, things are a little bit different. Here, we actually have DataFrame objects for all of those things. Now we have a DataFrame for conferences, DataFrame for country codes, and all those kinds of things. You see, it’s not countries, it’s actually country codes, because all the countries and everything, it’s inside the conferences DataFrame. Country codes we need for other things to generate some other fun things. If you want to know about APIs, and all these, how it’s been implemented, we’ve done a few talks at Devnexus and Devoxx Greece, where you can actually see the same data, see the same code, but where we talk more about using the APIs and how you can use it in different settings.
Conference Explorer – Memory Cost Comparison (Mutable vs. Immutable)
The question here is, what would it cost to load 1 million conferences into data structures like these? Let’s find out. First thing we would like to see is how it behaves when you do a type of library and a type of data structure. A type of library like Java or Eclipse Collections, or DataFrame, and for each of them, we’ll try to see for mutable and immutable data structures. You can see the memory footprint of mutable structures are much bigger, at least for Java sets. It gets smaller for the other ones.
Raab: A Java set is exactly the set we warned you about in the previous slides, that’s java.util.HashSet, and you’re seeing the extra cost there or how it translates as you scale up.
Mehmandarov: Still, even for the other ones, it’s still higher. This is for 1 million conferences. It’s not for 25, or 50, or whatever. Also, the funny thing is that you can also see that the DataFrame-EC one, obviously it doesn’t have the mutable version, but immutable version of that is even half the size. If you wonder why, we’ll see that in a bit. We’re going to keep that and leave that as a question, so why are the collection alternatives comparing so badly to DataFrame-EC? This is one of the main answers. The answer is something called pooling. Now we’ve done the same thing, library, only immutable data structure, but now with or without pooling, and based on our dataset. This is the graph. You will not see exactly like this for your own data if your data is different from ours. For us, it looked like that. With introducing pooling, we halved or cut the size of the whole thing in two. Like I said, it’s not the data structure. It’s exactly the same data. It’s exactly the same data structures. The only difference is if we do pooling or not. We implemented custom pooling using Eclipse Collections for Eclipse Collections-based solutions. The recommendation here that’s important, if you’re going to take away anything from this slide, is that you should understand your data and analyze it using some tools like Java or something else, to understand how it behaves and where you can optimize it.
What Is Pooling?
What is pooling? Can you explain a little bit about that?
Raab: The first thing we’re going to talk about is, what is a pool? The way I would describe pools, it’s a set of unique values that you can put things into and get them out of. If you think about a HashSet in Java, you realize like the set interface, you can add to it, but you can’t actually get anything out of it. You can check for containment. If a set had a Get method, it turns out in Eclipse Collections, our UnifiedSet is actually a set and a pool, so actually have a Get method on our set. What is a pool useful for? Why would you want to have a Get method on a set? What it’s useful for is like, it helps you reduce the number of duplicate items that you have for a specific set of data in memory. Think of it basically as a HashMap of the key and value being the same thing. I want to look up this key, and if I have that key, I want to look up this value, and I only want to keep this value in memory. It’s just for the set. Since key and value are the same, if you have a Get method, you’re looking up the key and you get back the key, is really what it comes down to.
In JDK, there are different kinds of pools that actually happen. They’re not implemented as sets, but like, there’s the type of pooling if you haven’t heard of before, which actually gets used for literal strings or String.intern. There is an internal pool that the JVM manages, that literal strings use, and you can use it as well. It’s a method on string to use a pooling. There are articles out that have existed for quite a long time, and done over the years explaining when and when not to use String.intern and different issues with it. It is available there. It is a pool. There are also these pools available on the boxed wrappers that actually get used through autoboxing. There’s a value of method on each of the boxed wrappers like Boolean, Short, Integer, Long. There is a range of values that are basically cached or pooled for each of these types. For integer there’s 256, I think integers. It’s like negative 127 to 128, or negative 128 to 127. They’re both ranges. They keep these small integers both in negative and positive cache.
As it turns out, and this was actually a mystery for us. At first, we didn’t know why DataFrame-EC was doing so well. We didn’t do anything to it. We just loaded data into it. We thought we had a bug or something, like why is it half the memory? It turns out, like DataFrame-EC actually uses the Eclipse Collections UnifiedSet underneath to actually pull the data for each column. It’s very smart. It makes sense because like for each column, since it’s a column-based datastore, it can say, if I have a string, let me have a pool of strings, and I can unique while I’m loading. Or if I have a date, let me have a pool of dates, and even if I have a tremendous number of years, the number of dates is probably going to be much less than hundreds of thousands or millions.
Mehmandarov: Think about it, like 365 possible dates a year, and if you have even dates for 30 years, 50 years, it’s still much less than 1 million elements.
Raab: After we started understanding our data and looking at it and seeing where the costs were, we said, what can we do here with pooling? We saw like, we have a lot of duplicate cities. We actually have a set of 6 of them that we load. Immediately, we can get rid of 1 million strings, and just through pooling, wind up with 6. There’s a tremendous savings. We have start date and end date. We have a million of each of those. There’s only 364 that we wind up loading for 2023, so we go from a million to 364. Then, as it turns out, session types, which Rustam talked about before with the CSV data, we’ve got talks, workshops, lightning talks. It’s a combination of either anywhere from one, two, or three, and then the combination of those, and you can see in total, you wind up with seven instances. What’s interesting is when you use the Eclipse Collections ImmutableSet, you wind up with named classes for each of the sizes. It actually tells you more about the distribution of your data. I refer to this as a tracer bullet, it’s like I can see what’s out there as I’m shooting my sets into the heap, what they actually look like and where they land, and see, I’ve got 375,000-plus singleton sets, 541,000 doubleton, and 82,000 tripleton, and those then reduce down to those.
Row-Based vs. Column-Based Structures
Now we can talk a bit about rows versus columns, and something to think about in this space. Row-based structures, really, the benefit you get is like you can get custom tuning ability. You can actually really try and compress down the data in your row. You’ve got limits with that. You do have also the ability to do custom pooling, which is what we did after the fact. We’re going to show a little bit of what we can do in terms of achieving more with rows to squeeze even more memory. Some of the challenges that you get is like you get this object header cost. For every row, I got like an object header. We’re going to talk a little bit about what Java is going to do, eventually, to help us in this space. You also have object alignment costs to think about. The way objects get this 8-byte alignment, so whatever you can fit into 8 bytes. If you don’t fit 8 bytes there, you fit 4, it’s still going to cost you 8. You got to consider that. There’s a great article from Aleksey Shipilev on object alignment and how that works. With column-based structures, you only get an object header cost per column, so a lot less columns, let’s say 10 columns versus a million rows. You get great compression and performance, especially if you’ve got primitive types, things can maybe just be loaded directly in from a second level cache into the processor and get good cache locality. Then you are limited, though, in terms of tuning to the available types. DataFrame-EC only has a long value type for integral values, it doesn’t have Short, Int, or Byte. That’s a place where it could actually give you more savings.
Fine-Tuning Memory in Conference Explorer
Let’s go look at fine-tuning memory in Conference Explorer and what we did. Through some, just manual, what can I do to squeeze memory here? We only did it for the one column. We could have done it for all of the first four, because this was really making changes in the Conference record, changing the types. What we did was we took what were 4 int fields initially, and we made one of the int fields, byte, because the number of tracks is always typically going to be less than 10. In a byte, I can fit 128. Then the int value is like, I don’t really need 2 billion for number of speakers, number of sessions, and cost. A nice value is Short. It’s much smaller. Hopefully, you don’t hit the max size for the cost, but you’re definitely not going to hit the max size for speakers and sessions. This gave us the ability to shrink 16 bytes down to 7. Then we did this really gnarly trick of like, let’s get rid of an object reference and that will save us 4 million extra bytes, potentially, if we can get the object alignment working out right. We combined the two dates into this pair object, it’s called the twin, because twin is the same type, and basically, we were able to reduce the reference cost. This is that funny date math, now we have the combination of from and to. Since to is always greater than from, you wind up with, at max, like 66,000. That’s still a lot less than a million. Even if I have to create 66,000 of these at 24 bytes, the million times 4 is going to be more than that. We got to see things from there. What you can see is, we effectively got the Eclipse Collections ImmutableList row-based approach to be a little more than 10 Megs less memory than the DataFrame-EC, where we didn’t have to really do anything. That’s explaining what happened here.
Let’s turn the volume up to 25. We wanted to see, what if we turn the volume up actually to 25, where we went from 1 million to 25 million? We still have that manual tuning for just the Eclipse Collections column. What we added here is we wanted to remind you on the file size, so you can actually compare like, this is the file size that when we generated the 25 million objects it took 2.19 Gig. You can see in comparison, how much does it require in memory. As it turns out, when we actually turned it up to 25 million, and I tried to run our code, it exploded. We ran out of memory. It turns out because we hadn’t tuned this at all, we’re using Jackson CSV, and we’re using this mapping iterator, readAll method. ReadAll what it does is it creates basically a list of maps. We wound up creating 25 million maps. They were probably just JDK HashMaps. It was like it blew out the memory space. What you need to do is, in order to scale using Jackson CSV, we use the iterator directly. We’re creating one row at a time, and then turning that into the conference row, not creating maps and then herding maps into conferences. This was much better. The manual savings here that was adding up when you talk about 25 million rows, you can see like, we’re saving now over 300 Meg compared to DataFrame-EC. That manual tuning is starting to pay off.
What Will the Future Bring?
We could talk a little bit about what’s going to happen with Java.
Mehmandarov: We talked quite a bit about actually how we can fine-tune, and we can do it by knowing our data or using the right data structure for that. There is also things happening in Java world that will bring the size of the memory footprint down a bit. There are a few projects that are working in different directions. They’re doing different things. All in all, at the end, at least two of them for sure, and the third one kind of, will influence the value of its memory footprint. Project Lilliput works with techniques to downsize the Java object headers in general, in the HotSpot JVM, from 128 bits to 64 bits or less. Project Valhalla will also have this thing called value objects. If you want to read more about that, there are links to both descriptions of the project, but also to, in this case, a blog post from Brian Goetz, where he explains how this works. The interesting call to take from here, it says, like primitives, the value set of an inline class is the set of instances of that class, not object references. Also, Project Amber, which is not exactly memory related, but still will influence it, and also help to think about data-oriented programming and approach to that in Java. You can also read about that. Also, you can read this article by Brian as well, which is a really interesting insight of data-oriented programming in Java.
What can we say? Data-oriented programming in Java is actually possible. You don’t need to go for another framework, language, whatever. You can do it. It’s feasible. It’s flexible. It can be fun with all these fine-tuning and small things that you can do, and see your memory footprint go down. You can do all these fun things in your favorite language. Understanding and measuring your data is the most important key that will be there, no matter what you choose, no matter what framework, language, whatever you will go for, it will be a very important part of it. You should use tools to measure that. You should consider using pooling to get a lot of memory benefits, especially if you have repeating values and values that are not absolutely 100% unique. Object compression is also possible. It’s possible using smaller primitive types with fixed number ranges. We need to think about also column-based approach versus domain-oriented or object-oriented approach to the structure. When we do column-based, you should try to stick to primitives. Primitives will generate a lot of memory savings. You should think about providing support for smaller integral types that can add to the memory savings. When it comes to domain-oriented approach, we should think about how it can be tuned manually, because here, you need to know your data. You need to know how it’s put together. You need to know where you can cut things. One of the most important things, one of the most low-hanging fruits, probably, is to convert things into immutable data structures. After you’re done playing around with it, doing something with it, put it into immutable data structures, and just leave them there because it will give you a better memory footprint.
See more presentations with transcripts