 
  				MMS • Bibek Bhattarai

Transcript
Bhattarai: My name is Bibek. I work at Intel as AI Technical Lead. Our group mostly deals with customers directly. We have firsthand knowledge of what customers are deploying, where they are deploying, and what platform they use, what matrix they really prefer when it comes to deploying the model. Today’s talk in particular, I’m going to focus on this little addition in CPU architectures called AMX, Advanced Matrix Extensions. It’s added from 4th Gen Intel Xeon processors, you can find on the data center.
Basically, I’m going to go into a little bit of a dirty detail into what exactly in AMX, and what makes it really good for matrix multiplication, and why you should be careful in learning how to enable it, and then how to take advantage of it. I break down the whole presentation into four parts. Why should I care? Then, what does it exactly solve? Then, how does it solve it? I will go into a little bit of detail in there. Then, last but not least, how do I leverage it? How do I make sure I am using it, and then I am taking full advantage of it?
What Deep Learning Workloads Usually Run in CPUs?
From our interaction with customers, our collaboration with customers, there is still a lot of the deep learning workloads that still runs on some CPU. This may be because of availability, which is obviously a big issue. GPUs are hard to come by. You might want to use your GPU for something more important, like training large language models. Or some of the models are inherently bound on the memory bandwidth. For whatever reason, there are a lot of the workloads that still makes sense to deploy on CPU rather than using on GPU. Not to be the downer, because everybody’s talking about LLMs, you can really squeeze a lot of performance out of a CPU. This is a number published by Intel when Llama 3.1 was released. It was on the day of release.
Then, this is what you can get. You can get pretty much one token every 20 milliseconds if you compress your model and then deploy it on a 5th-generation Intel CPU server. What I’m going to say here is, for whatever reason, you may be deploying your workloads on CPU, or you may be doing feasibility analysis to see if you can deploy your workload on CPU. If you’re one of those folks, I want to help you understand what is the maximum I can squeeze out of my CPU so that you can make an educated, informed decision on the selection.
GEMMs (General Matrix Multiplication) Dominate the Runtime
What’s the most time-consuming part in all these workloads? There’s a huge compute-bound part, and there’s also a huge memory-bound part in LLM. A lot of the traditional machine learning tasks, traditional deep learning models, they are heavily bound in computation. Whatever the model architecture is, in the heart of the model is what we call GEMM operation. It stands for General Matrix Multiplication. It’s a matrix multiplication. If you are using encoder-only or decoder-only transformer models, around 70% of the runtime is being spent on the GEMM. Same goes for CNN. You can convert your 1D, 2D, 3D CNN into a matrix multiplication operation, and that takes significant amount of time.
For RNN models, you can convert your LSTM gates, GRU gates, into a linear operation, which transmits to the GEMM operation. You get the idea. Basically, whatever model you are running on the surface, eventually what you are doing on the hardware is spending a lot of the cycle doing matrix multiplication. Obviously, it makes sense to go after this huge component if we want to accelerate our workload on CPU.
Let’s do a little exercise on matrix multiplication. I know it seems like a high school math class. I’m sure pretty much every one of you have written matrix multiplication code at some point in your life. If nowhere, probably during preparing for the interview. I did. This is what typical matrix multiplication looks like. You have matrix A, that size, let’s say M by K, M rows, K column. You have another matrix B, that is K by N, K rows, N column, and then you multiply them together. You get a matrix C, M rows, and then N column.
Typically, the first source code you probably wrote for matrix multiplication probably looked like this. You iterate through M, you iterate through N, that basically iterates over every cell on matrix C. For every cell in matrix C, you need to get one row from matrix A, one column from matrix B, and then do the element-wise multiplication, accumulate all of them together into a single value, and then you move to the next cell on matrix C. This is not an efficient algorithm, obviously, because it’s the most naive matmul.
One thing I really want to point out here is, this is really terrible when it comes to cache utilization. I want to reiterate that cache locality is very important. We saw the impact of caching here. I’m sure you are familiar with the hardware caching. When you fetch a data element from matrix, so this matrix is in memory, when you fetch this particular element, it will fetch a bunch of other elements, assuming that you’re going to use them in future. Then, same goes on matrix B. When you fetch this one, it will fetch a bunch of these elements and put it on the L1 cache. Eventually, when doing the matrix multiplication from matrix B, you are only going to use this element, and you got to move to next row. All these elements you just fetched went to waste. You bought it in the L1 cache, they didn’t get used, so they went to waste.
One really easy way to fix this is called loop reordering. It’s surprisingly really easy, and then it’s very efficient. All we did from the last code is we swapped K and N order in the loop. When we do this, we see this consecutive memory access on all matrix A, B, and C. To compute one row in matrix C, you will use this one, and then you do next row, next row, next row, all the way down, and then you are done with one row.
Then, you will switch on A, and then repeat the process on B. There are lots of other ways to make matrix multiplication cache-friendly. For example, tiled matrix multiplication is one of the popular approaches nowadays, especially because of the GPUs and parallelization potential of these tiled matrix multiplication. Basically, when you are doing the matrix multiplication on CPU, whatever tiles you are operating on, you want to bring them as close to your processor as possible, so preferably on L1 cache. You want to bring them in L1 cache. As you can see, this simple decomposition, I decomposed my 1K-by-1K matrix into tile of 128-by-128, and then I used this two-tiered loop, and then it actually reduced the runtime from 130 milliseconds to 89 milliseconds.
Obviously, we’re not going to spend the whole time optimizing the matrix multiplication, because there are literally teams of hundreds of engineers in almost every big company who are working full-time to optimize GEMM kernels. What I want you to take from this little exercise is the fact that data locality is really important. Obviously, with modern CPU architecture, the optimization is much more complicated. The tiling is not cache agnostic. There are multiple tiers of caches. You want to maintain locality in all of them. There are architectures like AVX, which can process huge chunks of data in a single go. You want to take that into consideration when optimizing your algorithm. If you just run this matrix multiplication on NumPy, it runs in around 13 milliseconds. We are still seven-fold away from that optimal solution. NumPy, when you run in CPU, it is just oneMKL from Intel, so you get the idea.
One point I really want to emphasize here is, for better performance, you want to have a large amount of useful data as close to the CPU as possible. You want to have as large amount of data close to CPU, but those large amounts of data have to be useful. It cannot be useless. If you really want to go into detail of what matrix multiplication optimization really looks like, there are a couple of blogs I’ve linked here. I really love them. They have code, illustration, everything, so feel free to go through them.
Low-Precision Computation (Intel AMX)
We got one motivation. We want our computer architecture to be able to handle large chunks of data, and then we want them to keep it really close to the processing unit. The second point I want to emphasize, especially with the rise of deep learning and then large language models, is you must be able to compute efficiently on low-precision data. Meryem from TitanML also emphasized that point as one of the five key takeaways for deploying LLMs. She said quantized models are your friend. You want to use the low-precision numbers. Due to the sheer amount of memory you need, and then the amount of compute you need, you want to be able to store your model parameters, weights, and biases using low-precision data format.
Then, you want to have a compute unit that is able to process this data really efficiently. We got our two objectives, and that’s exactly what Intel AMX fixes. For the data part, we want to have a large chunk of data close to CPU, so we used to have tiles on L1 cache. What’s better than having data on L1 cache? You can now have 2D chunk of data in register. You can load 1 kilobyte of data in single register at a time. You have really large chunks of data close to your processing unit. Then, the next part, you should be able to compute those low-precision computation efficiently. Then, that part is handled by what they call tiled matrix multiplication unit, or TMUL. It is basically a systolic area of Fused Multiply-Add ALUs. You can multiply two elements, and then add with the existing element, and then it gives the result. Systolic area of that FMA unit is what’s in the TMUL.
In order to operate these tiles, and TMUL, they have added the instruction sets in x86, so there are basically three groups of instructions you want to take a look at. One is managing the configuration. You need to be able to load your tile configs, and then store your tile configs efficiently. The second group on the bottom left is for data management. You should be able to clear your tile registers, and then you need one instruction to load the data from memory to tile, and then another instruction that can store the data from tile into main memory. As far as the computation goes on the TMUL, right now it supports int8 and bf16. You can see there are four combinations of int8, basically handling signed and unsigned combination. You can do unsigned to unsigned, unsigned to signed, signed and unsigned, and then signed, signed combination.
Then, last one is for bf16. Some more information on resources specification. You get in total eight of those tiles. You can use all of them to make sure you are processing as much as possible at a time. Each of these tiles have to be configured in a way that they cannot exceed 16 rows, and then they cannot have more than 64 bytes in one row. Basically, maximum size you get is 16-by-64 bytes, so that would amount to 1 kilobyte in total.
Then, depending on the precision of the data you are storing on these, if you are doing int8 matrix multiplication, you get to fit a 16-by-64 int8 chunk in one register. If you are doing bf16, you get a 16-by-32 chunk in your register. Basically, what AMX does now is it gets two 16-by-64 int8 matrix and then multiplies that, and then it stores it into a 16-by-16 int8 matrix. Accumulation happens on int32 so that it does not overflow. Then, same for bf16. It gets 16-by-32 matrix A, and then 16-by-32 matrix B, and then the result will be 16-by-16, this time float32 because you don’t want to overflow your bf16.
AMX Utilization
We understood why, and then what exactly is in the AMX. Now I want to move into how we can use this. We already saw a little bit of instruction set that is available for you to manipulate AMX, but obviously you’re not going to do programming using those commands, so you’re going to have to write in high-level languages. This next section covers that. First of all, you want to check if your CPU has AMX or not. For AMX, you will need at least 4th-generation Xeon processor from Intel. That would be Sapphire Rapids, codenamed, or if you are on AWS, that would be M7i, R7i, C7i instances. Then, you will need Linux kernel at least 5.16, because that’s when the intrinsics for programming AMX were introduced on Linux.
Once you have that, if you do lscpu on your Linux server, you’ll see a bunch of information on the CPU, like cores, cache size available, and a bunch of other information among this. This is the one I want you to check. On the flags field, there are a bunch of other flags. Just to check if you have AMX or not, you need to confirm you have these three flags. AMX-TILE is for tile manipulation, AMX_BF16 and AMX_INT8 are for computing on bf16 and int8. Then, for the power efficiency, AMX is disabled by default when running a bunch of softwares. If your software or if your core section needs AMX, you’ll have to make a system call to ask Linux to let you use the AMX resources. This is usually present on every bit of software that uses AMX.
Basically, you have to do this prctl system call and then ask for this X feature, XTILEDATA. Then, once you get that feature access, you will be able to use AMX in your code. There is a link to the Linux document that you can go and then check for more detail.
Once you ask for Linux kernel to use AMX, the next steps are really straightforward. First thing you want to do is, based on your application, you want to configure the tile. There are a bunch of different things you need to configure. You can see there is a structure on the left. First one is palette_id. It has to be 1 when you initialize the tiles, when you configure the tiles. There is start_row. start-row is basically for fault tolerance. If your matrix multiplication or any of these AMX instruction fail, it has to know where to go back to fetch the data. That’s basically what a start_row does.
Then, there are a bit of a result bit. Those are reserved for future architecture extensions. That’s not relevant right now. Two factors that you really need to configure are these. These are columns byte and then rows. These are basically your tile sizes. You want to make sure how big of a tile you want for each of the register. If you undersubscribe this, let’s say if you only use 8 rows and then 32 bytes, then rest of the 8 rows and then remaining 32 bytes will be filled with 0s when doing the computation so that you can still proceed with the operation. Here’s an example initialization I did for a toy code. Like I said, I set palette_id to 1, start_row to 0.
Then, now we want to compute two int8 matrix and then accumulate them into int32, so matrix A, matrix B, and matrix C. Matrix C will be in tmm0. Matrix A and B will be in tmm1 and tmm2. We want 16 rows on each of those tiles and then 64 bytes. What we have is 16-by-64 matrix A, 16-by-64 matrix B, those are int8. Then, when we multiply those together, we get 16-by-16 int8 matrix. B is basically B transpose. You can think 64 is the common dimension when multiplying the matrix. Actually, it’s happening with dot product, so this seems a little bit of a confusing part.
Once you configure your tiles, next step is you load the data from memory to tile. You can use tile load data intrinsics for that. We load C into register 0, A into 1, B into 2, we get these nice three tiles. Next step is basically doing the computation now. Now you can just use this instruction here, dot-product byte, and then we have signed int, signed int, and then we will do accumulation on the DWORD that is int32. I will go into a little bit of detail over this in a while, but basically you are multiplying two int8 chunks and then getting your result in 32 tile. Under the hood, it’s working like this. Like I said, there is a systolic area of FMA, this is a two-dimensional systolic area of FMA, and then it will propagate the data from tile A, and then tile B, and then tile C, and then basically add to what’s already on tile C while multiplying A and B.
Once you are done with your computation, you will store the result from the tile 0 to C, basically back to memory. We can do one tile to one tile and then get one tile. How does this translate to doing really large matrix multiplication? We basically divide them into a bunch of tiles. If we have matrix A and B, that means that if only one-fourth of A, B, and C fits in one tile, we will do this. We will divide A into four tiles, B into four tiles, and then we will follow a schematic where we load some tiles from matrix A, some tiles from matrix B, and then do the computation and then update the result in C. In reality, we are only using three tiles here. We still have five more tiles, so we can load two more from B and then two more from C and then do a huge computation at the same time.
This is basically the continuation of this one. In our example earlier, we loaded C into tmm0, A into tmm1, and then B into tmm2, and then this is basically showing all the data flow that’s happening in your processing element. Before I move forward with this, I want to emphasize that whatever hardware I just described, it’s present on every single core on the CPU. Every single core will have this added piece of silicon that has T, the tile register that can do the data management, and then the TMUL units that can do the computation. These are the supported combinations for now. As you can see, there is a bf16 on the very top. You can multiply two bf16, accumulate them into float32, and basically all combinations of signed and unsigned int, and you will accumulate on int32 to avoid the overflow. How does this map into our low-precision compute requirements?
Obviously, the default training and inference precision, most of the time, is float32. For float32, we don’t use AMX, we stick with AVX-512. For floating-point 16, there is AMX-FP16 and AMX-COMPLEX that’s coming on new 6th-generation of Xeon. It’s not supported right now. For all the other bf16 and then signed and unsigned integers, you use AMX-BF16 or AMX-INT8. The good thing about this AMX or the TMUL architecture is that it’s extensible. Whenever there is a demand for new precision, for example, FP8 is picking up pace really well, int4 has been really popular in the literature. If you have this low-precision format that becomes really popular, then Intel will add native hardware computation support for those in the next iteration of CPU, and then you get to enjoy them.
How Do I Take Advantage of AMX?
You can do matrix multiplication with this really tedious way. Why would I do this? That’s the question most people ask. Why would I want to do that? This is for how exactly you leverage AMX on your deep learning workload. For this, you don’t really need to do whatever matrix multiplication I just did. If you are someone who is developing a custom operator, custom primitives, like we talked earlier on the TPPs, so if you have new primitives that uses a different form of GEMM, then you might have to implement this on your own, or you can just put in request to Intel’s oneDNN team, and then they’ll do implementation for you. If you are someone who is working on frameworks that is not one of the popular ones, like TensorFlow, PyTorch, ONNX, these frameworks are widely supported by Intel engineers. They push their optimization regularly to those frameworks.
If you are working on one of those frameworks that does not have Intel support, then all you need to do to enable these features on Intel machine is port either oneDNN, oneMKL, and then whatever other libraries that are relevant to your framework. Even for this, if your framework is picking up popularity, then Intel will be more than happy to jump in and then do the integration for you. I don’t see anybody exactly doing this outside of Intel, or maybe AMD, or some other folks who want to enable and maximize the ability of their hardware. For the folks who are either data scientists and ML engineers who rely on framework, all you have to do is install the new enough framework. That’s all you need to do. You have TensorFlow 2.10. After TensorFlow 2.10, it uses oneDNN by default on CPU. If you have a workload running on CPU, make sure you are using newer than 2.10 TensorFlow. Same goes for PyTorch. If you have a workload that is running on PyTorch, you want to make sure it is new enough.
Once you have new enough PyTorch, this is all you need to do. If you want to use AMX_BF16, you just run your workload using Auto Mixed Precision. You run your model using Auto Mixed Precision, and then AMX_BF16 will automatically kick in. There are a bunch of operators that are supported on the oneDNN for bf16 precision. Same goes for int8. If you quantize your model using PyTorch or some other framework, and then you run it using PyTorch, then you are automatically utilizing the AMX_INT8 that’s implemented in oneDNN. Because PyTorch uses oneDNN by default on CPU. Whenever you use framework, AMX gets used whenever possible. If you have your model in bf16, int8, more than likely you are already utilizing AMX_BF16 or AMX_INT8.
The good thing is this works really well with both torch.compile and TorchScript because folks have exported their model into different formats. Intel makes sure that all the torch.compile and then TorchScript exported model supports the AMX kernels. The same goes for TensorFlow. You need to make sure that MAX ISA is AMX_BF16 so that it can use the AMX_BF16. Once you do that, all you need to do is set the AMX flag true. Then you get to enjoy the acceleration without doing anything at all. Obviously, you can quantize your model and run it. The same thing goes here.
For Optimization Freaks
For people who are going to go to the extra mile to get the best performance, so this is where I would say I reside. I want to make sure I can get the best performance I can on the model without losing accuracy. Obviously, I talked about TensorFlow and PyTorch. Intel contributes periodically to this framework. A lot of the optimization that are not yet upstreamed into PyTorch and TensorFlow, more aggressive ones, some that does not meet the Meta or Google’s requirement, they reside on these repositories. Intel Extension for PyTorch, Intel Extension for TensorFlow. As you heard earlier, IPEX was mentioned on AMD’s evaluation as well. It optimizes a lot of the PyTorch workload. Depending on the model, you can get a really huge boost. To compress your model really well, sometimes just converting using PyTorch’s default quantized dynamic does not meet your accuracy requirement because there is a huge accuracy loss when you quantize to low-precision format.
Intel has tools that give you a way to quantize your model, compress your model using sparsity pruning, do distillation, basically make it as small as possible so that you can run efficiently on CPU. Then, these frameworks have a safeguard that allows you to tune for accuracy. You can put in, “I am willing to lose no more than this much accuracy”. Then, they will iterate on your model and then find the config that meets that accuracy requirement and then give you the compressed version of your model.
Then, OpenVINO Toolkit is a little bit different. It has the compression capability, just like Intel Neural Compressor. It’s also an end-to-end system, so you can actually train your model and also deploy your model because it has Model Server as well. If you want really best performance on CPU, I would pay some attention to this toolkit. I know the transformer has been really the focus of a lot of the conversations on this conference, as well as the whole deep learning industry.
Intel Extension for Transformers is a tool that does more or less what Intel Neural Compressor does, but more aggressively for transformer model. You can get lots of improvement on your large language models, embedding models by using the compression techniques that are in this tool. They have a really large amount of examples on how to convert model to bf16, int8, int4, int2, whatever precision you need. I would check this out for some examples.
How Does This Translate to Different Models?
What does this mean? We saw the matrix multiplication. What does this do to actual deep learning model runtimes? These are some of the performance metrics I took over the year from our customer works. These are the models that have been deployed in CPU. Then they were deployed in 3rd-generation Xeon, then they moved to 4th-generation Xeon. We do this analysis all the time to give them the best config they can get to make sure they optimize the performance per dollar.
As you can see, the red vertical line, is the float32 baseline. Just by using AMX_BF16 or AMX_INT8 for accelerating GEMM operations, you can get a really good improvement. For DistilBERT or the BERT base, you get more than 5x improvement just going bf16 with no loss in accuracy. For all these numbers, there is less than 1% loss in accuracy for int8. For bf16, no loss in accuracy. For some of the CNN model or CRNN model, you get more than 10x improvement just by using the new enough framework and then doing the model quantization.
Summary
What is it? It’s an instruction set architecture extension that was introduced on 4th Gen Xeon. It will continue to evolve when we produce more iterations of Xeon. It’s basically a dedicated piece of silicon on each core that has data management unit and then compute unit. What data types are supported? For now, AMX-TILE is for your data management. Then, AMX_BF16 and AMX_INT8 are for doing computation on bf16 and int8 data. On the future iteration of Xeon, AMX-COMPLEX and AMX-FP16 are coming. Then, obviously, how do I use it? Most of the time, just use the framework that already uses oneDNN, OpenBLAS, oneMKL. If you want to go the extra mile and optimize your workload, look into OpenVINO, Neural Compressor, and other tools I just mentioned earlier.
Questions and Answers
Yue: I’m always curious about how the sausage is made. The packaging makes it really easy to use these features, if you’re on one of the libraries or frameworks. Most people probably won’t write a CUDA kernel, but it’s fascinating to learn how you can maximize the hardware performance by doing the right thing in software. What kind of effort went into enabling these hardware features in those libraries that people can easily pick up? What is the process like?
Bhattarai: When it comes to enabling these features in the framework, it’s mostly down to oneDNN and oneMKL to write the efficient kernels. They want to make sure all these operations are really optimal. There are a couple of different level of effort happening. One is obviously, like I said earlier, you want to optimize the uses of AMX resource you have. You want to peel all the tiles you have available and then keep your TMUL unit busy. By doing this, you can get up to 8x more operations per cycle compared to the last iteration of the processors. That’s one part. Then the second part oneDNN library really focused on, is breaking the operation into these tiles really efficiently. When you have a linear operation versus when you have 1D, 2D, or 3D convolution, GEMM looks really different.
Some of the time, some of the GEMM operation will not have all 16 rows because they have a really small dimension in one of the dimensions. Factoring for those things and then putting this safeguard for some of those low utilization use cases is one of the things that does. That actually also translates into how you optimally use AMX. If you have a small enough model, it’s suggested to use really large batch because that way you get to fill the 16 rows in AMX_TILE versus for some of the model, if you just use batch size one, then your GEMM operation becomes GEMM B operation. It will be matrix vector multiplication, and then you will not fill the tile and then that way you will waste a lot of the resources there.
Participant: If we use bare metal versus virtual machines, all those techniques to improve the efficiency of the computation, is there any different considerations we can have when we have bare metal versus Linux?
Bhattarai: On the bare metal, we do have control on all the physical threads as well as Hyper-Threads. Basically, enabling Hyper-Thread might not always help, but it will not do the harm. Same goes for these instances. When you are running on cloud with CPU, you don’t know what the rest of the node is being used for. If there is a way for you to utilize Hyper-Threading, then you can enable Hyper-Thread to get the performance improvement on cloud.
See more presentations with transcripts