
MMS • Ye Qi
Article originally posted on InfoQ. Visit InfoQ

Transcript
Ye (Charlotte) Qi: I’m Charlotte, working on LLM inference at Meta. You’re going to hear a rapid fire of all challenges that you’re going to run into when you turn an LLM into real LLM serving infrastructure. Since 2023, everyone is in this AI gold rush, trying to find compute resources to fit the unprecedented growth from big models and longer context. This year, the demand keeps growing because test-time compute and compound LLM systems are the new hotness. Scaling LLM serving is almost becoming building like a distributed operating system. It’s so foundational to all of the innovations. We are seeing innovations happen everywhere, especially at the intersection.
At Meta’s AI infra, we are building this strong foundation to empower our researchers and ML engineers. I’ve been solving the problems of serving models for the past six years. My current focus is cost-saving and DevEx. LLM serving is probably the most interesting model serving problem that I work on because a model is a system. The best solution really requires you to think comprehensively to get some joint optimization between model, product, and system. I’m part of the zero-to-one launch that brings Meta AI online. Our team optimize inference backend for Meta AI and smart glasses. Apart from the public-facing traffic that everybody sees, we also have a lot of internal traffic, like RLHF, data curation, distillation that goes into the making of Llama family of models. These are absolutely massive. During the busy week of RLHF, our team has to process hundreds of millions of examples.
Challenge 1 – Fitting
I shared our story at the AI Infra @Scale conference. The question that I get the most is, should I run my own LLM service? How do I optimize my own inference? I think the best way to answer this question is to build everything step-by-step. Let’s imagine we have an awesome product idea. We are going to build a web agent that’s browsing InfoQ’s website, and every time Charlotte posts something, you’re going to summarize it, push it to your customer, and let them ask some follow-up. What an awesome idea. We’re going to go through a journey of four stages of challenges. Let’s start with the simplest one. You get the model. You get some hardware. What to do next? Step one, you need a Model Runner. LLM inference is notoriously expensive to run. The model is trained with next token prediction, which means your inference logic is also token by token. Look at this colorful sentence.
If an LLM has to generate it, every color switching means another model forward pass. We call the generation for the first token, prefill, and all the tokens after the first, decode. This iterative process is very challenging to accelerate. Normally, the end-to-end latency is going to be a few seconds. That’s why almost all LLM applications are using streaming interface. To tackle this special execution pattern, you want to find a runtime that at least supports continuous batching and KV cache. Why continuous batching? The response from LLM will have variable length. The shorter one will exit earlier if you are just using a static batching or dynamic batching, and the resource will be idle. What is continuous batching, though? Imagine like a bus.
At every bus stop, which is the end of each decoding step, it will try to pick up new passengers if there’s an empty spot. The new passengers carry a lot of luggage. They’re very slow to get on the bus. It’s just like prefill. The bus will keep going if nobody is waiting at the bus stops. Fortunately, most of the bus stops are indeed empty, just like what’s in the South Bay. This is how we keep GPU well-utilized. What about KV cache then? Every decoding step is conditioned on all previously generated tokens. The K and V tensors for the same token at the same position basically stay the same across the single generation request. If you don’t do this, your attention computation is going to be cubic instead of quadratic, which is beyond sustainable. Luckily, almost all of the mainstream LLM frameworks support it, and you wouldn’t go too wrong here.
Step two is to find some hardware. The high-end data center GPU today typically come with 8 GPU setup. This is an example from Meta. You will see some network connectivity with different speed, but our job is fitting. Let’s only pay attention to HBM size. You will see some 40, 80, 96, 192. These are the numbers from the most popular data center GPUs like A100, H100, MI300. Let’s take some H100 and try to fit some models in. The 8B models fit in one GPU just fine. You got runners and just run it. 70B model cannot fit in one GPU. We use tensor parallelism to partition the weights across multiple GPUs on the same host. You need at least two GPUs to not OOM.
Typically, we do use four to eight GPUs to allow bigger batch size to improve the throughput, because the KV cache does take quite amount of memory. The 405B model can’t fit in 8 H100. The weights is going to take over 800 gig under bf16, so you will need two nodes. We recommend using pipeline parallelism to further partition the weights across the nodes, because communication from multi-node tensor parallelism is just going to introduce too much overhead. You can also choose MI300, which offers you 192 gigs of HBM, and you can serve it on a single host without going through all the troubles. The takeaway is, don’t just grab your training or eval code. Find some runtime specialized in LLM inference serving. Try to have some understanding of your AI hardware that match your model. You can use tensor or pipeline parallelism to fit your models.
Challenge 2 – It’s Too Slow
We solved the fitting problem, and the product is hooked up. The next challenge is, it’s too slow. Most infra problems do get better by throwing capacity. You can also throw more GPUs, like 3 million of GPUs on the screen, or you wait for faster next generation GPUs if you can actually wait. Throwing capacity blindly only takes you this far, not that far. You’re staring at the runtime that you find, and thinking, outside of these 100 arguments, is there anything that I can tune to make it faster? Certainly, there is, but let’s understand where is the limit. Remember we mentioned that LLM inference consists of prefill and decode. Prefill generates the first token. It reads a huge model weight, and it does tons of computations across all of these tokens within your prompt to find all the relation between each pair of tokens, and it then outputs one token. Decode generates subsequent tokens.
Again, it reads a huge model weight, and it reads all the KV cache, but it only tries to attend next token, one next token to all previously generated tokens. The compute density is a lot lower. This makes prefill GPU compute heavy, decode, memory bandwidth heavy, and the KV cache requires quite amount of memory capacity. To sum up, making LLM faster is really about fitting this LLM math operation within these three types of system resources, and they scale with the model sizes, sequence lengths, batch sizes, but a little annoying that they scale with different multiplier. However, the ratio of the system resources on your hardware are fixed once it’s manufactured. This is why we have to try harder to bridge the gap.
Where do we start then? When you say things are slow, let’s be a little bit more precise. If you’re using a streaming interface, you might care about the first token latency, or TTFT, to reduce the awkward silence when your customer is waiting for something to be generated, or you care about the generation speed, or output speed, or TTIT or TTOT, they’re all the same. This is to let your user to see your app is super-fast churning out the responses, or you actually get a bunch of, it’s not user-facing, it’s a bunch of bots talking to each other. In non-streaming use cases, you might actually care about the end-to-end latency, because your client has some timeouts. These three latencies can be optimized in slightly different ways. You need to set up reasonable expectations around what’s achievable for your model sizes, input-output lengths, and your own cost budget.
For example, you cannot expect a 405B model to run faster than a 70B model, unless you can make significant modification to your hardware. With this context, we can get back to our step one, throwing capacity a little more wisely. I actually ignore the fact that even in continuous batching, you will see some hosts right there. This is because we’re running prefill in the continuous batch. All of the decoding steps in the same batch will get slowed down. This might be ok, because practically, per instance, request per second for LLM is quite low, because of resource demand. Most of the steps are actually not interrupted. Imagine you suddenly get a request with 32K input, all of the decoding will get stuck for a few seconds, and your customer will actually see it. They will feel it. We can fix this problem with disaggregation. We replicate the already partitioned weights, and we run prefill and decode in different services.
This way, we can scale their system resources separately, and get rid of this annoying 10x P99 latency for decode, because that P99 is basically the average of your prefill latency. It helps us to maintain the same latency SLOs, with much fewer machines. If you only rely on 8-way tensor parallelism, even with disagg in place, if you’re looking for processing an input that’s 128K, you are looking for at least a minute. If your app does look for this type of responsiveness, even for such long input, you do have the expensive option to use context parallelism, that further partition the entire workload at the context dimension. It helps you to store more targeted compute at your prefill time. Replication and partition are our best friends. It helps us to manipulate the system performance profile into a better shape that can fit into the hardware, that’s come with different shapes. Whenever something is too big to functionally fit, or create a substantial system bottleneck, we’ll just cut it, copy it, and we can get over it.
Enough about distributed inference. Let’s move on to step two. You should consider making your problem smaller. As long as your product quality can tolerate, let’s try to make some modification to lower the burden for your hardware. Number one, just use a more concise prompt. Number two, get a smaller model with specialized fine-tuning, distillation, or pruning. Number three, apply quantization to unlock twice or even more compute available on your hardware. Pay attention here, quantization is not a single technique.
At the post-training time, you can mix and match different components, different D-type, or different policy to use for quantization. The tremendous openness in the LLM community actually will allow you to find implementation for all of this very easily. You can play around and see which option gives you the best ROI. Step three, we can also get better at memory management using caching or virtual memory. Remember we discussed that KV cache takes quite an amount of memory. This problem can be basically tackled in the same way, in the traditional system performance.
First, we’ll try to identify what can actually get cached. For requests using roleplay or integrity chat, you typically have to throw a very long system prompt to keep the model in context. For multi-turn chatbot application, each request includes all of the chat history from previous turns. That’s a lot of re-computation. On your GPU server, there’s a decent in-memory storage system right there. You will get half terabytes of HBM, a few terabytes of DRAM, a few dozens of terabytes of flash. We can build a hierarchical caching system that actually matches the KV cache usage pattern. For example, the common system prompt might get cached in the HBM. The active user’s chat history that you need to load every minute or so, can sit in DRAM. Chat history from less engaging users can be offloaded to flash.
If you can manage the system and privacy complexity for this type of problem, this is basically a no-brainer. It’s lossless. It’s very common for us to see over 50% of reduction for both latency and capacity. Step four, if you’re looking for even more levers on performance and cost saving, you can try speculative decoding, chunked prefill, dive deeper in the attention kernels, or try some sparsity at the token level. There are tons of LLM domain-specific optimizations. This table is a quick glimpse into this space. Each technique takes different tradeoff, sometimes conflicting tradeoff around TTFT, TTIT, quality, and cost. You need to decide what’s important for your product. You also want to take my words with a grain of salt because the exact behavior really depends on your own unit workload.
Let’s recap what we’ve learned so far. Getting the basics right will give you a 10x foundation. Distributed inference, smaller input and model, and caching will give you another 2x to 4x boost. You are free to explore more advanced decoding algorithms, changes in your engine, to find your next 10x or 100x.
Challenge 3 – Production
People might come and ask, this sounds so complicated. Did you do all of this at Meta? You will find out what we might do if you hear the next two challenges. Suppose our awesome InfoQ agent is fast enough for product launch, what do we solve next? Our toy system now has to turn into a production service that is available for a user at all times. What comes to your mind when you hear the word production? To me, everything could function in a slightly unexpected way. We get a lot more moving pieces. Request distribution is changing. Like you can see in this graph, you get users with different intent of engagement, while some just spam you a lot more often. The input to output ratio has bigger variance. The effective batch size is much smaller and also changes a lot. All of these invalidate our assumptions around creating that perfect ratio of system resources to max our hardware ROI.
What’s more, the temporal patterns are also changing a lot. We get daily peaks and daily off-peaks, and spikes just like all online services. The spikes might be just some random people running evals. Very strict real-time online services like chatbots are not the only type of upstream for LLMs, human-in-the-loop annotations is also real-time, but it has to align with the evaluator’s schedule. Batch processing for summarization and feature generation can be scheduled with some time shifting, up to minutes or hours. What’s more, LLM deployment has very complex latency and throughput tradeoff. Throughput is only meaningful when it can miss your latency SLOs. You have to do this for multiple deployments. A mature product typically involves serving a handful of LLMs. You get the main big model for your chat, and you get a bunch of ancillary models for safety, planning, rewarding, function calling, to complete the end-to-end loop. Putting everything together, you’ll often find the peak FLOPS, how a vendor advertise is kind of useless.
It’s very common to lose 50% effective FLOPS, even at the earliest kernel benchmarking stage, especially for shorter input. Combining the latency bound and some operating buffer, you’re going to lose 10x. This is very common. How do we get back all of these losing FLOPS? Step one, let’s think harder. What can be traded off? Out of quality, latency, throughput, reliability, and cost, you can probably only get to choose two or three of them. I want you to ask this question to yourself, do you really need three nines or four nines if that’s going to cost you 3x? Do your customers only feel meaningful differences when you are serving at an output speed of 100 tokens per second versus 10 tokens per second? As a reference, I only speak at a speed of five tokens per second.
After you’ve thought through it, you come to me and say, “I just want to push harder on the latency. Tell me more about inference optimizations”. Let’s look at the data and decide where to optimize. You get the 70B model running at roughly 200 milliseconds on 8 GPU, which is a pretty decent performance, but your service is only hosted in California. Your New York customers get another 75 milliseconds in network roundtrip. You don’t just hardcode your host name and port. Another 75 milliseconds goes into naive host selection, health check, load balancing, and everything else. Your LLMs may be multimodal, and you have to download the images in the request, another 150 milliseconds.
Your app has to coordinate a bunch of business logic, adding safety checks before and after the main model call. It might also do a bunch of other things, like fetch information, react, doing search. This can easily add another 400 milliseconds. Look at this distribution. You can tell me where to spend time, where should we optimize. Hold on. You actually care about stuckness from this long prefill. Then, let’s do disagg. Earlier, I made disagg sound so simple. You just create two services, and everything is done. Your prefill operations and your decode operations is going to get nicely batched together. What I didn’t mention is, the link between prefill and decode is quite thin. Between prefill and decode, you have to transfer hundreds of megabytes of KV cache. You typically have an upper network bandwidth limit when you are talking TCP/IP. It’s very common to add another 50 to 100 milliseconds to your TTFT if you are doing disagg.
This means, by doing disagg, you are basically signing up for these hard optimizations, like request scheduling, and overlap data transfer and compute at every single layer. You might also want to do a little bit of additional tuning based on your own unique network environment. What’s more, how would you push a binary to it? How would you update model weights? That alone probably needs a full 40-minute session to explain. No disagg, maybe caching is easier. True, but LLM has this annoying positional encoding. The same tokens showing up in the beginning of the sentence versus the end of the sentence will have different embeddings. Imagine you are trying to get the model cache, but you are sending requests are including the last 10 messages from your customer. Then, nothing will be cached. To improve the cache hit rate, you not only need a big cache, as we discussed how you can build a hierarchy, you also need prompts that are very cacheable.
To solve this problem, we co-designed a chat history management with our product team, and we even further customized the request data flow to not only maximize cache hit rate, but also leave longer inference latency budget. To route these requests to the host with the cache, we use consistent hashing to sticky route the request from the same session to the same host, and we also have to ensure that retries on host failure are also sticky. Maybe we can try quantization then, if you are in a resource-constrained environment, and you really don’t have a choice, and quantization is basically a functional requirement. When you do have the choice, you might want to be a little bit more paranoid. What we commonly see is you quantize the weights to fp8, you run some benchmark, let’s say MMLU.
The scores are the same, sometimes even higher. When you release the quantization to production, some of your customers are just showing up and saying, something is not working. Most of the benchmarks are saturated, and may not best represent your own product objective. You should build your own product eval, and use a slow roll process to get these wins. Speaking of slow roll and testing, how do we maintain these nice scores over time? Inference bugs can manifest as subtle performance degradation, because LLMs are probabilistic models. It’s possible you have something horribly wrong, but the result comes out still decently correct. To prevent your models from getting dumber over time, our approach is no different than traditional CI/CD. Just make small benchmark runs every single one of your diffs, and run more comprehensive ones on every release for inference engine.
Another time for takeaway. We will lose theoretical FLOPS in a production environment a lot. Don’t get surprised about it. When you look at the end-to-end latency, inference a lot of time is just a small portion of it. If you want to make good use of KV cache, and the persistent KV cache, you often have to do product and infra co-optimization. Lastly, don’t forget to continuously evaluate and test all of these acceleration techniques directly using the signals from your product.
Challenge 4 – Scaling
Our InfoQ agent is a big hit. We decided to make it a little bit more fun. We want to let our users to customize which speakers they want to subscribe. What else is going to change? This is when we encounter our last stage of challenges, which is scaling, which is a never-ending topic for all infra engineers. You get so many rocket parts. They’re scattered around. How are we going to put them together to make them fly? To assess that, let’s ask ourself, what numbers will get bigger as you scale your apps? We’ll have more deployments. We’ll use more GPUs. We’ll involve more developers and serve more models. Let’s look at this one by one.
Step one, we get more deployments. Let’s use disagg as an example, because I like it. We mentioned prefill service needs more GPU compute, so you find a hardware called hardware P uniquely fits this job. Decode service needs more memory bandwidth, so you find a hardware D for this job. When you look at the host in the data center, region 1 is full of hardware P, and region 2 is full of hardware D. Interesting. Remember, prefill and decode has to transfer hundreds of megabytes of data to each other for every single request, so you probably don’t want to do that cross-region. This means if you opt in for disaggregated deployments with some hardware preferences, you actually need to maintain and optimize inference deployments, not for one, but for three, because you also need to handle the remainders.
Thinking about this, you might have different answers about what your net win is going to look like here. Remember the idea of using context parallelism for very long input. It helps you to reduce the TTFT. 90% of the time, you’re at this happy path. You get high bandwidth network from RDMA, and everything is fast and nice. However, 40 GPUs in one partition will take down your entire process group. Assuming at any time we have 3% random failures for your GPU cards, the blast radius will exponentially grow when the smallest inference deployment units is getting bigger or smaller. With faster interconnect, these hosts are required to be actually placed physically closer together, and whenever there’s some maintenance events going on in the infra, like network device upgrade, daily maintenance from the data center, all of these hosts will be gone at the same time.
At Meta, we actually have to create a dedicated job scheduler to allocate hosts for distributed inference. We also talk about heterogeneous deployment, and hosts with shared fate. What about autoscaling? In the CPU world, if you have a service with changing workloads throughout the day, you will find throughput metrics like QPS or your request queue size, then you turn on autoscaling and you forget about it. For LLM, number one, the right throughput metric is extremely tricky to find, because the bottleneck depends on your workload. QPS obviously does not work. Tokens per second works under a lot of caveats.
Number two, you cannot freely upsize, as you’re likely running to the GPU limit first. Putting allocation challenges together, you need a deployment allocator to place your job with awareness of network topologies and the maintenance events to minimize the impact from the infrastructure activities. To tackle heterogeneous hardware and autoscaling, you will need a deployment solver that actually treats autoscaling as a shard placement problem. It understands the supply and demand from your inference services, and it can make decent decisions even when the demand exceeds supply. At scale, all of these traditional operational problems are all solved with quite sophisticated software and algorithms.
We talk about allocation at scale. Step two is cost saving. We’ve covered dozens of options to make things faster and cheaper. It’s going to be quite tedious to manually apply every single one of them and tune every job individually. When you draw a graph of latency and throughput for different serving options, like different number of GPUs, different inference acceleration techniques you try, you will get a graph that looks like the left. You will see the lines eventually crossing to each other at some point. This means your best cost-effective option actually depends on your latency requirement. We have to categorize these workloads to create like an inference manual to really allow our customers to choose the best for them. This type of inference manual requires very extensive performance benchmarking automation and a lot of data science.
Suppose we solve the volume problem, cost saving also has to be looked at end-to-end just like latency. Your tool chain will get so much more effective if you grab some knowledge about inference accelerations that I talk about from this talk. I commonly hear people raise this question, look at this workload, look at the gap between the off-peaks and peaks, can we run something else to leverage this free capacity? It’s like, this green line is your provisioned throughput and you can actually raise this line 3x by applying some ideas I mentioned in this talk. This is good to hear. I have many deployments. Where do I start? A common perception I see from people is 90% people focus on one, two, three head models because we think that deployment consumes all of the GPUs.
When you actually measure, you will find so many tail deployments than you can imagine, and collectively, they might consume even more GPUs than your main model. This means observability and automation are no longer nice to have, they directly help you to claim all of these low-hanging fruits. To put them together, remember the 10x loss. Your ideal throughput is determined by your workload and your hardware. The inference optimization will help you to improve achievable throughput.
The deployment option you choose yourself is going to determine your provisioned throughput, which you might not choose the best that’s available for you. You want your deployment to be sized automatically as you are charged by GPUs, not by tokens when you’re hosting your dedicated deployments. With good data foundation and also removing some foot grounds from our customer, we are capable of really minimizing the gap across the four. Next step, running a product is beyond serving one version for each model. You want to split your traffic and evaluate a few launch candidates. This has to happen for multiple models on the platform. What does this mean? Everything we’ve discussed so far got yet another dimension to actually multiply onto all of the problem set.
Summary
You’ve already got a glance into all of the building blocks for a scalable LLM serving infrastructure. We start from model and hardware, which are really the spotlight for everything. To fit the model onto hardware efficiently, you will need a good Model Runner, a performant execution engine with a handful of inference acceleration techniques. To get the engine ready to become production service, we need to think about end-to-end latency, efficiency and correctness by investing into monitoring, routing, product integration, scheduling optimization, and continuous evaluation. A serving infrastructure needs to run multiple LLM services.
This is when allocation, deployment management, model management, experimentation all come into play. We haven’t even talked about quota and interfaces. Look at this graph, model and hardware are really just a tip of the whole LLM serving iceberg. To get the best of all worlds, we often have to do vertical optimizations across the entire stack. I think we spent a good amount of time scratching the surface of LLM serving infrastructure. I hope you find something useful from this talk, and hopefully you can get some insights for your inference strategies. Many of the pains are still very real. I hope to see more innovations to help us scale development and deployment for GenAI applications all together.
Questions and Answers
Participant 1: What are you focusing on right now? Obviously, there’s a lot of areas, but just currently, what is one or two areas that you’re focusing on at the moment?
Ye (Charlotte) Qi: Getting ready for the next generation of Llama, obviously.
Participant 1: What’s the toughest challenge in this space that you worked on in the last six months or a year?
Ye (Charlotte) Qi: I think for model serving, especially if you want to get things end-to-end, it’s not like you only care about modeling or you hire a bunch of GPU experts or you hire a bunch of distributed system experts. You actually have to have all of the three. You actually have to make these people talk to each other and understand the tradeoff in the same language. Because a lot of time, when you get things online, there’s only limited stuff that you can do by only applying infra techniques. Sometimes you do have to make some modification to your models directly. You have to work closely with the modeling people, ensuring that the modeling direction can get manifest into the correct net behavior for all of those techniques together.
As I show in here, yes, a lot of time, people will say, LLM serving, you have a lot of interesting problems to solve. Still, around 70% or even 80% of the time, have to be spent in typical system generalist type of work. We have to build this foundation very solid, make sure it’s scalable, debuggable, especially.
Participant 2: At your peak times, what was your maximum GPU utilization that you have gone through? Also, in terms of hardware performance specifically, so do you work with vendor partners like either NVIDIA or AMD to even make it more better in terms of acceleration techniques? They may be providing libraries to even give better performance of using majority of the hardware resources. How does that fit in your work here?
Ye (Charlotte) Qi: First about GPU utilization. It’s actually quite complicated to use one single metric to measure your GPU utilization, because GPU is this very massive parallel machines. Like for NVIDIA hardware, there is this device active metric that you can get, but that metric is very coarse. It tells you anything that’s happening in this GPU. It doesn’t measure at a finer grain on how are you utilizing everything there, because for a single GPU, there’s hundreds of processing units within it. We also have a metric to measure the utilization at the processing unit level.
Typically, in our work, we do cover all of those as well. The exact GPU device utilization and the streaming processors utilization really depends on your workload. Typically, we do see them go up, like go up to more than 50% very easily for our workload. Sometimes, if you’re only looking at the utilization, it can get misleading from time to time. Like we mentioned applying caching, your tokens per second will go up, but your utilization will actually go down. You need to make sure you are tracking the stuff that you care about.
About working with hardware vendors to improve performance. Meta is a quite big company. We have people working on more like an internal cloud LLM serving platform. We are also partnering very closely with PyTorch. We are basically collaborating together. Let’s say we first introduce AMD to our fleet, we work very closely with the PyTorch team. They gather signal directly from the fleet to know, are the features all covered? Are all of those operators running? Even today, I think there’s still like a tiny bit of gap. They get the signals. They get both kernel-level performance as well as the end-to-end performance. They are indeed working closely. The exact partnership is outside of my own expertise.
See more presentations with transcripts