Mobile Monitoring Solutions

Close this search box.

Presentation: Modern Compute Stack for Scaling Large AI/ML/LLM Workloads

MMS Founder
MMS Jules Damji

Article originally posted on InfoQ. Visit InfoQ


Damji: Let’s go on right about how the modern ML, AI, and LLM stack has actually emerged over the period of years. Nischal from Scoutbee talked about the data stack, and also an emerging and evolving machine learning stack from the traditional ways of doing machine learning to today, where we actually have an LLM. You can’t really attend any ML track without the mention of LLM and ChatGPT. I realize how ubiquitous that term has evolved and has become over a period of time. About a month and a half ago, I was celebrating my 21st birthday, and I went to Whole Foods and bought loads of meat. I must have bought about 50 pounds of meat, lamb chops, filet mignon, rib eye, you name it, and I had about a dozen bags. I went to the cashier and I plopped them on the conveyor belt. She looked at me and she’s smiling, say, “Are you hosting a barbecue?” I said, “Yes, I’m hosting a barbecue this weekend.” She turned around and with a wry smile, she said, you might want to ask ChatGPT for a good recipe for a barbecue. It dawned on me that has become such a ubiquitous term.


I’m a lead developer at Anyscale, part of the Ray team, which is the open source product. Prior to that I was at Databricks for close to six years. I contributed to Apache Spark, MLflow, and also the principal author of “Learning Spark.” Prior to that, I was at Hortonworks, leading the efforts of Hortonworks. The three elephants, anybody used Hadoop before? Before I decided to move into advocacy and work with developers and stand and deliver, I was a very introverted software engineer at Sun Microsystems for about 10 years, and then went to Netscape. Then was at @Home, which was the first company that actually introduced cable modem, some of you were using cable modem, it actually came from @Home. Then went to Loudcloud and Opsware, which was a company, prior to AWS was invented, we were actually doing cloud computing. Then, Verisign. I’ve had the good fortune, I had the good privilege, and I was just at the right time at the right moment to see some of the companies how they actually made shift. Then today, we are actually in the middle of another evolution, which is LLM.

I work for a company called Anyscale. We were the original creators of Ray. Ray is a unified general-purpose framework that allows you to do distributed computing. Ray evolved from the emergence of other research labs at UC Berkeley called AMPLab. That AMP stands for algorithms, machine, and people. That’s where Spark, Mesos, Caffe came into being as well. The next thing was RISELab from where Ray actually came. Anyscale were the original people who actually created Ray. The whole idea behind why we actually want to do and what Anyscale actually does is that we provide you the scalable compute. We are the scalable compute for your managed service with Ray being at the core of it, and then provide you the best platform to write now all these LLM based applications. Why do we do it? I think, if you have been following the trend of how machine learning has evolved, and how data has grown big, and how machine models are getting bigger, scaling is a necessity, it’s not an exception. It’s a norm these days. We want to make sure that scaling is not hard, it’s easy, so anybody who is a Python programmer can write a distributed application as if they were writing an ordinary Python program as well.


I’m going to first get into, what are some of the challenges of the existing machine learning stack? Some of the things that we have worked with our Lighthouse users who keep on telling us what are some of the challenges they’re actually running into, and are they having issues with managing the infrastructure, which gets very out of date, given that the state of art of ML workload is changing every day. I’m just going to very briefly get into what is Ray and why Ray, and how AI library helps you to make that as part of the stack. I’m going to concentrate on two particular libraries, which are a very important part of that particular ML stack: Ray Data, and Ray Train. Then move into how the emerging modern stack for fine-tuning LLMs plays a huge role, and how Ray libraries actually fit into this. Then we’ll use that, whatever I’ve talked about the stack, the important components of the stack, to show you how we can actually fine-tune a 6-Gigabyte GPT model. Hopefully, what I did was I ran the model ahead, and I’ve few parts which I’m going to run live.

Challenges with Existing ML/AI Stack

Let me talk about the challenges of existing model. We’ve been working with ML users for quite [inaudible 00:06:17], and some of the things that we keep on hearing from them are, “Training is taking too long. Data is too big to fit on one machine. Models are getting larger, we can’t really fit on one machine, we have to shard those particular models.” The existing infrastructure gets out of date very easily, because the pace at which everything is changing is at lightning speed, and we get into this rathole of constantly updating that. How do these challenges and problems manifest itself when we actually talk to the user? Some of these challenges and some of these concerns and complaints were echoed quite a bit by many of the users who went through the journey of Ray, and decided why to move into Ray at the Ray Summit. Challenge number one with machine learning today is how to go from development to production scale. Here’s a very small example, a very common example that most of you who are actually doing machine learning will know. On the left-hand side, you actually have an Apache Spark, which you’re actually using to do your preprocessing. You’re getting data from either an ETL, already clean data, and you’re doing some few processing, or you’re doing major preprocessing at a massive scale. Once you’ve done that, you’ve fed that particular clean data into your training model, and this training function happens to be TensorFlow, using Horovod, so you can actually distribute all the training functions across multiple clusters. Once you’re done with that, you get your evaluation data from your pretraining data, put your model into your model store, and then you’re going to do evaluation, again, distributed across that. How are you going to orchestrate that? You’re going to use like an Airflow. Right here and there, you actually have three bespoke systems: you have to worry about managing, you have to worry about monitoring, you have to worry about maintaining it. That over a period of time becomes a bit difficult, because imagine if this happens to be your development environment, how do you actually have all this sitting on your laptop? This was a major problem that we actually kept on hearing from users, was that it was hard for us to go from development to production scale, because this infrastructure did not match, they had to SSH into a machine. There was like a staging environment. They duplicated a production environment, and that it would actually run in test and it would come back, and so on, so forth. That was challenge number one, it was that the developers’ velocity was very slow. It took a lot of time for them to do that. They had to worry about managing that infrastructure.

The second challenge was that, as you know today, the ML space or the AI space, happens to get very fast, it’s revolving very rapidly, and your infrastructure gets out of date. Infrastructure gets out of date because supposing you’re actually building, you want to use state of the art machine learning frameworks, such as JAX, or you want to use another machine learning called Alpa. You don’t have that in the infrastructure. Why? Because you haven’t provisioned for that. What happens at this point is that your infrastructure gets out of date. Any machine learning instead of the frameworks you actually want to use have to be incorporated into the existing infrastructure. You get into the rut. What I call, if I can use the analogy, you know how a hamster is going on a wheel, you’re constantly updating infrastructure just to accommodate those new frameworks. That becomes a huge burden on especially small companies who really want to concentrate on using the greatest and latest frameworks. What happens with this is that if you actually had to somehow analyze these pretty good problems, you get this chain of reasoning of problems. The first problem is that scaling is hard for the data scientists because they don’t have to worry about machine learning code. They don’t have to worry about infrastructure. All they want to care about is write the ML models and get on with the business. The rationale is that, why don’t we just go with the bespoke or a Windows solution that gives us the scale, and we don’t have to worry about that. The problem there is that when the solutions sometimes are either limited to one particular framework, or maybe one or two frameworks that they support, and if you have a new framework coming up, now you have to wait for them to update that. That cycle for a long-established platform doesn’t give you the developer velocity that you need. It limits your flexibility. Then you might argue, why don’t we just go and build our own infrastructure? We don’t want to depend on some vendor, why don’t we just do that? It turns out that most companies who actually want to build this custom infrastructure, it can be a bit too hard because they don’t have the T-shaped organizations where they have the breadth and they have the depth. You get into this long cycle of building the infrastructure to be able to do that. This gives you this root cause analysis of a reason and justification to have one framework that supports all the latest and the greatest machine learning libraries that you actually want. You want to have the infrastructure to be automatically managed by this particular framework.

How do we actually solve these problems? As I mentioned earlier, we have to deal with the blessings of scale. Today, you can see the models are getting larger. You want to be able to deal with those blessings of scale. You want to make sure that your developer velocity increases, if you are actually using whatever that particular framework is. You don’t reason about managing infrastructure, that should actually be taken care by the underlying framework or the underlying library, the underlying infrastructure that you have. You want to be able to do end-to-end machine learning pipelines that could include data ingestion, that could include training and tuning, serving, batch inference, and so on. What we really covered is what I would call the simplicity and the blessings of scale. If I could actually somehow frame that in an analogy, in the good old days about 2010, things were simple, things were small. You could actually write your entire machine learning linear regression, or whatever algorithm you actually use on scikit-learn. Download the machine learning files that you actually need on your laptop. Build your model on your laptop, test it, and you’re done. Everything actually fitted in one model. Where we are today, that doesn’t work. It just does not work. Why? Because the data is too big. Why? Because the models are too large. Why? Because the machines are smaller, you need to scale out.

What we really want is we want to covet the simplicity of yesterday with the blessings of today’s scale. What we desire is the simplicity and scale that you can actually write your entire machine learning applications in one file or one Python, that would include other files, but it’s all self-contained. You want to be able to do all the preprocessing within that particular file, you don’t need to have a different framework to do that. Preprocessing could be by having data sitting on the Delta Lake, you might have data sitting on a Snowflake, you might have data sitting on a disk, or parquet files. You’ll be able to do that right there within that particular framework. You also want to be able to take the state-of-the-art training frameworks that this particular framework supports that you can actually use, whether you want to use Hugging Face, or whether you want to use transformers, or whether you want to use TensorFlow or PyTorch, or PyTorch Lightning. You want to be able to actually train all that within. You don’t have to wait for a machine learning framework to be available. It should be part and parcel of this particular existing framework, and we’re able to do score and evaluation right there within this particular framework. My assertion is an opinionated one, obviously, that Ray and Ray libraries provides you that particular simple application that allows you to do all those things within the framework. It gives you the future provability to say, ok, I’ve got this pretty good framework. I’m supporting all this state of the art, but supposing tomorrow I want to use JAX, or tomorrow I want to use Alpa, how long is it going to take for me to do that? We have made steadfast promise to the community that any new framework that’s demanding that’s actually coming state of the art, we’re going to integrate that so you can actually use it, so you don’t have to wait for six months or a year to do that. It will be available within the next subsequent release. My contention, obviously is an opinionated one, that I aver that Ray can actually do that for you, and the Ray libraries together gives you that single simplicity and blessings of scale.

What Is Ray?

I talked about the challenges, one, that productionizing machine learning is difficult for the simple reason is because your infrastructure gets too complicated. You as a data scientist could care less about managing the infrastructure, all you care about is writing code, all you care about is using the state of the art. Second, that machine learning infrastructure gets outdated. To be abreast, you actually need something that is very easy and future proof. I’m going to talk about what is Ray and then move into the next layer of the stack. One thing I want to take away is Ray is a simple general-purpose library that allows you to scale any particular workload. That workload could be a Python application, that could be a machine learning workload, that could be an LLM workload, that could be batch inference. Whatever you desire that you actually want to distribute at particular scale, this library, Ray, gives you the primitives to do that. It comes straight out of the box with libraries that allows you to do data ingestion, the libraries allow you to do scaling, serving, that allows you to do hyperparameter tuning and training as well. It runs on your laptops. If you just do pip install Ray, you should be able to run it on your laptop with 10 cores that you actually have, and run with it, with reasonable amount of data that actually fits in.

It is a layered cake of functionality and capability that I want to somehow bring to your attention. If you have to look at that particular layer, if you have to look at that particular stack, you have your cloud provider. If you go one layer above, you hit the Ray Core abstraction, which is tasks, actors, and future objects. Tasks are units of execution that you can actually take your Python and distribute it across your entire cluster. Actors are stateful tasks or stateful classes that you can take your Python class and make that as a stateful service and deploy it across the cluster. The objects are distributed objects that can materialize as futures. Those who are familiar with object-oriented concepts of futures, those are some things that are given to you as a reference, and you can query about them sometime in the future when they materialize where we actually created that. If you go one level above, you have these libraries that I talked about. Each of these libraries map to a particular function or workload that most of you in the ML space or in the data space are familiar with, or probably doing it. One is ingesting data or preprocessing data before you actually give it to your trainer. You’re probably doing training, whether it’s a statistical model, it’s a traditional model, or whether it’s actually doing a fine-tuner model, or if you’re actually building a model from scratch. You might be doing hyperparameter tuning. It is fair where you actually want to have the best tuning parameters, so you actually get the best results. You might want to serve. Serve could be, you might want to do online serving, or you might want to do batch inference with a large dataset. Then last part is reinforcement learning. If you actually are in the space of reinforcement learning, that’s the library they provide. If you don’t care about any of these things, you just happen to be a physicist or atmospheric scientist who cares about doing simulations or modeling of your Python applications at scale, you can use these particular primitives to write your own distributed application using task, actors, and remote objects.

Some are fibbing about it. There are people who are actually using Ray at a massive scale. These are some of the companies who have been using Ray for a few years. There are new companies who actually have come on to the fore, one of them happens to be DoorDash, and Pinterest, and J.P. Morgan, and all these companies who spoke at the Ray Summit, are using Ray at scale. ChatGPT and GPT-3 and GPT-4 were built using Ray. It’s not something that I’m just sending over here and trying to proselytize. It is being used in production by some of these notable companies. We have a large community, vanity stars of 25,000. We have loads of contributors.

Ray AI Libraries: Ray Data + Ray Trainer

I’m going to shift the focus, I talked about the challenges, I talked about what and why Ray is. Let me now get into the second layer of what are some of the Ray libraries and how they actually make up the training stack. This is the emerging training stack that we contend, and it’s an opinionated one, that we are beginning to actually see emerge with people actually using that on a regular basis. At the top, if you’re actually using any of the LLM models, you might be using the Hugging Face model, checkpoint, you’re just probably going to bring that checkpoint and load it as a model. If the model is very huge, you’re probably going to use DeepSpeed to shard the particular model. Model parallelism allows you to break the models into smaller parts and cross it across the cluster. You might be using Accelerate library, which is now part of the Hugging Face library transformer. You might be changing or moving data from CPU to GPU and GPU to CPU back. The way you do that using the Zero-3 stage that automatically actually does that when you’re moving your Tensors from CPU to GPU. All that’s a very common thing that you actually do when Ray tuning. All that is enabled because Ray Train has incorporated as a library for all the training that actually happens. Ray Data is the one that actually allows you to shard your data across. We’ll look at some of those in more detail. For the orchestration, you can use Ray that allows you to scale things across, downscale. If you want to use the latest hardware accelerators, you can use GPUs, TPUs, you can use Inferentia 1 and 2, and Trainium. This creates the modern emerging stack for any of your workloads, whether it’s LLMs, whether it’s deep learning frameworks, or whether it’s just traditional model. You can use this particular layer stack. This weird thing is the current stack for you to be able to scale your fine-tuning models, or scale your deep learning models at a massive scale.

Why and when would I use these Ray AI libraries? You are giving me all these data points about what they are, and the benefits. There are certain scenarios where you actually want to use it. Let’s say you just want to scale a type of workload, let’s say you just want to do data inference. You want to do batch inference at a particular scale, you can use Ray Data to actually do that. Or you just want to serve a particular model, you can actually use Ray Serve to do that. It’s not one all, you can actually use individual parts of it. Or you want to build an entire pipeline from data ingestion, to training, to tuning, to serving, you can actually use those because they’re very composable libraries, they actually work cohesively together. They’re very declarative, in that you’ll see those in the APIs momentarily. Or you actually just want to use an existing library that has been incorporated, let’s say, like Hugging Face, and you want to do something with it, which is what we’re going to concentrate on. Or you actually want to build an entire platform, your machine learning platform for the company, and you want to abstract away all the Ray libraries and you just want to have your high-level ability to tell your data scientists machine learning to do that. Companies like Shopify, and Spotify, and Pinterest, and all these other companies are actually doing it already. They’ve abstracted Ray on top of their own ML libraries that actually call into that. These are all different scenarios. These are all different ways you’re actually going to actually use this particular stack.

Let’s talk about Ray Data. Ray Data is a high-level library that provides you a common format. You can actually read data from the disk in any particular format or any way from the cloud, whether you’re using images, or whether you’re using CVS, whether you have a parquet file, whether you have Hugging Face data, we provide you the ability to use very simple APIs to actually read stuff in that. It automatically distributes your shards because if you have data, and maybe one of the challenges I talked about is data is too big, and it can’t fit in one machine. Ray Data allows you to shard your data to be able to feed that into your trainers which are running in a distributed manner. It automatically does that for you. It works very well with Ray Train. Any data that you actually have that doesn’t fit on one machine, you can actually shard that data across your workers, which are training on that particular data. Ray Data is an important part in that particular stack I talked about. Here’s an example how we actually use it. You might be able to read it from the storage, if you have a parquet file or if you have a CVS file. You could use it to transform data. Transformation is an important part in preprocessing, or transformation is an important part when you’re doing batch inference. Ray Data gives you these very easy map_batches functions that you can provide your own function or your own UDF to say, go ahead and process this data, or transform this data in this particular way. You can do that in batches, or you can do that in function. Finally, you can consume the data. When data arrives on your worker, you can actually use each of the batches to give you the batch, and now your training function can work on that particular batch to do that. These are being used today at Amazon at a petabyte scale. This is not something that I’m just saying without social proof. People are actually using. There are blogs written about it. I’ll give you some resources at the end. That’s how data overview actually works.

Here’s another example. Like I said, data preprocessor comes straight out of the box. There are two ways to do that. You can actually use your default preprocessor, which are very common. You might be using scaling, you might be using transforming, so you actually use that straight out of the box. Or you can provide your own UDFs to actually do that. You want to transform the data, but you want to transform in a certain particular way. Whether it’s Tensors, that you actually want to change the rotation, shaving something, you want to increase or decrease the size, all these are transformations that are very common, as the last mile ingestion for your trainers. I would contend and I would say that Ray Data is not a replacement for a data processing or ETL library. Its primary goal and purpose is to provide you last mile data ingestion for your trainers. That’s an important part of it. Here is an example where you might want to use scalers, or you might want to use concatenators. Its API is very similar to something you’re familiar with. If you come from scikit-learn, you know how to use fit and transform that gives you that. That’s one way to do that. Straight out of the box, you can actually use scalers, common ML transformation.

The second bit is providing user defined functions. Here’s a very simple workflow, here’s a very simple thing you might want to do. For example, you want to read enormous amounts of data. You want to preprocess in a distributed manner. Once you’re preprocessing, you want to inference it and then take that inference or prediction that you have, and then save it. The code to do that is very simple. It’s very declarative. It’s very Pythonic in nature, and it’s composable. Some of you actually come from Spark background, you actually see that this is a way to create a chain of operations, and then these can actually happen in a streaming fashion. I can define a particular classable model. My process function could be any Python function I define. My model can actually take a particular batch, which is in the NumPy Ray. Then I can call my model to actually do the batch and then return those as results. When I’m stringing these operations together, I can actually do that using the chaining operator, while I’m reading the data. The output of that particular data goes to the map_batches, it uses pre_function to do that. As soon as that’s done, it goes to the map_batches to do the prediction. Then I can write to and save it to the parquet file. That’s one simple way of doing your batch inference.

What about if you have a multistage heterogeneous pipeline, which is very common? If you’re doing deep learning inference pipeline, you might have part of the data that might be used on the CPU, and part of the data that’s going to be used in GPU. How do you actually combine heterogeneous? How do you actually combine both? I want this part of the data to be used on the CPU cluster, and this part to be used on the GPU. This is a typical pipeline that you can use with Ray Data to use UDFs. Here, I’m creating a particle class, where I have a callable function, a callable class that will take a batch that’s actually sent to me. I’m going to invoke my model on that particular batch and get the results out. Those batch could be on a GPU or on a CPU. How do you actually determine, how do you declare, how do you tell the Ray Data, this needs to be GPU? Here, again, is a very simple example, which is built on the previous one, but except over here, I’m using a heterogeneous environment. I’m reading the data, again. I want to do preprocessing of the data. I might want to change certain things of the data. That’s the way CPUs need it. I’m going to send that particular batch of data to the cluster that actually has CPUs. When I get my results back, I’m going to do the batch inference on that. Here, I’m going to provide my class. I’m going to say this particular part of the operation for batch inference will need one GPU. I’m going to use an actual pool of X number of actors. I might create five or six of them, and then data will be sharded across those, and you’re going to do a batch inference. Once you get the data back, you’re actually going to write to a parquet, or save it to wherever you actually want. That was the Ray Data, which is an important part when you’re building machine learning applications at scale, when you’re making that part of the stack.

The other bit I want to share with you is about the Ray Train. Ray Train is the important component of that, because it has the flexibility and it aspires and wishes to make your framework foolproof. What I mean by future foolproof is that any new library that actually comes up, can be plopped into this particular last top bit. Let’s say we have a JAX trainer that we actually want to use, or if you want to use Alpa, that can actually be integrated very easily with Ray Train. Because Ray Train is built on Ray, it actually automatically gets all the scaling, and all the fault tolerance with it. It can run on any of the cloud providers that you have. Let’s go a little deeper into that particular Ray Train. At the core of the trainer, you actually have two concepts. One is how you want to scale across your cluster. What is the training function that you’re going to provide to this high-level API, that’s going to run on each and every worker. Over here, we have a training function that can be written for any of these libraries which is integrated with Ray Train, with this Lightning PyTorch, Hugging Face, Horovod, JAX, Alpa. You write that particular training function, and you’re going to tell it how you’re going to scale it. This expresses into a very simple API, where we have a generic TorchTrainer, that you actually create a particular class into the Coach trainer. You’re going to provide the training function that you’re going to write. This could be your non-changing code that you already have already written, all you do is just plop it into here, and that becomes your training function. You provide that scaling configuration to the TorchTrainer to say, here’s my training function, go ahead and run it on all these particular clusters. Here’s how I want you to scale it. Here are the resources I want you to use for that. Then you just use training fit, and you get the results back. Simple, declarative, and ability to incorporate any new libraries that you want. You’re just writing very simple Python code, and it’s very composable that way.

Here’s an example of how would you actually take an existing PyTorch that you’re actually writing for your deep learning framework. Let’s say you’re not using Ray, let’s say you’re using just PyTorch, you’re going to write all this boiler code for you. You’re going to set up the distributed environment. You’re going to create a DDP, which is a distributed data parallel protocol. You’re going to set up your distributed sampler to move data between CPU and GPUs. Then you’re going to move that data over to the GPU where you’re actually doing the training. You can crystallize that particular code and simplify with those three lines of green code that you actually have. You inherit, or you include a couple of functions from Ray Train Torch, prepare data loader and prepare a model that prepares the data to move your data from CPU to GPU, to device or from device depending on if you actually have that. You have the data loader that you actually load. Then you start training. That’s your training function. You provide the same training function to your TorchTrainer, and you run with it. Simple, declarative, composable API that allows you to do that. That’s all encased in a high-level abstraction called TorchTrainer. Let’s say you have a Hugging Face transformer that you actually want to do the same thing. That you actually want to fine-tune a model or you just want to tune a particular model, you’re doing exactly the same thing. You’re taking your existing code that you have that you have written to train your transformer on one single machine, because that’s a single model, single GPU, and now you’re actually transferring that to Ray Train that allow you to distribute across, use multiple GPUs, multiple machines. Same training code that you wrote for transformer, you cut and paste in your training function, you provide the training function to the TorchTrainer, and you go off to the training: Pythonic, expressive, declarative, simple.

Another important part is the integration with doing checkpointing. Whenever you’re building or training large language models or deep learning models, at every epoch, or every X number of epochs, you do want to checkpoint that. If something goes wrong, you can start from the checkpoint. Before we actually improve how Ray Train would actually put the checkpoint, we used to put the checkpoints all back to the head node, and we were running into a lot of OOM memories, because some of the models actually have large Tensors for their weights. When they go through the backward pass, they actually have all the gradients [inaudible 00:34:26] from all the different workers at the head node, which then writes to a particular disk. That was creating a lot of OOM memory. Now we have the ability to specify, each and every worker would actually write to a local disk, and after that, asynchronously put it up to the cloud. That way, it reduces the traffic of moving all the data or copying all the data to the head node. It writes directly to disk. Then each and every individual worker would upload that to a particular central location in the cloud. Then when you actually want to read from the checkpoint, the libraries allow you to say, ok, here’s a checkpoint library sitting in a Loudcloud, go in and create my model from that particular checkpoint. It knows exactly what the directory structure looked like, reads each and every data dictionary weights, and then creates a particular model for you. That has become a very integral part of the Ray Train. How does Ray Train work? We haven’t invented any new protocol. We’re using the same DDP library underneath. Trainer is just an orchestration. It’s saying, ok, you’ve got four workers, I’m going to go in and create my DDP init Python processes, and the internal Python process is going to do all the gradient communication. It’s not like we’re recreating, we’re just orchestrating that, so that underneath, you don’t have to worry about the DDP protocol. The Python process groups actually do all the synchronization. That’s a huge benefit for Ray Train, to be part of that emerging stack, for doing distributed computing at scale.

Ray Train: Training, Scaling, and Fine-Tuning LLMs

We’re going to go one more level now to how we can actually use Ray Train for scaling and fine-tuning your LLM models. Some of the trends that you actually have seen, if you’ve been attending some of the conferences where they talk about training large models and large datasets, and invariably they go together. We are actually seeing that the NLP’s Moore’s Law now, is that every year the model size increases by 10x, because the old Moore’s Law is dead. We got to have some replacement for Moore’s law. Now they come up with this new idea about NLP Moore’s Law, which says your model size increases by 10x, and along with that, you also have the data that actually increases. Now you have a challenge of large models, and you have a challenge of large datasets. What do you do with that? You have to somehow deal with model parallelism. There are two kinds of model parallelism that most of the frameworks support. One is you shard the particular model, or you take layers of a particular model and you shard it across your worker. When you’re training that particular worker, all the gradients are eventually going to be synchronized. Or you take part of the matrix that you actually have, and you Tensor parallelize across all the GPUs. That’s another way of doing Tensor parallelism, that’s model parallelism. Most libraries today actually support that. Ray Train is incorporated, so you don’t have to worry about that. You just say, I want model parallelism, and it’ll give data. Data parallelism is another way, where you break out the model into significant shards, and then cross it across that. Together, if you actually have a framework that gives you both the data parallelism and model parallelism, you can actually do distributed training at scale.

What’s the LLM Stack?

I talked about Ray Data. I talked about how it integrates with transformers, how it integrates with the latest and state of the art frameworks. How Ray gives you the orchestration, and how Ray Train allows you to do that. Gives you this particular model emerging stack that we have seen, that has emerged and most people are actually using it today. We can substitute any other stuff that you actually have at the top level, and you could be downloading the model from Hugging Face, or you could be downloading model from some other serving agencies that actually have that. The important thing is that if you’re using a predominantly Hugging Face Transformers libraries, then you get all the benefits of using DeepSpeed and Accelerate and Zero-3, which are actually part and parcel of that library. You use Ray Train to create a high-level abstraction or TorchTrainer, and now you can actually use that in a distributed manner. That’s an important part. That’s an important element that we’ve actually seen that evolve. We begin to see more, even though I claim that it’s an opinionated stack, but we’re seeing more people actually are using that. Whether you’re using it from scratch, building an entire LLM model from scratch, whether you’re just doing full parameter fine-tuning, or whether you’re actually using partial fine-tuning, you’re using LoRA or you’re using PEFT, which accommodates this ability to fine-tune your model.

You get all that for free, because the distributed data parallel and all that is actually supported by the PyTorch libraries, and we just have a wrapper around that. The DeepSpeed training is now part of the Hugging Face transformer. That just comes automatically with the library, and so we use a particular library to do that. On the Ray cluster, all we’re doing is we’re creating all this PyTorch process init groups that will do your gradient synchronization, they will do the training automatically for you. You don’t have to worry about that. If you’re using a Hugging Face transformer, then you don’t have to change that much of code as I showed you. You’re taking pretty much the same code that you actually use to train your transformer and now you’re actually doing it in a distributed manner. Ray datasets provide you the ability to actually take your Hugging Face data that you have and then convert that into the Ray Data, and then distribute it across. This compatibility and this ability to actually move from one framework to the other one gives you two things. Remember the challenge one? The challenge one was, how do I actually go from development to production? You can actually do that all on your single machine. I’ll show you how we actually do that on your single machine, and distribute it across. It gives you the future foolproof because if your ML framework changes, and tomorrow you actually have a new ML framework, we can actually provide you, depending on how popular that ML framework is, the high-level TorchTrainer or transformer trainer or whatever it is, to be able for you to actually do that in a very timely fashion, so you still maintain your developer velocity.

Finetuning GPT-J/6B Ray AIR + HF + DeepSpeed

Here’s an example of what I’m going to do. I’m going to show you an example of how you can actually take an Eleuther GPT model that I have that was trained with 6 billion parameters on a Pile dataset, over 865 gigabytes. I’m going to fine-tune that using the Shakespearean festival because I want to query and prompt that particular model with some plain English language, and I want it to talk to me in the medieval language. Then we’re going to use the stack that I talked about. We’re going to use Ray for our orchestration. We’re going to use Ray Train. We’re going to use DeepSpeed, and we’re going to use the Hugging Face model. The outcome will be a pretrained model. Here’s a very typical training flow that you actually might have for your Hugging Face single model. You load the data. You create a particular model from a checkpoint. We provide training arguments on the checkpoint. You create a trainer and you train. This is how we actually change that into a trained flow for a distributed library. You just take that and create that training function, and you use pretty much the same code. Then you actually create the TorchTrainer and you provide the scaling config, and you do a, and you’re done with it.


I’m going to be going on the one that I’ve circled, which is green, but you have the ability to actually fine-tune a 2-series Llama model. You can also use any of the three frameworks that I have. What I did was I took this notebook that I have, I ran it about an hour. You’ve seen this particular model. Here, I’m going through this particular phasing. We’re going to do a quick batch inference. We’re going to do training. We’re going to do the data processing. We’re going to set up the model first. By setting up the model, I’m just going to import my stuff that I actually have. I’m going to provide what my model name over here is. I’m just going to go ahead and initialize that with some of the environments that I need. When I run my ray.init, all I’m telling over here is go ahead and create my cluster. It’s telling me I’m using Python 3.1. This is the nightly builds that I’m using. This is where I can actually use that. Don’t worry about this particular code. The important bit about here is I’m reading the load datasets from the import. This is my Hugging Face data. Notice that there’s only one row of data of that 400,000, which is really a bad thing. You have to preprocess data by splitting them into lines. Preprocessing dataset, see how I’m using ray. from Hugging Face, and then using the Ray datasets. I’m using the transformers to actually split that, so I actually have 40,000 lines of code, and then I’m going to use a tokenize to tokenize it. Here are my batch processers, use a split text, convert them into pandas, use the tokenizer. I’m going to fine-tune it. When I do a fine-tuning, all the Ray Data that I have, the 40,000 lines of code will be broken into blocks and given to each and every worker. This is the gist of the code.

Remember I talked about all you do is write the training code. Here’s my training code that I’m writing that’s going to be run on each and every worker. My batch size. Here are my hyperparameter tuning parameters. Here are my DeepSpeed that I actually want. Here’s my zero-set optimization, so use this particular configuration that I have to move data between CPU and GPUs. All that is actually done and initialized over here. I’m using my training arguments. This is the Hugging Face training arguments, these args. I’m providing all the necessary arguments that I need. Then I’m going to get my compute metrics, and I’m going to return the trainer. Then when I return the trainer, I’m going to go ahead and use my transformer trainer. You saw how we used the transformer trainer. You provide the training function to it, provide the scaling config to it. I want you to use the GPUs. Here are my resources. Here’s my training data. Here’s my validation data, and go ahead and train it. You start training. It took about seven-and-a-half minutes to run. I used about 385 CPUs out of 560, and all 32 GPUs to actually train that in a very distributed manner. I ran through five iterations: one epoch has five steps. Normally you read more than that, but for the illustration, I just wanted to do that quickly. I ran that. Finally, I got my model trained after about seven-and-a-half minutes. Each and every Ray Trainer creates a checkpoint. A checkpoint is a way for you to actually now take the state of the last checkpoint that you actually did after the model, and then use that model for prediction. I’m going to get the best checkpoint because I gave all these different hyperparameter tuning, I’ll get the best checkpoint that I have. You can see the checkpoint now is actually sitting on my disk. I’m going to read that particular checkpoint and materialize. Here is the DataFrame I’m going to use as my batch inference. Romeo and Juliet, “Romeo, my love, how can I show devotion to you,” and so on and so forth. That’s going to convert that into Shakespearean. I’m taking English and I’m converting that into Shakespearean, because I fine-tune, take my prompt and rewrite in Shakespearean. I do the predict. Let’s see if this works. Brilliantly did.

What I did essentially over here was use the emerging stack. I used Ray Train and Ray Hugging Face transformers to be able to actually do that. I used DeepSpeed, and all the libraries that actually come that allow me to distribute over that. I created training functions that say, here’s my training function, and run it across my 32 workers. Here are all the parameters that I want you to actually use. Then go ahead and create my checkpoint. It ran for about 7 minutes across 32 GPUs, across all the CPUs that I have, because I have 32 workers on EC2, took about 7 minutes. It’s a small dataset, but you can actually use the same strategy for fine-tuning any of the other models. The thing that I actually gave you, you can actually do that. Then I did the prediction, and I got my prediction in the Shakespearean.


I gave you an outline and explored some of the challenges. I offered an opinionated way how we can actually do that. I provided you insight and intuition how we can actually use Ray stack. Demonstrated the model stack that you can actually use for tuning your models.

Using Model Parallelism for Model Sharding (Inference)

Damji: We talked about, you can use model parallelism to shard the particular model.

When you actually do the inference, what happens is that when you load the model, if the model is too big, it’s going to shard it across inference. Each data will be used to actually do the sharding. It will actually collect the answers for each and then return you that. That’s how DeepSpeed will actually use to shard the particular model.

Why Ray Data is not an ETL Replacement

Damji: Why is Ray Data not a replacement for ETL?

If you look at Spark, Spark is actually based on the whole idea of a DataFrame. The central concept of Spark is you actually have a DataFrame, you have it distributed, you have a DSL, you have a domain specific language that you actually use. It goes through creating a pretty good graph for you, and converting that into Spark tasks. There is a DSL associated with that. It’s very much catered towards doing large data processing for ETL. If I was doing large data processing ETL, I wouldn’t even consider Ray, because it’s not meant for it. This is not to say that you can’t use, that you can’t build or you can’t construct your own DCL on top of that using Ray tasks and actor. Remember what I told you was that, it is a general-purpose distributed framework. It’s not tied to a schema. It’s not tied to a DataFrame. It gives the primitives to actually write that. Think of it as a Unix file system, or Unix-C interface that has an interface to all the different things, and you can actually use those primitives to build high-level applications on top of that. Similar analogy. Spark is good for ETL. It’s good for doing streaming. It’s good for SQL. Ray is different. It is giving you few map join, group by, last mile injection, in order for you to actually send that to the training. You’re not using Ray for massive ETL. If you haven’t done that, don’t use Ray, use whether it’s Dask or whether it’s Spark. I would go for Spark because I am a fan of Spark. Yes, I would use Spark at scale. Then once your data is actually ready, and you’re just doing a last-minute tweak, you now give it to Ray to be able to give it to the trainers. There is a big difference, both extremely complementary, they’re not replacing one or the other, they work very well together.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.