Presentation: Streaming-first Infrastructure for Real-time ML

MMS Founder
MMS Chip Huyen

Article originally posted on InfoQ. Visit InfoQ

Transcript

Huyen: My name is Chip. I’m here to talk about a topic that I’m extremely excited about, at least excited enough to start a company on it, which is streaming-first infrastructure for real-time machine learning.

I come from a writing background. I have actually published a few very non-technical books. Then I went to college and fell in love with engineering, especially machine learning and AI, and have been very fortunate to have had a chance to work with some really great organizations including Netflix, NVIDIA, and Snorkel AI. Currently, I’m the co-founder of a stealth startup.

Two Levels of Real-Time ML

Because my talk is on real-time machine learning, if you talk to software engineers, they will probably tell you that there’s no such thing as real time, no matter how fast the processing is, there’s always some delay. It can be very small, like milliseconds, but it is still delay. Real time encompass like near real time. There will be two levels of real-time machine learning that I want to cover. The first is online predictions, is when the model can receive a request and make predictions as soon as the request arrives. Another level is online learning, but then if you enter online learning onto Google, you will get a bunch of information on online courses like Coursera, Udemy. People are using the term continual learning instead. Continual learning is when the machine learning models are capable of continually adapting to change in distributions in production.

First about online predictions. Online predictions is usually pretty straightforward to deploy. If you have developed a wonderful machine learning model like object detections, the easiest way to deploy is probably to export it into some format in ONNX. Then upload it to a platform like AWS, and then get back an online prediction endpoint. If you send data to that prediction endpoint, you can get predictions onto this data. Online predictions is when the models wait to receive the request, and then generate predictions on a further request. The problem with online predictions is latencies. We know that latency is extremely important. There have been many research done to show that like, no matter how good the service is, or the models are, if it takes us like milliseconds too long to return results, people want to click on something else. The problem with machine learning is that in the last decade, models are getting bigger. Usually bigger models give better accuracy, but it generally also means that bigger models means that it takes longer for these models to produce predictions, and users don’t want to wait.

Online Prediction: Solution

How do you make online prediction work? You actually need two components. The first is you need a model that is capable of returning fast inference. There have been so much work done around this. One solution is model compression, so using quantization and distillation. You also can do like inference optimizations, like what TensorRT is doing. Or you can also build more powerful hardware, because more powerful hardware allow models to do computation faster. This is not the focus of my talk.

Real-Time Pipeline: Ride-sharing Example

My talk is going to focus on real-time pipeline. What does that mean? A pipeline that can process data, inputs the data into the model, and generate predictions and return predictions in real time to users. To illustrate real-time pipeline, imagine you’re building a fraud detection model for a ride-sharing service like Uber or Lyft. To detect whether a transaction is fraudulent, you want information about that transaction specifically. You also want to know about the user’s recent transaction, to see [inaudible 00:04:18] recently. You also want to look into that credit card recent transactions, because what happens is that when a credit card is stolen, the thief wants to make the most out of that credit card by using that actually for multiple transactions or at the same time to maximize profit. You also want to look into recent in-app fraud, because there might be a trend regarding the locations and maybe this fraud, this specific transaction is related to those other fraudulent transactions in the locations. A lot of those are recent information, and the question is, how do you quickly assess these recent features? Because you don’t want to put them into your permanent storage and have to go into the permanent storage to get them out because it might take too long and the users are impatient.

Real-Time Transport: Event-driven

The idea is just like what if we leveraged some in-memory storage, so real memory storage. Like when we have an incoming event like a user booked a trip, picks a location, cancels trip, contacts driver. Then you put all the information into interim storage, and then you keep it there for as long as those events are useful for real-time purposes. Maybe after seven days, you can either discard those events or move them to permanent storage or S3. This in-memory storage is generally what is called real-time transport. Real-time transport doesn’t have to be confined to in-memory storage. It can also be real-time transport if it can leverage more permanent storage efficiently. Real-time transport tools, you probably know like Kafka, Kinesis, or Pulsar. Because this is very much event-based, so this kind of processing is called event-driven processing.

Stream Processing: Event-Driven

First, I want to differentiate between static data and streaming data. Static data is the data that has already been generated before. You can access through a file format like CSV or Parquet. Streaming data is the data that you can access through [inaudible 00:06:30] real-time transport like Kafka, Kinesis. Static data because it has already been generated, you know exactly how many examples are there, so static data is bounded. We input a CSV file into the processing code when a job is done, like when it goes through every sample of static data, you know that the job is done. We’re assuming that it’s like continually being generated, so it’s unbounded. You will never know when the job is finished. Certainly, that allows you to assess static features, like features that don’t change very often, something like age, gender, or when the account was created. Whereas streaming data allow us to access information that is very recent which means it can change very quickly. For example like users locations in the last 10 minutes, or what have I been reading in the last few minutes. For static data you can’t leverage, so traditionally for things like static data, you need to process them using a batch processing in MapReduce, Spark. Whereas people think to process streaming data you want some stream processing tool like Flink, Samza, Spark Streaming.

One Model, Two Pipelines

There’s a problem with like separation of batch processing and stream processing, is that now we have two different pipelines for the model. First of all, during training, you have a lot of static data, and you use batch processing to generate those features. During inference, now you have to deal with, if we do like online predictions, you want to work with streaming data, so you have to use stream processing to extract features through the stream processing process. Now you have a mismatch. It’s actually one of the very common source for errors that I see in production, so like when a change in one pipeline fails to replicate in another pipeline. I personally have encountered that a few times. One time we had models that performed really well during development.

Then when we deployed the models, like really poorly, the performance was poor, and we had to look into it, so we get the same piece of data, and we’d run it through the predictions function in training pipeline, and then the prediction function in the inference pipeline, we actually got different results. We realized there’s mismatch between our two pipelines.

Stream and Batch Processing

There’s been a lot of work done in unifying the batch and stream processing. One very interesting approach is to leverage streaming-first infrastructure to unify both of them. The reason is that, especially folks at Flink have been pushing for, is like batch is a special case of streaming, because bounded dataset is actually a special case of unbounded data streaming, because if your systems can deal with unbounded data stream, you can make it work with bounded dataset, but then if your system can only deal with bounded dataset, it’s very hard to make it work with unbounded data stream. People have been using things like Flink, or a streaming-first approach to unify both stream and batch processing.

Request-Driven to Event-Driven Architecture

We talk about event-driven processing, I want to talk about a related concept, which is event-driven architecture, as opposed to request-driven architecture. I want to talk about it in terms of microservices. In the last decade, the rise of microservices is very tightly coupled with the rise of the REST API. REST API is request driven. What that means is that you usually have a concept of client and server. The client is like send a request, like a POST and GET request to the read server and get back a response. This is synchronous. The server has to listen for the request to register. If the server is down, the client will keep resending new requests until it gets a response, or until it times out.

Things look pretty great. Then problems arise when you have a lot of microservices, is the inter-service communications, because different services will have to send requests to each other and get information to each other. Here in the example we have three microservices, and we see too many arrows like information back and forth. If we have hundreds or thousands of microservices, it can be extremely complex and slow. Yes, so complex inter-service communication.

Another problem is how to map data transformation through the entire system. We have talked about how difficult it is to understand machine models in production. If you want like how management change data, you have to ping data management to get that information. You actually don’t have the full view of the data flow through the system, so it can be very hard for monitoring and observability. Instead of having request-driven communications, what if we want to do it like in an event-driven architecture. What that means is that instead of services communicating directly with each other, you want a central location, like a stream. Whenever a service wants to publish something, it pushes that information onto the stream. Whenever other services keep on listening in. If they find out that like this much is relevant to them, then they can take it and they can produce some result, and then it sends it back to the stream and then the other services keep on listening. It’s possible for all services to publish to the same stream, and all services can also subscribe to the stream to get the information they need. You can write the stream into different topics, so that it’s easier to find the information relevant to that service. This is event-driven. First, it reduces the need for inter-service communications. Another is that like, because all the data transformation is now in the stream, you can just query the stream and understand how a piece of data is transformed by different services through the entire system. It’s really a nice property for monitoring.

Model’s Performance Degrades in Production

We talk about online predictions. The next step we want to talk about is continual learning. It’s no secret that model performance degrades in production. There are many different reasons, but one key reason is data distribution shifts. Things change in the real world. The changes can be sudden, first of all, COVID. I saw recently that Zillow actually closed their house flipping because they failed to forecast house prices, because all the change with COVID made their models extremely confused, and they lost a bunch of money and they closed the service. Yes, sudden changes like COVID. It can be cyclic or seasonal, for example, ride sharing demand is probably different on the weekend and on the weekday, or during rainy seasons and dry seasons. It can also be gradual, just because things change over time, like emojis, the way people talk slowly change over time.

From Monitoring to Continual Learning

Monitoring is a huge segment of the market. Monitoring helps you detect changing data distributions, but it’s a very shallow solution, because you detect the changes then what? What you really want is continual learning. You want to continue to adapt models to changing data distributions. When people hear continual learning, they think about the case where you have to update the models with every incoming sample. It’s actually very few companies that do require that, for several reasons, like one is catastrophic forgetting. Another is that it can get unnecessarily expensive. A lot of hardware backends today are built to process a lot of data at the same time, so you just use that to process one sample at a time, it can be very wasteful. What people usually do is that they update models with micro-batches, so they wait to collect maybe like 500 or 1000 samples, and then make an update with a micro-batch.

Learning Schedule != Evaluating Schedule

Also like the difference between a learning schedule and evaluating schedule. You make an update with the model, but you don’t deploy the update because I worry if that update is really bad and is going to mess up our service. You don’t want to deploy the update until we have evaluated that update. You actually won’t update a model if you create a replica of that model, and then update that replica, which now becomes a candidate model and you only want to use it to deploy that candidate model after it has been evaluated. I can talk at length about how to do evaluations, but you have to do like offline evaluations. First of all, you need to use some static data test set to ensure that the model isn’t doing something crazy. You also need to do online evaluations, because the whole point of continual learning is to adapt a model to change in distributions, so it doesn’t make sense to test this on a stationary test set. The only way to be sure that the model is going to work is to do online evaluations. There are a lot of ways for you to do it safely, through A/B testing, canary analysis, and bandits. I’m especially excited about bandits because it allows you to test multiple models, so you frame it the same way you would frame the multi-armed bandit problem. You treat X model as an arm, so if you have multiple current models, and each of them is an arm and you don’t know the real work for each model until you push an arm. The same frame as a multi-armed bandit.

Iteration Cycle

For this continual learning, the iteration cycles can be done in order of minutes. Here’s an example of Weibo. Iteration cycle is around 10 minutes now. You can see similar examples with the Alibaba Singles’ Day, TikTok, and Shein. A lot of Chinese companies, and I’m always blown away by the speed, and then when I talk to American companies, and they don’t even talk in order of minutes, they talk in order of days. There’s a recent study by Algorithmia, and they found it’s like 64% of companies take a month or so, or longer.

I think about it as like, there’s no reason why you shouldn’t make a difference between batch learning paradigm and the continual learning paradigm. Those are lessons I learned from machine learning computing. Before, people tell us like, that’s crazy, you can spread the workload onto machines, but you can’t spread the workload to 1000 machines. That’s what people thought before. Then as you got a lot of infrastructure work done in cloud, to make it extremely easy for you to like spread the whole logic to a machine or 1000 machines. I think the same thing is happening in machine learning. If you can update the model every month, there’s no reason why you can’t update the model every 5 minutes. It’s just like the duration is just a knob to turn. I do think just like with infrastructure wasn’t being built for what we are doing right now, it’s going to make it extremely easy for a company to just update the model whenever they want.

Continual Learning: Use Cases

There are a lot of great use cases for continual learning. Of course, it allows a model to adapt to rare events very quickly. Of course, like Black Friday. Black Friday happens only once a year in the U.S., so there’s no way you can have enough historical information to predict accurately what the user is going to do on this Black Friday. For the best performance you would have to continually train the model to the data from the same days so that you boost performance. Actually, that is one of the use cases that Alibaba is using continual learning for, so they acquired Ververica the company that maintains Flink for $100 million, to adapt Flink to a machine learning use case. They use this for Singles’ Day, which is like a shopping holiday similar to Black Friday in the U.S.

It also helps you overcome continuous cold start. Continuous cold start is when you have new users or users get new device or you have a lot of users who are locked in, or like users who were locked in so you don’t have enough historic confirmations to make predictions for them. If you can update the model with in-sessions or during sessions, then you can actually overcome the continuous cold start problem. Because with in-sessions, you can learn what users want, even though we don’t have historical data, and you can make relevant predictions for them in-sessions. It can be extremely powerful. Imagine you have a new customer, users coming into the search service and if that user doesn’t find anything relevant to them, they’re going to leave, but if they find things relevant to them, because you can recommend it to them, then they might stay. Just an example, like TikTok is so addictive because they are able to use continual learning to adapt users’ preference with in-session.

What Is Continual Learning Good For?

Continual learning is especially good for tasks with natural labels, for example on recommendation systems. Like you show users the recommendations, and if they click on it, it is good predictions, and if after a certain period of time and no clicks then it’s a bad prediction. It is one short feedback loop, in order of like minutes. For online content like short video, TikTok videos, a Reddit post or tweet, then they recommend from the time, the recommender just show for this time they have click on is pretty fast, but not all recommendation systems have short feedback loops. For example if you work for Stitch Fix, and you want to recommend items that users might want, you would have to wait for the items to be shipped and users to try them on before you know, so it can take weeks.

Quantify the Value of Data Freshness

Continual learning sounds great, but is this right for you? First, you need to understand, you have to quantify the value of data freshness. People keep saying that fresh data is better, but how much better. One thing you can do is you can try to measure how much a model performance changes, if you switch from retraining monthly to weekly to daily or even hourly. Back in 2014, Facebook did this study, and they found if they’re going from training weekly to daily, they were able to increase their click-through rate by 1%, which is significant enough for them to change the pipeline to daily. You also want to know, how would retention rate change if you can do in-session adaptations, for the case like new users coming in, like if you can show the relevant information to them, and how much more they’re going to stay? You also want to understand the value of model iteration and data iteration. Model iteration is when you can make significant change to a model architecture, and data iteration like if you train the same model on newer data. In theory, you can do both. In practice, the more it’s going to run on one, means the less resource you have to spend on others. I’ve seen a lot of companies they found out, it was like data iteration actually gives them much higher return than model iteration.

Quantify the Value of Fast Iteration

You also want to quantify the value of fast iterations, like if you can run experiments very quickly and get feedback from the experiment quickly, then how many more experiments can you run. The more experiments you can run means that you can find experiments that works better for you and give you better return.

Quantify Cloud Bill Savings

One problem that a lot of people are worried about is the cloud cost. You’ve got training cost money, and the more often you train the model, you just think the more expensive it’s going to be. It’s actually not the case, it’s very interesting about continual learning. In batch learning, when it takes longer for you to retrain the model, you have to retrain the model from scratch. Whereas in continual learning, actually, you just train the model more frequently, so you don’t have to retrain the model from scratch, but you can continue training the model on only fresh data. It actually means that it requires less data and less compute. Here’s a really great study from Grubhub, when they switched from monthly training to daily training, gives them 45 times cost savings on training compute cost. At the same time, 25% increase in the metrics they use in purchase through rate.

Barriers to Streaming-first Infrastructure

Streaming-first infrastructure sounds really great. You can use that for online prediction. You can use it for continual learning. It’s not really that easy. A lot of companies haven’t switched to streaming-first infrastructure yet, one reason is that they don’t see the benefits of streaming, maybe because their systems are not at a scale where inter-service communication has become a problem. Another thing is they don’t see the benefit of streaming-first infrastructure because they have never tried that before. Because they’ve never tried that before they don’t see the benefits, so the chicken and egg problem, because they need to see the benefit to try it out, to deploy it. Because you need to deploy it first, to see the benefit. Also, there’s a high initial investment in infrastructure. When we talk about streaming-first infrastructure, a lot of companies they’re like, you need a specialized knowledge about it. It may have been true in the past, but right now the good news is that there are so many tools being built to make it extremely easy for companies to switch to streaming-first infrastructure. For example, Snowflake now has streaming. Confluent is a $16 billion company. You have services like Materialize that’s raised $60 million on top of previously $40 million, so $100 million to help companies adapt their infrastructure to streaming-first, easily.

Bet on the Future

I think it’s important to make a bet in the future, because for some companies, you can make a lot of internal updates, but then it may be cheaper to make one big jump to streaming. Maybe it’s cheaper to make that big jump instead of doing 1000 incremental updates. Other companies are now moving to streaming-first because their metric increase has plateaued, and they know that for big metric wins they will need to try out new technology, and that’s why the investment in streaming. We are building a platform to make it easier for companies to do machine learning, leveraging streaming-first infrastructure.

Questions and Answers

Lazzeri: How does reproducibility change with continual learning?

Huyen: I think that you understand that maybe we just look at, where does the difficulty in reproducibility come from? I think one of the big reasons for making it so hard to reproduce a model in production is a separation of retraining and serving. ML production has different steps, you develop a model, and then you deploy it, and that causes a lot of problems. I see for example, like somebody used the wrong binaries, it can be you used an older version of a model instead of the current version, or like, there’s some changes in the featurization of code and the training, and so they actually now use different features. One thing about continual learning is that you can design infrastructure, so that the training and serving actually use very similar infrastructure, for example, in terms of features. One trick you can do is that when you make predictions on some requests, you extract features from that request to make predictions, and then use those same extracted features to train the next generation of the model. Now we can guarantee that the features you do in training and serving are the same, so it helps with reproducibility. Also a thing like controlling, you need to be able to track the lineage of the models, because you have a model and then you want to train the next iteration of it. You have to have some model store. You could check how different models are evolving, and you know like what models inherited from what model? I think that’s going to definitely help with reproducibility.

Lazzeri: There is another question about model training and performance. If the model is constantly training, isn’t there a risk that the model starts performing worse. Does this require more monitoring?

Huyen: This is one thing that I work a lot on actually. When you train the model, the model can learn very quickly. It can fail very quickly. I’m not sure if you guys remember the example of the Microsoft chatbot, Tay. Tay was supposed to learn from users’ feedback interactions, and within 24 hours, it became extremely racist and sexist, just because it got a lot of trolls. When you put the models for continual learning online, it’s more susceptible to like attacks, more susceptible for like people to inject bad data or maliciously making the model to learn really bad things. Evaluation is extremely important in continual learning. When you think about it, updating the model is like anyone can write a function to make the model update with the number of data. The hard part is like how to ensure the performance of this model is going to be ok. The static test set is not going to be like current. You’re going to have to have some systems you test the model in production. The test in production is not a new concept. There have been a lot of really good engineering practices to retest the model performance online. First of all, like canary analysis. You can start by deploying into 1% of the users, 2% of the users, and only when it’s doing good that you can roll it out to more. A/B testing, we can have bandits. Of course, yes, so definitely a lot of model evaluation is going to be huge. I think companies with good evaluation techniques will definitely have a competitive edge.

Lazzeri: Can you tell us more how the short iteration cycles in minutes are achieved? Is this just a factor of the huge amount of traffic that the listed companies have?

Huyen: There are two different things, one is about whether you need faster iteration cycles, like if you don’t have a lot of traffic, if the model doesn’t change as much. If the model doesn’t change every five minutes, then there’s no point in changing the model every five minutes. First is, whether you need short iteration cycles. The second is if you have a lot of traffic, then how do you achieve the short iteration cycles. At this point it’s an infrastructure problem. You need to set infrastructure in a way that allows you to access fresh data. If the machine learning engineers can access new data after a day, after a week, then he doesn’t know where he can update the model every five minutes. You need some sort of infrastructure to allow you to access fresh data very quickly. Then you need some very efficient processing scheme, so you can process all the fresh data quickly, and indirectly, so process data into the model. A lot of streaming infrastructure allows that to happen, so that is a purely infrastructure problem.

Lazzeri: I do have another question on the cloud cost aspect. I really appreciated you talking about these, because I think that there should be more transparency and visibility for the customers in the industry to get total visibility into what is the cloud cost. You’re saying that, for that company, going from monthly to daily training saved them actually a lot of cost. From an architectural point of view, do you think that this is the same independently from the type of provider that you’re using? Have you also seen other companies going through a similar experience?

Huyen: I think it can be independent from the provider. I think it’s just like, if the question is, how much compute power do you need to retrain? Previously, if you trained the model from scratch on data from the last three months, then you need to use maybe a lot more epochs, or you just use a lot more data, so you just need to use more compute. Whereas if you just use data from the last day, then you can do it very quickly. The thing about software is not really scale. Sometimes 100 times scale is not 100 times more expensive, but like a lot more times expensive. When you can reduce the scale of the training compute requirement, then you can actually reduce your training cost. It’s not perfect. It’s not true for every case. I would really love to see more research on this. I would really love to see the research on how the retraining frequency affects the training data requirement, because I haven’t seen any research around that area.

Lazzeri: I haven’t either, so that’s why when I saw your slide I was like, this is such an important topic that we should investigate more. I agree with you, I would also love to see more research in the industry around that.

Huyen: I think this mostly has been anecdotal from the industry.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.