MMS • Chip Huyen Shijing Fang Vernon Germano
Article originally posted on InfoQ. Visit InfoQ
Transcript
Huyen: I’m Chip. I’m a founder of a startup that focuses on infrastructure for real time machine learning. I’m teaching a machine learning system design course at Stanford, which is a course to help students prepare for how to run ML projects in the real world. I also run a Discord server on MLOps. I think I’ll be learning a lot from people there.
Fang: This is Shijing. I work in Microsoft as a data scientist. My day-to-day job, it depends on the projects. Oftentimes, I meet with my partners, stakeholders, my colleagues, discuss about the projects. Then we take the problem, and then look into what data we have, how to build machine learning models or insight so that we can feed back to the business questions and challenges that we have. Of course, still with a lot of coding data qualities and all kinds of data and machine learning challenges that we have on a daily basis.
Germano: I’m Vernon. I am a Senior Manager for Machine Learning and Artificial Intelligence. I work for Zillow. I run a couple of different teams there, that are tasked specifically with estimations on property value, things like trying to calculate estimated taxes on properties, valuations essentially on real estate nationwide. Prior to that, I worked for Amazon Prime Video, and did completely different kind of ML work. I’ve been in the industry for many years and working in ML for quite a few of those.
Why ML systems fail In Production
Greco: As three experts, so you’ve seen the future. You know about the future for the people in our audience are now diving into MLOps. We talked about this with Francesca, a lot of ML projects have failed. What would you say are the top one or two things why ML systems fail in production?
Germano: I’ve seen these projects go off the rails. A lot of times when you might have applied science where your scientists spend a lot of time in research, and spend a lot of time in trying to develop models and trying to meet certain performance standards, so everybody establishes a metric. They look at that. It can be precision. It can be recall on models, and they focus on the science of it. I think that’s really important. Where I see things go off the rails is sometimes you get to a place where you’ve got a model or several models that are performing to the standard that you’d like, but you have no idea how to actually scale and implement that. You’re confronted now with the engineering side of it. Engineering and science, while some of the tools are similar, you need a splitting at some point where you actually now have to consider all that’s necessary in order to make this stuff work in production. If you’re looking at things like we’ve got a great model, but how do we do online inference? How do we scale to the size of our audience? How do we make sure our customers are satisfied? These are all engineering related questions that scientists don’t spend a lot of time thinking about. They think about the science behind what they’re trying to accomplish in the metric. They spend very little time and aren’t really expected to be the experts on how you scale this stuff out. That’s where I’ve seen it fail is, I’ve seen models on the shelf that are amazing, but getting those things into production, there wasn’t the stomach for trying to implement that thing. It could be very expensive.
Greco: In other words, not as much focus on the engineering side.
Germano: Yes.
Fang: I would just add on top of what Vernon said about lack of the engineering view, at the same time, sometimes is also lack of the holistic view in terms of when you develop the model itself. We have seen a lot of great product, which you discussed with Francesca as well, in terms of the product itself, or the machine learning model itself looks beautiful when it was in a POC stitch. However, your lack of consideration about what is the business objective or context to apply, what is the end goal of that? Then what data eventually we are going to get, and then how to put into the production, into the engineering pipeline, what environment we are dealing with. When the model itself in the experimentation or in the POC stage, it looks beautiful, but then when it gets to the real world practical, then it fails in many places or stages.
Also, even you get to the production stage, lack of the monitoring, that goal of looking into the changes of the environment or changes of the data, even just a schema change many times we didn’t realize that, and then it failed when no one was monitoring it. Then even you get to the production and it’s still trailing down into a couple months later, if people still don’t realize it’s all wrong. I think it’s lacking of this communication and holistic view with different departments, with different stakeholders, any of these stages, and you get to the fail stage.
Greco: You’re thinking like more of a process problem.
Fang: It definitely has the process problem, also could be a culture, or lack of continuous business context, continuous engineering context as well.
Huyen: I’m sure you guys have seen a lot of news about the data science team at Zillow recently. We have great interest in learning models of post-mortem of what happened with using machine learning model for estimations of housing price on Zillow. One thing I would really want to learn more, not just about Zillow but in general, is that a lot of my students, when they ask about machine learning failures in production, I don’t think there’s any anatomy or white studies. A lot of people say that, ok, so a lot of it is because of the engineering problems, like some problems with distributed data pipeline features, engineering problems, but what percentage? How prevalent is that? How often does it happen?
I think that Google was the only company I’ve seen where they actually published some internal study, like the study on the machine learning system failures in the last 10 years. They found out that 60 out of 96 failures are actually not ML specific. A lot of it has to do with dependencies failures, a lot of problems with data joining, like when you’re joining data from different sources, and it doesn’t work. A lot of it has a distributed component, the bigger your system the more distributed the component, then the more likely it’s going to fail. I think having some understanding there could be very useful. I also think that the problem is that probably because we don’t have good enough tooling. If you have good tooling then you can automate a lot of process and you can reuse a lot of code, than if you don’t have like too much surface for bug areas and we have less bugs. I’m very excited to see more good tooling around the space to reduce failures.
Greco: It’s interesting, the three different views on why things are going wrong. I hear engineering, or not so much focus on engineering. There’s tooling, and there’s a process problem. For companies that are getting into this, it seems like there needs to be additional focus, not all the focus, of course, but additional focus on engineering, process, tooling, and maybe even the people themselves too in terms of education and training, that’s equally.
How to Continuously Deliver Models
Right now when we create models, and we put them in production, we have this almost like a waterfalls mentality about building a model, put it in production, and then doing it over again. It’s like a batch model. Do you foresee us moving into something that’s more dynamic or like a continual delivery of models? If so, how do we do that?
Huyen: I think it’s going to piggyback on what you just said, Frank, about the different view on process and tooling. I don’t think they are separated, like to only engineering, process, and tooling. I think they are quite the same. The key here is just like engineering, because you need to have a good engineering process and have good tooling to have better engineering experience. Focus more on engineering and less on tuning models.
Germano: I think that at the root of it is that there is good engineering practice around continuous delivery, generally in successful businesses. An infrastructure is set up for integration testing, and automation around all of that, and placing things into production is predictable. In a machine learning environment, because you’re reliant on retraining the models and republishing of models, you have to take a slightly different approach to try to get to a continuous integration environment. It’s not like these things are just off the shelf. I think a lot of companies are working towards that infrastructure. I think it’s important that they do. When you take on the care and feeding of a model, and I appreciate the other panelists both bringing it up, that it’s not a once and done, these are things that are continuing. You’re continually training your models. You’re continually revising the data. You’re continually looking at how you’re feeding that to those models.
One thing to consider is that if you’ve got human in the loop operations, for instance, how do you reintegrate their data to improve performance of your models? What automation are you going to create around that? How are you going to look at things like reinforcement learning? If you’ve got new sources of data coming in all the time, having that is a huge benefit, but doing that in an automated fashion requires an understanding and an investment of it. I don’t remember which of the panelists mentioned, taking a holistic approach, but I couldn’t agree more. You literally have to look at not just delivering on some model that’s performing beautifully right now today, but also, how are you as a business going to own this model? How are you going to own these predictions? How are you going to continuously improve them? Honestly, I have not seen an off-the-shelf solution that does all of that because it’s very complicated. What you’re taking on as a business is the care and feeding of that model is either going to require that you put in a lot of manual effort, or you’re literally going to have to take some engineering time and set up an infrastructure that’s going to allow you to do this in a way that’s not going to cost you a lot.
At least in my organizations, those have always been the big challenges. Not the development of the models, a lot of times scientists can get to a really good solution rapidly if you give it enough data, and you give it enough time, and you give it enough energy, you’re going to come to a good solution. How do you make sure that that solution isn’t just a point in time reference, but how do you build on an entire infrastructure to make sure that’s continuous? I think it’s something that you have to at least first acknowledge that you’re taking on that responsibility if you’re going to put this thing into production for the long term.
Fang: Just add on top of using reinforcement learning as an example, I think in reading out to me you have more discussion and I think industrially also have more discussion around increasing applications in reinforcement learning. However, that also requires, in my opinion, I think some disruptive of the engineering efforts to have these real time distribution data and collection and then feedback. Which is a heavy investment to the engineering and infrastructure changes. You have these really disruptive concepts and system of the machine learning. At the same time if you don’t have the engineering system catching up and also business didn’t or haven’t really realized the value of it, it’s really hard to have that trend even though it’s increasingly becoming a topic. I think that just adding on top of and then how to invest not only the engineering side, but also the human side of continuing more into, do we prioritize this project versus the others? It’s a lot related to kind of in this system.
Heuristics for Machine Learning in Production
Greco: Do we see foresee the learning aspect of a machine learning project happening during production? Obviously, we create these models, we put them in production, and we have test data and everything works, and we put it in production and things don’t work out. Are there any heuristics, besides just applying accuracy, is it 95%, 94%? What heuristics can they use to say, we need to create another model other than like the model is FUBAR, and we start all over again? Are there any things that a company can do to tell them how often to update the model?
Germano: There’s things that companies should be doing on any model that they’re placing in production, and they’re relying upon. One thing is monitoring performance of your models and making sure that they’re maintaining their accuracy, because models drift. It’s hard. Again, if you don’t have an infrastructure for making sure that your models maintain their performant nature, then you’re going to have bad predictions, and you’re going to find yourself off the rails at some point. You don’t want your customers telling you it’s wrong, you want to be able to know that ahead of time, or that you’ve got movement in a particular direction. Setting up appropriate metrics is super important. Also setting up that monitoring to make sure that you are continuously meeting those performance standards, is something you want to get out ahead of. It’s probably one of the most critical things you can do. Forget reinforcement learning or esoteric things like, how are you going to do this? If you’re not monitoring your performance, then you’re just setting yourself up for failure, for sure.
Existing ML Toolsets, and Their Efficacy
Greco: Speaking of toolsets, what tools should a company put in place before you put an ML system into production? Do you have any recommendations for toolsets? Other than monitoring of accuracy, are there other things you would recommend?
Huyen: I think that a lot of problems you guys talk about are very similar, and it seems like a lot of investors realize as well the problem with monitoring, continuous delivery. At the same time there have been so many companies, like startups trying to capture this. There are so many monitoring software in the last year, probably like 20 of them raised a ton of money. If the problems are well-known, and so many tools out there trying to solve the problems, why that’s still a problem. If you guys look at something like AWS SageMaker, they’re trying to deliver that, Google Vertex trying to deliver that. Why are those tools not good enough for your use case?
Fang: From my experience, of course, it’s not lack of the tools. We have a lot of open source toolsets. Also, big companies, Microsoft, AWS, theoretically release different features and tools in place, and also Google and others. I think it is several things I see from my team perspective, we leverage a lot of the Microsoft Azure stacks. We do have the tools in place, however, it’s also continuing to change according to, either it’s a security concern, or is the next generation or the data size. Sometimes we change from SQL in the past for data acquisition, and then to the Data Lake and then to Spark and everything. Our engineering also and the data scientists also need to catch up with all the skill sets. We also have an internal tool, we call it Kusto, which is currently publicly available as well. All these are individual files needed to catch up and understand what’s the roadmap, and then to plan about the current projects that we are working on leveraging the existing infrastructure, and then how to dogfood to the next platform system, so into that. Then, how to leverage the existing Microsoft solution, MLOps and all this, as well as the open source in Python, in R, so that we can be part of the best-in-class of the system as well.
I think it’s a lot related to all these systems talking together, all these components together, as well as dealing with the complex business scenario. For example, two years ago, we had this huge spike of usage because of COVID, so how to rapidly respond to these changes. Then reflecting to the model itself, do we discard those data or do we incorporate this plus one dot data? Those discussion systems that we needed to take into consideration, so a lot of places together. I don’t think it’s a failure of, or it’s the tooling system problem, it’s all these end-to-end in terms of what is the best for our system. Then how to, one thing, adapt the best-in-class, but at the same time looking into the long term solution.
Embedding Machine Learning in Tooling
Greco: It’s almost like an inevitability. We had software tools for engineers. The advent of the IDE for an engineer just accelerated software development. Are we going to see ML embedded into our tools to help the ML systems?
Germano: There’s already some of that. If you look at modern IDEs, you look at various tools that are available out there. I think Microsoft even has the ability to use some machine learning to evaluate code, and tell you there’s better performant ways to do what you’re trying to accomplish. I see machine learning being integrated completely into developer toolkits. The purpose of that is to make sure that we’re all working from the same baseline. I think that’s great. Why not have some of that help as an engineer to help us anticipate things like performance problems, or anticipate things like not meeting our operational excellence standards, or something within our organization? Yes, I see that now. I see that continuing. I think that working side by side with ML is something that all people in engineering are going to wind up doing.
Frankly, probably all people in our diverse workforce working on all kinds of problems are finding themselves working with machine learning help. I don’t think you can go into certain email systems anymore without them trying to anticipate what you’re going to type. I find that a little crazy, but it’s usually right. I give credit to engineers working in that field, in applied sciences working in that field. I think more and more, we’re going to see that across development tools and development infrastructure. I just imagine when it comes time to doing performance, and sizing, and things like that, AWS will continue to implement all kinds of machine learning to try to anticipate scale. You look at the ability of these systems in the cloud to scale up and scale down automatically based on anticipated use, it’s all being done with machine learning. Those things are there. They’re going to continue.
I think just, on the previous point, it’s like buying a car. Some people need a car that goes fast. Some people need a car that can carry a big family. Some people need a truck. I don’t think there’s any problem with having lots of monitoring tools and lots of tools to pick from, but I think every organization needs to have a centralized approach to it so that not everybody just goes out and picks their favorite and you wind up with a garage full of cars that don’t actually satisfy your need.
Tips on Prioritizing the Likely Impact of Various Features Before They Go To Production
Greco: ML feature engineering can be expensive and time consuming. Do any of the panelists have tips on prioritizing the likely impact of various features before we put them into production?
Fang: This definitely is a huge investment. One thing our organization has been doing is creating the so-called feature bank, so that you can enable the others, so when we have, for example, a machine learning project, and we follow, here is the subtle features that are relevant. Instead of just serving for these particular projects, we also put it into the centralized data center in that environment, so that it’s documenting, also the pipeline is somewhat maintained. These feature banks can be leveraged for the other projects which may be relevant. That is one of the ways for us to start to enrich the scalability, in this case. There are some other things that we do as well, also putting into a centralized place in terms of the data pipeline. For example, sometimes we look at the customer lifecycle to determine, what is the inflection point for the customer? For those inflection points that may apply to one scenario, it may be applying to broader scenarios. We also have that inflection point converted as a metric, so that it can be also looked into as a standardized way, leveraged by the other machine learning projects or business cases. That is one of the ways for us to resolve the scalability of the feature engineering.
Germano: I love that approach. I see that approach as being highly successful in businesses where you’ve got multiple teams that are working on problems. If you think about it, generating an embedding that is useful for one particular model, one particular problem may be useful for others, and you’ve already gone to the expense of generating that embedding. You’ve already gone out, you’ve established this feature. Having a feature repository at a centralized location for features is one way to really speed up engineering work overall. You do not want to be doing duplicative work in this area, because it can cost a lot of extra money. If one team has solved the problem, it’s really awesome to have one place to go to build upon that.
Deploying Models from Notebooks
Greco: The role of Jupyter Notebooks. Jupyter Notebooks and alike are great during the research phase, but some organization productionize notebooks right away that make it possible to deploy models from a notebook. Is that good practice or is that not a good practice?
Germano: I think it’s a question, is it a good practice for your organization based on the scale of what you’re trying to accomplish? That would fail in a very large scale world, potentially, because you’ve got infrastructure, what are you using to host that? Is it SageMaker? What are those costs? Is that the best place for that to live for your organization? Who are your customers? How are they spread out? How are they diverse? What is your tolerance for cloud services? I think it’s less of a machine learning question and more of an engineering question, and about, what are you going to do to be performant? I’ve seen it work at some scale. You can use this, especially in like batch systems and stuff like that, where it’s just going to run something overnight, maybe you’re going to just use that infrastructure to go do some analysis or inference at night. The tradeoff is if I’m going to go and I’m going to actually have 100 million people hitting my website during the day, and they’re all going to be calling into this thing, what is that going to look like?
There’s nothing wrong with evaluating Jupyter Notebooks for a production environment for an organization where it makes sense. You got to decide whether it makes sense for you or not, and you have a tolerance for being able to host your stuff in that way. Then you got to ask all the same questions about, how do you make sure you’re doing versioning correctly? How are you testing that? What is your infrastructure for doing integration testing across your entire pipeline? Is making one change to that going to break 100 places? These are questions you have to ask yourself and see what your tolerance is.
Explainability and Interpretability
Greco: I did want to bring up the point about explainability, interpretability. We know that for various reasons, especially like legal reasons, that this is an important thing. For a company that’s starting out in deploying an ML production system, how do you ensure that? How do you ensure interpretability, explainability? What do you do?
Germano: It’s dependent upon what you’re trying to accomplish. If you’re using deep learning models, your explainability is going to be really tough. That’s how we train the model. If you get into a really complicated deep learning infrastructure, you literally have to let go, as a manager, as somebody evaluating output, you have to let go of a little bit of explainability. Trust that if it tells you that the cat is a cat, you can go and you can evaluate the performance of it, and say, here’s the metric that we’ve established to say that it’s performant in identifying cats. If I’ve got to explain to you how it determined that that’s a cat, I’d have to show you a dataset of 100,000 cats.
Explainability is important if you’re looking at things like linear regression models. These things are a little simpler. As you start to get into very complicated models, or you get into models that build upon each other where you’ve got some really complicated learning process, it becomes a little more difficult. It becomes a little bit of trust that those metrics that you’ve established are the appropriate threshold for evaluating the performance of the model. That’s my opinion. I notice 100 million other opinions. Because every time you talk to a scientist, or someone, they’re going to give you a slightly different opinion on that.
Huyen: I think we have many different dimensions of explainability, interpretability. One thing is like for developers to understand the model, so that’s what Vernon was talking about like, if you’ve made the decision that this is a cat, how does that arrive at the decision? Another is to ensure that you don’t have biases on a model. For example, if you have a resume screening model, show that like a feature they’ve picked on along, is that person of a certain race? Then it’s definitely something you need to keep an eye out for. There are different dimensions of interpretability. It’s another insight to help you observe the model performance. When I was talking about how when you monitor a model, and you monitor for performance decay, but then there’s so many different things that can help of course the performance decay. Without some understanding into how a model arrived at certain predictions, it’s impossible to detect the causes of the performance decay.
Greco: Certainly an interesting thing from a legal point of view too going forward in the future, you’re not being dragged in front of the Senate, saying, “The model is the model. That’s how it was trained.” Unless we have to educate our senators on how machine learning works.
Machine Learning At the Edge
There’s this new trend about computing at the edge. For us being machine learning people, machine learning at the edge. Are there any suggested architectures in doing ML at the edge. Is it just computing at the edge, except we’re just applying the monitors and more to that?
Huyen: I think for me, personally, I feel that machine learning on the edge is the Holy Grail. I think the more computation you can push to the consumer devices, the less we have to pay for cloud bills. I’m not sure what the cloud bill complaint at your company is. Every time I talk to a company, they were like, “We need to reduce the cloud bill this year, and I don’t know how.” Some organizations can push to the edge device. There are so many problems with deploying machine learning on the edge. Like hardware problems, whether the hardware is powerful enough to run the models. There’s the question of how to manage different models, because now instead of having one model on one server, you can have 200,000 different models on 200,000 different devices, then how do you monitor performance? If you have an update, how do you usually push out an update to all of them while maintaining the localization of different devices? There are a lot of problems with edge devices.
Germano: I agree. I think the last point is, how do you maintain models at the edge? Especially considering some of these models need to be retrained quite often. Again, if you’re going to suffer drift in one model running internal, imagine if you’ve got multiple versions of them out there on different devices. One I think compelling technology is you start to look at your ability to run models inside these virtual spaces, like a browser itself, where the potential is that you’re still hosting one model. It’s just being made available to many users outside your organization. I think, one, that saves you on cloud services, potentially, but it also can really aid in performance. I think that the performance nature is just as important as the expense in cloud service operations.
If, for instance, I can push all those cycles out to laptops out there that are accessing my website, then I have not only saved money, but I’ve also given my user a much more interactive experience potentially. As we look at the development within browser infrastructures of being able to push these things all the way to the laptop and let those take advantage of, for instance, GPU hardware, whatever that’s actually out there. I feel like over the next few years, that’s going to be a really interesting approach. I would imagine that that’s going to be something that is adopted by a lot of organizations that just actually see that they can, not just save money, but give that really nice, crisp experience.
Huyen: When you say performance on browser, do you mean latency aspect or accuracy, performance?
Germano: I’m referring specifically to your ability to push your models out and have them run inside, basically a virtual machine that’s running inside the browser. It’s not that you’re hosting that, it’s that the browser is actually hosting your model itself. As we see the development of those technologies, just as in the past without having JavaScript running within a browser, we lose out all that functionality and now we push all that out. We’re not actually doing that work internally, that’s happening on our users’ machines.
Same thing for models, eventually, we’ll be able to see model infrastructure where analysis and inference is done, and the model itself is being hosted remote from us. We just happen to serve it up, and then it’s running somewhere else. That to me would be extremely helpful in areas, for instance, Zillow, where you do a home walk-through. It’s a video picture of a house, and now you’ve got to try to figure out the layout of that house, or you want to actually figure out how to present a panoramic 360 view stitching images together. If I can stitch those images together on a user’s machine instead of doing it myself, I’ve saved myself a tremendous amount of effort, and I’ve given them a much better experience.
See more presentations with transcripts