MMS • Lola Priego Jose del Pozo
Article originally posted on InfoQ. Visit InfoQ
Transcript
Priego: We are going to be talking about how to solve data quality issues when you are diagnosing health symptoms with AI, with technology overall. I am Lola. I’m the founder and CEO of Base. We are a direct to consumer lab testing and app tracking and app diagnosis startup. Before founding Base, I was a senior software engineer in big tech companies like Amazon, Facebook, and Instagram. I’ve done all kinds of work in the past, from recommender engines at Amazon, to all kinds of performance engineering and mobile engineering at Facebook, some data science work as well.
Del Pozo: I’m Jose. I’m a senior software engineer at Base. My background is a mobile engineer. From time to time, I’m now working more as a backend engineer, as a full stack engineer. I’m the first engineer at Base.
Why?
Priego: What problem are we going to be solving or talking about? We wanted to preface with a question that a lot of consumers and people have, that is, why am I tired? Why can’t I sleep? Why do I go on a diet and I have digestive issues? We’re going to be specifically talking about symptoms diagnosis. When we talk about symptoms, we are referring to daily symptoms that are related to deficiencies coming from hormones, vitamins, and nutrients. The problem that we are off to solve is that, increasingly, people don’t know what’s going on with their health. They don’t have the support that they need. It’s really hard to solve recurrent problems like chronic fatigue, digestive issues, and so forth. There is a gap in the healthcare system currently, for people with chronic conditions that they can get better care. As we mentioned, there’s no clear answer. Specifically, each person has their goals, and health profiles are different, which makes things even harder, because there’s not a one-size-fits-all. There’s not a diet, a supplement, an over the counter medicine that you can take to fix it. It truly depends on your lifestyle, so it needs to be personalized. This is where Base can help. We believe that these symptoms and these health problems need to be personalized. We are here to talk about how we’re going to introduce that and how we’re going to solve for certain challenges from a data perspective.
Goal
Del Pozo: The goal of this talk is to find root causes based on user reported data to improve their symptoms. We would like to find the root causes based on how the users feel. Then once we’ve found out that, we would like to improve their symptoms.
Outline
The agenda is to tackle these problems, in order to build this system that you see there. We will have some inputs like user symptoms, the lab data and science because we are working with health. Then the system must be able to find out user deficiencies in order to improve the root causes. The problems that we found building this system were how to capture user input. How to deal with unreliable self-reported user input. As you may know, it’s really difficult to deal with this data. How to normalize data from different labs. Each lab is different. They have different lab ranges. They have different units of measurement. We need to work for every specific lab. Then, how to deal with factors like pricing in your model. Each test costs money, so it’s difficult to find the best solution when money is involved. Then, how to prevent against biases and performance issues in healthcare. We’re going to go into more detail for each problem. Then we’ll show you our solution.
Problem: Unreliable Self-Reported User Input
The first problem is unreliable self-reported user input. Imagine when you go to the doctor, and they ask you some questions to check what’s going on. Sometimes we struggle to answer those questions. It doesn’t mean that you are not fair with the doctor, it means that you maybe forget, or you are not able to answer that question. The accuracy of self-reported data is an issue in this system. Then another issue is, user symptoms are not standardized. We cannot get the user symptoms in a standard way in order to plug them as an input to our model. Imagine that these two things are really challenging for machine learning models. The last thing in this problem is, you don’t feel the same every single day. Every day, our symptoms and our feelings are changing, are evolving, so we need to update, we need to keep tracking on what’s going on to our users.
Problem: Training Data from Different Labs
The next problem is training data from different labs. At Base, we work with different labs for the same test, so we receive results for a single biomarker in different labs. That means that each lab have their own optimal ranges, plus different units of measurement. The challenge here is we need to compare results in order to build our system. We need to create something that allows us to compare results between different labs.
Problem: Pricing as a Factor for Your Model
Another problem is pricing. Pricing is a factor in our machine learning model, because the decision of the model involves money, and our users are price sensitive, meaning we cannot suggest a full panel test for a user because it’s too expensive. We need to balance between cost and probability of finding a deficiency.
Problem: No Training Data Yet
Another problem is we didn’t have training data at the beginning. Imagine at the beginning of Base, we didn’t have any user input. We didn’t have any result to analyze. We didn’t have a dataset in order to start with a system. For that, our science team prepared a map between user symptoms and the most likely test that we are going to find a deficiency.
Preventing Against Poor ML Performance in Healthcare
Preventing against poor machine learning performance for healthcare, this problem which is related to healthcare could not be solved only by technology. Sometimes the technology is not ready yet to solve the problem. We need to balance between technology and medical QA to check that everything is ok. For this issue, we are in high risk of model biases. Imagine that our model is price sensitive, so if we are not balancing between cost and user select different tests, our model could be biased to the price.
Priego: I think when we are talking about systems that involve user’s safety or user’s health, we are going to see that sometimes your QA, having not necessarily a supervised model, as in the technical term, but how would humans act to actually manage this technology is going to be key. About the high risk of the model biases as well, we want to make sure that the system that we are building is not biasing towards the cheapest test or towards the test that we are recommending the most, because then we don’t get a chance to actually see the other test that we can potentially prescribe. That’s something that we’re going to be talking about as well. Also, another parallel that we can draw with this problem in healthcare will be self-driving cars, where we find a similar problem. Today, it’s really hard to hop in a self-driving car, and then let regulations fully trust the engines and systems that we’re building to drive autonomous cars. There always needs to be that supervision. That’s something that we’re going to be talking about.
How to Capture User Input
Del Pozo: Now that we have listed all the problems, we’re going to tell the architecture and solution that we implemented. To answer the first question we’ve had in the agenda is how to capture user input. User based journeys start at our webpage. There we direct them to a quiz, which is a questionnaire with questions related to symptoms. We ask them the gender, age, and some other things related to the health profile. We implemented this using Typeform. Typeform is a form, and we use it through webhooks. Through webhooks we post users answers to our API, to our backend. We selected Typeform because integration was super-fast, their documentation was great. One really important thing for us was security, health data must be private. Security in their webhooks allows us to make that data private. We talk about user symptoms and feelings changes every day. That’s why we need to ask them how they are feeling in a periodic way. For that, we use our mobile app, in which we ask them every week, how they are feeling, how their symptoms have evolved. We have a check every week there. As well, we have another check after they get their results. When they get their results, we ask them how they feel. If the results are what they were expecting, or their feedback in order to improve our machine learning model.
How to Deal with Unreliable User Input
Here, you have a GIF, in which we saw more or less how we ask for user symptoms in there. This problem is a multiplatform problem. As you can imagine, we have our webpage. Then we have our Typeform integration with webhooks through our API. Then once Typeform pushes the data to our API, our backend, with our machine learning model that we are going to explain in a bit how it is built. There is our mobile app in which we ask the user every week, how they are feeling.
ML Model Overview
Moving on to our machine learning model, here we have the input and output of the machine learning model. At the beginning we have the symptoms, this data comes from Typeform, from the quiz. Then the health profile input which it generates, and if they have a problem, for example, diabetes, or blood pressure, or whatever. With these two inputs mixed, we expect an output. We have one per biomarker, so each biomarker has the probability of how likely that test is going to be ok. An example we see that testosterone is great, but thyroid panel is just 20% likely to be ok, so it’s most likely to be off. Then the suggestion for the algorithm will be to test thyroid.
Priego: I think it’s important to explain why we have one model per biomarker, in these cases which we are treating with health conditions, so it’s key that we have high accuracy. In that case, we prefer to now optimize for efficiency of our systems. In this case, even though it’s a little bit more costly, we want to have one model per biomarker. Which also, it’s relevant, because each of these biomarkers have different bio properties, so they will behave a little bit different. That’s why you also want to build individual mathematical models for each of these, instead of building a whole model that outputs the map of the regression for each of the biomarker deficiency prediction. That’s something that we can see there. It was really important to us to increase the accuracy of the model per biomarker. We also have certain conditions as well, like certain tests being prescribed a little bit more or potentially more important for someone’s health that led us to reinforce this decision.
ML Model – Dealing with Pricing
Del Pozo: This machine learning model deals with pricing. We don’t like to be biased for a cheaper test, so we sort the output by efficiency, and then the output will be just one test. We add an option to the user to upsell expanded test in order to test a full panel in the area that the algorithm suggested that the user might be off. This allows us to have more data and try not to bias the model for a cheaper test, because if we let the user decide, maybe they will just decide to test the cheaper test, and we just have results for those tests. Then our model will be biased for some specific tests.
ML Model – Overview
The success metric is finding user deficiencies, for example, when we found non-optimal results for our test. If we have some symptoms, then the output of that model, and then when a user tests, results are not optimal. For us, not a success. If we don’t find those, we need a feedback loop. That feedback loop is executed every time a user receives results in order to learn what’s going on with the algorithm. We had a cold start at the beginning. At the beginning, we didn’t have any data, any user symptoms, any user health profile, and we didn’t have any data. That’s why our first suggested test was based on science knowledge.
Preventing Against Biases
Priego: I’m going to talk a little bit more here about how to prevent against biases assignment, where let’s say that our model gets used to prescribe thyroid and vitamin D tests for when people are fatigued, which is common in general. Then there are potential issues with certain demographics like women that are around 30 years old, which tend to be iron deficient, but because the model can get biased because of the rest of the demographics, which tend to have that deficiency, including women, but for that cohort that I mentioned, iron will be a more deficient test. In that case, you can bias your model because you can actually be prescribing thyroid and vitamin D test for this cohort of users, which iron will be more relevant, but then we will find somewhat non-optimal results for that test. Then the model itself will believe that it was successful, when in reality we were missing a more important test to be prescribed here.
In that case, the closed feedback loop will be where your model gives you a test prescription prediction based on what we think is going to be deficient. Then from there, the user gets tested. They get their results back. We’re going to normalize the data and so forth. Then with that you train your data, including the user symptoms that they input through the quiz, and then we are polling through the mobile app as well. With that, you build your model or rebuild your model. What happens now, if we really want to open this loop, we include the concept of a full panel test. It is important from time to time to suggest this to users. You can see exactly, if you were to test everything for a user, what is the most deficient test? That way you break the bias for your model itself.
MVP Architecture via AWS
We thought that this audience may benefit from an MVP architecture that we built via AWS, including the healthcare component and data quality issues that we mentioned. Most importantly, we wanted to talk about it from a perspective of a startup. Something that I went through when I transitioned from big tech into a smaller company, is that sometimes it’s harder when you don’t have internal tools or things that are already built to go and build your solution, you really have to go from zero to one. In some cases, in mid-sized businesses, this is also quite common, because you don’t want to build a rocket to cross the street. You want to build a prototype, in some cases, to see your success metric, and see how a model or a new system does in production, in the wild, before you actually build something that is more complex. In that case, we take pride in building the simplest solutions first. For us, we really leverage AWS to build our entire system. When you saw that architecture that Jose walked you through on the API, the model was in the background, or in the backend. What we really do is use SageMaker to train our model. We talked about the feedback loop. Currently, we are evaluating training this model weekly, specifically, because we’re getting out of the cold start problem once you start getting data. Then you can deploy this in a development endpoint where you can have validity and you can actually have different pathways that you can do QA against.
Again, we talked about the importance of balancing tech and medical expertise in our model, when sometimes your system’s output or behaviors, implies some potential danger, or they need to stick to regulations like driving regulations or healthcare regulations. In this case, having a development endpoint allows our medical team to QA the algorithms with different pathways. You have a person that has a certain profile that should be prescribed a thyroid test. That’s something that the medical team can go and evaluate, actually, what tests this model is prescribing for different profiles that they know really well, just to make sure that things are stable and good to go. Once that’s approved, you can deploy that through code pipelines to our backend, that is mostly in Spring Java, in AWS, and use the Amazon API gateway to communicate with our SageMaker model.
Summary
Del Pozo: First of all, we had a multiplatform polling to prevent against unreliable self-reported user input. That is the first problem that we tackled. Then, we implemented a full testing panel to prevent against bias and to enrich our feedback loop. The third thing that we would like you to think about is normalizing training data via math distance for the different lab results. We need to normalize in order to compare the data. We need medical QA for model performance review prior deployment, because at this point, tech could not be enough for medical and for healthcare issues.
Questions and Answers
Breck: I’ve been tracking my sleep for a good while, and the impacts it has on my health. Accounting for how I feel each day and what my blood says is very interesting to me. I think no matter what our industry here, as we move into more predictive models, whether in manufacturing or in financials, or what have you, the devil is really in the details in terms of data quality, and how that drives our models.
I’m curious that once you have a lot of data, and you can rely more on your empirical models, will you abandon the scientific models that you use to bootstrap or will you always have a balance of the two?
Priego: Basically, when you are at the cold start stage, you have to rely purely on data from clinical trials, medical expertise, just basically what the science is telling you. Then we start feeding and collecting data. Then you start to be able to build models, and then suddenly, those are giving you more information or more helpful information than what doctors originally had in their head. Then, to our point on deploying a model and getting that to production, they still need to be supervised by doctors, and if a pathway seems off or they really disagree with, that model cannot be fully deployed to production. Because at the end of the day, a doctor needs to approve the algorithm. It is challenging to have all potential dimensions or pathways, although that’s possible, because at the end of the day, those are finite. You take the most common ones in some critical scenarios, and then you get those in front of doctors. If a doctor says, listen, for this case, we cannot say or prescribe X, Y, or Z, or like, you’re missing out on this case, and you have to still patch it with some manual roles, or go back and adjust your dataset.
Breck: I’m sure there’s lots of people who maybe want to get into more predictive models, and this isn’t something that they’ve done yet. Especially as a small company, building an MVP, was Amazon SageMaker an absolutely obvious choice, or did you weigh other options? Are there any details you can provide on the actual ML model that you’re using?
Priego: AWS SageMaker was a no-brainer. Also, A, because they have models out of the box, so you don’t have to go scrapping for libraries that plug in into your backend. They have all kinds of libraries for all kinds of popular models. At the end of the day, as a startup, you don’t have the data science and mathematical resources to go and build your own, you’re going to have to use common libraries. Something that I also saw in big tech, is that sometimes when you’re trying to build the first iteration of a model you always have to go with out of the box, because it takes a long time for data scientists to come up with actual custom mathematical models for whatever system you’re trying to build. SageMaker was a no-brainer because all of the libraries for popular models are there. They have out of the box tools that you can use for pre-deployment, testing and QA, and then eventually deploying that to your production endpoints.
In addition to the question of, what models are you using? Linear regression is the one that seems to be doing the best out of the box. Of course, at the end of the day, for all kinds of systems that you’re trying to do, the first thing that you’re going to try is run your input and outputs towards different models and see which one has the best performance with your training data. In our case, that was regression models. Any other options that we evaluated? I looked into a popular one, Scale AI, that is now becoming popular with startups and mid-size businesses, but mostly they are for self-driving cars. They’re doing some cool stuff around e-commerce. They help you with labeling data, but again, you have a conflict there, because our data is medical, so we cannot give access to external tools to label the data or polish the data, or anything in relation to that. We couldn’t leverage that component. After looking into Scale AI, they seem like a solid solution, but not applicable to the healthcare industry.
Breck: Do you need to do any testing in production, like giving your production models a known set of inputs and looking for a known set of outputs as a way to continuously validate your system in production? Is this something that you do or have thought about?
Priego: Again, you need to have that medical QA, then like testing in production. That’s something that we cannot afford with the type of dataset. I don’t know for people that are working with, for example, self-driving cars, how they go about it, but in my head there will be some parameters that you always need to control deeply just because of regulations. The Department of Health do force you to have certain pathways and have medical supervision over the AI systems that are actually making medical decisions. For us, it is really having those pathways prior production. Then there are other scenarios that you can test, for example, like outliers that maybe you don’t have in your testing models yet. Someone that comes to you with a given medical condition and a given symptom, and that’s something that you still don’t have in your dataset, that’s one that we’ll have to go and supervise.
Breck: Someone’s asking about the types of DevOps tools that you leverage. Maybe building on that question, how do you version different models, keep track of different models, promote them from your engineering QA environment into production? Do you have any tips and tricks in managing?
Priego: Yes. For the models, again, everything out of the box for SageMaker as of today. Then there’s some interesting ones that we have to develop, because the part that will be challenging is for us the multiplatform edge, when you have to start bringing datasets from different types of inputs to your model. In our case, the user was very far removed from AWS, so in order to capture that input, we had to figure out, how do you build user input tools to be able to get that into AWS in a way that is standardized and normalized?
Jose, I don’t know if you want to elaborate a little bit more into the DevOps tools leveraged for Typeform. How we capture the user input. How we normalize it. I know that we didn’t go super deep, for example, into the lab results values that are normalized that are coming from the labs, which is also interesting. Because you have on one hand users giving you input about their symptoms. Then, on the other hand, you have different labs from Quest to small local labs that don’t have a lot of technology, also giving you data in a spreadsheet on S3.
Del Pozo: We talked about Typeform, which is how we get user input. Typeform produces the data and the user symptoms, using webhooks to our backend, and there we standardize the user’s answers. We store those in our database, and we store those in S3. Then the same happens for the frequent app polling that we do every week. Then with the labs, they have an internal portal in which they upload user results. All of that data comes to the backend, which is standardized and normalized and then stored. Then our models can grab that data to build.
Priego: Something too to note on the lab side as well, is that we have to leverage some human supervision as well, just to make sure that everything looks good. Of course, at the end of the day, even humans make errors, so what you do is you normalize your data and you build guards. For example, for certain results, you will not believe it, but labs do have a tendency to, on 1% of cases, add a zero into a number of a result, which could be a deal breaker for a person, and what that means for their health. What we do is also we build guards around the data on what looks a bit off or abnormal, and what’s critical, also because the critical one the doctor has to follow up manually with the person. In those cases, that’s when the human will go and supervise the data, and try to maybe potentially follow up with the lab or not.
The way that looks is the lab uploads the data, we absorb it with S3. They literally upload the spreadsheet daily. Then with a Python script, we basically run that daily as well to standardize the data and normalize it. Then if everything looks good, it gets uploaded to the server. For those parameters that look off, an analyst will go and supervise those outside, and the rest will get uploaded. Then we just ingest that daily directly into our backend, and it will go into DynamoDB. Again, leveraging a lot of AWS. That’s why we also chose SageMaker, but that’s the DevOps tools that we’re leveraging in that case. Then you can of course schedule all of your cron jobs in an Elastic Beanstalk instance, or any of the out of the box solutions from AWS.
See more presentations with transcripts