Mobile Monitoring Solutions

Search
Close this search box.

39 Statistical Concepts Explained in Simple English – Part 17

MMS Founder
MMS RSS

Article originally posted on Data Science Central. Visit Data Science Central

This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, decision trees, ensembles, correlation, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, cross-validation, model fitting, and many more. To keep receiving these articles, sign up on DSC.

39 Statistical Concepts Explained in Simple English

Previous editions, in alphabetical order, can be accessed here:

 Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6 | Part 7 | Part 8 

Part 9 | Part 10 | Part 11 | Part 12 | Part 13  | Part 14 Part 15

Part 16

To make sure you keep getting these emails, please add  mail@newsletter.datasciencecentral.com to your address book or whitelist us.  

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


28 Statistical Concepts Explained in Simple English – Part 18

MMS Founder
MMS RSS

Article originally posted on Data Science Central. Visit Data Science Central

This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, decision trees, ensembles, correlation, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, cross-validation, model fitting, and many more. To keep receiving these articles, sign up on DSC.

Below is the last article in the series Statistical Concepts Explained in Simple English. The full series is accessible here

Source for picture: here

28 Statistical Concepts Explained in Simple English – Part 18

To make sure you keep getting these emails, please add  mail@newsletter.datasciencecentral.com to your address book or whitelist us.  

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Panel: Predictive Architectures in Practice

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

Transcript

Moderator: Thank you all for being here at the tail end of the “Predictive Architectures in The Real World” track. I hope you learned a lot of cool things and enjoyed yourself in general. My brain is bursting from all the things I’ve heard, I’m sure for you too. I hope you have some questions for the people who taught us all those amazing things today.

I’m going to kick it off with my own question because that’s track host privilege. I’ll let the panel answer and then I’ll turn it to you, you can think of your questions, scribble them, and then if you run out of questions, I have a bunch of my own, so don’t feel under the gun that you have to ask something, or there will be a very boring session, although my questions may be more boring than yours, who knows?

Just to start by introducing our experts even though you’ve probably seen them earlier. We have Eric Chen, who is software engineering manager from Uber. He talked about feature engineering and the Michelangelo palette, he can probably answer a bunch of questions about that. We have Sumit from LinkedIn, and he worked on People You May Know, and the transition from batch to streaming and special optimized databases that they have built. We have Emily and Anil from Spotify, and I totally missed their talk, but I heard it’s awesome and all kinds of good things about real-time recommendations. Then we have Josh Wills, who knows everything about how to monitor your machinery pipeline, so it will work at 2 a.m. in the morning just as well as it does at 2 p.m. in the afternoon, which may be not at all.

Evaluating Success

My question for this esteemed panel of experts is, how do you know that what you did actually worked? How do you evaluate your success? If your manager asks you, “Well, did you do a good job?” How do you know?

Chen: Short answer, I don’t know, some people tell me. Longer answer, the customer usually comes back to us, they will tell us their stories. For example, the ease, the maps, the whole examples we’re showing here is actually coming from our interaction with our teams. They are data scientists, they have their own KPIs, they know by productionizing this particular model how much money they saved or how much efficiency they improved. That’s a way we know, “Ok, we’re actually making some impact here.”

Rangwala: Yes, for us it’s always the metrics, we always measure our success based on the metrics. I happen to be in a machine learning team, so at the end of the day, for us, that’s our true north metrics. If we don’t move it, then most of the time the meeting succeeds.

Moderator: What’s your most important metric?

Rangwala: Our most important metric is what we call, “How many members invited another member and they were accepted?” Invites and accepts are basically one of the metrics. We also have another metric which at LinkedIn, pretty much all the metrics are tracked in real time. At any time, we can go and see how we are performing on any of the metrics, I think the report card is out there.

Samuels: I would just also like to add that, another thing that I think about is how easily it is to experiment on our platform. We have metrics to determine how people are listening to home and whether they’re happy with our recommendations, but we also always want to be trying out new things and always be improving. One of the ways that we know that our architecture is good is if we’re able to easily experiment.

Wills: I’ll echo that and I guess I’ll throw in reliability as being a major concern for me, it’s like speed times reliability. I studied operations research in college, it’s either I’m trying to optimize different models per week subject to give different reliability uptime constraints, or vice versa, or I’m trying to optimize a product of the two with a sort of a logarithmic measure of reliability, basically. Times iteration speed is more or less what I’m after, speed subject reliability, that’s my jam.

Limitations of Tooling

Participant 1: In the areas that you work on, what are the areas that you feel that you’re limited in and you wish you had more power, or you had more tools to build up better [inaudible 00:05:36]?

Moderator: Do you want to repeat the question because of the lack of mic?

Wills: Sure. The question is, in what areas do we feel limited by our tooling, I think is the core of the question. What tooling do we wish we have? I feel somewhat limited by the monitoring. I guess my talk was primarily about monitoring and alerts, counters and logs, and stuff like that. I do feel somewhat constrained, I think, by my monitoring tools in the sense that they weren’t really built with machine learning in mind. One of the ways we have screwed up Prometheus in the past is by inadvertently passing basically Infinity to one of our counters, incriminating by Infinity, and Prometheus just doesn’t handle Infinity very well. I don’t blame it, I can’t handle Infinity that well either.

It’s the kind of thing that’s an occupational hazard in fitting models, is like singularities and dividing by zero. You have to kind of adapt the tool to be aware of that use case, I fantasize about if there are any VCs here who want to give me money about machine learning, monitoring tools that are sort of designed for machine learning as a first class citizen, as opposed to performance and errors.

Samuels: I feel like a lot of times we’re fighting against the data that we have to feed into the models, and if you put garbage in you get garbage out, so a lot of times, we have to do a lot of work upfront to make sure that the data that we’re putting in is of good quality. That’s something that can be limiting.

Rangwala: For us, I can think of two things, one of the things is this impedance mismatch between when we are working in the offline world. Whenever you have to train a model, you have to work in the offline world, but then when you have to actually use that model, you have to come in the online world and serve it, and making sure that both the worlds are exactly same is actually a very big problem. The offline world is still fairly matured, but there are lots of gaps on the online world. Trying to be able to be able to do exactly what you’re doing offline, online, is extremely hard at this point. Then, it also involves moving data from offline to online, Uber talked about how to move features, there is also a problem of how do you move the models? How do you make this whole system seamless so that the training happens automatically, and then it gets picked up automatically by the online system, while not affecting the performance?

The second thing is – maybe it’s because of the nature of the company that I work at – we have solved many of the problems, and now the only big problems that we deal with is the size of the data. Again, and again, we are limited by what we want to do, any algorithms that can work at a smaller scale, now they don’t work at a larger scale, and we have to figure out new ways of doing things.

Chen: First of all, one thing we’re trying to emphasize is what I’ve experience. From the time you have some experimental idea, you want to try it out till the time you can deploy that in the real production. These things actually related need a lot of tools on your hand to analyze your model, so on two fronts, one is, how can you as a user of the system interactively define all your needs? The other one is, analytics and evals. How can I provide a bunch of evals, which can help you to understand online offline consistency? How you can understand your model behaving in the real situation while in a simulation system? Then, eventually how your model can be pushed into production, and also monitoring on top of it. This is one big thing. Definitely, the other thing we’re fighting for as well, the reliability, the size of the data keeps growing. How can we build a system which can magically grow with the size of your data? That’s also what we’re imagining as well.

A/B Testing Systems

Participant 2: I have a question for the entire panel. I’m interested in learning what is the A/B testing system you use, and if it’s proprietary or open-source or a combination? I really like the entire A/B testing work. From bucketizing user, tracking their results, visualizing the experiment results.

Moderator: Ok, what tools do you use for A/B testing?

Chen: Short answer, it’s internal tool we built ourselves, it’s not part of Michelangelo because A/B testing is everywhere in the world. Longer story, even A/B testing itself requires different tools behind the scenes. Some of them, as you talked about, are bucketizing, some of them, I have a default, I want to turn on a little bit, and then turn it off the way I’m controlling it, probably by time. It’s also related to how can I organize multiple experiments, because I want to isolate the impact of them. These things can happen, some of them are actually operational, say, because, we’re operating in different cities. That’s why our A/B testing is internal, it’s very complicated.

Rangwala: I think it is fairly well-known, LinkedIn has a fairly sophisticated A/B testing system which we extensively use, and every time we are ramping any model, any feature, it goes through the A/B test. There is a whole bunch of pipelines which are built, which come for free for anyone who is wrapping their model, which will actually calculate all the metrics, all the true metrics that we care about, product metrics and at the company level. We come to know if our experiment is causing any problem anywhere.

It’s a very big team, and it’s almost a state of the art at this point. One of the things it still has missing is, just like we do for a software system, where whenever you are doing a code commit, there is an automatic pipeline where the code gets deployed to the next stage and if there are no issues, it automatically moves on to the next stage. I would like to have the same system for models where the A/B testing system is the baseline, which is the information that you use to figure out when to move anything automatically to the next stage.

Muppalla: At Spotify we have a complex way of doing A/B testing, we have a team that’s built a platform to generalize how you divide and bucketize users. For home, we have our own way of testing and using this infrastructure that we have and also, we have a couple of data scientists who see the exposures for each treatment that we expose the users to and come up with, “How is this affecting our core metrics that we care about for the homepage?”

Wills: Yes, so echoing what other folks said, at Slack we built our own. I was not actually involved in building it, we had some ex-Facebook people who built a Facebook-centric experiment framework. I had my own biases here because I wrote the Java version of Google’s experiment framework, which is unified across C++, Go, Java, different binaries; all have the same experiment configuration and diversion criteria, bucketing, logging, all that kind of good stuff. It was interesting to me to learn how the Facebook folks did it, it was just educational, like the pros and cons. With these sorts of systems, they almost always end up being homegrown outside of, if the marketing team is using Optimizely or one of these kinds of systems for the machine learning use cases, and I think that is another factor, the fact that, A, these things are super fun to write, and so, it’s an awesome thing to do. Why would I pay someone to do something that I enjoy doing? I’m just going do it myself.

Two, they’re almost always built on top of a bunch of other assumptions you’ve made. The same thing is true for monitoring stuff, I’m not generally bullish on machine learning as a service, because I want my machine learning to be monitored exactly the same way the rest of my application is monitored, whether I’m using ELK and Grafana, StatsD, Sumo, Datadog, whatever it is I’ve chosen, I want to use the exact same thing that I use for everything else.

One of my lessons in my old age is that there’s no such thing as a Greenfield opportunity. You’re always hemmed in by the constraints of other choices other people have made, and an experiment framework is not generally built in the first week of a company’s life. It’s built after years and years, and so you build it in the context of everything else you’ve built. I love the Uber talk by the way, and I totally get the palette work on top of Flink and on top of Hive, and on top of all these other systems that are built and trusted and used for lots of different things; as opposed to building something de novo on top of Spark, would be insane.

Accountability and Protecting Your Systems

Participant 3: I wanted to ask you a question. Sometimes recommendations can have real consequences, self-driving cars, you can imagine. Our company makes recommendations to farmers with digital agriculture, somebody follows a recommendation and they don’t get what they want, you could be open to be sued or something. What keeps you up at night? What ways do you audit or maintain tracks of your system so that you can replay or you could explain to someone, our system actually worked correctly?

Moderator: “What keeps you up at night?” is a good question on its own right. Then, how do you audit and protect yourself against lawsuits because the recommendations went wrong or went right?

Wills: I know what keeps me up at night, I don’t have a good answer for the auditing stuff.

Moderator: You have a three-year-old son.

Wills: Definitely my three-year-old son keeps me up, I actually haven’t slept in about four days. The thing that keeps Slack up at night more than anything else, is serving messages to someone that they should not be able to see. There’s a private channel somewhere, there’s a DM conversation, something like that, and if you somehow did a search, and the search resulted in you getting back messages that you should not be able to see from a privacy perspective, that is existential threat level terror to the company. It 100% keeps me up at night, and is the thing I obsess over.

We have written triple redundancy checks to prevent it from happening. We are paranoid about it to the point that we’re willing to sacrifice 100 milliseconds of latency to ensure that it never happens. That is the thing that honestly keeps me up at night more than anything else, more than any recommendation ranking relevance explanation, that’s my personal terror. I’m sure everyone has their own, that’s mine. I’ll give some thought to the other question though, that was a good one.

Muppalla: The thing that I am scared or worried about, or my team constantly thinks about, is a user not seeing anything on the homepage. That’s a bit scary for us. We also have a lot of checks, and with respect to checking whether when you make a recommendation, it actually works, we have different users that we are trying to target and different use cases that they belong to. We have this nice way of creating a user that fits that profile kind of thing and seeing what happens, and whether it fits that user’s recommendation.

Samuels: I’ll just add that I think at Spotify in general, one of the things is if you hit play, and you don’t hear anything, that’s scary, too. Making sure that if nothing else works, at least you can play a song.

Rangwala: I think for LinkedIn there are existential threats if the websites goes down, but we have a lot of systems in place to take care of it. I would rather answer the question about adding explainability, many times people see something which they sometimes find offensive. We have systems which actually go and find anything which can be low quality, or which could be offensive, and we remove it from the feed for everyone. That being aside, for every machine learning product, there is always this question of, if we recommend something to the user which the user does not find ok, or it leads to a huge amount of traffic to the user, what do we do? The general thing that we use at LinkedIn is, we do a lot of tracking of the data. Everything which has been shown to the user has been tracked with a finer detail of information about what was the reason, what was the score, what was the features that we used in order to recommend this? So that whenever a problem is reported to us, we can go back, replay what happened, and come up with an explanation of why that recommendation was generated.

Chen: To answer the question, I think it’s a really a layered question. Michelangelo is really the influencing layer, in some sense, we don’t know where you’re serving, so we don’t answer the question directly. The way we are solving the problem is, we’re providing tools for our customers so they can understand their particular business domain, then they can explain to their customers, so what tools are we building? For example, when these things are behaving not really exactly the way as the training time, it could be several reasons, one is, your data is not behaving exactly the same. To our world, it’s more or less like data evaluation tools. Are your data drifting? Are you sending a lot of noise to us, because we’ll then fall back to something. Those things are built into our system. We service those things to our customers.

The other one is to understand your model itself. Models usually end up with, this is a linear regression model, or this is the decision tree model, or more popularly, this is a deep learning model. In some sense, we don’t like deep learning modeling in a sense, because it’s very hard to interpret what’s going on. For example, we build tools to help you to understand how a decision tree is working, so at what point which particular decision making is made, so the decision tree is behaving this way. We have visualization tools to help you to understand those.

Then, back to your question, I think it really becomes, is this a single instance or this is a trend? If it’s a single instance, then you probably need those model evaluation tools to understand what exactly is happening to your particular instance. If this is a trend, it’s probably more about, “Do I see some special trends for recent data coming into the system?” I think different problems, we’re solving differently.

Moderator: I have a question for you because Uber did have a bunch of lawsuits. Did you ever find yourself, or one of the data scientists, in a situation where you had to explain a machine learning model to a lawyer?

Chen: Not so far as a manager.

Moderator: Because I think it’s coming for all of us.

Chen: I don’t hear any particular request coming to the team yet.

Understanding Predictions & Dealing with Feedback Participant 4: Once you’ve acted on behalf of a machine learning model, how do you know that the predictions were correct?

Moderator: That was a feedback question. How do you know your predictions were correct? You made a prediction, someone took action, how do you know if it turned out right?

Chen: In Uber’s case, a lot of those are not so hard. Let’s say I’m trying to make an ETA guess for your meal. When your meal arrived, I already know the answer, it’s really a collection feedback loop.

Participant 4: What if you were prioritizing those people who have a longer ETA?

Chen: Sorry, your question is, what if I keep giving longer ETA?

Moderator: Yes, if you notice that right now all your ETAs are, kind of, you missed the mark and meals actually arrived too late. Have you built a feedback loop into your system?

Chen: There are two Ps, one is, what is the action you need to immediately take? That’s monitoring. Let’s say, I know my particular ETA is supposed to be in this range in this day, during this time slot. Now I see my ETA is having a different time pattern, it’s a purely anomaly detection, it’s actually unrelated to the model itself. It’s basically, “Here is my trend, am I still following the trend?” Then, the second part is the modeling. When the result is really coming back, so that actually needs to take a longer time. I think the two problems need to be solved separately.

Moderator: Maybe someone else has different ways of dealing with feedback? If you predicted the two people are related but they keep telling you, “No, we never met each other”? Or people hate music you recommend, how do you handle those?

Rangwala: At LinkedIn, it’s fairly similar to Uber, the metrics that matters are all measured, and we also have an anomaly detection system on top of the metrics. Individual teams can choose to take their metrics into or put those metrics into these anomaly detection system, and they will flag whenever something bad is happening to their system.

Samuels: For music recommendations, it can be a bit more subjective than just if you open up your phone, and you look, and you see, it can be hard to tell whether this is right, or this is wrong. You can know for your own self whether something makes sense, but it’s hard to look at somebody else’s data and see whether it makes sense for them. We do have a lot of metrics that we measure, like how often are people coming to home and are they consuming things from home and things like that. We look at those metrics to make sure that we’re doing a good job, and if they’re tanking then we know we have to change something.

Wills: I would just throw in that search ranking is much the same. I generally know within seconds whether we’ve done a good job of finding what you were looking for or not. Either you click on something or honestly, at Slack, sharing a message you find back into a channel is the single most strongly positive thing we did a good job of finding what you’re looking for. Google on the ad system, it’s the same kind of thing; you get very fast feedback. I have not personally worked at Netflix, but I have friends who have, and my understanding there is that, ultimately, all they care about is churn. Generally speaking, they need on the order of 30 days to get a full understanding of the impact of an experiment, since they’re waiting to see how many people in treatment A versus treatment B churn.

From their perspective, they have things that correlate with churn that they can detect early, that they use as a kill switch mechanism, but their iteration time on experiments has to be long because their ground truth is long. I imagine folks who are doing fraud stuff for credit cards and stuff like Stripe, again, have very long lead times with that similar kill switch things early, but let the thing run to really fully understand the impact of it. I think that sounds unpleasant to me and I would not like to work in that kind of environment. I mean, it’s nice, fast feedback is good.

Best Practices

Participant 5: A question for the whole panel. You build successful ML systems and this is a pretty new system in all this space. The question is, what have you learned, what are the basic principles you can share with us? Like when it comes to data, the data must be immutable. What kind of products are similar to that, for ML-based system?

Moderator: One top best practice each?

Wills: What was hard for me in coming to Slack- I was at Google before, I may have mentioned that – I very much had it in my head that there was the Google way of doing things, and the Google way was the truth. It was the received wisdom, it was the one and only way. What was hard for me at Slack was learning that that was not the case. Getting to work with folks who’ve been at Facebook and Twitter and stuff and seeing other things that worked helped me understand the difference between good principles and just path-dependence coincidence. There’s a lot of stuff at Google that’s been built up around the decision that Kevin, the intern, made in the year 2000, to make it seem like it was the greatest decision ever, when it wasn’t. It was just a random decision, and you could have done it another way and it would have been totally fine.

The only thing I take as absolute law and I imposed in my first week at Slack during my management days, was having evolvable log file formats. I got rid of the JSON and the data warehouse and moved everyone over to Thrift. That was, for me, the equivalent of the guy who on September 10th mandated that every airplane had to have a locking cockpit door. I have saved Slack hundreds of millions of dollars with that decision that no one can tell, it just seems invalid. From a machine learning perspective, being able to go back and replay your logs for all time using the same ETL code without having to worry about, “How the hell do I parse this? What does this field mean? Oh, my God, the type change.” Not worry about that nonsense is just the only 100% always true thing everyone should do everywhere right now.

Muppalla: To add to what Josh mentioned, when you’re running ML experiments, I think the quality of the data is something you have to take for granted to retrain and iterate on your models. That’s something you have to make sure that’s good, some counters or some way of measuring the data you’re training on is reliable, I think that’s very important.

Samuels: Then, to add to that, the ability to experiment really fast. Having a good offline evaluation system has been really important for us to be able to try out a lot of different experiments, rather than having to run so many A/B tests that take time to set up and you have to wait a couple of weeks to get the results. The offline evaluation allows you to experiment much faster, that’s been really important for us.

Rangwala: Ok, I’ll take a different stand here. In a previous life, I was an infrastructure engineer, I have even written networking protocols in my previous life. One of my key takeaways when I started working with the machine learning folks, is that when you’re building architecture, don’t think like an infrastructure engineer. The key is, the metrics are all that matters, and you have to understand that machine learning algorithms by their very nature are statistical, so your infrastructure need not be non-statistical or deterministic. Sometimes, there are a lot of variations that you can try out if you understand the fact that your true north metric is statistical in nature.

I have done that a few times in my career where I have understood that part and exploited it to build systems which otherwise won’t be able to be built, if I only thought about full consistency or absolute deterministic systems. I think in our talk, we talked about the fact that at a certain scale, you have to start thinking machine learning and infra in conjunction with each other. Without giving a specific example, I can think of a problem that can be solved in two different ways using machine learning, where one of them is much more amicable to infrastructure compared to another. It’s actually very important to go as far as you can up the stack, where you figure out, “What things can be changed?” True north metrics are all that matters, beyond that, everything should be rethought whenever you hit a scaling problem.

Chen: To my point, I would say learn with your partners. Be simple and stupid in the beginning, I can say this. Three years back while building Michelangelo, actually, in the team, nobody even understood what do you mean by supervised learning. Why do you need a table where categorical features need a string indexing? We thought, “Oh decision tree. How can a decision tree handle categorical features? We don’t know.” Our data scientists know that, I think to make a system really useful, one, learn with your partner, understand their needs and the second, advocative after you understand that, so then you can educate more people. More or less, you eventually become a central hub, you absorb different knowledge from different users, and then you advocate the same to more users. Then, you make this practice more unified in the whole company. That makes your life easier, and makes your customers’ life easier as well.

Quantifying the Impact of Platforms

Participant 6: Some of you have built platforms for machine learning. I was going to ask, how do you quantify the impact of those platforms you’ve built on the productivity of your practitioners, your data scientists, or ML people, or whatever, and what part of that platform had the biggest impact on that matter?

Moderator: That’s a good one. What part of your platform had the biggest impact, and how do you know?

Chen: Yes, how do you quantify it? I don’t think this is really a machine learning problem, because I’m coming from data org. Originally, I think the whole data org idea, because we work very closely with the data scientists, where we’re talking about data scientists to data engineer ratios, I think that still applies. Basically, as one single data engineer, how many different data scientists can you support? How many different use cases can you support?

Rangwala: I would just second that, most of the time it’s either the metrics, or the productivity of the machine learning engineers that you can improve. At times, it’s measured how long it took to actually build a model or iterate over a model. One of the metrics sometimes being used is the number of experiments or number of models that you can put into production and iterate over it over a given period of time. Those are the usual metrics to measure success.

Samuels: Yes, I don’t have much else to add other than another thing that we were looking at, is how fast can you add features to the model? That was another big effort on our part to improve the way that our infrastructure was, so that when got that down to shorter amount of time that was how we knew we were doing a better job.

Muppalla: One more thing is, once you start getting these logs that are results of your experiment, how soon can you make them available for research and analysis, and are there tools that support this automatically? The smallest number of systems that an engineer can touch to get results, the better it is.

Wills: I hate panel questions where everyone agrees, I’m going to disagree vociferously. Everyone else is totally wrong. No, I’m kidding, everyone else is totally right. The thing I will chime in is, data scientists are great, but they don’t necessarily write the most efficient code in the world, I think the dollars and cents impact of bringing in data engineering and infrastructure was to watch the AWS costs fall precipitously. We stopped letting them write some of the ridiculous things they were trying to do. We didn’t really need quite so many whatever x32 758 terabyte instances to fit models anymore. That was a good thing.

Moderator: Yes, I like the idea of AWS costs, it’s the ultimate metric we try to optimize.

Wills: It’s very satisfying, it’s more money.

Machine Learning with Different Amounts of Data

Moderator: We have five minutes, so I’m going to ask a last question for the panel. A lot of you went here and did presentations about how you do machine learning at this humongous scale. Is this inherent to the problem? Is machine learning always part of the machine learning at scale sentence structure? Or, if I’m a small company and I have a bit of data, can I still do machine learning on my tiny bits of data?

Wills: The answer is, yes, 100%, yes, those people are not at these conferences, though, because they’re just trying to keep their company open. They’re just trying to not to run out of money, if they were here, I’d be like, “What the hell are you doing here? Get back to work.” All of us have the fortune that I don’t think any of our companies would go out of business if we step away for the day.

I think it’s a biased sample we’re getting. I would say, honestly, my friends who do that who have small series A, C companies doing machine learning on small data, are far cleverer and work much harder than I do to do things on large data. Large data makes a lot of problems much easier, definitely not without challenges, not without hard stuff, but I think, generally speaking, it’s honestly just way easier. Yes, it very much can be done, and I’m sure there’ll be an incredibly valuable next wave of startups that’ll come out of that. Then in a couple of years, we’ll be able to come to QCon and talk about it.

Moderator: Can you tell which ones so we can all go look for jobs while the options are still good?

Wills: I guess it’s hard, personally I’m a big fan of Robin Healthcare, is doing some very cool stuff. If you go check out Robin, a little tiny series A startup out in Berkeley that is doing automatic transcription of doctors notes to enter stuff into the EMR for them. The doctor doesn’t have to spend half their week entering information into the EMR. This is a very hard problem, they have a very small amount of data and they have to be much cleverer than I do to figure out how to do it.

Moderator: Oh yes, doctor handwriting is the worst, that’s an amazing problem to solve.

Wills: Precisely, it’s general A.I. level stuff. Yes, I’m trying to figure out how to profit from the fact that I just mentioned them right now. Nevertheless. Yes, I mean there’s a lot of that stuff, I don’t really know half of it.

Samuels: I don’t have too much to add here, because I’ve just mostly been working on things at scale, but it seems like there’s a lot of really interesting problems when you get to this scale, in dealing with all of the data, and wrangling it. I feel sometimes the hard problems are also the organizations that you have to deal with at your company, in terms of who you have to interface with, and who you have to work with, and what are their systems doing, and how do they interact with yours? How can you coordinate with other teams so that you’re all doing the same thing, instead of reinventing things in different pockets of your company? At scale, I feel like those are the other types of problems you have to deal with, not just the technology, but the org and the people.

Moderator: The organizational scale.

Samuels: Yes.

Rangwala: I think it’s both easy and difficult. Generally, what we have seen is that, when you have a lot of data, sometimes even simple models are able to perform really well. With small amount of data, probably you will have to use different machine learning techniques when compared to what techniques you use when you have large amounts of data? In a way, the problem shifts, when you have small amount of data, the machine learning techniques that you use in order to learn from those models are the challenging part. When you are at high scale, sometimes it is the data, or dealing with this large amount of data, that becomes a bigger challenge. However, you’re also reaching the limit of diminishing returns, which means that once you have tried a certain model, even at high scale, you won’t be able to discover new things that you would probably have to try new techniques like deep learning to discover.

Moderator: Can you even do feature engineering at small scale? Is that even a thing?

Chen: Sure, why not? To me, I think it’s not about big or small, it’s about the thinking style. It’s about how you solve your problem. I think we’re on a journey to changing people’s mind, how to understand, how to solve their problem. I see machine learning really as a part of data-driven applications. We’re here changing people’s mind, brainwashing, that’s how I see it.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


The Importance of Community in Data Science

MMS Founder
MMS RSS

Article originally posted on Data Science Central. Visit Data Science Central

Nobody is an island. Even less so a data scientist. Assembling predictive analytics workflows benefits from help and reviews: on processes and algorithms by data science colleagues; on IT infrastructure to deploy, manage and monitor the AI-based solutions by IT professionals; on dashboards and reporting features to communicate the final results by data visualization experts; as well as on automatization features for workflow execution by system administrators. It really seems that a data scientist can benefit from a community of experts!

 

The need for a community of experts to support the work of a data scientist has ignited a number of forums and blogs where help can be sought online. This is not surprising because data science techniques and tools are constantly evolving, and mainly, it is only the online resources that can keep up the pace. Of course, you can still draw on traditional publications, like books and journals. However, they help in explaining and understanding fundamental concepts rather than asking simple questions that can be answered on the fly.

 

It doesn’t matter what the topic is, you’ll always find a forum to post your question and wait for the answer. If you have trouble training a model, head over to Kaggle Forumor Data Science Reddit. If you are coding a particular function in Python or R, you can refer to Stack Overflowto seek help. In most cases, there will actually be no need to post any questions because someone else is likely to have had the same or a similar query, and the answer will be there waiting for you.

 

Sometimes, though, for complex topics, threads on a forum might not be enough to get the answer you seek. In these cases, some blogs could provide the full and detailed explanation on that brand new data science practice. On Medium, you can find many known authors freely sharing their knowledge and experience without any constraints posed by the platform owner. If you prefer blogs with moderated content, check out online magazines such as Data Science Central, KDnuggetsor KNIME Blog.

 

There are also a number of data science platforms out there to easily share your work with others. The most popular example is definitely GitHub, where lots of code and open source tools are shared and constantly updated by many data scientists and developers.

 

Despite all of those examples, inspiring data science communities do not need to be online, as you can often connect with other experts offline as well. For instance, you could join free events in your city via Meetupor go to conferences like ODSC or Strata, which take place on different continents several times each year.

 

I am sure there are many more examples of data science communities which should be mentioned, but now that we have seen some of them, can you tell what a data scientist actually looks for in all those different platforms?

 

To answer this question, we will explore four basic needs data scientists rely on to accomplish their daily work.

1. Examples to Learn From

Data scientists are constantly updating their skill set: algorithm explanations, advice on techniques, hints on best practices, and most of all, recommendations about the process to follow. What we learn in schools and courses is often the standard data analytics process. However, in real life, many unexpected situations arise, and we need to figure out how to best solve them. This is where help and advice from the community become precious.

 

Junior data scientists exploit the community even more to learn. The community is where they hope to find exercises, example datasets, and prepackaged solutions to practice and learn. There are a number of community hubs where junior data scientists can learn more about algorithms and best practices through courses on site, online or even a combination of the two —starting with the dataset repository at UC Irvine, continuing with the datasets and knowledge-award competitions on Kaggle, through to educational online platforms such as Coursera or Udemy. There, junior data scientists can find a variety of datasets, problems and ready-to-use solutions.

 

However, blind trust in the community has often been indicated as the problem of the modern web-connected world. Such examples and training exercises must bear some degree of trustworthiness, either from a moderated community — here the moderator is responsible for the quality of the material — or via some kind of review system self-fueled by community members. In the latter, the members of the community evaluate and rate the quality of the training material offered, example by example. Junior data scientists can therefore rely on the previous experience of other data scientists and start from the highest rated workflow to learn new skills. If the forum or dataset repository is not moderated, a review system in place is necessary for orientation.

2. Blueprints to Jump-Start the Next Project

Example workflows and scripts, however, are not limited to junior data scientists. Seasoned data scientists need them too! More precisely, seasoned data scientists need blueprint workflows or scripts to quickly adapt to their new project. Building everything from scratch for each new project is quite expensive in terms of time and resources. Relying on a repository of close and adaptable prototypes speeds up the proof-of-concept (PoC) phase as well as the implementation of the early prototype.

 

As is the case for junior data scientists, seasoned data scientists make use of the data science community, too, to download, discuss and review blueprint applications. Again, rating and reviewing by the community produces a measure for the quality of each single blueprint.

3. Giving Back to the Community

It is actually not true that users are only interested in the free ride — in this case, meaning free solutions. Users have a genuine wish to contribute back to the community with material from their own work. Often, users are more than willing to share and discuss their scripts and workflows with other users in the community. The upload of a solution and the discussion that can ensue have the additional benefit of revealing bugs or improving the data flow, making it more efficient. One mind, as brilliant as it may be, can only achieve to a certain extent. Many minds working together can go much farther!

 

This concept reflects the open source approach of many data science projects in recent years: Jupyter Notebook, Apache Spark, Apache Hadoop, KNIME, TensorFlow,Scikit-learnand more. Most of those projects developed even faster and more successfully just because they were leveraging the help of community members by providing free and open access to their code.

 

Modern data scientists need an easy way to upload and share their example workflows and projects, in addition to, of course, an option to easily download, rate and discuss existing ones already published online. When you offer an easy way for users to share their work, you’d be surprised by the amount of contributions you will receive from community users. If we are talking about code, GitHub is a good example.

4. A Space for Discussions

As we pointed out, the main advantage to an average data scientist for uploading his/her own examples on a public repository — besides, of course, the pride and self-fulfillment of being a generous and active member of the community — exists primarily in the corrections and improvements advised by fellow data scientists.

 

Assembling a prototype solution to solve the problem might take a relatively short time. Improving that solution to be faster, scalable and achieve those few additional percentages of accuracy might take longer. More research, study of best practices, and comparison with other people’s work is usually involved, and that takes time with the risk of missing a few important works in the field.

 

Therefore, data scientists need an easy way to discuss with other experts within the community to significantly shorten the time for solution improvement and optimization. A community environment to exchange opinions and discuss solutions would serve the purpose. This could take place online on websites like the KNIME Forumor offline at free local Meetup events.

 

A Community Data Science Platform

Those are the four important social features that data scientists rely on while building and improving their data science projects.

 

Data scientists could definitely use a project repository interfaced with a social platform to learn the basics of data science, jump-start the work for their current project, discuss best practices and improvements, and last but not least, contribute back to the community with their knowledge and experience.

 

Project implementation is often tied to a specific tool. Wouldn’t it be great if every data science tool could offer such a community platform?

Paolo Tamagnini contributed to this article. He is a data scientist at KNIME, holds a master’s degree in data science from Sapienza University of Rome and has research experience from NYU in data visualization techniques for machine learning interpretability. Follow Paolo on LinkedIn.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Papers in Production Lightning Talks

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

Transcript

Shoup: I’m going to share very little of my personal knowledge, in fact, none of it, but I’m going to talk about a cool paper that I really like. Then Gwen [Shapira] is going to talk about another cool paper and Roland [Meertens] is going to talk about yet another cool paper. The one I want to talk about is a paper that’s around using machine learning to do database indexing better. This is a picture of my bookshelf at home. A while ago, I bought myself a box set of “The Art of Computer Programming”, which has basically all of computer science algorithms written by or assembled by Don Knuth. There’s 4a, so he’s still working on completing the thing, hopefully, that will happen. Anyway, there’s a lot of algorithms out there in the world and a lot of data structures.

When we’re choosing a data structure, typically we’re choosing it in this way, we are trying to look for time complexity, how fast is it going to run, and space complexity, how big is it going to be? We typically evaluate those things asymptotically, we’re not looking as much at real-world workloads, but looking at what are the complexity characteristics of this thing at the limit when things get very large? We’re also, and this is critical, looking at those things without having seen the data and without having seen typically the usage pattern.

We’re doing is we’re saying what is the least worst time and space complexity, given an arbitrary data distribution and an arbitrary usage pattern? It seems like we could do a little better than that, that’s what this paper is about. What we’d like to be able to ask or to be able to answer is how could we achieve the best time/space complexity given a specific real-world data distribution and a specific real-world usage pattern. That’s one of the things we’re going to talk about in this 10-minute section.

The Case for Learned Index Structures

The fantastic paper is from a bunch of people at Google and who are visiting Google, it’s called The Case for Learned Index Structures. There are three things that I would like you to take away from this paper, the first is the idea that we’re going to be able to replace indexes with models. That’s going to be pretty interesting; that there’s a duality between an index or a data structure and a machine-learned model. Next, we’re going to talk about how we might be able to use machine learning to synthesize a custom index that actually takes the specific characteristics of a specific data distribution and a specific usage pattern in mind, and how we might be able to combine them using machine learning. Then finally, the thing that’s most exciting to me is that this really opens up a whole new class of approaches to fundamental computer science algorithms.

Let’s first talk about replacing indexes with models. The first key idea of the paper which is actually in the first sentence of the abstract, is an index is a model. What do they mean? Let’s use a couple of examples of really common data structures, a B-tree which we use to search and store things in database indexes, you can think of that as a predictive model that given a key, finds a position in a sorted collection. We can think of that as those things are dual. You can think of a hash-map as, again, a predictive model that takes a key and finds a position of something in an unsorted collection. Then you could look at a bloom filter as simply a binary classifier, a bloom filter is a way of deciding cheaply, is something in our set or not in our set, and you can have false positives but not false negatives, etc.

The wonderful idea though, is that there’s this duality between indexes and models. How they show it in the paper is this, is that you might think of a B-tree index as given a key, you go into the B-tree and you find the position of the record that you’re interested in, but a learned index could replace that triangular B-tree symbol with a square model, which could be a neural net or a regression or some regressive thing or something else. That’s a wonderful thing. You go,” All right. That’s kind of cool but what am I going to do with that?”

Let’s go to the next step which is automatically synthesizing custom indexes using machine learning. One of the things we could use machine learning for, which is pretty straightforward, and a bunch of other people have talked about it here, is we might be able to use it to select which data structure we’re using, that’s pretty cool. Then we might be able to think about how to use it to tune the data structure, hyperparameter tuning is very much the same idea as applied to machine-learned models. We can use machine learning to see how best we can tune our other machine-learned models. Given a traditional data structure, we might be able to use machine learning to help us figure out something better. Even cooler than that is we could use machine learning to assemble a custom structure that uses a specific, it takes advantage of a particular data distribution.

We might be able to choose different data structures for different parts of the distribution. One part of the distribution might be very sparse, another part of the distribution might be very dense, one part of the distribution might be integers, another part of the distribution might be something else. We could assemble a custom structure that put all those things together. Then we could actually use a hierarchy of models, we could think of a kind of ensemble or a mixture of experts approach, where we use a hierarchy of first choose the next model, and then the next model, and then finally, at the end, the actual structure.

Cleverly, we could default to the traditional structure if we weren’t able to improve upon it with machine learning. This is what it looks like in the paper, imagine in this example, a three-tiered situation, where a key comes into an initial model, it chooses the next model, and then those choose the next model, and then finally, we find the position. Well, this seems like a lot of work to replace some really fast, highly optimized data structure that we use all the time. Is there any way this could work in a performant way? It turns out, yes. The paper uses a specific example of a highly cache-optimized production quality B-tree implementation, versus a two-stage learned index where there was a neural net on the top and linear models at the bottom and what they got was actually a net increase in 70% in speed and 10 to 100x in size. That’s pretty cool, that’s not a bad day’s work.

The thing that I think is even more exciting is just the idea of replacing the guts of aspects of important parts of a database with machine-learned approaches, so an entirely new approach to how we might improve on performance characteristics of stuff inside a database. One of the things they point out is that what we’re actually doing with that tree of models is we’re learning a model for the cumulative distribution function, the CDF of the input. We can use that CDF model to build an indexing strategy like I talked about, we could also use it to build an insertion strategy, a storage layout, we could do cardinality estimation and join optimization. This is all stuff that database researchers have been doing for a long time, we could also actually compute approximate queries.

The same authors did a subsequent paper later on this year, which they called SageDB, where they are using this idea of building a model in the center and using it for query optimization, for data access improvements, for sorting, for advanced analytics, all this great stuff. What’s very cool about using machine-learned approaches here, is not only will this stuff improve as we improve the machine learning techniques- that’s one dimension of improvement- it will also be able to take advantage of improvements in algorithms and data structures in the traditional computer science world, because we’ll just have better base things to assemble. Even more importantly, this stuff scales with GPU and TPU performance, which actually is continuing to improve versus Moore’s law and CPUs, which are flat. I leave you with this totally awesome paper, a Case for Learned Index Structures.

Toward a Solution to the Red Wedding Problem

Shapira: I’m going to talk about a paper called “Toward a Solution to the Red Wedding Problem”. The paper is almost as good as the name, but you really cannot do that, the content of the paper is around solving a very unique problem around compute at the edge, so I’m going to start by giving some background about what we’re doing at the edge at all, talk about a specific problem, talk about the proposed solution, and then talk about some things we can do in the future, all of that in 10 minutes.

What do we mean when we talk about the edge? If you think about most of the web architecture these days, there are the main data centers that have databases, and we call them origins. If everyone connected to that main database, we would have a lot of problems, it will be slow, because some people live very far away from database. It will also become a big bottleneck, our entire global capacity will be limited to what one database can provide, that’s a problem. People take different locations around the world, put a big cache there, and now people can talk to the cache. That’s fantastic from reading, because you read data that is nearby.

If you’re writing, on the other hand, the writes still have to end up on the original data center for consistency. Usually, you write to the cache there and the request is kind of rerouted to data center. All the problems that we have with one data center still exist. These days, we have a new solution end problem, we added compute to the edge. Most of CDN providers and AWS CloudFront allow you to run small serverless functions on the edge, so you can do cool stuff like resize images or reroute requests for AB testing or add some headers, etc., but it’s all limited to small and most importantly, stateless stuff, because there is no state at the edge. You have some cache and you have no control over what is actually in there.

What is the Red Wedding Problem? The Red Wedding Problem is what happens after there is a particularly popular episode of “Game of Thrones” airing on TV, as happened this Sunday, I think it was Sunday, I don’t actually watch it myself. Immediately after the episode airs, everyone apparently runs to Wiki and tries to update their episode page, their favorite character page, start editing pages. Now there is also a read spike, but we already know how to handle a read spike, but what about the write spike? How do we handle a huge spike in writes? This is the Red Wedding problem, the Red Wedding episode just aired; what do I do with all those writes?

In a nutshell, the solution is very simple, we need to allow concurrent updates on the edge. That’s the way to solve it, don’t take all the writes to data center because we said it’s going to have problems, the devil is clearly in the details. Our theoretical solution is absolutely awesome because if we can do that, it means we have low latency writes, everyone writes locally, and we get rid of a big honking bottleneck in our database. It sounds amazing if only we can solve it.

There are obviously some challenges, one of them is that, as we said, the edge only runs serverless, stateless stuff. How can we maintain state there? The other part is that if everyone’s updating their own data center, and everyone is working on the same character page, how are we going to merge all those things together? It turns out that the second problem is actually way easier, which was incredibly unexpected when I read the paper. It turns out that there is something called CRDTs, which are basically data structures made for merging. As long as you use CRDTs for editing, you’re guaranteed to be able to merge them almost conflict free at the end. I’ll talk about which CRDTs to use when I get to the end of this talk. Surprisingly, the hardest part was how do we actually maintain state on the edge?

Here’s what the solution kind of looks like – I’m sorry for the mess – we know start reads work as usual. Writes are now handled at the data center, where we’re running all those Lambda functions, each of them maintains its own state but they also communicate, so you talk to Azure Lambda functions in the same point of presence, in your same location, and you do it through a queue, so you still propagate your state. If we’re the same data center, I have my state, you have yours, we’ll communicate with each other, it means that when they start a new serverless function, it can very rapidly recover state from someone else running in the same location. Every once in a while, we take all this data to accumulate it on the edge and merge it back into our data center, and that’s how we maintain state in the data center. Those batches are what makes it efficient.

In a bit more detail, we use containers to deploy our application and database. In this case, and I’ll show you in a second, the application database is actually the same thing, it’s embedded. Then we recover state from other replicas running in the same location, and we use some CRDTs for merging and we also use Anti-Entropy to make sure that mistakes don’t accumulate and propagate. Implementation – that’s all I’m going to say about implementation – they’ve taken and distributed peer-to-peer key-value store written in Erlang, embedded it in Node.JS, and use AWS Lambda to run it on the edge. This thing should be legal and probably is, in many states, whatever you do, do not take this paper to production. This is not the right paper to take to production, it needs some production in it.

The main problems they ran while implementing, is the fact that they’re running it on AWS Lambda on the edge, half the paper is one limitation after another and how that impacts the solution? You basically have stateless functions, that’s all you have, you cannot bind to a port which means that original plan of how we talk to each other on the same location went completely out the window and they talk to each other using a queue. As you can see in the last point, there is also not really a queue on the edge, so they talk to each other via queue in the nearest data center. Then you also don’t control concurrency, ideally, you will have a lot of Lambdas running at the same time to serve the workload, they basically had zero control over that, they could theoretically lose their entire state accidentally.

Invocations don’t last long, they found weird ways to keep the Lambda functions running for longer so they could actually perform their anti-entropy cleanup work. If you want to hear a hack after hack, on top of a really good idea, really amazing idea, that’s the paper for you. How is it actually useful? Well, for one thing, a lot of the IoT and AI is moving to the edge because that’s where all the data is, so the idea that you can use CRDTs to merge state on the edge and avoid this global bottleneck seems incredibly useful, something that I’m totally going to play with.

If you need CRDTs, there are JSON CRDTs, and since JSON is kind of the global data structure we all sadly converged upon, this allows us to basically use CRDTs for whatever it is we need. The JSON CRDT paper with Martin Kleppmann is dense, but amazing in its own right. He also has a good talk about it, EdgeLambda is clearly terrible and awful, and I hope either AWS fixes it or someone else will show up and provide a service that has good stateless computing on the edge, but actually allows you to maintain state, bang to ports, and do normal stuff.

As I read the paper, I was just thinking, “Hey, I have KafkaStreams, it has local store with RocksDB, RocksDB support merges. We have Kafka as a broker, I can actually recreate it in a better way than what they did with EdgeLambda.” If anyone wants to beat me to it, it sounds like a fun weekend project. CRDTs on the edge for the win every one. Next up is Ronald [Meertens], who builds autonomous vehicles, self-driving cars for a living, that has to be interesting.

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise

Meertens: I am indeed Ronald Meertens and I work on self-driving vehicles, I think it’s the most interesting AI problem of this century. If you agree, come talk to me, if you don’t agree, come also talk to me. I’m going to present a paper called a Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Let me first show you what’s there, what kind of databases this can work on.

Imagine that you have this kind of data, then as a human, it’s pretty clear to see that probably these things belong together, but it’s not like a normal distribution you expect. Same here for this kind of pictures, where you noted these data points probably belong together, or that these data points belong together. You can’t really fit it to a normal distribution, this is a case where you can see what belongs together by looking at the points around it. That’s why the DBScan algorithm is so nice and elegant, because it’s an unsupervised algorithm designed to find clusters of points that are close together and thus likely belong together. Because it’s unsupervised, there’s no need for any training step, you just have to select the right parameters.

In case you guys now think, “Oh man, it must be really difficult to find these parameters”. Ha-ha, no. The other thing which is really good is it is great at finding outliers or anomalies, because it is not really influenced by them, unlike, for example, k-means where if you have an anomaly, your mean for a cluster is dragged away. The only two parameters you have here are the distance in which you want to search for neighboring points and the amount of points that need to be close together to be part of the same cluster.

Let’s go into that, it’s clustering by simply counting, this is basically the whole DBScan algorithm. Points are either core points or border points or noise or anomalies, if you have a single point like point A in that picture, that have N or more points around themselves, then those are core points. Points which can be reached from a core points, but don’t have enough points around themselves to reach enough other points, they are border points but they belong to the same cluster. Points that are not close to any core points and don’t have N or more points around themselves are classified as noise or anomaly. This DBScan, the density-based scan algorithm recurses through the connected components to assign the same label to connected points. If you notice something as the core point has N or more points around it, you can basically recurse to each of these endpoints around it, then try to see if they are belong to the same cluster or not. That’s basically the whole algorithm.

Because I have some time left now, I also wanted to go into some of the applications and how you can use it, because the cool thing about this algorithm is not only that it’s really simple, it’s also that it’s super versatile and can be used for many applications. For example, you could use this in an outlier detector for a group of servers where you try to group them together based on how much RAM they use or how much CPU to use, you can find some outliers there.

You can use it for clustering cells in biology, for example. You could cluster movies for a recommendation engine; if people like this movie and there are many movies close to it, probably you like these other movies as well. You can maybe cluster crops in satellite or drone images, you can find separate objects on the road in LiDAR data. Overall, it’s a very powerful tool, you can use it as a very powerful tool in your software engineering toolbox.

I wanted to go into what this example of finding objects on the roads, one thing you have in self-driving cars, you have this really cool sensor set up on the top. Those things are called LIDAR’s, and it consists of multiple lasers which basically spin around very fast. These gives you a very large amount of accurate measurements, but they are unstructured points in space, or you could treat them that way. The main challenge here for self-driving cars is, of course, what objects are on the roads and where are they, or where are any objects in the roads? The first challenge is, find what separated objects are on the roads. As you probably guessed, you could use DBScan for that, then there might be a bonus challenge that determines what these objects are. If you do that, if you just run DBScan, on the left, you see the outputs, I gave every connected cluster a different color. You can see the DBScan algorithm is able to find- I’m sorry, you have to have a lesson in how to read these things – every group of points is actually one car or a tree in this case. That’s the nice thing about the DBScan algorithm, it’s pretty versatile, so you can also use these points as input for a new algorithm.

I asked a student, Christian, to do that, he came up with this, where in this case, he tried to draw or paint every car green, every tree orange, and everything which is not related to any relevant objects blue. He also made a video, so that’s cool, let me see if I can actually play that. Here you see a car driving on the road, it’s in the center actually, and so you can clearly see the trees in this image, which are all colored orange, you can see that all the cars become green. You can even use DBScan as basically input for your cooler deep learning algorithm if you want.

To summarize it, I think it’s a really powerful tool in a machine learning toolbox, I use it very often for a lot of things, a lot of machine learning problems I encounter. It needs no training, it’s easy to reason about the arguments you want to pass to it. For example, if you have this LIDAR case, you could say, “Well, I want to connect or find all the objects which have no more than 20 centimeters space around them and I want to remove all the noise, so if there are only three LIDAR points hitting an object and slightly noise.” It’s really easy to reason about that, it gives you end clusters and the outliers, making the algorithm even more versatile, and can be used as a first step in a longer chain of processing.

Questions and Answers

Participant 1: My question is for Randy [Shoup]. You talked about time complexity that you showed.

Shoup: That’s the table of results, that was an empirical result, yes.

Participant 1: Did they think about how to think about time complexity for that design in a classic way? Come up with a asymptotic all for those so that we have tight bounds in the more theoretical way, as opposed to just, “Oh, we’ll use this real data and think this is how much faster it is”?

Shoup: I think I understand the question, it was real world, you should read the paper. They used real-world data, they used actually three separate datasets, one of them was a Google, I forget, but it’s three real-world data sets and a real-world production, implementation of a B-tree, and then obviously a real-world example of their two-layer model. I think they’re not asserting a different complexity class, if that’s what you’re saying. They’re just asserting that they were able to beat the performance of something that was good.

Participant 2: My question was more about that, because the anxiety that I have about using that approach is how can I guarantee that my running time is going to be [inaudible 00:26:05]?

Shoup: Oh, yes, we should talk about it. You’re worried about worst case, I see what you’re saying. You ran it 10 times, what if you ran at the 11th time and it was longer? Yes, they do address that a little bit in the paper. We should talk about that offline, it’s a great question.

Participant 3: I’m wondering if you use a model as an index, which sounds like a fascinating idea, what about the problem of drift? As data accumulates in your database, the structure of it may change over time.

Shoup: I think there are two aspects to that, about the drift, that actually can be an advantage. If you think about again, in 10 minutes, I didn’t summarize all the cool ideas of the paper. One of the other cool ideas is you can imagine retraining the algorithm, retraining your mixed model or hybrid model. You could imagine, “Ok, at some point in time, I’ve got something that feels optimal”, and then do it the next day.

The cool thing that I implied in the abstract but didn’t say explicitly is, usually the only way to get a custom data structure is to hand write it, and that’s why people don’t do it, but if you can automate this process and use machine learning to do it, then you can run it however often you like. That’s one part of the drift, so you can actually take advantage of that, the other thing what you might be implying is machine learning models are inexact, versus traditional data structures. They do talk about ways of using auxiliary structures to deal with that. If I didn’t answer your question, then we can talk more of it.

Participant 3: Well, I was just going to say, no, I was thinking more of the first one, which is the idea of in the beginning, you have this database and you put things in it and just because you have maybe beta customers, they’re all in a certain geographical region or there’s some characteristic where they all have indexes work for them. Then all of a sudden you have all these new users or there’s a Red Wedding and suddenly that something happens, so the now the indexes that, I mean, virtual indexes, the learned way of finding things quickly, is now failing because whatever you’re using as your features in that model are not as relevant anymore.

Shoup: Yes, you’re basically saying it’s overfit, so I think all the techniques that are around being resilient to overfitting would work. There’s another thing that they talk about in a subsequent paper, where the example that they give are read-only structures. What they’re showing is for static data, so we’ll start with that. What you’re really saying is, which they talk about in the subsequent paper, what if you inserted stuff into it and changed it over time? The idea there, which I’m not sure I’m going to be able to articulate well is, if you could model the ideal CDF, that is what allows, which is the real domain from which the input is coming from. That’s what you’re looking for in the insert update case. You can model the actual CDF in the read-only case and then you can finish. It doesn’t matter that it’s overfit because the data is not changing. I don’t know if that made any sense but we can talk more if it doesn’t.

Participant 4: I had a question. Are they updatable models? Like the B-tree …?

Shoup: That’s what I’m saying.

Participant 4: Like for each update.

Shoup: In the first paper, they only talk about read-only structures. Then the obvious, the orange site, people are like, “This is useless because I update my data structures”. I don’t think I did a good job of explaining it, but because I don’t fully understand, I’ll be open. In the read-only case, you can model the very specific exact data distribution and get a perfect near optimal thing. That’s clear, what you’re saying is, “Hey, man, that thing’s going to change over time”. What you’re looking for is a harder to train model, but ultimately the model is about predicting the theoretical or the general CDF as opposed to the actual. Does that make sense?

Participant 4: Basically, if it’s read-only, the training accuracy is 100, perfect. Perfect training accuracy, right?

Shoup: You overfit like, yes, dude, I overfit. That’s the goal.

Participant 5: It’s a question for Gwen [Shapira]. Can you describe the mod, [inaudible 00:30:55] and talk about, I didn’t catch exactly what it is?

Shapira: The CRDTs?

Participant 5: Yes.

Shapira: Yes, I love to talk about CRDTs. Do we have an hour? The general idea is that, if you think about it, if you’re an engineer, we’re all here pretty much engineers, how I make changes to my code, and you make changes to your code, and we both Git commit. And 90% of the time when we run Git merge, it just seamlessly magically merges. If we edited two different files, not a problem, we edited two parts of the same file, not a problem. Unless we really edited the same line, still not a problem. You can think about the generalized, like if you have an array and someone inserts something and someone else enters something else, you can figure out that hey, it’s two inserts and so on.

Basically, CRDTs are data structures that in addition to the data structure, support the merge operation. The JSON CRDTs support that on generalized JSONs, which is pretty cool, you can imagine that if I was an owner of a Wiki and really, really worried about concurrent writes, I would deal with that with if someone edited two different parts of the page, not a problem at all. If two people actually edit the exact same sentence and I tried to merge it, I will probably end up with a mess. No matter how I tried to solve it, there is no good solution for two people editing the same sentence.

It’s a Wiki, and store the mess, somebody will notice it, they’ll fix it an hour later. It’s not a bank, not a big deal, basically. Some stuff can also be smoothed over by just shrugging and saying “Not a big deal”. I want to again refer people to a talk that Martin Kleppmann did, I think both at QCon and GoTo, on the subject of JSON CRDTs and how he’s basically building a text editor that is distributed and supports the same idea. That’s a good example of how I imagined the whole Wiki CRDTs works.

Participant 6: I have a couple of questions. Regarding self-driving, is there any hope that there’s unsolved self-driving without LIDARs or is that crazy?

Meertens: That’s a difficult question because they are trying, so they obviously believe that there is. The big thing is that if you work with camera only, everything you see is an estimation. If you use LIDAR, you have a measurement, the big problem with these LIDARs is that they are super expensive. That’s why Tesla wants to try to make something without them because obviously, they want to sell consumer cars. The big difference is that we, at our company, want to build robo taxis, where you can offset the costs of a single vehicle over a way higher usage throughout the day than Tesla can, where you only use your Tesla to drive to your work and back. Obviously, we as humans can’t do it without any exact measurements, but we use a lot of intrinsic knowledge about how it looks like, how it behaves, etc. I think in the short term that using LIDARs is a very good bet. In the long term, we don’t know.

Participant 7: I also have a question for Gwen [Shapira], which is maybe more of a funny question. Do you think serverless functions are popular because they are the enterprise version of Ethereum?

Shapira: I think they’re popular because they’re kind of easy to write, they’re kind of easy to run, and it’s 99.99% of the scale, it’s actually quite cheap. It’s not cheap when you reach the tail end, especially if you do what they did in the paper, where dysfunctions continuously communicate with it from the edge to a data center. The natural costs totally pile up easily but, when you begin it’s cheap, it’s easy, and everyone loves cheap and easy.

Participant 8: I have a question for Randy [Shoup]. I think this goes back to the question because he was asking. If you’re building a model, in order to optimize your data structure, by very definition, your model is statistical in nature. It seems like there is a fundamental problem here because you cannot provide guarantees, because it’s the statistical. If you do try to provide guarantee, you will give away to overfit your model, which means a new data point that comes in for access will not be able to satisfy or will not be able to provide the same kind of performance, so it’s not just a problem where you can get away with it. You can only have either of the two.

Shoup: Yes, I do suggest you read the paper, I’ll separate these two ideas. If you’re in a read-only situation, overfitting is your goal. We’ll start with that, I have a static set of data and all I want to do is serve read-only stuff. Actually, overfitting is my goal there. Do we agree on that?

Participant 8: Again, we don’t exactly know what is the read pattern. The key here is that the distribution of usage should not change from the distribution over which the model was trained, because if the distribution of the user changes, then…

Shoup: Then you might get a slightly less optimal, or there might be something that’s better. I think nobody is asserting that it’s- maybe I use the word optimal- nobody’s asserting that it can’t be improved upon. Obviously, to your point, if anything changes, all bets are off, or a lot of the bets are off, but the point that you make around, and the reason why I separated this because this other part is dealt with in the paper. You said statistical but I’m going to say back approximate, you’re saying and it’s true that the vast majority of machine-learned models give me approximate answers, not exact answers.

This is not my paper, just the beautifulness of the paper is that they describe that. In a B-tree, the prediction that we’re making is to find a page, we’re not finding the record; we’re finding the page, then we look in the page. Ditto for a hash table. What we do with hash collisions, lots of places hash to the same hash, but then we have the link list or whatever of the million cuckoos, etc., techniques to deal with those collisions. In both cases, the existing data structures are already approximate, does it make sense what I’m saying? Yet another, for me, totally mind-blowing, like “Oh, my God, you’re right.” Yes. I had the same thing, there’s no way you could do that with the model. “Oh my God, you can’t do that”.

I can’t encourage you enough to read the paper because that’s dealt with in there. In the Bloom filter section, Bloom filter does need to be exact or else it’s totally useless. Exact in the sense that I can get false positives but I cannot get false negatives, or else I shouldn’t do it at all. What they suggest there is because the only ways I could think about were approximate, yes, there would be some side Bloom filter for the outliers. Anyway, you should read the paper. But all your points are well taken, I encourage you to read the paper because they’re dealt with in one way or another in there. Awesome.

Participant 8: Also, I think you showed that it’s like the cache optimized B-tree, but there are clustered indexes, probably passing the model. I’m assuming cluster B-tree [inaudible 00:39:45]

Shoup: Yes, the orange site people are like, “They are compressed structures.” Yes, and the machine-learned model can take care of the compressed structure. “There are succinct structures.” Yes, we can take advantage of that too. Anyway, read the paper, it’s a great write, let’s take away, “Wow, what a cool paper. Let’s read it.”

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


DevOps Needs Continuous Improvement to Succeed

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

Continuous improvement is not a new thing and is often misunderstood. To be successful, we can take guidance from agile principles and apply them to the DevOps world, argued Mirco Hering, managing director at Accenture. At Agile Portugal 2019 he spoke about DevOps leadership in the age of agile.

Hering believes we are way beyond the point where DevOps and the associated practices like Continuous Delivery are not well enough understood, yet we don’t see many organisations reach a mature state of DevOps. He mentioned two themes that are preventing organisations from making faster progress.

The first one is a tendency to declare success too early; adopting DevOps practices is hard work and a lot of the benefits come with the last 20% rather than the first 80%. After all, a semi-automated process is not that much better than a manual process, argued Hering.

The second theme is around organisations adopting the old approach of transformation to DevOps, which means they either try to reorganise themselves out of their problems, or do the equivalent of an As-is To-Be analysis to define their transformation plan for DevOps. Both approaches are destined to fail, said Hering. Rather than charting out the whole DevOps journey, we need to accept that only rigorous continuous improvement will allow you to make significant progress; things will change all the time and you need to be able to react to it in an agile fashion.

Hering stated that organisations tend to lack two things in order for continuous improvement to really work: the discipline to follow the more rigorous process, and the data to set baselines and measure improvements.

Hering stated that continuous improvement is often misunderstood. Many organisations create a list of things to improve and track, whether or not they have implemented them, and call that continuous improvement. “This is just not good enough”, he said, “it often leads to increasingly long checklists for people to follow because adding a step in that list is an easy way to feel like you improved something. Does this actually improve anything? Who knows…”

We are getting better though, said Hering, as there’s a lot more talk about gathering and using data from our DevOps tools now. Open source solutions like hygeieia and efforts by established vendors indicate an increased investment in this space. Hering encouraged companies to take control of their data and find ways to use it during their transformation.

Hering spoke about applying ideas from the agile world to DevOps at Agile Portugal 2019. In his talk, he gave some examples:

When I first worked with Agile, one thing that resonated strongly with me was the principle-based approach. If you understand the principles, then methods will follow. I prefer this very much over the “war of frameworks” that seems have broken out in some parts of the Agile world.

Systems thinking was such an eye-opener for me when I got introduced to it in my first agile trainings. Here is where this applies to DevOps I feel: if we allow every team to create their own tooling and approaches to DevOps then we have optimised for those teams to the detriment of the overall organisation. It is very difficult to maintain an architecture when every team can do what they want. So what constraints and alignment do we require to optimise the whole system instead of just improving each team.

Continuous improvement is not a new thing; there are practices such as the Deming cycle dating back to the previous century. Hering shared how to use the Deming cycle as a scientific method for continuous improvement:

We have an idea about what can improve the situation. We are designing an experiment to see whether we are right and agree what data we need to validate that. Then we implement our idea and compare the results afterwards. Whether it worked or not, in either case we have learned something that influences the next experiment in our continuous improvement journey. For example, that increasing code coverage is not actually improving our quality any longer at some stage.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Benefits of Microsoft’s New Versions of Azure Application Gateway and the Web Application Firewall

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

In a recent blog post, Microsoft discusses the benefits of the generally available releases of Azure Application Gateway V2 Standard SKU and Web Application Firewall (WAF) V2 SKU’s. Microsoft fully supports them with a 99.95% SLA, significant improvements and capabilities.

Azure Application Gateway is part of the Microsoft Azure networking services portfolio that includes Azure Load Balancer, Azure Traffic Manager, and Azure Front Door. The service is a web traffic load balancer and Application Delivery Controller (ADC) in Azure that enables customers to manage traffic to their web applications. The service provides customers with layer 7 load balancing, security and WAF functionality. 

Next, the Web Application Firewall is integrated with the Azure Application Gateway core offerings and further strengthens the security portfolio and posture of applications – protecting them from many of the most common web vulnerabilities, as identified by Open Web Application Security Project (OWASP) top 10 weaknesses.

In April, Microsoft announced the release of both Azure Application Gateway V2 and Web Application Firewall (WAF) V2. Subra Sarma, principal program manager, Microsoft Azure, states in the blog that customers should move forward to the V2 SKU’s:

We highly recommend that customers use the V2 SKUs instead of the V1 SKU for new applications/workloads.

To help customers with migrating to V2 gateways with the same configuration, Microsoft created a PowerShell script along with documentation that helps replicate the settings on a V1 gateway to a new V2 gateway.

With the new V2 releases, customers can benefit from:

  • Autoscaling – allowing elasticity for their applications by scaling the application gateway as needed based on the application’s traffic pattern.
  • Zone redundancy – enabling their application gateway to survive zonal failures and offering better resilience.
  • The Static VIP feature ensuring that their endpoint addresses will not change over its lifecycleHeader Rewrite – allowing them to add, remove or update HTTP request and response headers on their application gateway.
  • Faster provisioning and configuration update time.
  • Improved performance for their application gateway, helping them to reduce overall costs.

Source: https://azure.microsoft.com/en-us/blog/taking-advantage-of-the-new-azure-application-gateway-v2/

Note that if customers use WAF functionality for their web application in the Azure Application Gateway V2 Standard, they will need to have one per region and manually scale it as well. However, alternatively customers can use Azure Front Door to have a global reach and autoscaling.

The Standard_v2 and WAF_v2 SKU are available in the America (North Central US, South Central US, West US, West US 2, East US, East US 2, Central US), Europe (North Europe, West Europe, Southeast Asia, France Central, UK West), and Asia (Japan East, Japan West) regions. Furthermore, Microsoft will add additional regions in the future. 

Lastly, Sarma states in the blog that customers will have lower costs with V2 SKU’s. The pricing details are available on the pricing page.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


A Comprehensive Guide to Data Science With Python

MMS Founder
MMS RSS

Article originally posted on Data Science Central. Visit Data Science Central

A Hearty Welcome to You!

I am so thrilled to welcome you to the absolutely awesome world of data science. It is an interesting subject, sometimes difficult, sometimes a struggle but always hugely rewarding at the end of your work. While data science is not as tough as, say, quantum mechanics, it is not high-school algebra either.

It requires knowledge of Statistics, some Mathematics (Linear Algebra, Multivariable Calculus, Vector Algebra, and of course Discrete Mathematics), Operations Research (Linear and Non-Linear Optimization and some more topics including Markov Processes), Python, R, Tableau, and basic analytical and logical programming skills.

.Now if you are new to data science, that last sentence might seem more like pure Greek than simple plain English. Don’t worry about it. If you are studying the Data Science course at Dimensionless Technologies, you are in the right place. This course covers the practical working knowledge of all the topics, given above, distilled and extracted into a beginner-friendly form by the talented course material preparation team.

This course has turned ordinary people into skilled data scientists and landed them with excellent placement as a result of the course, so, my basic message is, don’t worry. You are in the right place and with the right people at the right time.

What is Data Science?

image result for what is data science?

To quote Wikipedia:

Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. Data science is the same concept as data mining and big data: “use the most powerful hardware, the most powerful programming systems, and the most efficient algorithms to solve problems.”

From Source

More Greek again, you might say.

Hence my definition:

Data Science is the art of extracting critical knowledge from raw data that provides significant increases in profits for your organization.

We are surrounded by data (Google ‘data deluge’ and you’ll see what I mean). More data has been created in the last two years that in the last 5,000 years of human existence.

The companies that use all this data to gain insights into their business and optimize their processing power will come out on top with the maximum profits in their market.

Companies like Facebook, Amazon, Microsoft, Google, and Apple (FAMGA), and every serious IT enterprise have realized this fact.

Hence the demand for talented data scientists.

I have much more to share with you on this topic, but to keep this article short, I’ll just share the links below which you can go through in your free time (everyone’s time is valuable because it is a strictly finite resource):

You can refer to An Introduction to Data Science.

Article Organization

image result for exploring data science
From Pexels

Now as I was planning this article a number of ideas came to my mind. I thought I could do a textbook-like reference to the field, with Python examples.

But then I realized that true competence in data science doesn’t come when you read an article.

True competence in data science begins when you take the programming concepts you have learned, type them into a computer, and run it on your machine.

And then; of course, modify it, play with it, experiment, run single lines by themselves, see for yourselves how Python and R work.

That is how you fall in love with coding in data science.

At least, that’s how I fell in love with simple C coding. Back in my UG in 2003. And then C++. And then Java. And then .NET. And then SQL and Oracle. And then… And then… And then… And so on.

If you want to know, I first started working in back-propagation neural networks in the year 2006. Long before the concept of data science came along! Back then, we called it artificial intelligence and soft computing. And my final-year project was coded by hand in Java.

Having come so far, what have I learned?

That it’s a vast massive uncharted ocean out there.

The more you learn, the more you know, the more you become aware of how little you know and how vast the ocean is.

But we digress!

To get back to my point –

My final decision was to construct a beginner project, explain it inside out, and give you source code that you can experiment with, play with, enjoy running, and modify here and there referring to the documentation and seeing what everything in the code actually does.

Kaggle – Your Home For Data Science

image result for kaggle

 

If you are in the data science field, this site should be on your browser bookmark bar. Even in multiple folders, if you have them.

Kaggle is the go-to site for every serious machine learning practitioner. They hold competitions in data science (which have a massive participation), have fantastic tutorials for beginners, and free source code open-sourced under the Apache license (See this link for more on the Apache open source software license – don’t skip reading this, because as a data scientist this is something about software products that you must know).

As I was browsing this site the other day, a kernel that was attracting a lot of attention and upvotes caught my eye.

This kernel is by a professional data scientist by the name of Fatma Kurçun from Istanbul (the funny-looking ç symbol is called c with cedilla and is pronounced with an s sound).

It was quickly clear why it was so popular. It was well-written, had excellent visualizations, and a clear logical train of thought. Her professionalism at her art is obvious.

Since it is an open source Apache license released software, I have modified her code quite a lot (diff tool gives over 100 changes performed) to come up with the following Python classification example.

But before we dive into that, we need to know what a data science project entails and what classification means.

Let’s explore that next.

Classification and Data Science

image result for classification and data science

 

So supervised classification basically means mapping data values to a category defined in advance. In the image above, we have a set of customers who have certain data values (records). So one dot above corresponds with one customer with around 10-20 odd fields.

Now, how do we ascertain whether a customer is likely to default on a loan, and which customer is likely to be a non-defaulter? This is an incredibly important question in the finance field! You can understand the word, “classification”, here. We classify a customer into a defaulter (red dot) class (category) and a non-defaulter (green dot) class.

This problem is not solvable by standard methods. You cannot create and analyze a closed-form solution to this problem with classical methods. But – with data science – we can approximate the function that captures or models this problem, and give a solution with an accuracy range of 90-95%. Quite remarkable!

Now, again we can have a blog article on classification alone, but to keep this article short, I’ll refer you to the following excellent articles as references:

Link 1 and Link 2

 

Steps involved in a Data Science Project

A data science project is typically composed of the following components:

  1. Defining the Problem
  2. Collecting Data from Sources
  3. Data Preprocessing
  4. Feature Engineering
  5. Algorithm Selection
  6. Hyperparameter Tuning
  7. Repeat steps 4–6 until error levels are low enough.
  8. Data Visualization
  9. Interpretation of Results

If I were to explain each of these terms – which I could – but for the sake of brevity – I can ask you to refer to the following articles:

and:

Steps to perform data science with Python- Medium

At some time in your machine learning career, you will need to go through the article above to understand what a machine learning project entails (the bread-and-butter of every data scientist).

Jupyter Notebooks

juypter
From Wikipedia

To run the exercises in this section, we use a Jupyter notebook. Jupyter is short for Julia, Python, and R. This environment uses kernels of any of these languages and has an interactive format. It is commonly used by data science professionals and is also good for collaboration and for sharing work.

To know more about Jupyter notebooks, I can suggest the following article (read when you are curious or have the time):

 

Data Science Libraries in Python

image result for scikit learn

The standard data science stack for Python has the scikit-learn Python library as a basic lowest-level foundation.

 

The scikit-learn python library is the standard library in Python most commonly used in data science. Along with the libraries numpy, pandas, matplotlib, and sometimes seaborn as well this toolset is known as the standard Python data science stack. To know more about data science, I can direct you to the documentation for scikit-learn – which is excellent. The text is lucid, clear, and every file contains a working live example as source code. Refer to the following links for more:

Link 1 and Link 2

This last link is like a bible for machine learning in Python. And yes, it belongs on your browser bookmarks bar. Reading and applying these concepts and running and modifying the source code can help you go a long way towards becoming a data scientist.

And, for the source of our purpose

Our Problem Definition

This is the classification standard data science beginner problem that we will consider. To quote Kaggle.com:

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.


From: Kaggle

We’ll be trying to predict a person’s category as a binary classification problem – survived or died after the Titanic sank.

So now, we go through the popular source code, explaining every step.

Import Libraries

These lines given below:

 

Are standard for nearly every Python data stack problem. Pandas refers to the data frame manipulation library. NumPy is a vectorized implementation of Python matrix manipulation operations that are optimized to run at high speed. Matplotlib is a visualization library typically used in this context. Seaborn is another visualization library, at a little higher level of abstraction than matplotlib.

The Problem Data Set

We read the CSV file:

 

Exploratory Data Analysis

Now, if you’ve gone through the links given in the heading ‘Steps involved in Data Science Projects’ section, you’ll know that real-world data is messy, has missing values, and is often in need of normalization to adjust for the needs of our different scikit-learn algorithms. This CSV file is no different, as we see below:

Missing Data

This line uses seaborn to create a heatmap of our data set which shows the missing values:

Output:

 

Interpretation

The yellow bars indicate missing data. From the figure, we can see that a fifth of the Age data is missing. And the Cabin column has so many missing values that we should drop it.

Graphing the Survived vs. the Deceased in the Titanic shipwreck:

 

Output:

As we can see, in our sample of the total data, more than 500 people lost their lives, and less than 350 people survived (in the sample of the data contained in train.csv).

When we graph Gender Ratio, this is the result.

Output

Over 400 men died, and around 100 survived. For women, less than a hundred died, and around 230 odd survived. Clearly, there is an imbalance here, as we expect.

Data Cleaning

The missing age data can be easily filled with the average of the age values of an arbitrary category of the dataset. This has to be done since the classification algorithm cannot handle missing values and will be error-ridden if the data values are not error-free.

Output

image result for output for data cleaning

We use these average values to impute the missing values (impute – a fancy word for filling in missing data values with values that allow the algorithm to run without affecting or changing its performance).

 

 

Missing values heatmap:

 

Output:

 

 

We drop the Cabin column since its mostly empty.

Convert categorical features like Sex and Name to dummy variables using pandas so that the algorithm runs properly (it requires data to be numeric)..

 

Output:

 

More Data Preprocessing

We use one-hot encoding to convert the categorical attributes to numerical equivalents. One-hot encoding is yet another data preprocessing method that has various forms. For more information on it, see the link 

 

 

Finally, we check the heatmap of features again:

 

Output

 

image result for classification model

No missing data and all text converted accurately to a numeric representation means that we can now build our classification model.

Building a Gradient Boosted Classifier model

Gradient Boosted Classification Trees are a type of ensemble model that has consistently accurate performance over many dataset distributions. 
I could write another blog article on how they work but for brevity, I’ll just provide the link here and link 2 here:

We split our data into a training set and test set.

 

Training:

 

Output:

 

Predicting:

 

Output

 

Performance

The performance of a classifier can be determined by a number of ways. Again, to keep this article short, I’ll link to the pages that explain the confusion matrix and the classification report function of scikit-learn and of general classification in data science:

Confusion Matrix

Predictive Model Evaluation

A wonderful article by one of our most talented writers. Skip to the section on the confusion matrix and classification accuracy to understand what the numbers below mean.

For a more concise, mathematical and formulaic description, read here

 

 

So as not make this article too disjointed, let me explain at least the confusion matrix to you.

The confusion matrix has the following form:

[[ TP FP ]

[ FN TN ]]

The abbreviations mean:

TP – True Positive – The model correctly classified this person as deceased.

FP – False Positive – The model incorrectly classified this person as deceased.

FN – False Negative – The model incorrectly classified this person as a survivor

TN – True Negative – The model correctly classified this person as a survivor.

So, in this model published on Kaggle, there were:

89 True Positives

16 False Positives

29 False Negatives

44 True Negatives

Classification Report

You can refer to the link here

-to learn everything you need to know about the classification report.

 

 

So the model, when used with Gradient Boosted Classification Decision Trees, has a precision of 75% (the original used Logistic Regression).

Wrap-Up

I have attached the dataset and the Python program to this document, you can download it by clicking on these links. Run it, play with it, manipulate the code, view the scikit-learn documentation. As a starting point, you should at least:

  1. Use other algorithms (say LogisticRegression / RandomForestClassifier a the very least)
  2. Refer the following link for classifiers to use: 
    Sections 1.1 onwards – every algorithm that has a ‘Classifier’ ending in its name can be used – that’s almost 30-50 odd models!
  3. Try to compare performances of different algorithms
  4. Try to combine the performance comparison into one single program, but keep it modular.
  5. Make a list of the names of the classifiers you wish to use, apply them all and tabulate the results. Refer to the following link:
  6. Use XGBoost instead of Gradient Boosting

Titanic Training Dataset (here used for training and testing):

Address of my GitHub Public Repo with the Notebook and code used in this article:

Github Code

Clone with Git (use TortoiseGit for simplicity rather than the command-line) and enjoy.

To use Git, take the help of a software engineer or developer who has worked with it before. I’ll try to cover the relevance of Git for data science in a future article.

But for now, refer to the following article here

You can install Git from Git-SCM and TortoiseGit 

To clone,

  1. Install Git and TortoiseGit (the latter only if necessary)
  2. Open the command line with Run… cmd.exe
  3. Create an empty directory.
  4. Copy paste the following string into the command prompt and watch the magic after pressing Enter: “git clone https://github.com/thomascherickal/datasciencewithpython-article-src.git” without the double quotes, of course.

Use Anaconda (a common data science development environment with Python,, R, Jupyter, and much more) for best results.

Cheers! All the best into your wonderful new adventure of beginning and exploring data science!

Image result for learning done right with data science
Learning done right can be awesome fun! (Unsplash)

If you want to read more about data science, read our Data Science Blogs

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Front End Architecture in a World of AI

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

At QCon New York 2019, Front End Software engineer Thijs Bernolet of Oqton explained some of the challenges in creating front end architectures influenced by machine learning.

Looking back to Eternal moonwalk, created in 2009 by Bernolet and his team, he remarked that in 2009 it was not possible to easily manage, tag, and edit the uploading of 15,000 videos in three days. But in today’s world of machine learning, there are many possibilities and challenges.

According to Bernolet, the primary challenge with machine learning influencing the user interface is the sharing of state between UI code and machine learning logic, overlapping with data models representing users. Foundations of a sound user interface logic typically rely on principles of loose coupling and high cohesion. Machine learning agents tend to impact infrastructure, data models, and business logic, breaking the UI foundational paradigms.

Bernolet explains that traditional UI models such as MVC degrade due to the introduction of tight coupling between the model and view layers. His team started investigating Redux and asked the question if Redux could also be used by machine learning agents, with action sequences as training actions.

Bernolet demonstrated his proof of concept for a Redux CLI, and he appreciates the Redux ecosystem support for features like undo/redo, time travel, handling of side effects, and Redux devtools.

Bernolet ran into issues in managing distributed state with Redux, including merging states and race conditions. Explorations included operational transforms (OT), and Conflict-free replicated data (CFRD) types. Their team started to consider the management of these challenges might get solved by leveraging git rebase style operations with OT in the browser, which resulted in a git-js proof of concept.

The talk highlights some of the challenges of working with Redux in distributed state systems. Alternatives to Redux that might solve similar challenges could include solutions based on JSON patch such as @dojo/framework/stores and json-patch-ot.

The combination of Redux and client-side git with OT solves Bernolet and his team’s use case for optimizing the manufacturing process through a combination of user and machine learning inputs. And this approach, had it existed ten years earlier, might have simplified the development of Eternal Moonwalk.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Mini book: The InfoQ eMag: DevOps for the Database

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

A robust DevOps environment requires having continuous integration for every component of the system. But far too often, the database is omitted from the equation, leading to problems from fragile production releases and inefficient development practices to simply making it harder to onboard new programmers. In this eMag, we discuss the unique aspects of databases, both relational and NoSQL, in a successful continuous integration environment.

Free download

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.