Mobile Monitoring Solutions

Search
Close this search box.

Build Your Intelligent Enterprise through a Data Fabric

MMS Founder
MMS RSS

Article originally posted on Data Science Central. Visit Data Science Central

The future offers interesting and exciting times ahead for most businesses. With data being a big influencer in the enterprise of the future, it is a matter of time before we jump into the era of intelligent enterprises.

Intelligent enterprises are going to be organizations that offer exemplary customer experience and have an efficient operations procedure. The ability to give clients what they want throughout their experience with an organization is what separates an intelligent enterprise apart from the others.

With competition between organizations expected to increase with the passage of time, an intelligent enterprise would be a need of the hour. Organizations would want to work efficiently in production and to deliver the most to customers across their experience.

Ronald van Loon recently had the opportunity to attend SAP Sapphire and speak to some of the most renowned Chief Data Officers (CDO) from across the globe. As part of this venture, both SAP and Ronald van Loon got a lot to learn about the data culture of today, and how the intelligent enterprise of the future can be built through a data fabric.

Main Challenges CDOs Face

CDOs currently face numerous challenges when it comes to building a proper intelligent enterprise. While the prospects and the aims from the intelligent enterprise of the future are clear in theory, the process for achieving these goals is a bit complicated right now.

Some of the challenges that CDOs currently face are:

Data Silos

The Chief Data Officers at the conference believed that a data silo is one of their major challenges. Data silos impede progress towards intelligent enterprise, and CDOs must look for alternative solutions.

To minimize the damage caused by data silos, organizations need to curate a proper master data management plan. The plan is meant to foster increased results from the data at hand, and to ensure that these results are coherent with the overall plans.

Data Quality

Data quality is another problem troubling CDOs in today’s world. Chief Data Officers feel that the analytics  generated from data would only be as good as the data on hand. So, it is imperative for data to be high quality, and to have proper definition. You can increase or ensure quality by playing with the source of your data.

Data Migration

Leading the data you currently have towards the world of integration can come with certain problems. For starters, gathering all your data at a single, integrated source is something that many organizations are unable to do. The inability to create a single source can then lead to the creation of silos.

Additionally, the data you currently have might be positioned for legacy systems. Updating the type of data can be a tough ask, knowing that you will ultimately have to use it for integration purposes.

Data Volume and Variety

The volume and variety of the data coming through from different sources can be quite intimidating. The sheer volume of the several data types requires proper data analytics mechanisms to ensure you can make some sense out of all of this data.

Understanding a Data Fabric

A data fabric is the culmination of everything required for generating actionable insights from the data you have. The data should be simple and it must have one source. It should also be intelligent so that it can be used immediately. Finally, the analytics should be fast, so that there is a reduced time to cut risks associated with depletion.

Core Items Needed for Building an Intelligent Enterprise

The process towards building an intelligent enterprise starts with the introduction of the following core items:

Get Rid of Silos

Getting rid of silos is crucial; it will free up your data resources, and make management of the cloud easier for your organization. You should ensure that all your company data is accessible through one particular source, without any issues whatsoever. By accessing all data through one particular source, you cut down on all inefficiencies and make sure that you are able to speed up the analytics process.

A combination of a cloud, hybrid and on premise solution is for most organizations a prerequisite due to all existing systems. It offers fast access to your all of your enterprise data, and positions it in one place for you to see.

Centralize Governance and Ensure Data Quality

Data governance is extremely important for an intelligent enterprise. Focus your data management at the source, and make sure that you have quality data inflow coming your way. You can also practice data enrichment methods as part of centralized governance. Data enrichment will help give you superior quality data, without any irregularities whatsoever.

Data quality is an important metric in the overall success of the plan, as it can help put your business in getting the perfect results from the analytics. Your analytics will only be as good as the data you pour into your systems. Poor quality data will ultimately lead to poor analytics, which would then lead to poor quality decisions. The whole process can be at risk of being derailed if the data that you initially put into the system cannot be trusted for quality.

Data Warehouse in the Cloud

All your business users should be able to access all of the data within your organization in a governed environment. Your business users should also be able to dig deep into a granular level for further understanding of the data and the systems in play.

You can follow some of the renowned examples from the industry here to get a template for what is required of you. With the right moves, you can secure an intelligent enterprise of your choice.

Intelligence

Intelligence is one of the core requirements of an intelligent enterprise. This includes the presence of Machine Learning and Artificial Intelligence systems that help you in making decisions that benefit your cause. This intelligence can augment your drive towards an intelligent enterprise for the future and can ensure that you’re on track to achieve your goals.

How Does a Data Fabric Solve Key Issues for CDOs?

A data fabric can help solve all the key issues complicating the work required from a CDO. To begin, data fabric will help minimize the presence of data silos. Centralized forms of government will grant fast access and limit secluded data. Additionally, the presence of metadata management will perform the required quality checks and ensure the provision of quality data that can be trusted above anything else.

Moreover, we may also see organizations find the ability to handle all kinds of varieties and volume of data without any complications whatsoever. This will lead to minimization in data migration implications.

Developments in the Market

No alt text provided for this image

SAP HANA Cloud Services | Business Data Platform for the Intelligent Enterprise

SAP launched their enterprise data fabric at Sapphire. It supports companies and their Chief Data Officers in launching a companywide management plan for data in the cloud, hybrid cloud and on premise. The centralization of data helps with the management and creation of an efficient plan of operations. Since efficiency is the most sought after result from the intelligent enterprise of the future, a data fabric will help get you the desired results. Business users have real time, easy and governed access, with real time data anonymization capabilities, to all enterprise data via a data warehouse in the cloud. Fast access and processing is supported by SAP Hana in memory data base and it’s new persistent 6TB memory. This efficiency and intelligence will allow you to give customers their desired experience across the user cycle.

Learn more about intelligent data management and other related details by clicking here.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


6 Important Steps to Building a Successful Factory of the Future

MMS Founder
MMS RSS

Article originally posted on Data Science Central. Visit Data Science Central

What is the factory of the future? Is it a synonym to Industry 4.0, or is it a different concept in its own right? Industry 4.0 and the factory of the future might sound similar, but they are different in some ways. To begin with, the factory of the future is an elusive concept that isn’t as common as Industry 4.0.

The factory of the future is meant to be your gateway to the future of AI and IoT. Through exploring my interest in the topic and as an Oracle ambassador, I had the opportunity to speak to Hans Michael Krause from Bosch Rexroth about the future of IoT systems. Michael, who is the Director of Product Management PLC and IoT systems, is an industry expert and knows quite a bit about the subject.

Talking to me Michael said that, ‘the factory of the future has elements of Industry 4.0, but it is integrally different from it. The factory of the future also talks about distributing your energy via inductive factory floor. It involves 5G connectivity, and is much more than the bland digitization offered by Industry 4.0. The factory of the future also involves collaborative robots, which are counted as a new trend in this realm. These are just some of the many dreams that are phasing out to be true through the collective efforts of scientists in the factory of the future.”

Michael and I then went on to discuss other topics that will have a bearing on the industry of the future, for example the factory of the future bringing exceptional connectivity. One example in this instance could be that of Oracle ERP Cloud, which connects factory data with back office solutions. Additionally, the concept also entails connectivity from the manufacturing to the selling of the product. Manufacturers, hence, should educate themselves to be on par with the trends around them.

Challenges

There are going to be challenges impeding the path towards the overall implementation of the factory of the future. These challenges include the understanding of the new technology ready to make its way within organizations. The necessary priority and resources will be required to better understand, embrace and ascertain how to effectively leverage technologies like blockchain, IoT and AI in ways that create business differentiation and advantage for customers. Moreover, the presence of siloed data within the business operations can lead to numerous complications, including tracking and tracing parts and products life and availability within the supply chain. Finally, limited connectivity between the user and the producer can also hinder progress for the factory of the future.

6 Steps for Factory of the Future

As part of our conversation, Michael pointed out the 6 steps Bosch Rexroth defined for the seamless integration of the factory of the future within your setup:

  1. Lean Manufacturing: Lean manufacturing is an efficient system of production that finds its roots in Japan. The manufacturing process is based on limiting any inefficiencies and wastages within the business processes.
  2. Connectivity: Connect manufacturing, production and assembly data to give enhanced visibility of processes.
  3. Generating Data: Generate data with a closed loop approach. Gather data at different phases, and make sure that it is trustable. Closing the loop will drive efficiency of your equipment.
  4. Obtaining Info from Data: Obtaining information from your data is the most important step after you have gathered and stored it. Raw data alone doesn’t hold much business value. It’s the information and patterns extracted from data that gives you an advantage over competitors.
  5. Insights from Information: Once you have obtained information from your data, you can start extracting insights. Use insights to further effect, take decisions that make sense in the business structure.
  6. Fully Flexible Factory: A highly flexible factory in which everything except floor, roof and walls is changeable within days or even hours. The production is independently adjusted to the product to be manufactured.

Jumpstart the Journey to Factory of the Future

Tech-savvy manufacturers, like Bosch Rexroth, have begun to invest in the factory of the future, taking steps towards modernizing them and embracing innovations that drive connectivity between people, processes and machine data. Organizations taking this path can accelerate business, provide improved customer experience and gain competitive advantage, placing themselves in a good position within the manufacturing industry of the future.

You can learn more about the journey required for fully embracing the factory of the future from Oracle.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Microsoft Launches Blockchain-based Decentralized Identity System

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

Recently launched in preview by Microsoft, ION is a Decentralized Identifier (DID) network that runs on top of Bitcoin and aims to provide a decentralized identity system and PKI at scale.

Identities are key for most of the things we do in the digital world. As Microsoft vice president Alex Simons writes,

Today, the most common digital identifiers we use are email addresses and usernames, provided to us by apps, services, and organizations. This puts identity providers in a place of control, between us and every digital interaction in our lives.

Decentralized identities promise to change this picture by enabling an ecosystem where large numbers of organizations and individuals can operate securely, while fully preserving their privacy, without any company, organization or group deciding who may participate by controlling identifiers and PKI entries.

ION, short for Identity Overlay Network, is based on the Sidetree protocol, which is backed by the Digital Identity Foundation, a consortium of more than 60 members, including Microsoft, IBM, NEC and others. ION builds on previous efforts that led to the creation of Identity Hubs. The main appeal of ION is its capacity of achieving tens-of-thousands of operations per second, thus overcoming a fundamental limitation of other decentralized identity systems and other systems based on blockchain transactions.

In contrast to bitcoins, identities are not meant to be exchanged nor traded. This made it possible to define the Sidetree protocol in such a way that it can operate at scale without requiring a Layer 2 consensus scheme, trusted validator lists, or other solutions aiming to improve blockchain transaction performance. This is key to understand how Sidetree and ION can achieve such a high operation throughput.

All nodes of the network are able to arrive at the same Decentralized Public Key Infrastructure (DPKI) state for an identifier based solely on applying deterministic protocol rules to chronologically ordered batches of operations anchored on the blockchain, which ION nodes replicate and store via IPFS.

ION is just a step towards Microsoft’s vision for decentralized identity, which rests on a few tenets including user-owned identifiers, secure and user-controlled, “off-the-chain” datastores for actual identity data, and more.

Work on ION is not complete, since ION is being released as a preview and is still best suited for use by experiences developers only, says Microsoft. In the coming months, though, ION codebase is expected to evolve rapidly and mature to the point it can be released publicly on Bitcoin mainnet.

If you want to play with ION, you can set up an ION node on any machine or use it on Azure.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Data science Coding in a weekend series of books …

MMS Founder
MMS RSS

Article originally posted on Data Science Central. Visit Data Science Central

After testing this idea for the last few months, we have formally launched this concept

 

The idea of ‘Data Science Coding in a weekend’ originated from meetups we conducted in London

 

The idea is simple but effective

 

We choose a complex section of code and try to learn it in detail over the weekend

 

We work backwards i.e. try to drill down the concepts behind the main ideas

 

This led to the philosophy which I articulated in  learn machine learning coding basics in a weekend a new approach

 

And the first book free book classification and regression in a weekend

 

The “in a weekend” series of books on Data Science Central can be seen as an online version of our London based meetups. All the books have a single community HERE. Like a meetup, the books are free to use. The code is in open source. We have drawn upon many sources which we have referenced in the books

 

For this first book, the steps in the code are

 

Regression

Load and describe the data

Exploratory Data Analysis

      Exploratory data analysis – numerical

      Exploratory data analysis – visual

      Analyse the target variable

      compute the correlation

Pre-process the data

      Dealing with missing values

      Treatment of categorical values

      Remove the outliers

      Normalise the data

Split the data

Choose a Baseline algorithm

defining / instantiating the baseline model

fitting the model we have developed to our training set

Define the evaluation metric

predict scores against our test set and assess how good it is

Refine our dataset with additional columns

Test Alternative Models

Choose the best model and optimise its parameters

Gridsearch

 

Classification

Load the data

Exploratory data analysis

     Analyse the target variable 
    Check if the data is balanced

     Check the co-relations
Split the data
Choose a Baseline algorithm
Train and Test the Model
Choose an evaluation metric
Refine our dataset

Feature engineering

Test Alternative Models
Ensemble models 
Choose the best model and optimise its parameters

 

The second book – coming by next week – is entitled “Azure machine learning in a weekend”.

 

I introduced the book in this blog – Azure machine learning concepts – an introduction.  Most of us start learning development using a language like Python or R. But when you work professionally, you typically end up working with a Cloud platform. The top three Cloud platforms today in terms of market share are AWS, Azure and GCP(Google). These platforms are similar. Our goal is to learn the how to develop for these platform.  We start with Azure and then with Google next month.

 

We welcome your comments on the books and approach

You can download the first book free book classification and regression in a weekend and join the community HERE

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


10 Areas of Expertise in Data Science

MMS Founder
MMS RSS

Article originally posted on Data Science Central. Visit Data Science Central

The analytics market is booming, and so is the use of the keyword – Data Science. Professionals from different disciplines are using data in their day to day activities, and feel the need to master the start-of-the-art technology in order to get maximum insights from the data, and subsequently help the business to grow.

Moreover, there are professionals who want to keep them updated with this latest skills such as Machine Learning, Deep Learning, Data Science, and so either to elevate their career or move to a different career altogether. The role of a Data Scientist is regarded as the sexiest job in the 21stcentury making it increasingly lucrative for most people to turn down.

However, making a transition to Data Science, or starting a career in it as a fresher is not an easy task. The supply-demand gap is gradually diminishing as more, and more people are willing to master this technology. There is often a misconception among professionals, and companies as to what Data Science is, and in many scenarios the term has been misused for various small scale tasks.

To be a Data Scientist, you need to have a passion, and zeal to play with data, and a desire to make digits and numbers talk. It is a mixture of various things, and there are a plethora of skills one has to master to be a called a Full Stack Data Scientist. The list of skills often gets overwhelming for an individual who could quit, given the enormity of its applications, and a continuous learning mindset the field of Data Science demands.

In this article, we would walk you through the ten areas in Data Science which are a key part of a project, and you need to master those to be able to work as a Data Scientist in much big organization.

  • Data Engineering – To work in any Data Science project, the most important aspect of it is the data. You need to understand which data to use, how to organize the data, and so on. This bit of manipulation with the data is done by a Data Engineer in a Data Science team. It is a superset of Data Warehousing and Business Intelligence which included the concept of big data in the context.

Building, and maintain a Data warehouse is a key skill which a Data Engineer must have. They would prepare the structured, and the unstructured data to be used by the Analytics team for model building purpose. They build pipelines which extract data from multiple sources and then manipulates it to make it usable.

Python, SQL, Scala, Hadoop, Spark, etc., are some of the skills that a Data Engineer has. They should also understand the concept of ETL. The data lakes in Hadoop is one of the key areas of work for a Data Engineer. The NoSQL database is mostly used as part of the data workflows. Lambda architecture allows both batch and real-time processing.

Some of the job role available in the data engineering domain is Database Developer, Data Engineer, etc.

  • Data Mining – It is the process of extracts insights from the data using certain methodologies for the business to make smart decisions. It distinguishes the previously unknown patterns and relationships from the data. Through data mining, one could transform the data into various meaningful structures in accordance with the business. The application of data mining depends on the industry. Suppose in finance, it is used in risk or fraud analytics. In manufacturing, product safety, and quality issues could be analyzed with accurate mining. Some of the parameters in data mining are Path Analysis, Forecasting, Clustering, and so on. Business Analyst, Statistician are some of the related jobs in the data mining space.
  • Cloud Computing – A lot of companies these days are migrating their infrastructure from local to the cloud merely because of the ready-made availability of the resources, and the huge computational power which not always available in a system. Cloud computing generally refers to the implementation of platforms for distributed computing. The system requirements are analyzed to ensure seamless integration with present applications. Cloud Architect, Platform Engineer are some of the jobs related to it.
  • Database Management – The rapidly changing data makes it imperative for the companies to ensure accuracy in tracking the data on a regular basis. This minute data could empower the business to make time strategic decisions, and maintain a systematic workflow. The collected data is used to generate reports and is made available for the management in the form of relational databases. The Database management system maintains a link among the data, and also allows newer updates. The structured format in the form of databases helps management to look for data in an efficient manner. Data Specialist, Database Administrator are some of the jobs for it.
  • Business Intelligence – The area of business intelligence refers to finding patterns in historical data of a business. Business Intelligence analysts would find the trends for a data scientist to build predictive models upon. It is about answering not-so-obvious questions. Business Intelligence answers the ‘what’ of a business. Business Intelligence is about creating dashboards and drawing insights from the data. For a BI analyst, it is important to learn data handling, and masters the tools like Tableau, Power BI, SQL, and so on. Additionally, proficiency in Excel is a must in business intelligence.
  • Machine Learning – Machine Learning is the state-of-the-art methodology to make predictions from the data, and help the business make better decisions. Once the data is curated by the Data Engineer and analyzed by a Business Intelligence Analyst, it is provided to a Machine Learning Engineer to build predictive models based on the use case in hand. The field of machine learning is categorized into supervised, unsupervised, and reinforcement learning. The dataset is labeled in supervised unlike in unsupervised learning. To build a model, it is first trained with data to let them identify the patterns and learn from it to make predictions on the unknown set of data. The accuracy of the model is determined based on the metric, and the KPI used which is decided by the business beforehand.
  • Deep Learning – Deep Learning is a branch of Machine Learning which h uses neural network to make predictions. The neural networks work similar to our brain and makes builds predictive models compared to the traditional ML systems. Unlike in Machine Learning, no manual feature selection is required in Deep Learning but huge volumes of data and enormous computational power is needed to run deep learning frameworks. Some of the Deep Learning frameworks like TensorFlow, Keras, PyTorch.
  • Natural Language Processing – NLP or Natural Language Processing is a specialization in Data Science which deals with raw text. The natural language or speech is processed using several NLP libraries, and various hidden insights could be extracted from it. NLP has gained popularity in recent times with the amount of unstructured raw text that’s getting generated from a plethora of sources, and the unprecedented information that those natural data carries. Some of the applications of Natural Language Processing are Amazon’s Alexa, Google’s Siri. Even many companies are using NLP for sentiment analysis, resume parsing, and so on.
  • Data Visualization – Needless to say, the importance of presenting your insights either through scripting or with the help of various visualization tools. A lot of Data Science tasks could be solved with an accurate data visualizations as the charts, and the graphs presents enough hidden information for the business to take relevant decisions. Often, it gets difficult for an organization to build predictive models, and thus they rely on only visualizing the data for their workflow. Moreover, one needs to understand which graphs or charts to use for a particular business, and keep the visualization simple, as well as informative.
  • Domain Expertise – As mentioned earlier, professionals from different disciplines are using data in their business, and thus its wide range of applications makes it imperative for people to understand the domain they are applying their Data Science skills. The domain knowledge could be operations-related where you would leverage the tools to improve the business operations that could be focused on financials, logistics, etc.  It could also be sector specific such as Finance, Healthcare, etc.

Conclusion –

Data Science is a broad field with a multitude of skills, and technology that needs to be mastered. It is a life-long learning journey, and with frequent arrival of new technologies, one has to update themselves constantly.

Often it could be challenging to keep up with some frequent changes. Thus it is required to learn all these skills, and at least be a master of one particular skill. In a big corporation, a Data Science team would comprise of people assigned with different roles such as data engineering, modeling, and so on. Thus focusing on one particular area would give you an edge over others in finding a role within a Data Science team in an organization.

Data Scientist is the most sort after job in this decade, and it would continue to be so in years to come. Now is the right time to enter this field, and Dimensionless has several blogs and training to get started with Data Science.

Follow this link, if you are looking to learn more about data science online!

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Federated Learning: Rewards & Challenges of Distributed Private ML

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

Transcript

Tramel: Today we’re going to be talking about federated learning. What this is going to start with is just a very high-level birds-eye view of federated learning and what it can mean, and then we’re going to jump into some of the challenges about what happens on the ground. Like Mike was saying, I’m really excited to be talking about this here in this crowd, because as you heard in the introductory speech this morning, this is really this intersection between software engineering, ops, machine learning, and trying to find that nice space in the middle, that nice interface.

Federated learning is a real crucible because it brings together even more, so it’s really an interface between data science, machine learning, engineering, DevOps, software data, and security engineering, and bringing all this together in one spot. If you don’t think you have any organizational weaknesses between these skill sets in your company, you’re going to find out real quick that you just haven’t found them yet, once you start taking on a federated project. Despite some of the challenges that we’re going to talk about today, I think you’ll see that there’s some real rewards for this.

Just a little bit of introduction. Who am I, why am I talking about this? I’m Eric Tramel, I’m a Ph.D. researcher, and I lead our federated learning R&D team at Owkin. What is Owkin? I can give you just one-minute pitch, we’re like the littlest best biotech that you’ve never heard of. We’re located in Europe and here in the U.S., and one of the things that we’re really proud of building through a lot of hard work of our team is our Owkin loop, which is a medical research network that brings together 30 hospitals and research institutions. Our goal at Owkin is really to advance clinical research and improve the pace of research, especially in oncology, by bringing AI and ML to researchers. Federated learning is one of the ways that we see to do that, so, how can we bring together high-quality clinical research data sets?

Machine Learning Today

let’s talk a little bit about federated learning, so, to start with, what is machine learning today? Machine learning today looks like gather and analyze the situation. For a little bit of this introductory part of the talk, I’m going to be framing things in terms of mobile devices, but this applies broadly if you have computation available on the edge of your network. What does machine learning look like today? We start with acquisition, we have a bunch of devices, we have users, participants in our systems that are providing data. Maybe we’re generating the data ourselves, but somehow, we have data that’s split between many different sources.

To run conventional machine learning pipelines, we need to bring all this data together because the compute needs to happen where our data is. If we take a look at the mobile setting, what do we need to do? We need to acquire this data, so, we need to get our users to agree to send us that data, which may be more or less complicated, and then, we need to pay the cost of bringing and gathering all that data in one spot. Then also too, if we want fresh data, we need to continually do this, bring more and more data to us.

Once we have the data, we have a number of costs and risks that come up in just maintaining data that we’ve gathered from multiple sources. If we’re dealing with data from users, we have, for example, GDPR compliance in the EU where we work a lot, which is saying, a user can ask, “Oh, I don’t want my data to be used in your system anymore, please remove it.” Now you have to dig through all of your infrastructure and find out where that data is, remove it from everything, and then give a certification that you’ve actually done that.

If you’re storing personally identifying information on your side, then you need to make sure that you are following all the proper security protocols, and maintaining that, because there are always threats that loom from people trying to take advantage of those situations. Also, you just have to pay the maintenance of keeping all of this data. Additionally, once you have all this data in one spot, you’re going to need some big compute, potentially, if you have a lot of data and a very complicated problem, then you’re going to need a lot of compute, and one of the common themes in machine learning is you start with the little data, then you do something simple. You get some more data and then you think you might do something a little bit more exciting, so, things start to get more complicated. You need more horsepower and then the cycle continues, you end up using more and more resources, so, you need to be able to scale your computes along with your data.

After that, you have some other things to consider in this gather and analyze. FAANG has kind of gotten a little bit of a bad rap in recent years, potential PR blowback of vacuuming up a lot of user data, and then people wonder, “Well, what happens with it once it’s in your cloud? Where does it go, how is it used? Is it passed on to third parties?” Some particularly big companies got into to some particular hot water about this. Then also too, if you’re dealing with certain kinds of data that are very strongly regulated like health data, maybe this is an extra burden that you don’t really want to take on, but maybe there’s some particular challenges, some tasks in machine learning, something important for your business that you want to do with this data, but the regulations around moving that data and storing that data on your servers make it difficult to manage.

The Federated Approach

All of these give rise to a different kind of approach, which is, what can we do if we don’t move the data? That’s where the federated approach is, so let’s take a look at these same challenges in the federated setting. In the federated setting, the idea is to keep the data where it originated, so, if it’s user data that’s generated on a mobile device, keep it on the mobile device, never take it off that device. If it’s data that’s generated in a hospital, never take it out of the hospital, leave it there.

For machine learning test, the compute is happening where the data is, so now we have some other extra complications, we need to bring the compute to the data. We’ll go into those challenges in just a bit, but let’s think about what is the potential of this federated setting if we can do all of the compute on the edge on the devices and still get high-quality machine learning prediction models?

In terms of acquisition what cost do you pay? Not so much, you’re not moving the data, you’re not uploading it to somewhere, the data is always fresh because you can make requests to the user device, they have their fresh data there. Also, you can limit your user agreement because they’re not passing it off to somewhere and what happens up there, the user can exactly specify, “Oh, I’m ready to use my data in your system now and not tomorrow, but yes, on Thursday.” so, that helps.

In terms of maintenance, you’re not maintaining a data store because your users and their devices are the data store. Also you’re not moving the PII to yourself, so you’re not taking on that extra liability. In terms of compute while mobile devices, let’s say, might be quite limited in terms of if you compare them to big beefy GPU clusters, there’s still a lot of idle CPU that’s available on mobile devices that isn’t being used all the time, especially while you sleep, so, there’s a lot of extra compute that’s available on the edge, and we can use this to train models.

The other nice feature of this is that as you get more devices in your system that have data, they come with their own compute, so you get a little bit of scaling for free. If you get another user, you have more data and more compute. More data and more compute, it goes on and on and on, so you get this “free” auto-scaling. Then because of this limited agreement for the users, they can decide to not participate or to join in the network, and this gives more control to the data owners. This is a real big PR win because you don’t have to appear as the one that’s taking all the data from everybody.

Use Cases of Federated Learning

Let’s talk about a few use cases of federated learning, and maybe you’re already thinking about a few of them. What I’m going to introduce is just a few use cases that I’ve seen a publicized and that I think are particularly interesting. One of the first ones is the use case that was originally popularized by Google, maybe you’ve seen their original blog post on federated learning, and this is what they’re already pushing out to Android devices right now. What they’ve advertised is doing or training language models on personal text data.

Here when we think about potentially hot data that you don’t want to move, your personal text history is not something that you probably feel very comfortable with sending off to Google to do with what they will. A technique like federated learning really offers a lot because if you can train a general nice language model for text prediction on the mobile device, but without moving all that data off, then you suddenly gain a lot. You gain access to potentially one of the largest data sets in the world, and you can use that without having to incur all of the burden of moving that data.

Another nice use case was in the Firefox browser, there was a study that was done by an intern there on doing a URL all ranking prediction. This is another situation where you might have potentially hot data and you don’t want to share all of your browsing history with Firefox all the time. What’s done here is a very similar situation, one wants to learn a ranking model, but this model can be trained locally on your computer, and then this information is aggregated to train a better overall model.

Another task that we’ve been looking at is collaboration between pharmaceutical companies in the context of drug discovery. Here you have competitors that they say, “We understand that if we had all of our data in one spot together, we could train some really nice representation models that we could use to really speed up drug discovery. However, our data is our lifeblood and we don’t want to share it.” This is billions of dollars’ worth of investment, so, there’s really high incentive to not collaborate, but also some incentive to collaborate. Technologies like federated learning can really offer a lot here, federated learning has enabled one of the first of its kind, grants from the EU that we’ll be leading, which is on providing a federated learning system between 10 different pharmaceutical companies in Europe, and this is really cool to see.

Another use case is in the hospital, maybe trying to train very large models, maybe U-Nets for doing tissue segmentation, and then trying to train these collaboratively between multiple hospitals. Intel had a very interesting paper on this, and in the U.K., King’s College is actually putting together an initiative called their AI Center where they’re going to be doing a federated learning deployment between four different NHS trusts to do these very large radiology machine learning model training.

Finally, another nice one was another [inaudible 00:13:11] startup working on wake-word detection for digital assistance. Here, many people have different ways of pronouncing particular wake-word for these assistants and from all of these, you want to train one very nice model, but we don’t want these devices recording us all the time in our homes. All of these things could be powered by federated learning.

Federated Averaging

How does that work in practice? What are the actual machine learning operations that are happening under the hood? What we’re going to do is take a look at one simple approach, it’s simple, but it’s actually very effective, and in fact, it’s the algorithm that’s powering all of these use cases that I’ve just shown you, and you’re going to be amazed how simple this is. It works like this, we’re just going to break it down really quick. You’re going to start with a model, some initial state, maybe it’s just your random initial state, maybe it’s some other model checkpoint, so, you’re going to start with this at some model server. Quite simply, you’re going to ask for some available workers “Who’s ready to do some training?” Maybe you have a population of devices, if you’re Google, it’s millions, tens of millions of devices, if you’re in our case, it’s tens of research institutions. You’re going to ask which ones are ready and available for training. After this, you’re going to just send down the model checkpoint that you want to train, you just synchronize this to all of your available workers. Subsequently, you’re going to do some steps of some local fit, this could be many, it could be few. This is, in fact, a hyperparameter of your federated learning training.

Based on the data that’s available at each of these different sites, each worker, or mobile device, or research institution, this is going to lead to subtly different models, which are going to be a little bit biased towards the data that’s available in each one of these places, but it’s going to give some representation of the overall data set. After you perform this local training, you’re simply going to take all of these model updates, how much they moved from their initial setting, and report these back to your central model server who’s going to take a look at everything that has been given and average it, the federated averaging. Congratulations, you’re done. After this, you can repeat this for many rounds.

Challenges in FL

All of this looks really great on paper, but there are some challenges. Maybe you’ve already seen some of the bottlenecks that exist in a system like this and some potential risks, and I’m going to go through some of those that we’ve found as we’ve been working on these systems. There’s a lot of very practical challenges, and some of these look a little bit daunting, but I’ll say that a lot of these have already been surmounted in production systems. From here it’s a matter of how do we find the most elegant way of dealing with these challenges, and then, how do we provide toolsets that really open up this technique for even more companies? This leads to a no free lunch theorem of federated learning, which is anything that’s worthwhile is worth working for, which you’ve probably heard your father say at some point.

One of the first challenges associated with federated learning is workflows. What does the standard workflow look like for a machine learning engineer? A machine learning engineer, depending on your organization, maybe you also call them a data scientist, is going to sit down and look at a problem and think about how to solve it. What architecture do I need to use? What loss? What am I going to measure, what optimizer? Are there hyperparameters? What’s the pipeline for producing batches? What’s the right features? How am I going to need to augment my data? All of the things that you need to be able to produce a machine learning solution.

They’re going to come up with all this. Maybe they’re going to use PyTorch, maybe they’re going to use TensorFlow, maybe they’re going to use something else that hasn’t been invented yet or something that has, and they’re going to take all of these things together and then say, “Great. I’ve written it. Now run it.” Okay, maybe that works. you can take this, you can run it locally, do a little debug, that’s fine. Maybe you put in a little bit more work, and you containerize this whole thing, you deployed on some other machine learning infrastructure, you deployed on the cloud even, for training continuously on new data, and you do a lot of work here.

What I should say is that these workflows are really focused on a single machine learning runtime. Maybe if you’re doing some extra work in TF Distributed, and you want to spread this out over many GPUs, it gets a little bit more complex, but overall, you’re writing a piece of code that you expect to run in one place. You’re just going to take that code and run it somewhere somehow.

When it comes to federated learning, you’re writing code, but where is it going to run? It’s running remotely somewhere else in an environment that the ML engineer doesn’t necessarily have direct control over or quite understands the specifications for. In this you say, “Okay. Well, also what do they need to produce? Are they producing a script? Are they producing some other kind of a set of code? What is it exactly that they need to build to be able to use it on a federated learning infrastructure?”

This also introduces another interesting point of who is the person that controls the federated learning specific design choices that you’re doing, which is how often do you need to communicate? What style of federated algorithm are you using? Is this the ML engineer that produces the original solution for your problem, or is it somebody else in your organization? Is it someone more on the off-site? Who is making these decisions? This is when I say that you’re going to find out about these organizational weaknesses. It’s this question about, where are the roles when it comes to federated learning that really bring a lot of this stuff to the surface?

What this means is that you need some kind of intermediate description to try to decouple the work of the ML engineer from the work of the federated learning ops side of things. This intermediate description, what it can allow for is saying, “Oh, I want to write my code in PyTorch, I’m going to describe my model in PyTorch. I want to describe my optimizers and PyTorch, and I want to just pass that code off to somewhere, and I just want this to be the one description that I write and have it work everywhere,” or TensorFlow, of course.

One really needs this intermediate description, not just of the training or experiment that you want to run, but really of the operations that you want to do here, operations in terms of the actual machine learning code that’s running. We heard a talk earlier that mentioned the need for a byte code for machine learning, this is where this would be really helpful in the setting of federated learning, some kind of standardization here for intermediate descriptions of machine learning code.

There are some approaches to this already that you can find in the different opensource frameworks. There was a recently released TensorFlow federated, some Google engineers invented a whole new language that isn’t even the graph, it’s not C code, it’s yet another language for describing machine learning operations in, a MapReduce format as this intermediate description layer, so the complexity of systems starts to go up a lot. There’s another approach that’s taken in the context of PyTorch by the Openminds Project and their [inaudible 00:21:53] module, where again, it’s about trying to abstract the operations that you want to do.

Once you have this intermediate description of your federated learning problem, you also need to think about where it’s getting run. What is actually catching these operations and machine learning operations that you want to do and then actually running them? You have to think about, “Ok, what is the device? What is its context? How does it get created? What are the interpreters that you need to build?” Unfortunately, there’s nothing quite off the shelf for this in the context of federated learning, so, this is something that you have to put together on your own, which is one of the challenges to implementing these systems in the wild.

This it elucidates the need for this abstraction layer from the original ML engineer that described the algorithm that you want to run, because if you’re having to do very specific things on your device, like you’re using a kind of a TF lighter, you’re using some very device-specific libraries to make this stuff happen. This is not something that you want your original data scientists or ML engineer to have to work with or try to munch about with. This is really where the power of that abstraction comes in.

Then finally, you have this other big problem of this orchestrator itself. What is this orchestrator? When a federated learning job is submitted, there’s something that needs to catch that, that needs to go out and find what devices are available on the network. Which ones to accept right now, which ones to reject, how to send to them the commands that need to be run, and this is the role of the orchestrator, this is a pretty heavy task. Just like with the device context, there’s nothing that’s built out there that you can just take off the shelf and use. There’s a very nice description and a reference architecture that Google’s given and they had a technical paper that you can check out in the slides after this talk, it’s linked here at the bottom, where they describe how to build this service, but it’s something that you have to start from the beginning. It can be more or less complicated depending on what exactly you’re trying to build.

Once you’ve done all that and you have your system deployed, and you say, “Ok, we’re ready to start doing some federated learning,” now you have to start thinking about, “Ok, now what’s the workflow for the data scientist that’s working on this kind of a system?” The conventional approach is, you say, “Oh, ok, I want to start working on a particular problem. I’ve got unknown data I’m going to start taking a look at that data, I’m going to try to understand the features. I’m going to try to understand the distributions. I’m going to dig into it so that I can find the right approach to use.” In this setting, this is problematic because you can’t pick up that data and bring it back to your servers to analyze it in such a nice and neat format for your data scientists, so, you need to bootstrap in some way.

A more bootstrapping approach would be to go out to public data to try to make a hypothesis about the kind of algorithm or architecture that you want to use, and it changes the mindset a bit. Working in the federated context, your data scientists can’t have a very rapid iteration of looking at the data and then going back and forth. You really have to take this like a scientific approach of hypothesizing and then making a prediction about what the system would do with this given architectural algorithm, and then making the observations about what it comes back with. It’s a little bit of a slower process, it takes a little bit more work on the front end to think about the kinds of things that you want to deploy to your federated system, which necessitates the need for parallel experimentation. The kind of system that you want to build, you want to build it in such a way that it can really support not just one experiment running on your system, but many, because you’re going to try to say, “Well, maybe this algorithm will work, but I don’t exactly know and it’s going to take some time to understand if it does. So, what about this other architecture, or this one, or maybe at this rate, or with that batch size?” You need to have this parallel experimentation going on in your system.

Then, it comes to privacy, now you think about, “Ok, so what are we moving?” If you’ve accomplished everything before this, you say, “All right, well, the data is not leaving the device. The data is not leaving the hospital. It’s not leaving our mobile device, so, everything is safe” No, you have to think about everything that you’re moving, so what are you moving? From the user device is going out some kind of update to a model, which, if you’re training some deep conf net, maybe it’s hard to try to understand what the specific features are that could be used to identify your users, but maybe you’re trying to train a different kind of a model. Maybe it’s user sentiment, and you’re trying to understand, they like this thing more or not that thing. They really like sushi, they hate dog parts, and this could end up revealing some kind of information.

What one has to do is to try to restrict the access to all kinds of information that are leaving the device, that sounds a little bit challenging. How do you restrict all information coming from this device, because who’s going to see it? Is it a man in the middle? Well, you can encrypt the communications and you can sort of mitigate that risk, but then what about the server at the end? You need your users to trust you and your server that you’re not digging into their information, so how can they trust you?

One of the ways that you can do that is through multiparty computation. This is a technique to encrypt all of the information, in this case, model updates, which are coming from the individual devices, and encrypt them all with individual keys such that the server on the other side can’t decrypt these individual contributions. However, with the right secure multiparty computation protocol, like secure aggregation, what you can decrypt is their average. There are many different flavors of NPC that you could be using and for different kinds of tasks, but what you can guarantee with this kind of approach is you say, “Well, we’re never looking at an individual contribution. We can only see their aggregate or their average. So, we can only see the overall statistic but not the individual.”

This is the way that you can show that you’re not digging into any single one’s personal data, except for the fact that if you only had two users, it’s not such a strong guarantee, but as you have more and more participants in your network, then this aggregation becomes less and less sensitive to an individual. This is actually the kind of system that Google uses in their federated learning production system. The cost of doing this is really just in time, the cryptography is mostly lossless, but there are multiple rounds to this protocol, so you spend a little bit extra time to do this.

Collaborative FL in Health

A lot of challenges that we see, a lot of things to try to apply, but I want to talk now a little bit about how these sorts of systems get used in healthcare, and the different kinds of challenges that we have to think about here. This is really what we’ve been seeing and addressing in our experience at Owkin. Why would one want to approach collaborative federated learning in health care? What are the benefits? The first one is quite clear, if you have more data, there’s more that you can do, quite bluntly, but specifically in health, you have situations where, like for rare disease, there’s maybe a few available patients or pieces of data inside each of the hospitals, and so, maybe within one region, or one hospital network, or even within one country, you might not have enough data to build up a very powerful machine learning model to investigate these diseases, but by joining together many more institutions, you can do something really useful.

Another use for this, especially for drug development is to try to generalize across multiple population centers. You can pick up data or make a contract with one single hospital and try to understand how a drug responds in one population center, but this isn’t going to do very well for the drug company that wants to sell its therapies across the entire world. You really need to join data from multiple different sites, but this starts to become quite a legal burden in trying to make many, many individual contracts to do this.

The other thing that federated learning and the hospitals can do is incentivizes hospitals to try to make their health records machine learning ready, because with the right systems, you can give attribution back to the hospitals and say, “Hey, your data was used like this, and it helped in this way, and now your data that’s just been sitting on your servers suddenly has value.” It not just serves the operator of the federated learning system, but it also serves the hospital and the researchers inside of that research hospital too because now suddenly, they have more data sets and that they can make use of. There are still a lot of non-technical challenges associated with this setting just in terms of health regulations, the legal and contracting, fighting innovation departments around intellectual property is complicated. I talked earlier about competition between pharmaceutical companies, but you’d be surprised how much competition there is between hospitals. It’s just astounding, I wouldn’t have thought it was that way, but it is.

What are some of the challenges that we face here? One of the first ones that we saw early on was large data and large models. The data is not moving, but say you’re working on trying to do volumetric units, to do segmentation tasks on whole volumes of MRI, or CT, now you’re dealing with quite large models, so certainly upwards of 500 Megs, just for the model weights. Now you want to do a distributed training, so what do you need to do? Maybe you need to push those weights one way or another, now you’re starting to incur a lot of bandwidth. The thing is that bandwidth in the data center comes for free in some sense, it’s just a little bit of time, but bandwidth has other implications when you’re in the hospital setting, because that bandwidth is not just powering your machine learning task, but other critical systems in the hospital, so, you really need to restrict your usage.

What you have to do here is try to develop algorithms that communicate as little as possible. Through quantization applied to the model, applied to the gradients, changing how you do the distributed learning strategy to reduce the amount of communication rounds that you need, all of this is a very big problem in the health setting. The nice thing is that there’s some very cool research that’s been happening over the past couple of years that can show if you’re training DCNNs on something like ImageNet, you can get like 10,000 X compression. Apparently, there’s a lot of information when you train a neural network that’s really useless, so it’s a nice feature.

Another problem in this setting is the networks are a lot smaller, so it’s not tens of millions of users, but it’s really tens of research institutions. Here the effects of biases in heterogeneities can cause bigger problems than your model training, you really need to take special care here in monitoring the progress of training and trying to understand which institutions are pushing in one way or another and try to develop your algorithms in such a way that they’re robust to these kinds of heterogeneities. This is especially prevalent in hospitals where you have different lab procedures, different procedures for coding. This is like, what are you going to call one disease versus another? All of this can change from hospital to hospital and create big problems when you’re trying to do machine learning in health. Another big step here is to make sure that you have consistency in how the data is generated, which if you own the app or mobile device and you produce your own data, then it’s quite nice, but in this setting, it’s quite difficult.

Second to last problem is traceability, this doesn’t crop up so much in a big consumer setting for federated learning, but when you’re dealing with health institutions that have tight regulations on data use, you need to be able to produce logs of what data was accessed, when, how it was used? You need to do this in an unforgeable way. At Owkin, the approach that we’ve taken here is to use technologies like Hyperledger Fabric to track and to trace each of the machine learning operations that’s happening in the different hospitals, so we can provide that guaranteed record of everything that’s occurring. This is also what allows us to give value back to the hospitals, because they can see, “Oh, my data was used like this. Oh, and it boosted that model that way. Nice.”

The other big problem here is having to go even the extra mile in terms of privacy. Here it’s not enough to give a hand waving approach to say, “We just average over many users and therefore don’t worry about it.” but one actually needs to be able to give a demonstration of privacy. Here the challenge is really about saying, “Ok, if patient A is in the data set, we need to be able to demonstrate that gradients produced by that model that are transmitted, and also too, that the final model can’t somehow be used to identify that user A was in that data set.” This is what we call a membership risk in trying to say that there’s no risk in participating in the training set. This can be approached through smart application of differential privacy and some other strong permissioning around the model use, and ideally, secure enclaves for all of your machine learning operations, so a big challenge there.

Takeaways

The takeaways for this are: federated learning is a real C change in terms of approach to machine learning operations, and in my opinion, it’s really going to be the future of machine learning on personal data. The implementation is non-trivial, the challenges are there, but they’re certainly surmountable, and they already have been surmounted by companies, so it’s something that can be done. Lastly, get ahead of the curve, there’s some open-source frameworks for starting to toy around with this. If you start to see some ways of how federated learning can work in your business model, go ahead, dive in, have your ML engineers or data scientists start taking a look at it using those open-source frameworks and see what can be done, because the tooling is only going to improve as time goes on, and I think it’s a bright future for federated learning.

Questions and Answers

Participant 1: You mentioned GDPR requires the person to be able to remove their data from a system. Have you really removed that data from a system if you’ve trained a machine learning model with it?

Tramel: This is something that’s nebulous in terms of both the legislation as well as the legal opinion about what that means, and you’re going to find different interpretations. On the first part, can you remove their data from your data lakes? Yes, you can find that record, trace everything down in the backups, and then get everything out. In the other case of how it’s used for machine learning model, I haven’t seen anything yet that says, “Ok, you need to get rid of your neural network because this one user’s data was used in it,” because there are many other users that also contributed to that model at the same time, and so, what was that one user’s contribution, it’s hard to measure. In this case, you can’t necessarily take the data out of the model, but you could certainly try to retrain the model without it, which happens naturally in the federated learning system if they say, “I’m out. I’m not using it again.” If you relaunch your training, again it’s going to be without the data.

Participant 2: I’m unsure if there’s actually research on this, but is there ability to look at a privacy budget for an individual over time, so to actually implement differential privacy over time with sequential learning in a federated situation?

Tramel: There’ve been some interesting applications of differential privacy to SGD. There’s a very nice SGD paper, and the thing is that the privacy budget is going to be given on a data set by data set basis. You can show the differential privacy accounting for one data set, and see when you want to restrict access to that data set afterwards, but, you do get into some complicated situations because you start to train your model, you measure the privacy budget that was given away for that single model by doing that training on that data set, and then, you need to log that, keep up with it, and this is where a ledger system helps because you can have an unforgeable record of that. Then after think about, “Okay, well, can we continue training on that or not?” It’s curious for me to understand what it means on individual records, but I don’t have anything to say about the individual record case other than I think it’s cool and I’d like to have it.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Michelangelo Palette: A Feature Engineering Platform at Uber

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

Transcript

Nene: Let me start by quickly introducing ourselves. My name is Amit Nene, here with me is Eric Chen, we work on Uber’s Michelangelo machine learning platform. I’ve been working on the platform for a little less than a year, Eric has been a veteran. We’re going to talk today about Michelangelo Palette, which is a system within the greater Michelangelo platform that enables feature engineering for various use cases at Uber. Feature engineering is the use of domain knowledge to prepare features or data attributes for using machine learning.

Michelangelo at Uber

Before that, I want to give a small overview of the Michelangelo platform and what it does. Michelangelo platform is a proprietary platform that was built at Uber and our mission there is to enable engineers and data scientists across the company to easily build and deploy machine learning solutions at scale. Essentially, you can think of it as ML as a service within Uber and we give you all the building blocks for creating your end-to-end machine learning workflows. For instance, the first part is about managing your data, mostly features, we give you the tools, the APIs, to manage those. The second part is creating your workflows, API’s to create end to end workflows for your supervised, unsupervised deep learning models. We support a variety of different tools because machine learning at Uber uses a wide variety of libraries and tools across Uber, so we want to make sure our workflows are built in a heterogeneous way.

Having trained your models, we give you the ability to deploy them in production, in various different environments, like batch environment for offline predictions, real time online predictions, and mobile serving environments. More recently, we’ve been working on tools to detect drips in features or your data, and also your models, the model performance, both offline tools as well as online tools.

That was a quick overview of what Michelangelo is at Uber. It’s not an open source platform, but the purpose of this talk was to really share ideas around how we approach feature engineering, which I’ll get to in a bit and also give you guys an overview of the architecture of the subsystem involved here.

Feature Engineering

Let’s get started with the feature engineering parts of it, to start with a real world example here. This is a real world example inspired by like Uber Eats, imagine you have an Uber Eats app and you search for your favorite restaurant, you find the dishes, you place the order. Let’s say you’re not satisfied with the ETAs that are reported, and you want to build a model to give you more accurate ETAs, so the first things that come to mind is you need additional data features. You start thinking of what you can make use of, a few things that come to mind, how large is the order, the number of dishes you ordered, how busy is the restaurant at that moment in time, how quick is the restaurant generally speaking and what are the traffic conditions at that point in time?

Michelangelo Palette is essentially a platform that makes you easily create these features for your production models. It turns out that managing features, in our experience, is one of the biggest bottlenecks in productizing your ML models and why is that? The first problem is starting with good features, there’s abundant data everywhere and data scientists or engineers just wish they had features ready to go so that they can use it further sort of models, so they struggle with that. Even if they end up finding or creating features for their experimental models, it is very difficult to make that data work at production and at scale, and also in real time environments where low latencies are expected.

The third very relevant problem to ML is a lot of training is done in an offline way, and solving often happens, at least at Uber, in real time environments. It is absolutely important to make sure that the data that you’re using in real time, at serving time, matches the data that is used at serving, a previous presentation also hinted at this. A lot of times this is done in an ad hoc way, which can result in the so called training/serving skew, which is extremely hard to debug in our experience and should be avoided.

The fourth area, which makes feature engineering generally difficult, and we’re seeing more and more of this, is a shift towards real time features. Features more and more are based on the real latest state of the world, and traditional tools simply don’t work. Data scientists engineers are used to working with tables and SQL, and this world looks very different, there’s Kafka streams, there are microservices.

Palette Feature Store

This is where palette comes in, the first important component of the palette feature store is the feature store. Our goal there is to be a centralized database in the company, to save your curated as well as crowd sourced- by crowd sourced, I mean, features contributed by various teams across Uber – features available in one single place. You can think of it as a one stop shop, where you can search for features. In our case, the restaurant features, trip features, rider features, in one single place, when you don’t find those features, we will provide you the tools to create new features and features that are production-related. I’m going to talk a little bit about that in a second.

The other important goal is for teams to share features, often we find various teams are building the same features again and again, like the trips that a driver took daily. You would find 10 different versions of this feature, so not only is this a extremely redundant waste of resources, but it is also important to share the underlying data so that models are really acting on the same data consistently and producing better results.

And the fourth goal is to enable an ecosystem of tools around the feature store, around the features. For instance we’re working on Data Drift Detection in real time and also offline, we’re also sort of experimenting with automatic feature selection abilities, so that not only will we help you search for features, but also let you point your labels to our feature store and we can help you pick the features that are related to the labels or have an impact on the labels. This is work in progress but this is sort of our eventual goal.

Feature Store Organization

A little bit about how the feature store is organized. At the highest level, it’s organized with the entities, so we’re talking about the riders, the drivers, restaurants, and so forth. The second level is feature groups, these are logical groups of features that are associated with each other. Typically, these are features or data originating from the same pipeline or job. The third is the feature name, the data attribute name and the fourth is a thing called a join key. It’s meta information in the feature store that tells the feature store how to look up that feature, so if you’re talking about a restaurant, the restaurant ID is an example of a join key. There’s a palette expression here, this is how you would uniquely identify every feature in the palette feature store. Starting with the restaurant real time, this is one feature name, orders placed last 30 minutes and restaurant ID.

Behind the scenes a little bit, how is it implemented? It’s supported by a dual data store system, you can hink of it as a lambda architecture, but not quite so. The offline data system is based on warehouse used at Uber, which is Hive. It essentially saves snapshots of features, daily snapshots of features, it’s primarily designed for consumption by training jobs. It can consume features at bulk at any given point in time, so the concept of time is really important here. It’s designed for bulk retrieval, you can access any number of features, and the system joins these features for you, and stitches it into a common feature vector.

The offline store is paired with an online store. The purpose of the online store is to serve those same features in real time in a low latency way, today we use a KV store there, Cassandra in particular. Unlike the offline store, which says historic data, the goal of the online store is to serve the latest known values of these features. Behind the scenes, we carry out the data synchronization between the two stores, so if you have new features that you ingested into the offline store, they will automatically get copied to the online store. If you ingest real time features into the online store, it will get ETL back to the offline store. That is fundamentally how we achieve the consistency of data online and offline, it’s the training/serving skew thing that I talked about earlier.

Moving on a little bit, going back to the example I started with, we want to build this model. It’s a simple model, we want four features. How large is the order? Is that something we can build in the feature store? It doesn’t seem like it because it’s very specific to the context of what the user ordered, so it’s probably an input feature, in our world, it’s called a basis feature. It comes in during the request or it’s present in the training table. How busy is the restaurant? That’s something we should be able to look for in the feature store. Quick, is the restaurant same thing here? Busy is the traffic. These seems like generally good features applicable to a wide variety of models.

Creating Batch Features

Let’s see how we can go about looking for these. We do go to the feature store and you find they’re not there, so now the feature store offers you options to create these features, let’s take a look at what are the options we have. Starting with the batch features, these are the most popularly used just because they’re easy to use. There’s a warehouse, there’s a SQL query you can express ETAs with, the general idea is that you would write a Hive query or you would write a custom Spark job, and you would put in a little bit of meta information in the palette feature store and we would automatically productionize the pipelines for you.

Let’s walk through this example, there was a feature, “how quick is the restaurant?” This feature doesn’t strictly need to be real time in nature, so what we did was we found some warehouse tables that had this information, we decided to run a query to simply aggregate some of this information and went into this feature spec here and we wrote the small snippet of Hive query and the palette feature store using Uber’s pipeline workflow infrastructure known as Piper. We automatically created the Piper job for you, not only did we do that, we also automatically created the alerts and monitoring so that if there’s any data outage, the on-call would be alerted, so this is all built in out of the box.

The way it works is the feature spec, the job that it produced, started ingesting the data using the Palette APIs into the offline feature store. From there, it was automatically copied to the online store and then this whole feature store system is both available for training and scoring. Similarly, we have support for real time or rather near real time features. Here, we leverage Uber’s near real time competition infrastructure, Flink, in particular, is a first class citizen here. The way it works is either you can write a Flink SQL that works on a stream, a real time stream, perform some transformations and produce those features into the feature store. Going back to our example, how busy is the restaurant? It turns out that there is a Kafka topic which is publishing some of this data, so it might be interesting if I were to build a feature out of this. It turns out however it’s like a minute aggregation and I need to turn it into a five-minute aggregate.

I use Flink, I perform some aggregation via Flink in real time using streaming SQL and then I turn it into a feature. In this particular case, I’m calling the feature nMeal, it’s a restaurant feature, it can be looked up at our ID. The way it works is very similar to the batch case where I go into the feature spec. I know I don’t have to write a custom Flink job, although it’s supported for more complex sort of feature engineering. Then I can do with a simple sort of Flink SQL, Uber’s Flink system offers a service model where you can create a job on demand. We use the palette system to sort of automatically produce the production pipeline, compute these features, ingest it in the online store so unlike the batch store, you first ingest it into the online store. Very symmetrical to the batch store, we automatically ETL this data behind the scenes to offline store, so that it’s available for training.

Whenever possible, we also support backfills, which means that if you just onboarded any real time feature, you can make it available for offline for training, that is a feature that we leverage from Flink, the underlying Flink support. For training and serving, the flows are similar, they specify the features that they want, they all get sucked in from the appropriate stores. The training and serving time, they get merged together and the feature vector is shipped for training and serving.

Bringing Your Own Features

Moving on a little bit, what if your features are so odd that you cannot use any of the palette features store tools? This is what we call bring your own features or external features. The expectation is that the customer is managing their own online and offline data here, so that is the caveat. This is like using palette feature store in an unmanaged way, meaning that the features are still registered in the feature store. The feature store still knows how to read these features, but the features are routed to external end points for retrieving them.

Going back to our example, how busy is the region? There was no warehouse table, there was no Kafka topic that had this information, but there was this one micro service that talked to a system outside of Uber to get this information and maybe I can make use of that for this. Here we’re specifically talking about the traffic conditions. What I decided to do is I went into the feature spec and I put an entry for a service endpoint, which I know has the traffic information. The small little DSL that we give you from the RPC result, you can extract the features and the features are then made available for serving.

Remember that managing this data in this mode is my responsibility, which means that I decided to log the data, their RPC results and ETL them into this custom store here, and so that the data is available for offline training. I do run it for a bit, a couple of weeks till I had enough training data.

Palette Feature Joins

I mentioned a few times how the features are joined. At the very high level, the algorithm simply is that you start with your input features, recall from the example, there was an order. There was the number of them in orders, the number of dishes, for example and the various keys were supplied as part of the basis training job. The algorithm essentially takes this, joins it against all the palette features, the three different features that we talked about, and produces the final table, which is used for your training. Scoring is very similar, but unlike table format here, you’re working on conceptually fewer rows at a time, essentially and logically, palette does the same exact operation. This is a sort of critical function of the palette store and we’ve invested a lot in getting this right from a scalability standpoint. Essentially, you want to take feature rows and join it, not only using a given key, but also a given point in time.

If there was a feature state from a month ago, you want to exactly state states from that point in time, so it’s very important to get this right. Scalability is extremely important, you end up joining billions and billions of rows during training, we’ve put a lot of engineering effort there to optimize this process. Serving is similar, features are scattered across the board, multiple feature stores. There’s a lot of parallelism, caching, that goes into ensure that features are available, can be retrieved in single digit P99 latencies, for example.

That was a quick example of how we put together these features, so we’ve used this as an input feature and order, and meal came from a near real time feature. There’s a prep time feature, which came from batch and busy, we did, bring your own features, but the question at this point is, are you done with the end-to-end feature engineering? Turns out that the answer is no, because there are still customizations you need to do before those features can be consumed in the machine models. There is model specific feature engineering, or there could be chains of features engineering that you may apply. For example, you may look up one feature from palette, and use that to look up another feature, so we’ve built a framework called Feature Transformers, which lets you do the last part of feature engineering. I’m going to hand it over to Eric [Chen] who’s going to talk all over that.

Feature Consumption

Chen: As Amit [Nene] just mentioned, the ways to consume those features are actually pretty arbitrary, so let’s just really look into the problem and see what we have. We have four features here, one is the number of orders, as we know it’s coming from your input, we call that basis feature. Not too many troubles over there, the other three actually, all very troublesome. Number of meals and meal do represent how busy this restaurant is, let’s look into the whole querying feature transformation chain of it. The input, we have the restaurant ID, so not too hard; you can make a query, get a one level query, you can get your nMeal based on our syntax. Prep time, because usually for some new restaurant, there’s no prep time, so for this particular case, you actually need two steps. The first step, you can still go to the feature store, get the prep time, but it could be actually a null. You need to introduce some impute, feature impute, so one possible way is, I’m going to use all the restaurants in the same training set and use the average as the prep time for this restaurant, so you need some feature impute after you grab the feature from the store.

The busy of the region is actually even more complicated because that is not a feature associated with the restaurant; instead, it’s associated with the region. If you look into this, based on the restaurant ID, we can find the location of the restaurant, which has a lat/long. Then we need to do a conversion into the region, so not too complex, you can imagine this is a geohash. Based on the lat/long, I can get a geohash, and then geohash this entity. Based on this entity, I need to do another round of query to figure out how busy this particular region is. Here even though we’re only trying to transform three features, these features are taking sort of like an arbitrary pass to make your final feature available, get into the model. How do we make these things happen?

Michelangelo Transformers

In order to answer this question, let’s just take one step back, trying to understand what is a model. This concept is not really new, it was introduced by Michelangelo, we actually stepped on top of Spark ML framework. They have this thing called a Transformer Estimator pattern, so for a particular transformer, that is one stage in your whole pipeline, that will take your input. It will modify or add or remove some fields in your record, so it’s doing some manipulation of your data.

What really is a model? A model is basically just a sequence of those transformations. Those transformations could fetch some feature from palette feature store, do some feature manipulation or model inferencing. What Michelangelo extends from, standard Spark, actually made them serverable from both online and offline structures. In the offline way, this actually is exactly the same, that is the spark transformer, but in the online way, they behave differently, we introduce some extension for it to make it happen. Estimator is also part of this framework, usually it’s called an Estimator Transformer Pattern, so estimator is used during the training time, during the fit that will generate a transformer and you have a sequence of a transformer, which is what’s used in the serving time.

Let’s just try to see what is really there, for this particular talk, we’re trying to focus more on feature engineering. Only for feature engineering side, for feature and consumption, there are already several types of it. Feature extraction, that’s the palette feature store we talked about, it’s also feature manipulation, for example, I have the lat/long, I need to translate it into the geohash, that is a UDF function, defining my things. Feature impute; you also can sync that as a feature manipulation, while if you write here, scala line, it’s more like saying, “If now, use this value, otherwise use the other way.” so it’s some code snippet in some sense.

The other two things are more centric on the modeling side, I’m actually not going to touch too deep inside. If you have categorical feature, you need to convert them into numerical values, you have linear regression, you may want to turn on one-hot encoder. If you have binary classification, you may want to define your decisions rationally, all these things can be seen as transformers as well.

Here is a real example where extend the standard Spark transformers. On the estimator side, that’s actually exactly the same, nothing really special, the only special thing happens on the transformer side. From the transformer side we use standard Spark and new model things like MLreadable, writable, the only differences is here; it is a transformer, MA transformer stands for Michelangelo transformer. What happened is it actually introduced a one function called score instance, that’s all used in the online case. The input is nothing related to data frame at all; it’s purely just a map and output is also a map.

Then the question is, because we need to serve the same in the online/offline system, how do we guarantee the consistency? Usually, the way to write a transform function is define this as a UDF and register that UDF in the map function here. That’s how you can guarantee the online and offline experience: consistency, because under the hood, they’re using the same line of code all the time, then we might have [inaudible 00:26:20] on top of it to make it happen.

Palette Retrieval as a Transformer

Let’s get deeper into two examples we want to understand. We talked about feature retrieval, we have that particular syntax, we want to get the feature out of the feature store. We have feature manipulation, for example, convert lat/long into the geohash, figure out the feature impute. Let’s just do one by the other, first of all, let’s try to understand what the palette feature transformer is. More or less, this is the syntax we already introduced in our previous page, we’re talking about in this particular transformation stage we are interested in.

Here is what we’re trying to do, in order to do the things in the transformer patterns, we need to figure out for each transformer what type of things we want to do. And the idea here is, we want to [inaudible 00:27:22] between palette feature retrieval and feature manipulation. Here are some examples, for the first feature transformer using the restaurant ID, I’m trying to get the number of meals, that’s for the first feature. For the second feature, I’m using the restaurant ID trying to get the prep time, but for the third feature, how busy the region is, I can now finish it. The first step is I’m just going to get the lat/long for this restaurant. Imagine there’s an interleave step, after that, I’m going to do another round of feature retrieval, then I can see I have region ID. I’m going to query from that service feature and figure out how busy this feature is. Then this one needs to be interleaved with a feature manipulation. Feature manipulation in our system is called a DSL, because that’s a type of domain specific language. The idea here is you can say how I want to transfer a feature, remember, we said we get a lat/long, we want to transfer feature into the region ID so in the next step, we can use the region ID do another round of query, so this transformation can be written as this.

I’m trying to get the region ID, the way to compute it is we provide some built in function in our system that’s really just a function that you can use these two features you specified in your previous stage, and then that will translate that into this feature. Similarly, after we grab all the features in the system, we want to do a bunch of impute feature manipulation. We clean up the prep time if it doesn’t exist, we’re using that as a numerical value. If it doesn’t exist, it will fill that with the prep time of it, the average prep time of it. The number of meals, we’re only going to just use that as a numerical value so we’ll convert that as a numerical number of orders, the similar thing, then the busy [inaudible 00:29:30], similar thing.

This is called an estimator, why is that? It’s really because of this fallback logic, because when we need to do the fallback, where are your fallback values coming from? It’s actually coming from your training dataset. You want your fallback values the same way as your training and serving time, so that’s why it’s considered as an estimator. Based on the estimation stage, it will fit through your model, understand all the stats, that becomes the transformer to use. Behind the scene, this one actually is going through some code gen things, so all the things you see or hear [inaudible 00:30:10] now will be converted into some code snippet, and go through some compiler, and becomes Java byte code during serving time. That’s where we can guarantee the speed, we don’t do the parsing for every single query you have.

Let’s now try to just pull the picture together. We have number of meals, we have prep time, we have the busy scale, we’re trying to interleave them with a two palette feature retrieval, and two DSL transformer feature manipulations. The first stage, we’re going to just query number of meals, prep time, and restaurant, lat/long from the restaurant ID. This manipulation, we’re going to change the lat/long into a region ID. In the second round, we’re going to, based on the region ID [inaudible 00:30:58] business, and final stage we do some cleanup of all the features. After this is done, the things after this is actually the standard training part, so you start talking about string indexers, start talking about one-hat encoder. If it’s a binary classification, you may have one estimator at the end trying to fit your affluence score, fit your decisions threshold based on [inaudible 00:31:21]. That’s more or less the whole picture about how can we make this pipeline ready.

Dev Tools: Authoring and Debugging a Pipeline

But remember, we are a platform, we are not an engineering team, so we cannot just code this one and say, “Hey, done,” because we have to handle all these things daily from different customers. How can we provide a tool for them so that we can give this power back to them, so that they can, without help from us, do all this work by themselves? The idea is, we already talked about the Hive part coming from Hive query or a Flink query. Here it’s about, how can you represent this as customized pipeline in an interactive way?

We actually leverage it directly from the Spark. Spark has this MLreadable, MLwritable thing so it can introduce the civilization, de-civilization. We gave users a tool where they can describe the pipeline they want, in this particular case, that’s interleaved between palette retrieval and DSL estimator, then this is part of the regular linear equation. You can fit the model that you can use the fit of the model to try on your data frame, then Michelangelo will provide you with a way at upload store so you can upload. The interesting part is you actually can upload, too, you can upload both the model you train for your one-off test, and also the model you want to train, that’s how you can do re-training through the system. We have a work saved flow system behind the scene which does all the optimization behind this simple line of code to make it scalable and re-trainable.

Takeaways

That’s more or less all we have, and trying to highlight the takeaways we have. We talked about a particular feature engineering platform. There are two pieces, one is, how do you manage? How do you organize all these features, especially for two stores, online and offline? Also, we talked about how we’re going to manipulate those features, and for the offline case, it’s many joins. Both online and offline are about speed, so online and offline, they actually have a different way to optimize the queries to make it happen, they also need different stores to make it happen as well.

About the transformers, an actually even bigger level of transformers is MA Transformer, where you introduce the function score instance, that’s the place we guarantee online and offline consistency. I actually would phrase the sentence another way, “It’s a guaranteed, controlled inconsistency.” What do I mean? Feature store, we said, in the online store, it implies that I’m interested in the feature right now. In the offline case, I’m saying, “I’m interested in the feature when this particular record happened.” That’s indeed not online, offline consistency; it’s online, offline, controlled inconsistency. How can we make all these things happen? It’s actually all modeled by our transformer framework. We also do all these pipelines behind the scenes, so we provide all the out of box reliability, monitoring, to put the operation back to our users.

Michelangelo, as a whole, is not only about feature engineering, if you go there, you can hear other stories. For example, how do we do [inaudible 00:35:07]? How do we do model understanding? How do we do other things, like Python-related modeling? There are all a lot of interesting stories there.

Questions and Answers

Participant 1: A quick question is, what if teams use TensorFlow to build a model? How do they leverage this platform?

Chen: TensorFlow is actually hiding behind this transformer framework, so your TensorFlow model can be hooked up with any other transformation stash by yourself, that’s what happens behind the scenes. I think you also probably want to say TensorFlow has some feature engineering stuff as well. TensorFlow feature engineering usually is about [inaudible 00:36:12] coding. It’s a different feature engineering from the transformer engineering I talked about. It’s a different scope of it, it’s fused inside TensorFlow part.

Participant 2: I noticed that you had different tools for the feature generation part- you had Hive QL, you had Flink QL. As we go further down, you had Spark for the transformers and the estimators. Have you guys thought about having a unified API, something like Spark Structured Streaming, which sort of works across the board? That way you have data frames to work from start to finish?

Chen: Unified API, the way we’re seeing that is actually not a unified API, we see that as a package of APIs. The way we’re trying to open this is through IPython Notebook, because we are trying to unify the experience of mainly the terms. When you say, palette feature, it’s supposed to mean exactly the same thing when you’re trying to do Hive query or when you trying to join them. We’re actually at the stage trying to unify all different Python packages under the same naming patterns. We’re exposing them all through this IPython Notebook provided by Uber [inaudible 00:37:38], so that’s how we are unifying everything.

Nene: I also want to add that these systems that you mentioned, they all have their own sweet spots. Hive will thrive at queries at large scale, Flink will thrive at processing Kafka streams in near real time and then the feature [inaudible 00:37:57] here has a kind of slightly different goal. These systems have matured and they all have their optimal place. We try to leverage what is available in Uber, and make it available via a calm and sort of central place.

Participant 3: I have actually three questions. The first question I wanted to ask is, how often does it happen that a feature is being used across teams at Uber? I think the whole purpose of this platform is to feature sharing, and in my personal experience, outside a team, there is very little feature sharing that is happening. That’s my question one. My second question is that when you build a platform like this, how do you prevent the proliferation of features created by different teams when they are exactly the same features, but with a different name. Sometimes they could be exactly the same, sometimes they could have partial correlation, that’s question number second. My third question is, you said that there are a bunch of applications for which accessing the feature online cannot be done by a key value store, so you let people have their own custom solution. I want you to describe what are the use cases where you have to use custom solutions, because I can think of three where key values will not work.

Nene: You’ve got a lot of questions there, I will very quickly answer you in just one minute. I think the first part was about feature sharing. That is indeed very much the goal and although there are some trends where teams contribute to the feature store, but they use their own features, that is true, we see that, but there are teams who actually do actually use the same features, so in our case, it has actually been true. Not to the extent that we’d like, but that is definitely a goal. Now, this is a complex problem and part of it is technical, part of it is like process, and how teams work with each other, but one of the things we have found is that there’s a trust element there.

Once you build your ecosystem where you can actually trust the features, where you actually build the tools to show the data distribution, the anomalies, how reliable it is, the confidence of using that feature actually increases, that is what actually we see. If you just throw some key values or features there and you don’t know anything about it, that is where the problem originates, in my experience. We are actually focusing on actually building the tooling so that users start trusting the features more.

Another example is the Auto Feature Selection tools that we’ve been building, not our team but a related team. We are actually finding that unrelated teams are finding features built by the other team by this particular tool, which is actually showing, “Hey, this is a feature that is correlated to your labels.”, so they are actually finding it interesting and then coming to us and saying, “How can we use this? Who’s owning the pipelines?” Some of it is really related to just having poor tooling and once you invest in the tooling that ought to change, is my view. The third question was pretty complex. I don’t know, maybe we should just discuss it like offline.

The tool we’re actually building for the automatic feature selection actually shows redundancies across features, so it’s a work in progress. But that is one [inaudible 00:41:32] step. Without such tools, you’re at the mercy of just reviews, so every check into the palette feature store actually goes through a review. We try our best there and we try to enforce things and entities and rules, but sometimes because it’s purely a human process, things can sort of slip through the cracks.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Writing Web Applications in Java – a Study of Alternatives

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

With the increasing popularity of compile-to-JavaScript languages, developers familiar with the Java Virtual Machine(JVM) languages and who want to develop web applications without the difficulties of a JavaScript development stack, have an increasing array of alternatives to JavaScript to choose from. A recently published analysis of GWT, TeaVM, JSweet, CheerpJ, Vaadin Flow performance and payload characteristics on a trivial application seems to indicate that the performance penalty vs. native JavaScript web applications is shrinking.

Java developer Renato Athaydes recently compared JVM alternatives to JavaScript. The target application that serves as reference for the comparison is a simple counter, implemented with the React JavaScript front-end library:

Simple React counter app

The counter app got implemented in five Java frameworks: GWT, TeaVM, JSweet, CheerpJ, and Vaadin Flow. The comparison between the five Java alternatives and the native JavaScript application followed a simple methodology: create the counter app using the most basic tools provided by the Java product/framework, and subsequently measure the application’s size, and performance. The application size is obtained through the browser’s network tab (counting all types of resources to avoid bias in favor of frameworks that rely on heavy, non-JavaScript resources). The performance is measured leveraging the auditing tool from Google Chrome’s built-in Lighthouse. LightHouse’s performance score is evaluated on six weighted metrics, presented here in decreasing order of importance: Time to Interactive, Speed Index, First Contentful Paint, First CPU Idle, First Meaningful Paint and Estimated Input Latency.

The study shows that JSweet and TeaVM beat the native JavaScript React application on the First Contentful Paint(FCP) measure as reported by Lighthouse. Google reports that First Contentful Paint “measures the time from navigation to the time when the browser renders the first bit of content from the DOM. This is an important milestone for users because it provides feedback that the page is actually loading”. While most implementations obtain what is deemed by LightHouse to be a good global performance score (> 90 with a maximum of 100), the CheerPJ framework performed poorly in comparison to its peers.

performance comparison

The resulting Java framework ranking can be explained in large part by the size of the produced JavaScript:

Imgur

Here, the Java frameworks rank in the same order. This is to be expected, as the time for the browser to parse the produced JavaScript correlates positively to size and negatively to performance.

The study does not include J2CL, a recent google-backed Java-to-JavaScript transpiler. Furthermore, the study relies on a trivial web application. While the results cannot be extrapolated to a large web application, it serves nonetheless to analyze the differences between the Java frameworks considered in the study.

CheerPJ’s large size and subsequent poor performance can for instance be explained by the fact that it is not really a Java-to-JavaScript transpiler but a complete Java runtime shipped to the browser. On the bright side, CheerPJ allows developers to take a compiled jar containing a Swing app and run it in the browser, without plugins.

GWT (released in 2006) is a mature open-source technology used for instance by Google AdWords. GWT ships with a series of widgets and panels for user interface construction.

TeaVM claims to be an ahead-of-time compiler for Java bytecode that emits JavaScript and WebAssembly that runs in a browser. TeaVM does not require source code, only compiled class files. Moreover, the source code is not required to be Java, so TeaVM may compile Kotlin and Scala.

Jsweet self-describes as a transpiler from Java to TypeScript/JavaScript with 1000+ well-typed JavaScript libraries available from Java. Jsweet claims to allow developers to access generated APIs/objects from JavaScript, without requiring additional tooling or runtime. However, Jsweet does not fully emulate Java. Existing Java applications may have to be modified in some measure, in particular those which use Java-specific APIs such as Swing for the user interface.

Vaadin Flow is a set of components and tools for building web apps in Java. Vaadin Flow features built-in Spring support, automatic server-client communication with WebSocket support. Web applications may be written in pure Java, or a mix of Java and HTML. Vaadin’s components are web components self-described as mobile-first, fine-tuned for UX, performance, and accessibility. However, Vaddin Flows ships a client-side engine weighing over “300k compressed (regardless of app size)”. The latter explains the performance profile of Vaadin Flow.

The study methodology is inspired from the blog post A real-world comparison of front-end frameworks, which InfoQ previously reported on.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Instrumentation, Observability & Monitoring of Machine Learning Models

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

Transcript

Wills: I’m so excited to see so many people coming for my boring title talk on “Monitoring and Machine Learning.”, that makes me very happy. It’s one of those topics that’s super boring until it’s 2:00 the morning and Slack Search doesn’t work, and you can’t figure out why, then, suddenly, it’s like very interesting.

About Me

My name is Josh Wills, I am an engineer at Slack, I work in our Search and Machine Learning and Infrastructure team and I’ve been at Slack for about three and a half years. My first couple of years at Slack, I was building out the data engineering infrastructure, so, I built the data team. I was the first director of data engineering, hired early data engineers, built early data infrastructure, this is all the logging systems, Hadoop, Spark, Airflow.. I built the team to about 12 people and I discovered in the process that I did not really care for management that much. There’s an old joke that law school is kind of like a pie-eating contest where first prize is more pie, and I found the same thing basically applied to management. It’s the more management you do, the higher you go, it’s just more management roughly speaking, the higher you go in the management hierarchy.

Before that, I had a great fake job. I worked as Cloudera’s director of Data Science. I was there for about four years, and I just went around and talked to people about Hadoop, and MapReduce, and machine learning, and data science, these sorts of things, it was a pretty great job, I really enjoyed that. Before that, I worked at Google, and my first job at Google was working on the ad auction, so, if you’ve ever done a search on Google, an ad showed up, at least roughly 2008, 2009, that was me, you’re welcome. At least some of you must work on ad systems here, I mean given the audience, that’s fine. I did that for a couple of years, and then I worked on Google’s data infrastructure systems for a long time, so I worked, in particular, on Google’s experiments framework, which they use for A/B Testing, and configuration deployments, and all that good stuff. I worked on a lot of machine learning models, I worked in a lot of systems for doing like news recommendations, friend recommendations for what eventually became Google+. That’s roughly me, that’s roughly what I’ve done.

Before I go too much further, I’m going to be talking a lot about Slack today. Does anyone not know what Slack is? Are there any Google engineers in the audience right now? If you don’t use it, Slack is sort of like Kafka, but for people, at least there are channels, and you can work in them. Slack is really great, and it’s very popular, people seem to like using it. The other company I’m going to be talking about a lot is Google. Does anyone not know what Google is? It’s always good to double check. Google is a search engine, you can search for things on it, we’ll show you ads sometimes.

The Genesis of this Talk

Gwen and I are friends, Gwen is the coach here, and she’s running the track here. A year ago, I gave a talk with a very similar title at a meetup for my friends at LaunchDarkly. For some reason, the talk was recorded, and it ended up on this orange hell site, and it made it to the front page, and I got lots of comments on there. Against my better judgment, I went and read the comments, it’s a generally a terrible idea. This is my favorite one, “Apart from the occasional pun or allegorical comments, frankly, that presentation ended very abruptly with very little actual substance, or maybe I was just expecting more.”

For context, it was a lightning talk, 10 minutes, all I was doing was presenting links to resources, and papers to read, and stuff like that, so that’s why there was no actual substance in the talk. Gwen asked me to come talk at QCon.ai about what I wanted to talk about. I was like, “Man, this is my opportunity. I can go into deep substance on this really interesting topic that I love talking about,” which is how we do monitoring for machine learning. Anyway, I hope that YouTube commenter, or Hacker News commenter or same horrible difference., if InfiniteONE is watching, I hope you enjoy this one a lot more.

Machine Learning in the Wild

First, to set the stage, there’s a lot of different stuff that is called machine learning, there’s a lot of different components to building machine learning systems. I am talking about a very specific context, which is, machine learning in production systems, in online serving systems where a request comes in to the machine learning model, the machine learning model needs to generate a prediction, or a ranking, or a recommendation and that response needs to be returned ideally very quickly, on the order of milliseconds in order to make some sort of subsequent decision.

Building complex machine learning systems includes offline systems, online systems, it includes of testing, and it includes lots of monitoring, I’m going to be talking primarily about the monitoring aspect. This is not a way of saying that the testing stuff is unimportant or the infrastructure stuff is not important. It’s super important, but there aren’t not enough hours in the day to talk about all these different things, so I’m focusing on this one relatively small niche area. If you were coming to this expecting me to talk about the other stuff, really sorry about that, I’ll try to talk about it at some other conferences in the future.

In the context of Slack, the system that I work on the most is our search ranking system, we have a dedicated microservice at Slack that does nothing but fetch and retrieve search results from our SolarCloud cluster and then apply various machine learning models to rerank them. What I’m going to talk about today is mostly about, how do we monitor and ensure that system is reliable in the face of failure?

Data Science Meets DevOps

What this really is about is, is data science meeting DevOps and data science and DevOps are terms that grew up and became things roughly around the same time, 2009, 2010, right around then. What it really is about, is software kind of eating the world? Data science is like software eating statistics, DevOps is software eating ops, basically, everything is code. Historically speaking, we haven’t really gotten data science and DevOps together. There’s this sort of thing where people were just trying to say DataOps, or MLOps, or whatever. The broader culture has not converged on a little pithy kind of name for this stuff yet.

What I really want to talk about is, what does it mean for data scientists to be doing monitoring? What exactly is entailed when the monitoring visibility team needs to sit down and talk to the machine learning engineers or the data scientists about how to handle monitoring and production? What I want to do is create a common sort of set of terminology, not a new terminology, I just want to explain to both sides what the other one is talking about and what sort of problems they’re facing to get this kind of conversation going, as we figure out what the best practices need to be.

Here’s a little bit of history on the DevOps side of things. Back in 2009 at the Velocity conference, one of my coworkers, Paul Hammond, and this other engineer named John Allspaw gave a talk about 10 deploys per day at Flickr. Flickr was deployed 10 times a day way back in 2009 and I know it sounds like craziness, but back in 2009, that was insane, deploying something more than once a month was crazy. They talked about the processes they developed in order to deploy that often and since that time, iterating and deploying faster has become a thing, Etsy wrote a blog post in 2014 about deploying 50+ times a day. Amazon Web Services has written blog posts that they deploy to production a service every 11.6 seconds, that’s how quickly they deploy to production, which is both terrifying and very comforting to me, roughly at the same time.

The question is, if you are deploying stuff to production this quickly, if you are making changes all the time, how do you know if the changes are doing anything good? How do you know if the changes are fixing things, improving things? That is fundamentally where monitoring and visibility sort of originated and grew out of. These are the tools we use at Slack for knowing as we deploy roughly 15, 20 times a day right now, are we doing a good job? Are we actually making the systems better?

Logs via the ELK Stack

Tool number one the, ELK Stack: Elasticsearch Logstash Kibana. Kibana is the UI. Elasticsearch is the search engine, Logstash is the log processing structuring extraction system. The primary use case for us with Logstash with this ELK Stack is, for all of our services, whenever they throw some kind of error message, be it an exception, or a 500, or whatever it is that happens, we run PHP systems, we run GoSystems, we run Java systems, we want to grab that event, grab the exception, grab a collected set of metadata around that exception, write it off to Elasticsearch so that we can search for it later, and find it if it turns out that it was part of an overall problem.

The stuff we write to Elasticsearch is a little bit different than the logs you might be used to as a data scientist. Data scientists only like very structured logs ideally with a Schema associated with them, Logstash logs are generally not like that, they’re generally JSON logs. We write them primarily in order to throw them away, they’re only really designed to be consumed by the people who wrote them in the first place. They’re not designed to be consumed downstream, they don’t necessarily last for more than a couple of weeks, we don’t really care about them after a little while. They’re going to mix and match unstructured data, which is typically some sort of stack trace and a sort of relatively small amount of structured data, which gives us some context around where the exception was generated. What machine was the process running on? What time did the event happen? Then it can also include, in a very convenient way, a bunch of high-cardinality fields that can be used for uniquely identifying specific events. In ELK Stack, every log record we keep, has a unique identifier, a GUI, so you can recover a specific record later on in time if you’re trying to figure out what’s been going on and debug something.

In addition to your timestamp fields, your hostname fields, information about what version of your code was running when this exception was thrown we can also include these very high-cardinality dimensions as well. You use Logstash basically like a search engine, it works like Splunk, Sumo Logic, all working the same general search-oriented paradigm. It’s an incredibly powerful way to be able to very quickly drill in, and debug, and figure out what’s going wrong with your system.

The only major problem we run into in using Logstash for figuring out what’s going wrong is that logs are not necessarily great for alerting-based systems. What I mean by that is it is entirely possible and, in fact, happens fairly frequently, for one of our services to just start spamming logs like crazy and sending just tons, and tons, and tons of enormous quantity of logs to our centralized log collection system. When that happens, the log service can get a little bit behind, so I might not be able to see kind of currently right now what’s happening in my system because my log collection system is like 15 minutes behind reality.

This is endemic to any kind of system that’s based on pushing events where you actually want to record every single thing that happens and have access to it later on. As a result, we need a different kind of tool in order to compliment Elasticsearch Logstash Kibana for helping us alert on and identify problems quickly.

Metrics with Prometheus

The tool we use at Slack is called Prometheus. Prometheus was originally developed at SoundCloud in 2012 by a couple of ex-Google engineers who based the design on a thing at Google that was called Boardman. Everyone here is familiar with Spark, MapReduce style pipelines, MapReduce and Spark have concepts called counters and accumulators, which you can use for tracking metrics about your jobs, about your pipelines that they run in real time. Prometheus is exactly the same thing, but it’s for your online production systems. You can create counters, you can create a gauge, which is just like a counter where the aggregation function is taking a max as opposed to just a sum, you can create even actually some pretty cool simple histograms. Each one of these individual counters you construct- I’m just going to call them counters from now on because that’s primarily what I use- can have a set of tags associated with it. In the same way that you can tag log records with structured metadata, you can tag your counters with which hostname generated this counter? What version of my code generated this counter? What was the function, what was the request doing when this counter was created? Think of it like a very high-dimensional OLAP cube system that you can use for querying and finding out what is going on with all of your real-time systems as things happen.

I mentioned that Elasticsearch, the logs processing pipelines, are push-based, the servers are pushing out logs, and they’re getting aggregated. Prometheus is very actually clever, in that it is actually pull-based, it is not push-based, so, the way it works is your service publishes a page, usually called slash metrics, which is just a list of metrics basically, just a list of counters and their current value that your server is tracking. The Prometheus agent scrapes this page, it’s some cadence, it can be every 10 seconds, it can be every 15 seconds. It scrapes the page and basically tracks changes in the metrics over time, so, it’s pull-based instead of push-based. This has a number of advantages, the first advantage is that Prometheus is an automatic health check for your system. If the Prometheus agent queries this page and doesn’t get any data back, that’s generally a bad sign. The system is probably in bad shape and that you need to alert when this happens. Simultaneously, you don’t run into the problem where you have a service spamming metrics at Prometheus and then knocking the central Prometheus servers over because, by definition, Prometheus controls its own ingestion rate via its sampling strategy.

There are some sort of negative consequences to this, one negative consequence is that Prometheus is not high-fidelity metrics. You can’t guarantee you’re going to capture every single thing that happens because Prometheus is just sampling data over time. If you need every single record, every single event, you need to write that stuff into Logstash so that you can get at it later. The other thing is that Prometheus can’t handle relatively large cardinality fields, if you want to push it a little bit, you can have a different dimension that has like 100 distinct values in it, but it cannot handle the arbitrarily large cardinality that Logstash can handle. You can’t put a unique ID for every request on every counter you ever create and expect Prometheus to keep working. You have to keep the cardinality space relatively small in order for Prometheus to be successful at aggregating your metrics, and working well, and performing well.

Traces

Third thing that I want to mention is something we’re deploying at Slack right now. The new hotness in the monitoring world is traces, traces are a thing that are kind of born out of necessity for microservice frameworks. If you ever use a profiler to find hotspots in your code, like create a flame graph that shows where we’re in my code base is some particular request spending most of its time? Traces are that idea applied to a microservice infrastructure where, for any given request, there could be 10 or 100 different services involved in actually satisfying that request. A trace allows you to pass a unique identifier for a given request around to a bunch of services and have a common structured way of figuring out, where is this request spending its time? Where are failures happening? What’s the relationship between all of my different microservices? The two major ways of doing this right now are both open-source. Zipkin is the one we use at Slack, it was developed at Twitter, Jaeger is the other one that was developed at Uber, they’re very similar conceptually.

From a machine learning perspective, we do not use traces per se, however, we do pass identifiers all over the place, both from the initial request that said, “Please give me back a search ranking,” or some other kind of prediction. We’ll pass a unique identifier for that request, the model will pass a unique identifier back down, which will get propagated to our client, then when a client takes an action in response to that model, whether they click on something or mark something as spam we use that identifier to give feedback to the model to know whether the model is doing a good job at what it’s doing. I think that traces is a good conceptual idea to be aware of when you’re developing machine learning models, even if you’re not using one of the official fancy frameworks like Zipkin or Jaeger, and you’re just doing poor man’s tracing yourself using common identifiers.

A Word about Cardinality

I’ll talk briefly about cardinality, I mentioned that one of the great things about Logstash is that you can have very high cardinality fields and everything will just work and I mentioned that Prometheus is great because you can do very fast aggregations, but at the cost of not being able to do very large cardinality dimensions and tags on the metrics you create. Some people at Facebook said basically, “Why can’t we have both of those things simultaneously? Why can’t I have high cardinality fields and be able to do fast, reliable aggregations?”, so, they built a system called Scuba, which lets them do exactly that.

A lot of folks who worked on Scuba or used Scuba at Facebook have now spun off into different companies, there’s Interana, there’s Honeycomb, there’s Rockset, and they are all trying to apply these principles in practice. We are experimenting with Honeycomb ourselves at Slack, I’m very excited to use it and figure out how to apply it to machine learning models. For the time being, our current workflow over the past year has been driving alerts and dashboards off of our Prometheus system, then once those alerts fire to let us know that failures have spiked, or errors are being thrown we then turned a Logstash to quickly drill down and figure out what exactly the source of the problem is, fix it, watch the numbers go back down, rinse and repeat essentially, forever. I’m optimistic about these systems be able to create ultimately a better workflow, especially for machine learning engineers by unifying together the alerts, notification, debug, fix cycle into like a single pane of glass, which is the dream for a lot of people.

Make Good Decisions by Avoiding Bad Decisions

That’s a lot about monitoring, let’s talk about how we apply these tools to the context of machine learning specifically. Fundamentally, what we are trying to do when we are thinking about monitoring is we’re thinking about, how do we want to handle failure? Assume that something bad has happened, assume that a request has failed, assume that a user has had a bad experience and figure out, roughly speaking, why did that happen? We’re basically trying to invert the way we typically think about problems, typically, we’re trying to optimize some response surface function, we’re trying to maximize revenue, we’re trying to cause something good to happen. In this case, our mentality shifts, and we need to think about, how do we handle it when something bad happens? It’s basically like inverting our problem.

My favorite example of this kind of thinking is this great example from World War II where they gathered all this data about planes that had returned after flying sorties and missions out in enemy territory. They analyzed to say, “Where were the planes getting shot? Where were there bullet holes in the planes?” The original General who gathered this data said, “Ok, here’s all the places where our planes keep getting shot. Let’s add extra metal material to protect those areas more.” Then a data scientist of his time looked at this data and said, “No, those areas are actually fine. The areas we need to add more armor to are the areas on the plane where there are no bullets because those are the planes that are getting shot down. Those are the planes that aren’t coming back basically. We need to add extra, extra armor there, the places where the bullets are getting shot, that’s not actually a problem, we don’t really need to worry about that.” In doing this sort of work, we’re trying to make good decisions just by avoiding bad decisions, that is fundamentally our goal.

The reference paper to read if you were interested in this stuff is, “The ML Test Score Rubric,” which was written by Eric Breck, and D. Sculley, and a few other people at Google. I want to caution you, there are two versions of this paper, there is a 2016 version and a 2017 version. The 2017 version is vastly better than the 2016 version. It was submitted to the IEEE conference, whoever the reviewers were, they did a phenomenally good job of giving the team feedback on the 2016 version of the paper. The 2017 version, even though it has the exact same title, is vastly more detailed, vastly more actionable, and has a better categorization and scoring system for evaluating the quality of the models.

The thing I want is monitoring folks, visibility info folks to understand about machine learning, is that the problems we are facing in deploying and monitoring machine learning models are harder because the failure modes are different. We’re not just monitoring for 500s, we’re not just monitoring for null pointer exceptions. By definition, when we were deploying models in production, we cannot properly anticipate and specify all of the behavior that we expect to see. If we could do that, we would just write code to do it, we wouldn’t bother doing machine learning, so, the unexpected is a given.

If it helps explain it to them, there’s a concept in monitoring visibility right now, called “Test in Production.” It sounds like a vaguely terrifying idea to a lot of people, but what I want to say is, from a machine learning perspective, “Test in Production” is just reality, it’s not an option. There is no way to possibly test every single input your system could possibly consider before you launch the model on production. You have to have state-of-the-art monitoring and visibility in place so that when things do go wrong in production, and they will, you have the tools you need to figure out what’s wrong and see about fixing it.

The Map is Not the Territory

I want to focus on the monitoring specific aspects of this paper. There’s stuff on testing, both data testing and infrastructure testing, but it’s all excellent, and it’s all worth reading, but from a monitoring perspective, I broke down the seven different items that the paper talked about in terms of things that are important to monitor, into three different kinds of conceptual categories, at least as I interpreted them, the first one being, the map is not the territory. The map is not the territory, the world is not the thing it describes, confusing the map with the territory, confusing the model with reality is one of the fairly classic mental errors we make. We’re trying to make good decisions by not making bad decisions. The first thing we need to remember, there’s that hoary chestnut that all models are wrong and some are useful, there is no such thing as the perfect model and in particular, for every model that matters, the model’s performance will decay over time. When you train data over time given a fixed set of features and a fixed set of training data, your model is going to get worse and worse over time.

One of the most important things you can do before you even remotely consider putting a model in production is to train and understand in your offline environment to see, “If I train a model using this set of features on data from six months ago, and I apply it to data that I generated today, how much worse is the model than the one that I created untrained off of data from a month ago and applied to today?” I need to understand as time goes on, how is my model getting worse? How quickly is it getting worse? At what point am I going to need to be able to replace the current model with a new one?

I have a rough idea of how quickly I have to iterate and train your models. At Slack, we try to publish a new search ranking model once a day, roughly speaking, that is our goal. We are iterating just as fast as we can, trying new features, trying new algorithms, trying new parameters, we’re always trying to bring new models into production just as fast as humanly possible. In fact, as a design goal, building an assembly line for building models, building as many models as you can has all kinds of dividends and advantages, and is, to me the number one design principle for doing modeling and production. Don’t ever do one model in production, do thousands of models or zero models. If you’re working on a problem, and you need to deploy to production, but you’re never actually going to rebuild the model, that is a strong signal that this problem is not actually worth your time.

There’s probably a vastly more important problem you should be working on instead, problems need to be solved over and over again, or not at all. Data science time, machine learning time, monitoring and visibility time, it’s just far too precious to waste on problems that are not important to be getting better at, and better at, and better at over and over again. When you’re designing your engines, when you’re designing your algorithms, when you’re designing your APIs, assume you’re going to have multiple models, assume you’re going to be running multiple models on every single request you do.

Deploy Your Models like They Are Code

Along those lines, deploy your models like they are code, in a microservices framework, this actually gets a lot easier. When we’re deploying a new Slack search ranking model or set of models, we bundle up the binary, we bundle up the feature engineering code, we bundle up logistic regression coefficients, or trees, or whatever it is we’re serving. We bundle the whole thing up to push it out to S3 and deploy it to all our servers as an atomic entity, as an atomic whole thing. Not everyone can do that, but if you can, it’s phenomenally great because you can leverage the same rollback infrastructure that your production code systems use for rolling back models in case things go wrong.

At Google, that was not really how models were deployed, models were deployed as artifacts themselves, data files will be pushed out and load it up onto new systems. If that’s the way you have to deploy, please build a way to rollback things in case stuff goes bad. Let’s just assume we’ve done this correctly, and we can roll stuff back when we’re done.

Once you have the capability to have multiple models running in production servicing a given request, you get all kinds of awesome stuff, first and foremost, you get to use ensembles of models, ensembles are almost always better than models by themselves. You can run experiments, you can run not only A/B Test experiments, but you can run interleaved experiments, which are incredibly powerful in search ranking problems where you can take the results, the ranking from model A, the ranking from model B, mix them together and see what people actually like use and click on.

You can run dark tests to figure out whether a new model is too fast, or too slow, or too good, or too bad before you put it into production, then, finally, you can use the results of one model to monitor another model. I am a big fan of always doing the dumbest thing possible and in search ranking, the dumbest thing possible is a model called BM25, which is a very simple TF-IDF model. At Slack, we use BM25 as a sanity check for our fancy, crazy, cool XGBoost-based ranking model, which incorporates a bunch of different signals. If the ranking model diverges too significantly from BM25, that is a red flag, something has generally gone horribly wrong in the model if we are way off from what TF-IDF would predict that would have happened in the absence of a model entirely. You can use an old, good, trusted model as a way of checking, and validating, and verifying a new model that you don’t quite trust yet. That is the primary virtue of being able to run these models from a monitoring perspective.

Tag All the Things

When I talk about Prometheus, when I talk about Logstash, I talk about incorporating structure data, some of the structure data you need to incorporate on any given counter or in any given log is an identifier for the model that was associated with the request, or the models that were associated with a request. At Slack, we have a sort of single Jenkins system, which has an incrementing counter, every single build we push has this unique identifier associated with it.

When we load the model up into the server, the model grabs the counter and adds that tag to every single thing the server spits out to Prometheus or to Logstash, so, we can very quickly see, is there a problem with a specific model? Is there a problem with a specific server? Whenever an error is happening, we can very quickly tie it to the code, the model, whatever that is associated with generating that. Generally speaking, using a GetShare is the dream and the ideal here, that’s not always possible, but if you can use it, create a unique identifier for your model that you associate with all of your future requests.

Circle of Competence

Models are dumb, they don’t know what they don’t know, we have to know what the model doesn’t know for the model, and we have to be able to detect when the model is making predictions on top of things it doesn’t know about. There’s a famous other hoary computer science chestnut, “Garbage in, garbage out.” generally true. However, at least in machine learning, you can do some surprising things with garbage, machine learning can surprisingly reverse the garbage. It’s not like nuclear fusion, but it’s like somewhere on the spectrum.

Machine learning models can be very robust to losing some of their inputs, and that is great, that is a good thing, but if that is happening, you need to know about it. At Google, for every machine learning model they ran, there was something called an ablation model. An ablation model is basically saying, “What would happen, what would this model look like, if we did not have this signal anymore? How much worse would all of our metrics get?” All that kind of stuff, to be able to understand what is the consequence of losing some particular input to our system for our sort of ultimate end performance.

Along those lines, is the center which you can link your online and offline metrics together. When you are building models, when you are training them, if you have a set of counters you are using to understand the distribution of inputs you are seeing, if you can have that exact same set of counters with that exact same kind of parameterization in your online model, you can quickly tell when the input data you are receiving has diverged significantly from the data that your model was trained on. This is a very important thing for spam detection, and especially for fraud detection, fraud detection generally relies on someone sending you a bunch of data in a region of the parameter space that your model is not very good at identifying and making correct predictions then, that’s fundamentally how fraud and spam work. This is another aspect of your system, understanding what your model has been trained on will help you understand, in an online setting, when your model is operating outside of its circle of competence, and is potentially making bad decisions.

Handling Cross-Language Feature Engineering

Another very common challenge that I’ve seen to be the source of problems is, if your production environment is written in GO and your offline data pipeline is written in Java, how do I translate all of my feature engineering code from the Java system to the GO system? Terrible problem, we ran into this at Slack, although we ran into a much more egregious version of it where the online system for a long time was written in PHP, and the offline system was written in Java. The initial solution we came up with was to just simply do all feature engineering in PHP. That was it, generate all the features in PHP, log them out to our data warehouse, and then run training off of that. Far better to have the code once, even if it’s in PHP, than to have two slightly different versions of the code doing two slightly different things that have to be tested.

Ultimately, the thing we ended up doing was basically moving the search ranking module out of PHP entirely and putting the entire thing in Java. This is my great hope for the future, that going forward, if your feature engineering, if your offline training logic is in Python, your online training logic can be in Python, or Java. We don’t actually run into this problem of having a different offline and online feature engineering model scoring environment because everything is designed to be the same, that is the virtue. I don’t know of a better way to solve this problem right now, it’s the simplest way to just eliminate an entire class of bugs, probably speaking.

Know Your Dependencies

Know your dependencies, become best friends with any upstream system that feeds input to your model. One of my favorite Google outages was the day that the search engineering team decided to change the values of the language in coding string that they passed over to the ad server. There’s this thing in search like, “What language is the person speaking?” and for a long time, there are these two-letter codes, like “en” for English, “es” for Spanish, and so on, that would signal to the ad system, “Here is what we think the language of the speaker is.” One day, back in 2009, the search team decided, “Hey, we’re going to change this from two-letter codes to four-letter codes so we can include dialects, creoles.” They passed it over to the ad system, didn’t tell the ad system they were going to do this, ad system sees a string, sends the string to the machine learning model, machine learning model has no idea what to do with the feature. All the language-related features, instantaneously useless, they all went to zero instantly because we had no training data on what is “en_us” versus “en_uk”.

I hate to show off for my company here, but, honestly, the thing I love most about Slack is being able to hop into some other team’s channel and see what they’re doing and detect when they’re doing this kind of stuff. Definitely know if your dependencies are failing, have timeouts for all the different sort of other clients you rely on. This is all just good standard monitoring practice, but the problem when you’re doing machine learning is you are susceptible to someone keeping everything working just perfectly, but changing it just a little bit. Changing the definition of an enum, changing something small like this in a way that causes your model to degrade just a little bit, not enough to trigger an alert, just a little bit worse. Become best friends to the point of being creepy stalkers with the teams whose dependencies you use for training your machine learning models.

Monitoring for Critical Slices

At Slack, we have a handful of very large customers who are very important to us partly because they pay us a lot of money, and partly because they are so much larger than everyone else that any kind of performance problem or issue will crop up with them long before it crops up with like a tiny little 5 or 10-person team somewhere. For these very large customers who we desperately want to keep happy, we create dedicated monitoring slices and dedicated views of our data so that we can see if they are having a worse experience than everyone else, even if otherwise, their stuff would get drowned out in the noise, we look at their results very closely.

If you’re working on recommendations at YouTube, you might want to have a special handler for things like fire at Notre-Dame to see if it’s making ridiculous associations with September 11th, this kind of stuff, any kind of issue where you know that a problem here, even if everything else is fine, is going to be a big PR problem for you, a big customer problem, a big systems problem, creating dedicated, devoted monitoring log systems, metrics, just for those customers is an exceedingly good use of your time, highly recommended.

Second-Order Thinking

This is my favorite outage, by far because it happened right when I got to Google, in 2008. People remember 2008, financial crisis, things were going badly. When I got there, not long, about a month after I started, Google’s ad system started showing fewer ad, that sort of slow, steady decline basis. People got a little freaked out by this, but they were like, “Oka, well, the economy, people are freaking out. Advertisers are probably pulling back their budgets. We have all these knobs we can use for deciding how many ads to show. Let’s just turn the knobs a little bit, and we’ll crank up the ads.” and so, they did that.

They turned the knobs, and they cranked up the ads, and the ads spiked for about a day, and then they started going down again. It continued like this for about two weeks, and people basically started freaking out more or less, because for about two months Google lost control of their ad system and the reason was because of a feedback loop. Google is not just one machine learning engine, it’s about 16 different machine learning engines that are all feeding into each other. Machine learning algorithm A is sending inputs that are used by machine learning algorithm B that go into machine learning algorithm C that feedback into machine learning algorithm A, so, a feedback loop in this process was slowly but surely killing the ads off. It took about two months of hair-on-fire panic emergency work to figure out what was wrong. At one point, I said to my mentor, there’s this guy named Daniel Wright, “Daniel, is it possible the ad system has become self-aware, and that it doesn’t like ads?”

The reason we have this panic fire drill was two-fold, one, this stuff is really hard to detect. In particular, it was really hard to detect because we had not done any of the work I just described to you to monitor individual systems to understand what their world was like. That was our assumption, the assumption is always Occam’s razor, the simplest possible explanation is. We don’t assume there are feedback loops from machine learning system A to B to C that are causing these kinds of systemic problems. We had to spend a good solid six weeks just doing simple, basic monitoring to understand what was going on before we were even remotely in a position to discover that this was the one time that Occam’s razor didn’t apply, and that the answer actually was fairly complicated in order to solve, that was kind of the trick.

My advice in all this sort of stuff is don’t be like Google, do your monitoring upfront ahead of time, don’t do it later on when you’re in a panic mode. Do it from the very beginning, bake it into every single thing you do.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Simple Trick to Remove Serial Correlation in Regression Models

MMS Founder
MMS RSS

Article originally posted on Data Science Central. Visit Data Science Central

Here is a simple trick that can solve a lot of problems.

You can not trust a linear or logistic regression performed on data if the error term (residuals) are auto-correlated. There are different approaches to de-correlate the observations, but they usually involve introducing a new matrix to take care of the resulting bias. See for instance here.  

Requirements for linear regression

A radically different and much simpler approach is to re-shuffle the observations, randomly. If it does not take care of the issue (auto-correlations are weakened but still remain significant, after re-shuffling) it means that there is something fundamentally wrong about the data set, perhaps with the way the data was collected. In that case, cleaning the data or getting new data is the solution. But usually, re-shuffling – if done randomly – will eliminate these pesky correlations.

The trick

Reshuffling is done as follows:

  • Add one column to your data set, consisting of pseudo random numbers, for instance generated with the function RAND in Excel.
  • Sort the entire data set (all the columns, plus the new column containing the pseudo random numbers) according to the values in the newly added column.

Then do the regression again, and look at improvements in model performance. R-squared may not be a good indicator, but techniques based on cross-validation should  be used instead. 

Actually, any regression technique where the order of the observations does not matter, will not be sensitive to these auto-correlations. If you want to stick to standard, matrix-based regression techniques, then re-shuffling all your observations 10 times (to generate 10 new data sets, each one with the same observations but ordered in a different way) is the solution. Then you will end up with 10 different sets of estimates and predictors: one for each data set. You can compare them; if they differ significantly, there is something wrong in your data, unless auto-correlations are expected, as in time series models (in that case, you might want to use different techniques anyway, for instance techniques adapted to time series, see here.). 

Testing for auto-correlations in the observations

If you have n observations and p variables, there is no global auto-correlation coefficient that measures the association between one observation and the next one. One way to do it is to compute it for each variable (column) separately. This will give you p lag-1 auto-correlation coefficients. Then you can look at the minimum (is it high in absolute value?) or the maximum (in absolute value) among these p coefficients. You can also check lag-2, lag-3 auto-correlations and so on. While auto-correlation between observations is not the same as auto-correlation between residuals, they are linked, and it is still a useful indicator of the quality of your data. For instance, if the data comes from sampling and consists of successive blocks of observations, each block corresponding to a segment, then you are likely to find auto-correlations, both in the observations and the residuals. Or if there is a data glitch and some observations are duplicated,  you can experience the same issue.  

To not miss this type of content in the future, subscribe to our newsletter. For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn, or visit my old web page here.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.