Uncategorized

Google Introduces Spinnaker Simplifying Continuous Delivery on Their Cloud Platform

MMS • RSS

Article originally posted on InfoQ. Visit InfoQ

Spinnaker is an open-source multi-cloud continuous delivery platform co-developed by Google and Netflix. In a recent blog post, Google introduced the Spinnaker for Google Cloud Platform solution, which allows customers to install and run Spinnaker in the Google Cloud Platform (GCP).

The search giant supports Spinnaker now in its Cloud platform, and it comes with built-in deployment best practices and will span Google’s Kubernetes, Compute and App Engine services as well as other clouds such as AWS and Azure or on-prem targets. Furthermore, it integrates with other relevant Google services, including Cloud Build, Binary Authorisation and Stackdriver – allowing customers to expand their CI/CD pipeline and integrate monitoring, compliance and security in the process.

Installation of Spinnaker is fast and straightforward, according to the blog post by only a few clicks. Once the service is up and running, the Spinnaker installation includes all the core tools, as well as Deck, the user interface for the service.

Source: https://cloud.google.com/blog/products/devops-sre/introducing-spinnaker-for-google-cloud-platform-continuous-delivery-made-easy

With a production-ready configuration of Spinnaker, customers can benefit from:

Secure installation – support for one-click HTTPS configuration with Cloud Identity Aware Proxy (IAP), providing control over who can access the Spinnaker installation.
Automatic backups – a configuration of a Spinnaker installation is automatically backed up securely, for auditing and fast recovery.
Integrated auditing and monitoring – Spinnaker integrates as mentioned earlier with Stackdriver, which provides a simplified way for monitoring, troubleshooting and auditing of changes and deployments.
Simplified maintenance – there are various helpers available to simplify and automate maintenance of Spinnaker installations, including configuring Spinnaker to deploy to new GKE clusters and GCE or GAE in other GCP projects.

Furthermore, customers can add sample pipelines and applications to Spinnaker that demonstrate best practices for deployments to Kubernetes, VMs and more. Also, customer DevOps teams can use those pipelines as starting points to provide “golden path” deployment pipelines tailored to their requirements.

Source: https://www.bmc.com/blogs/kubernetes-spinnaker-continuous-delivery/

Spinnaker for GCP could have a broad appeal to companies as the open-source version is a popular solution in its own right – hundreds of software teams at companies such as Samsung Electronics Co. Ltd., Cisco Systems Inc. and Box Inc use this version. Furthermore, Spinnaker for GCP is one of the latest offerings from Google with a focus on simplifying enterprises’ development workflows. For instance, last April, Google introduced a set of plugins collectively known as Cloud Code – making it easier for developers to move code from their workstations to GCP.

Lastly, existing Spinnaker users can migrate to Spinnaker for GCP today if they are already using Spinnaker’s Halyard tool to manage their Spinnaker installations.

Uncategorized

C++20 Feature List Now Frozen: Modules, Coroutines, and Concepts are In; Contracts Out

MMS • RSS

Article originally posted on InfoQ. Visit InfoQ

The ISO C++ Committee has closed the feature list for the next C++ standard, dubbed C++20, scheduled to be published by February 2020. C++20 will be a significant revision of C++, bringing modules, coroutines, and concepts, among its major new features.

At its latest meeting in Cologne, the ISO C++ Committee agreed on the last changes to the C++20 draft before submitting it to all the national standard bodies to gather their feedback. Among the latest additions are std::format, the C++20 Synchronization library, and better threading. Contracts, on the contrary, have fallen out of the draft and have been postponed to C++23/C++26.

C++20 will introduce a new text formatting API, std::format, with the goal of being as flexible as the printf family of functions, with its more natural call style and separation of messages and arguments, and as safe and extensible as iostreams. std::format will make it possible to write:

string message = format("The answer is {}.", 42);

C++20 will also improve synchronization and thread coordination, including support for efficient atomic waiting and semaphores, latches, barriers, lockfree integral types, and more.

In previous meetings, the standard committee had already agreed on the inclusion of a few major features that promise to radically change the way developers use the language, including modules, coroutines, and concepts.

Module support will be orthogonal to namespaces and will enable the structuring of large codebases into logical parts without requiring the separation of header and source files. This is how you can define a simple module exporting a function and its usage in a different file:

// math.cppm
export module mod1;

export int identity(int arg) {
    return arg;
}

// main.cpp

import mod1;

int main() {
  identity(100);
}

Coroutines are functions that can be stopped and resumed. They are stack-less, meaning they return to the caller when suspended. Coroutines support three new operators: co_await to suspend execution without returning a value to the caller; co_yield to suspend while returning a value; co_return to finalize the execution and return a value. Those three operators enable the creation of asynchronous tasks, generators, and lazy functions. The following is an example of generator:

generator<int> iota(int n = 0) {
  while(true)
    co_yield n++;
}

Another new major feature in C++20 will be concepts, which provide the foundation for equational reasoning. For example, concepts can be used to perform compile-time validation of template arguments and perform function dispatch based on properties of types.

Besides contracts, C++20 will not include a number of additional major language features that have been deferred to C++23/C++26, including reflection metaclasses, executors, properties, and others.

There is of course a lot more to C++20 that can be covered in a short post, so make sure you read the full trip report for the complete detail.

Uncategorized

What does the Machine Learning process look like?

MMS • RSS

Article originally posted on Data Science Central. Visit Data Science Central

This is the second article about Machine Learning. If you would like to start from introduction than you can find the first article here.

I have mentioned that every Machine Learning process is built from several steps like:

what would you like to achieve (define the goal)
prepare the data
select an algorithm(s)
build and train the model
test the model (and score it)

Let’s review them one by one. I should mention that you will be able to find information over the internet that the number of steps is different from what you see on this blog. For example you can separate building and testing a model but at the end you need to do this no matter if it is one or more steps. Same you can say about the testing and evaluating your model.

What would you like to achieve (define the goal)

Well. This is the most important part of the process! At least from the business (or problem solving) point of view. Please do not be angry I am using the word “business” as someone has to finally pay for your work as a data scientist.

IMPORTANT DISCLAIMER – NOT ALL PROBLEMS WILL AND SHOULD BE SOLVED BY MACHINE LEARNING!

Think about it. Imagine you are a care driver and see a traffic light. There are three colors – red, yellow and green. When the light is red you should wait. When the light is yellow you should be carefull and not enter the crossroads. When the light is green you can drive (safely). There is fourth state of the light as well – the red and yellow lights are on which means that green will be in a moment. The yellow can pulse which is an error state. You also know the sequence: green -> yellow -> red -> red and yellow -> green…

Now try to implement a system that by knowing the light color tells you whether you can go, prepare to go or wait. Do you need to create a Machine Learning model? Or maybe a neural network? No – the system is just a pretty simple an algorithm based on few rules.

Now you know – you would like to solve some problem. The better it is defined the greater are chances if the success. It can be a simple question like:

Based on my current medical examination results will I live over 100 years with 95% probability?
Is this mushroom edible or poisonous?
Is this email a spam?
How much is my car worth?

The problem can be however more complicated. Let’s look on the picture below. Guess which one is a chihuahua and which one is a blueberry muffin.

Chihuahua blueberry muffin

We as a humans can see the difference but for a program this could be a tricky and extremely challenging. What about a tasks that says: “Is this body cell malignant or healthy?” Machine Learning helps here.

Prepare the data

Wait! What data? Do I have the data already? You should have! Someone has already defined the problem and based on this knowledge data sets should be identified.

You can have files, relational databases, NoSQL databases, graph data… whatever it can be!

Believe me or not this is where the problems really start! The first question should be – what is the data quality? I have data from my company – can I trust the data?

What about public data sources like the one you can find in the internet? Take a look into the 1 minute movie I did for you. It is all about public data.

I have not shuffled the deck. The cards were there all the time. It was just like you have seen. Sometimes you need to have some good data and think that a public data set can provide this to you. Public data means you can easily get cheated and the quality of the entire process will be very very low. Not neccesarilly will be but…. You know. It can be.

You have succesfully gathered the data and need to do some preparations now. I will post not one but many articles about the techniques of data preparation. There are lots of methods here but you should know your data set – what is the origin, what information if contains – which attributes are important? If you do not know the data set very well – how to know it better ( How to perform exploratory analysis? ). Can we reduce the number of attributes (PCA analysis)? Can we remove some data without creating a data skew? Can we introduce new features by combining the existing ones? How to perform mapping from string data to numerical data (One Hot Encoding)? Should we perform some regularization or data standarization?

Oh boy, so many topics to cover!

Select an algorithm(s)

Based on the question you have been given (the goal) you should consider not one but more algorithms to use. There are dozens of them so how to pick the good set of the algorithms? The simplest approach here is to know whether you do a classification or a regression.

A classification is when you assign the output to one of the groups – like in our example – the email can be spam or not a spam. The classification process takes into account all input features and decides whether a new email (never seen before) is a spam or not.

A regression algorithm can predict (estimate or guesstimate if you will) a number based on the input features’ values. For example how much my car will be worth next year if it is now 3 years old and has 6.8 liters diesel engine and it is white (and many more…).

Let me name some classification algorithms here so we can play them later:

Naive Bayes
k-Nearest Neighbors
SVM – Support Vector Machines
Decission Trees
Random Forrest
Logistic Regression (yes, it is a classification algorithm)
Neural Network – wait! Is it an algorithm?

Here you are – some regression algorithms:

Linear Regression Model
Lasso Regression
Ridge Regression
Polynomial Regression
ElasticNet Regression

But how yo pick the one?

Build and train the model

You have a data set that contains input features and the information about an outcome (an output feature). Having this in mind let’s build a model.

To do so you need to split your data set into two parts called training and testing data set. Typically the training data set contains 70% of your data and the testing set has the remaining part. Of course it is not always like this and you can assign less data to the training data set especially if you have a lot of data.

Now you ask yourself – how to divide the data set correctly? Not in terms of numbers (70%-30% split) but in terms of data quality. The good thing is that existing frameworks like scikitlearn helps us in many aspects. I will concentrate on this part in later post.

Once you have a training and testing data sets you can choose a model you would like to build. This is really the easy part when you have chosen an algorithm and you know a framework like scikitlearn a bit.

Building a model is to create an object of a specific type and feed it with data from the training data set. Sometimes a model is trained just once and sometimes it is done iteratively like in the k-fold cross validation process (more on this later).

Test the model (and score it)

Once the model has been trained you need to test it on the data that has never been seen by the model. It is an analogy to an exam. You can prepare yourself to an exam by study books or doing research. Then you go to the exam and your knowledge is tested. The result of the test is how good your knowledge is. The higher score the better expert you are. But if your score is not so good you need to learn more or to change the approach.

The same process you should apply on your model. You need to evaluate it and see whether of really camn ask the question you have from the inintial step of this process.

But what if the model is not working as expected? Then you have two options:

run the learning process once again on the same model but tune its parameter (it is called hyperparameters tuning)
change the model and find better one

In the one of the next articles I show you how to automate this process in Python.

What’s next?

Are you overwhelmed by this post? I will explain all the main steps in the next articles so do not get confused! Now you should be relaxed as we will be using existing frameworks that speed up Machine Learning & AI steps I have described.

There will be even more new tasks to cover so please stay tuned. For example I will be discussing (apart from many other things I have mentioned above):

automated Machine Learning in the Microsoft Azure Cloud
manual deployments
consuming Machine Learning models in the apps

Cheers,
Damian

Originally posted here

Uncategorized

Introduction to Machine Learning

MMS • RSS

Article originally posted on Data Science Central. Visit Data Science Central

What is Machine Learning? This story started in mid 60 in the last century. Scientists and engineers found a lot of problems that were too complicated for traditional algorithms that it was not possible to create a program that could have possibly solved the problem.

Imagine a case where you have an object that is descirbed by 21 properties (as an input) and based on these properties you need to classify it to group A or to the group B. How would you solve the problem? Would you need to analyze all the 21 attributes and for all their possible combinations say whether your object is in group A or B.

21! = 5.1090942e+19 combinations. It is a lot! That means you would need to write 5.1090942e+19 instructions like IF statements. This is not possible at all I think!

If you wonder what object I refer to – here it is – the data set and a beautiful picture of it below.

This is why people started to think that at some point traditional programming is not enough and this approach will not work anymore. We need to use other methods to find insights in the data. There should be new algorithms that can learn from data. That means that the algorithms should look for some patterns (but not only). This is a boring task so we (humans) let the machines to do this. We (humans) like creative tasks.

How much data we need for Machine Learning? The more the better! The more you have the more insights you can find. Lets make a pause here as it is not so simple and require some knowledge from statistics as well. You need to also have some knowledge from linear algebra, probability theory and optimization methods. I will explain this in the next articles.

How to start with Machine Learning?

Every Machine Learning process is built from several steps like:

what would you like to achieve (define the goal)
prepare the data
select an algorithm(s)
build and train the model
test the model (and score it)

I will go through all the mentioned steps in details in next articles. They are important and you need to understand them very well.

What types of Machine Learning do we have?

There are three types of ML:

supervised
unsupervised
reinforcement

A Supervised Machine Learining is when you know your data set – you know the label of each of the attributes in the data set. You know the names and you know what is the purpose of each attribute. At the end you can say what are the inputs and outputs – so you know what to look for.

In the Unsupervised Machine Learning the data set contains attributes with no label. The algorithms try to find patterns, knowledge from the data. This might look very intuitive as of now but we will go into this as well.

The Reinforcement Machine Learning allows the algorithms to learn by trial and error. Then an algorithm makes and error it is penalized. When an algotithm does not have errors – it is rewarded. Think about a chess game or a very complex game of Go and maybe an autonomous cars.

Read more here.

Uncategorized

The tools you should know for the Machine Learning projects

MMS • RSS

Article originally posted on Data Science Central. Visit Data Science Central

I have been frequently asked about the tools for the Machine Learnign projects There are lot of them on the market so in my newest post you will find my view on them. I would like to start my first Machine Learning project. But I do not have tools. What should I do? What are the tools I could use?

I will give you some hints and advices based on the toolbox I use. Of course there are more great tools but you should pick the ones you like. You should also use the tools that make your work productive which means you need to pay for them (which is not always the case – I do use free tools as well).

The first and most important thing is that there are lots of options! Just pick what works for you!

I have divided this post into several parts like the environments, the langauges and the libraries.

THE ENVIRONMENT

The decision about which environment to choose is really fundamental. I use to have three environments and use them as needed. The first and the one I like is Anaconda. It is an enterprise data science platform with lots of tools. It is designed for data scientists, IT professionals and business leaders as well. You can configure it for you project so it contains only the tools and libraries needed. This can make your deployments easier (I am not saying it will be easy).

Anaconda – home page and the tools

Creating an environment is super easy! Of course assuming you know what you need but it is also possible to reconfigure the environment later. I understand an environment like a project.

The Anaconda environments

Anaconda offers also shortcuts to the Learning portal where you can find not only the documentation but also a lot of useful materials like videos or blog posts. This is really a great place to learn how to start working with a tool or to gain more knowledge

The Anaconda learning

The last thing I would like to show here is the Anaconda Community tab. The community is really what makes our live easier. You can share thoughts or learn or just ask questions. As a proud member of the #SQLFamily community I know what I am saying… The community is the heart of all the learning process so do not forget to take part in and share your knowledge!

The Anaconda community tab

By the way – you can install MiniConda (minimal installation of the Anaconda) and install everything else from the command line like I have shown here:

Cmd Cd documents Md project_name Cd project_name Conda create --name project_name Activate project_name Conda install --name project_name spyder

What I have done with the code above? I started with the cmd tool, then created a new project named project_name in the documents folder. Then I have created an environment and activated it. The last line shows an example how to install libraries or tools – I have shown how to install the Spyder.

I use the Jupyter Notebook along with other tools (Orange, Spyder etc) to do the modelling. The advantage of the Jupyter Notebooks over the other tools is that you can write a code and immediately run it without compiling anything. Looks great, ain’t it? This is not all as I always like to document my code and this is what you can do here. Take a look at the picture below – code and documentation live logether peacefully!

Jupyer Notebook in action

Now let’s move on to the Visual Studio Code. I have been using Visual Studio since it was released for the first time. You cannot be surprised that Visual Studio Code is just my natural choice for many projects including Machine Learning and AI.

Visual Studio Code is released once a month which makes this product unique.

You can customize your Visual Studio Code the way you need – just install all the extensions and start working with the code.

Visual Studio Code – my installed extensions

But this is not all. Having Visual Studio Code you have also a powerfull debugger, intellisense (!!!!) and built-in Git.

Visual Studio Code intellisense for Machine Learning project

What about the Visual Studio Code community? Yes, there is one! It is also powerfull so you do not get lost and get the help if needed.

The last tool I would like to present is the Azure Machine Learning Studio. This is a graphical tool and it does not require any programming knowledge at all. You need to log to the Azure Portal and create a Machine Learning Workspace.

Machine Learning Studio Workspace

There is a free version for the developers so you can just start immediatelly. I suggest you to start with the examples that are in the Gallery. Take a look at the one I have just picked and opened in the Studio:

Machine Learning Studio

As you can see the Machine Learning Studio is more oriented to the Machine Learning Process (take a look int my recent article) rather than coding. Of course you can add as much code as you wish there as well.

THE LANGUAGES

I would prefer to use Python but there is also the R language in the scope. What I see is the R language is mostly used by the people from universities whilst Pyhon is used by the data engineers and programers. This is how it looks like usually but I am not doing any assumptions. Please use the language you like and feel comforatble coding. I will use both of them on the blog.

Both Python and R are powerful languages. They can easily manipulate with data sets and perform complex operations on them.

Wait, do you know any other language that can handle data sets? Yes – it is the good, old T-SQL! I think you should at least know that the SQL Server can mix up the T-SQL, Python and R! You can create powerful Machine Learning and AI solution using SQL Server and I will definitelly show you how to this later!

THE LIBRARIES

Now we move to the heart of the Machine Learning modelling. The libraries that give you everything you need. You can prepare your data set, clean it, standarize, perform regularization, pick an algorithm, create learining/testing splits, learn the model, perform scoring, plot the data and many more…

The decision which library you use is really important. The decision is also driven by the language you use as libraries are not transferrable between the Python and R.

I am going to describe some well known (free of charge) libraries here below but we will learn more about them in the next posts where I will be discussing the code itself.

PANDAS

This is one of the most popular libraries for data loading and preparation. It is frequently used with the Scikit-learn. It supports loading data from different lources like SQL databases, flat files (text, csv, json, xml, Excel) and many more. It can do SQL-like operations for example joining, grouping, aggregating, reshaping, etc. You can also clean the data set perform transformations and dealing with missing values.

NUMPY

This is all about multidimensional arrays and matrices and it is used in linear algebra operations. It is a core component for the pandas nad scikit-learn.

SCIKIT-LEARN

This library is one of the most popular libraries today. You can find lot of both supervised and unsupervised learning algorithms like clustering, linear and logistic regressions, gradient boosting , SVM, Naive Bayes, k-means and many more.

It also provides helpful fonctions for data preprocessing and scoring.

You should not use it for the Neural Networks as it is designed for Machine Learning.

PYTORCH

This is the Deep Learning library built by Facebook. It supports the CPU and GPU computations. It can help you solving problems from the Deep Learning area like medical image analysis, recommender systems, bioinformatics, image restoration etc.

PyTorch provides features like interactive debugging and dynamic graph definition.

TENSORFLOW

It has been built by Google. This is both Machine Learning and Deep Learning library. It supports many Machine Learning algorithms for classification and regression analysis. The great benefit is that it also supports the Deep Learning tasks.

KERAS

It is a popular high-level Deep Learning library which uses various low-level libraries like Tensorflow, CNTK, or Theano on the backend. It should be easier to learn than Tensorflow and can use the Tensorflow under the hood (what for example PyTorch cannot do).

XGBOOST

This library implements algorithms under the Gradient Boosting framework. It provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

WEKA

I have used the Weka library in my R code when testing how the association rules work. But is a powerful library for data preparation and many types of algorithms like classification, regression. It can also do clustering and perform visualization.

MATPLOTLIB AND SEABORN

The two libraries are used for data visualization. They are easy to use and help you using both very basic and very complex plots. You do not need to be an artist or talented coder to make beatufil visualizations anymore.

WHAT ABOUT THE CLOUD SOLUTIONS?

Everything lives in a cloud now. This is also true for the Machine Learning solutions. There are many Cloud Providers you can choose but I will be showing most of my cloud solutions on Microsoft Azure. There is everything you need to start with. You can start from scratch and build your solution just step by step having control on everything. But you can also use so called Automatic Machine Learning (yes, I show you both ways !!!) to concentrate on the solution and not on the infostructure. Think about how powerful this can be – you develop a model and the Azure will deploy it for you – in a contenerized solution!

SUMMARY

Now you know the tools – environment, languages and libraries. We can move forward to Machine Learning. The next post will be dedicated to a very simple but powerfull example of the Machine Learning solution.

Please let me know if you need me to elaborate on a specific tool more. I will be very happy to do so in one of the further posts!

Originally posted here.

Uncategorized

Presentation: Making 'npm install' Safe

MMS • RSS

Article originally posted on InfoQ. Visit InfoQ

Transcript

Sills: Today I want to talk about making NPM install safe. Code has power, there’s this great quote from the structure and interpretation of computer programs and it says, “In effect, we conjure the spirits of the computer with our spells.” As programmers, we have tremendous power, and this power can be used for good and it can be used for evil. Code can read our files, it can delete our files, it can send all of our data over the network, it can steal our identity, drain our bank account, and much more.

My name is Kate Sills and I work for a startup called Agoric. There are two things that you should know about Agoric. Number one, we’re building a smart contract framework that we hope can handle millions of assets, or millions of dollars in assets. Number two, we’re doing it in JavaScript. This probably sounds like a terrible idea and we’ll talk about how we’re able to do that.

At Agoric, we’re at this intersection of cryptocurrencies and third-party JavaScript code. We use a lot of Javascript packages, and it turns out that this intersection is just really an A-plus target for attackers. It’s like Italian-chef-kissing-fingers type of situation. What exactly is going on here? NPM have some great stats, they say that there are over 1.3 billion downloads of NPM packages on an average Tuesday. That’s a lot of downloads. JavaScript has this rich culture of code reuse. Here are some more stats from NPM, there are over 800,000 packages in NPM, making it the largest open source code repository in the world. The average modern web application has over 1,000 dependencies. If we look at something like a create-react-app, which is supposedly bundling together all of the dependencies that you need as a beginning react developer, that has over 1,700 dependencies, so there’s a lot of code reuse going on here.

NPM has this great quote, they say that 97% of the code and a modern web application comes from NPM, and an individual developer is responsible only for the final 3% that makes their application useful and unique. We can think of all the time that saved by not having to reinvent the wheel. That’s hundreds of millions of coding hours saved. I think this is a really beautiful example of the kind of cooperation that humankind is capable of. We should be proud, NPM should be proud of everything that they’ve accomplished. We have this rich civilization of code, we have specialization, we have experts. We don’t have to build everything ourselves. As long as we can rely on all of these package developers to be good people, we’re fine. But not everyone is good, not everyone is good all the time, and people make mistakes. What happens when it goes bad?

When It Goes Bad

Using other people’s code is risky, and it’s risky because whatever package we install can do whatever it wants. There are no restrictions and we may not find out what happens until it’s too late. In real life, we also have this rich ecosystem. We’re able to buy and use the things that other people create. We don’t have to grow our own food or sew our own clothes. Imagine if everything that you bought, all of the interactions that you had throughout the day, that coffee that you bought at the airport, imagine if that had the power to completely take over your life. That would be ridiculous, it’d be absurd, but that’s the situation that we’re in right now with JavaScript packages. We can’t safely interact with the things that other people have made without it potentially ruining us.

Let’s look at how authority works in Node.js. I’m going to focus on Node.js for a bit here. By authority, I mean the ability to do things. Authority in node comes through imports through these require and it comes through global variables. It turns out that anyone or anything can import modules, especially the node built-in modules, and they can use global variables. The effects of this use, it’s often opaque to the user. There’s no notification that says, “Alert. Your files are being sent over the network.” It’s all opaque to the user. Imports can happen in independencies that are many levels deep. This means that all packages are potentially risky, so something that seems like it’s doing something simple and confined, could have malicious code that’s actually doing a lot more. Node provides no mechanisms to prevent this kind of access.

To illustrate this, let’s imagine that we’re building a web application and we want to add a little excitement. Let’s say we install this package at excitement that just takes in a string and it adds an exclamation point to it. If this notation isn’t familiar, that’s just a template literal, so it’s adding an exclamation point to a string. Our “hello” turns into “hello!”, this is very simple, it’s just string manipulation. It turns out that add excitement could actually be written like this. We have the same functionality, we’re adding an exclamation point to a string. From the user’s perspective, we get the same effects. “Hello” gives us “hello!” but we are also importing the file system. FS is the file system in node, and we are also importing https, so both fs and https are node built-in modules.

What we’re able to do, or what this attacker is able to do, is to say “fs.readFile” and let’s say that’s our bitcoin private key file. It’s able to read that and send it over the network. This code is simplified a little bit, but I think you can see what’s going on here. Just through access to fs and access to https, any attacker through the installation of any package could be able to read all of our data and send all of it over the network. In the cryptocurrency world where we’re dealing with bitcoin wallets that have private keys, this is a big deal. This is a problem.

Let’s go over the steps to read any file. All we have to do is we have to get the user or another package to install our package. Then step two, we have to import the node built in module fs. Step three, we have to know, or we can guess, the file path, and there are no penalties for guessing. That’s it, that’s all we have to do. When we look at this, we see that all we had to do, we just had to import fs and then we were able to read the file and send it over the network.

A Pattern of Attacks

This actually is a pattern of attacks. You might have heard of the event stream incident that happened last November. What happened there was that there was an open source JavaScript developer, Dominic Tarr, who had over a hundred open source packages that he was maintaining. A volunteer came up to him and said, “You know what? I would really like to take over the event stream package. I’ll take that off your hands.” Dominic Tarr said, “Great. That sounds great.” This volunteer made some changes and he actually added a malicious dependency, and that malicious dependency to the venturing and package targeted a specific cryptocurrency wallet. What it did was that it tried take the bitcoin private keys, send it over the network, exfiltrate them and ultimately steal bitcoin.

We saw this pattern again just this month. The electron native notified package also had a malicious dependency that tried to target cryptocurrency wallets. It added a malicious package as a dependency and it required access to the file system and the network. NPM was able to stop this attack, luckily, but I’m sure there are many attacks like this that we’re going to see in the future, that if we don’t do something to prevent it, we’re going to see this pattern again and again.

Solutions

What are the solutions? One of the solutions is to write everything yourself. Believe it or not, there are actually companies out there that are doing just this in the cryptocurrency world because the stakes are so high. They are writing everything themselves, this is a real solution. Obviously, it’s not a very good one and it’s not very practical for all of us here. It’s not very scalable, if I had to write everything myself, I would not get very much done, and we would lose the millions of coding hours that we’ve saved by reusing other people’s code. I don’t think this is a very practical solution.

Another solution that people have proposed is paying open-source maintainers, so at least there’s someone responsible for the security of the package. I think this is a good solution, I’m not against it. Even when someone is paid to maintain a package, they still may be compromised, people make mistakes. I think we will still see that this attack will happen even if we have people who are responsible for the security of the packages. Then lastly, the solution that people recommend is code audits. Code audits are great, but they’re not a panacea. They don’t solve everything, there are things that code audit smiths.

Here’s some code courtesy of David Gilbertson. Does anyone know what this code does? Any guesses? What this code is actually doing? It’s doing a fetch, it’s doing a window.fetch, this is accessing the network. If we were doing a code review, we would probably never guess that. How it’s able to do this is that a “const I” is actually fetch shifted over one character. Then a self is an alias for window. If we have thousands of dependencies and we’re just trying to do a code audit, maybe we’ve even gotten better and we have manual tools that are grepping for certain things, we would probably never find this. Code audits alone are not going to solve our problem.

If we go back to the steps to read any file, we see that a lot of these solutions are focused on number one. They want us to only install packages that we trust, and they want us to be able to trust the open-source code maintainers. I think this is admirable and I don’t want to stop people from doing this, but I think what we should really be focusing on is step number two and step number three. What would happen if the malicious code tries to import fs and it just can’t? Or, it knows the file path, it knows the file name, but it can’t use it? It just can’t access those files. There’s a great quote from Alan Karp that goes towards this point. He says, “The mistake is in asking, ‘How can we prevent attacks?’ when we should be asking, ‘How can we limit the damage that can be done when an attack succeeds?’” The former assumes infallibility, the latter recognizes that building systems is a human process.

What We Need: Code Isolation

If we were actually able to focus on preventing the import of fs and making sure that even if someone does know the file path or they can guess the file path, that it’s of no use to them, how would we do that? What we actually need is code isolation. It turns out, through an accident of history, JavaScript is especially good at code isolation. JavaScript has a clear separation between pure computation and access to the outside world. If we severed that connection to the outside world, we can actually get rid of a lot of the harmful effects. This is not true of other languages, if you look at something like Java, it doesn’t have this clear separation. As JavaScript developers, we’re actually in a really good place here for being able to do this code isolation.

In JavaScript, we already have the concept of a realm. Each webpage is its own realm, and JavaScript that executes in one webpage can’t affect JavaScript that executes in another. A realm is roughly the environment in which code gets executed. It consists of objects that must exist before code starts running. These are objects like object, object.prototype, array.prototype.push. It also consists of a global object and global scope. Historically, one way that people have isolated third-party code was to isolate it in a same origin, iframe, but using iframe is clunky, and we need to be able to create realms without the overhead of iframes.

This was the concept behind TC39 standards proposal called realms. What if we could actually create these realms without the overhead of the iframe? Without the DOM? What if we could do it in a very lightweight and easy way? Creating a new realm creates a new set of these primordials, these objects that we get at the start of our code running as well as a new global object and global scope. A realm is an almost perfect sandbox, whatever code is isolated in a realm, is isolated from the rest of the world. It has no effect or no ability to cause effects in the world outside itself. Malicious code that runs in a realm can do no damage.

The realms API allows us to create another type of realm, known as a featherweight compartment. Rather than duplicating all of the primordials – array, object, prototype, that sort of thing – a featherweight compartment just shares them, and this makes the compartment much lighter. This is a proposal before TC39, which is the JavaScript standards committee, it’s at stage two. This is the stage at which it’s still a draft. We’re looking for input on the syntax and the semantics and things like that. We’re hoping to push it forward so that it actually gets put into the JavaScript language itself. Even though it’s still at the proposal stage, there’s a realm Shim, and you can use this now. The realm Shim has really been a team effort between the company that I worked for Agoric and Salesforce. Mark Miller, JF [Paradis] and Caridy [Patiño] have been working on this, so you can use this Realm Shim now.

Let’s see if this will work. I want to show what happens when we try to execute that piece of code that we saw earlier, inside a realm. We’ve stringified it, let’s just evaluate it to see what happens. That actually worked fine, it was just that the attacker URL wasn’t defined, but that attacks succeeded. Now, if we use realm and we make a compartment, and then we evaluate that code and that compartment, it turns out that self-window, all of those things, they’re just not defined. That fetch just doesn’t work.

We have these featherweight compartments and we’re sharing these primordials, but we still have a problem, because we’re sharing these primordials and there’s a thing called prototype poisoning. What prototype poisoning does is that it actually resets these objects that we get, these primordials, and it sets them to other values. Here’s an example of an attack. Our attacker is taking an array.prototype.map and it’s setting it to something else and it’s saving the original functionality. From the user’s perspective everything looks fine, our array map is still working, no problems here. What it’s actually doing in the background is that in addition to actually doing the mapping, it’s sending all of the data in the array over the network.

Prototype poisoning is a huge problem. To solve the problem of prototype poisoning, there’s a thing called SES or a Secure EcmaScript, and you can think of SES as realms plus transitive freezing or hardening. What this means is that when someone tries to make changes to the primordial, when they try to do the attack that we saw, when they try to set array.prototype.map to something else, it just simply won’t work. This object is frozen, you can’t change the values.

Using SES is very easy, it’s a package right now, so you can do NPM install SES, and then you just require it. You say makes us root realm and then you evaluate the code. You might have noticed in the video that I showed that you actually have to stringify the code. This is a developer ergonomics issue that we’re still working out, but you do have to stringify anything that you want to try to evaluate in a safe way. You might be saying, “Well, that’s great, isolation is great, but my code needs to do things. It needs access to the file system, and it needs access to the network.”

POLA

What do we do when our code actually does need a lot of authority? There’s this thing called POLA, POLA stands for the Principle of Least Authority. You may have also heard this as the Principle of Least Privilege, but POLP doesn’t sound as great, so we’re going to stick with POLA. What POLA means is that we should grant only the authority that is needed and no more. We should eliminate ambient and excess authority.

What does no ambient authority mean? Ambient authority is easy access without any specific grants. In our example with the ad excitement package that we installed, we saw that access to fs and access to https, because they were built in node modules, we didn’t have to ask anyone. The attacker could just say import, or require, and it got byte authority. That’s an example of ambient authority. We also should not have excess authority, in that example, in the add excitement example, it didn’t need any at all. It was taking in a string and it was returning a string. It didn’t need to access to anything, but it had access to the file system and the network. So if we were to actually rewrite this under PoLA, it wouldn’t have the authority to do any of that.

To illustrate this, let’s use an example, let’s use the command line to do app. It’s a very simple, pretty stupid app. What it does is that it adds and displays tasks, these tasks are just saved to a file and it uses two packages, it uses Chalk and Minimist. You might have heard of them, they’re very widely used JavaScript packages, they have over 25 million downloads each. What Chalk does is that it just adds color to whatever you’re logging to the terminal, and Minimist parses command line arguments.

Here’s an example of how you might use it, I thought it was very simple and pretty stupid. First, we want to add it to-do, pay bills, and then we want to add another to-do, do laundry, and we add a third to-do, pack for QCon. This is priority high, because it’s really important. We’re able to display it, when we display it through Chalk, we see we have pay bills, do laundry, and pack for QCOn is in red because it’s really important. If we analyze the authority, things get a little interesting. Our simple commands line to-do app, it needs to use fs. It needs to use the file system because it’s saving to the file is reading from the file. That makes sense. It also uses Minimist, and Minimist just takes in arguments. It’s pretty much a pure function, it doesn’t need a lot of authority.

Chalk, on the other hand, does something really interesting and this is a very widely used NPM package. It uses a package, “supports-color,” which needs to use process and it needs to use OS, and processes a global variable provided by node, OS is one of the built-in node modules. These are both a little dangerous, let’s look at that. If you have access to process, then you can call process.kill. You can kill any process that you know the process idea of, and you can send various signals as well. That sounds great, this simple package that’s just supposed to turn the color of our logs to green, it’s able to do this. If we have access to OS, we’re also able to set the priority of any process. All we need to do is just know the process id and we can set the priority. This is crazy, but this is how the JavaScript package ecosystem works right now, is that things that we think are really can be doing all of these things.

If we want to enforce POLA, then we’re going to need to attenuate access, and we want to attenuate access in two ways. We want to attenuate our own access to fs and we want to attenuate Chalks access to OSM process. To attenuate our own access to fs, let’s first create a function that just checks the file name. If it’s not the file that we expect, if it’s not the to-do app path, then we’ll throw an error, that’s pretty simple. We’ll use this function when we actually attenuate fs. This is a pattern that you’ll see again and again in this kind of attenuation practice. What it’s doing is it’s taking the original fs from node and it’s saying, “You know what? We don’t need the whole fs. What we actually need is a pen file and we need to create read stream, we only need those two things. Not only do we only need those two things, we only need them for this one file.”

What we’re able to do is we’re able to create a new object. We harden it using SES. That means that it’s frozen, it can’t be modified. It only has two methods, it has a pen file and it has create read stream. Within those methods, it checks for the file name. It checks to see if it’s what we expect. Then if it’s not, it throws an error. This is an attenuation pattern that we’re actually able to use on our own access to fs. You might ask, “Why do we actually want to restrict our own access? That seems crazy. I trust myself.” It turns out that people slip up, I may do something that compromises my computer. It’s important to attenuate even your own access to things.

How do we attenuate Chalk’s access to OSM process? First, let’s solve the ambient authority issue. Instead of Chalk or supports color just being to import OS and use process, let’s actually pass those in. Let’s make it a functional design here. Let’s change Chalk to take in OSM process and let’s change supports color to take in OSM process. it turns out that chalk only needs OS. This really powerful package only needs OS to know the release, and release is just a stream that returns a string identifying the operating system release. The reason why it does this is because there’s a certain color of blue that doesn’t show up well on certain windows computers. That’s the only reason why it needs this authority. That means that we can attenuate this, and we can reduce it down to a much smaller authority. We can say that we’re going to attenuate OS, we take the original OS module and we create a new object that just gives that string, that just gives the release.

We can do the same thing to process, but it does need a bit more of process. We see the same pattern. We provide it within with platform, with versions and so forth. There’s something else interesting that we can do here, we can lie. We don’t have to tell the truth. I have a Mac right now, we can say that our platform is actually Win-32 and this dependency won’t know the difference. We can virtualize things, which is another really important pattern in enforcing PoLA.

If you’ve liked these patterns, if you’ve liked the attenuation in the virtualization, you might really like a programming paradigm known as object capabilities. The main message and object capabilities is that we don’t separate designation from authority. It’s an access control model that is not identity-based. You can think of object capabilities as a key, versus having a list of people who are proved to do certain things, it’s like the car key.

What’s really great about object capabilities is that it makes it very easy for us to enforce POLA and to do it in creative ways. We can give just the authority that we need to people, and we don’t have to have complicated permissionless, we can just do it with objects. It also makes it really easy to reason about authority, because it turns out that the reference graph, what things have access to you, is the graph of authority, so we can see if this actually has no access to this, it doesn’t have that authority, we’re protected from that. If you’re interested in more on object capabilities, there’s this great post by Chip Morningstar at habitatchronicles.com, I really encourage you to check it out. “Object capabilities allow us to do some really cool things in enforcing POLA.”

SES as Used Today

I want to go over how SES is used today, it’s a stage two proposal at TC39, so it’s still in progress, but there is a Shim for realms and SES, and people have already started using it. First, Moddable. Moddable is a company that does JavaScript for the internet of things. Your Whirlpool washing machine might have JavaScript running on it and they have an engine access. It’s the only complete EcmaScript 2018 engine optimized for embedded devices. They’re the first engine to implement SES, or Secure EcmaScript. What’s really cool about this is that because they have the security, because they have this code isolation, they can actually allow users to safely install apps written in JavaScript on their iot devices. You can imagine you have your oven, your washer, your light bulb, and you’re able to control all of those things with code that you write. The manufacturer can let you do that because it’s safe, because they’ve isolated your code and it can only do certain things, like make the light change color. That’s really cool, that’s really exciting.

Another company that has been using realms SES is known as MetaMask. MetaMask is one of the main ethereum wallets. What they’re able to do is that they’re able to allow you to run ethereum maps without having to run an ethereum full node and they actually have over 200,000 JavaScript dependencies. You can tell that they’re very eager to be able to isolate this third-party code and make this safe. What they’ve done is that they’ve created a browser five plugin that puts every dependency into its own SES realm. They also have permissions that are tightly confined to a declarative access file. If you’re interested in Browserify, you use Browserify, check out Sesify.

Lastly, Salesforce, who has been one of the primary coauthors of realms SES, they’re using SES right now in their locker service. This is an ecosystem of over five million developers. Their locker service is a plugin platform. You can create third-party apps, people can install them in salesforce. It’s really important to them since this handles a lot of people’s business data, that these third-party apps are safe. They’re using a version of this right now in locker service.

I want to be clear about the limitations, this is a work in progress, we’re still solidifying the API, we’re still working on performance. There are definitely developer ergonomics issues, you actually have to stringify the modules to be able to use them right now, we’re working on fixing that. It’s still stage two in the TC39 process, what that means is that SES is able to provide nearly perfect code isolation. It’s scalable, it’s resilient, it doesn’t depend on trust. It doesn’t depend on us trusting these open source developers. It enables object capability patterns like attenuation that I think are very powerful.

Most importantly, it allows us to safely interact with other people’s code and we can use your help. Here are the three repos that if you’re interested, you can look at there’s proposal rounds at TC39. There’s the Realm Shim at Agoric, and SES at Agoric. Please take a look at these things, play around with them. We would appreciate and love your comments, your POLA requests, all of that. We could definitely use your help.

Questions and Answers

Participant 1: With content security policy like report only, I can say, here are places where I’ve set a policy, but we’re asking for something that’s beyond, and that’s going to trigger potentially problems down the road. Is there anything with SES where I could say, here are the things that are requesting file system or whatnot?

Sills: Yes. If I understand the question, it’s is there any way with SES that we can we can record the intention or what we’re allowing things to do? It’s still a work in progress, but people have been building tools to be able to do that. There is a tool called TOFU, or Trusted on First Use. What it actually does is it creates a declarative manifest of all of the authorities that all of the packages that you’re using actually use. When something uses fs, it says this package uses fs.

It’s like a package lock Json or something like that. You can see where if a package does all of a sudden request more authority, well, first of all, it won’t get it; it’s restricted only to what’s declared in the file. If you do want to try to grant it, that authority, then that actually if you’re doing a POLA request or something like that, that’ll show up in diff and you can say, “Hold on, do we really want the simple string function to have access to this?” People are working on that. It’s still a work in progress, but if you Google SES TOFU, it should come up.

Participant 2: Thanks for your talk. I’m new to JavaScript, so maybe this question won’t make sense, but how are some of the key capabilities of SES different from what we already have? I’m used to using object.freeze to something similar to heart and I guess here. The second part of my question is, for the particular package that you talked about, Chalk, how do you figure out that the package itself only needs OS.release without going through the packages code?

Sills: First, the question was how is SES different than object.freeze? Object.freeze only freezes that particular object. JavaScript has prototypal inheritance and it doesn’t go up the prototype chain to actually freeze any of the primordials or the prototypes, things like that. When we see prototype poisoning, when we do object.freeze, it doesn’t actually prevent that. What you need to do is you need transitive freezing. You need to freeze that object, then freeze everything that it’s connected to and so forth until you frozen everything. Can you repeat the second question?

Participant 2: You talked about the Chalk package and that we could prevent access to its always start release by just virtually telling it that you’re on Win-32 or something. How do I know that that’s the only thing that it needs the OS package for, without going through the court? Then would that mean that I need to analyze every entire code of all the packages that I’m importing?

Sills: The question was, how do I know the authorities that the packages are using? Do I have to go through all of the code in that package or is there a simpler way that I can do that? We’re still developing the patterns of use for these kinds of tools. I think this goes back to the TOFU tool that has been built. Let’s say that you don’t want to spend any time on security whatsoever, you just want to lock things down to where they are, so that when Chalk is using OS, we don’t actually give it OS, we just give it os.release. Something like TOFU, something that creates this declarative manifest based on the authority, that your packages are using, would actually be able to attenuate it to that point automatically, which is great.

The other thing that you can do if you want to try to really attenuate things further is that you can just try running it in complete isolation and see what fails. It’s not a perfect solution, it’ll take some time. You may have to rewrite the module, you may have to ask the maintainers to rewrite the module in a way that actually allows you to attenuate it. That is what you can do.

Participant 3: I’m not at JavaScript developer as well, but I have some basic questions. When you said the three things that we can do to prevent malicious code from running on our machine, one was to write code ourselves, others to pay or maintain, and third is to limit permissions. With NPM, how we do NPM install and add a dependency and run? In other language types, for example, Python, we do a PIP install, add a dependency and run. We don’t hear this with Python as much as we hear it with NPM. What is Python or PIP doing in this industry that NPM can?

Sills: The question was, we hear a lot about problems with NPM install, but there are other languages that have other package managers and why don’t we hear about attacks that go through them? Why don’t we hear about that? I think it’s a number of things. The first thing is that those languages and those package managers, they are not protected. They still have problems. They haven’t actually solved the problem. What’s more is that given the language, it may be harder to achieve this kind of code isolation. I mentioned Java for instance, it doesn’t have this clear separation of pure computation and outside effects. It can be harder to achieve that kind of code isolation.

I think the reason why we hear about it so much in the news, why we hear about NPM install attack so much, is that JavaScript is so widely used. It’s the most widely used language and it has this culture of creating really tiny packages and creating a lot of them. There’s a lot of code reuse, which I think is a really good thing, I don’t think we should get rid of that practice. The fact that it’s so widely used and the fact that it has so many dependencies, I think contributes.

Participant 4: Was this possible to happen before anything? There’s nothing additional that they are doing in the industry to prevent malicious code from being [inaudible 00:40:00]?

Sills: Not that I know of, but I am not the subject matter expert on that.

Participant 5: I want to respond to that question for Java. Java does have protection Sonatype, which is one of the large binary repositories. Monitors may have been central and large feeder repositories. It’s part of their business model to do it for certain customers, and that winds up protecting the ecosystem. While they can’t protect it from getting into the repository, they can protect it from spreading very quickly. I do not work for Sonatype, but it’s a nice benefit.

Sills: Thank you.

Participant 6: Do you think that the security will be pushed to the package maintainers to be able to say, “My package requires only these few things and you can see that when you’re installing.” Or, do you think it will be on the consumer to say, “These are the only things I want to allow,” like a secure Linux kind of a thing or anything? Or do you think it could be a combination of both going forward?

Sills: I actually think it will be both, and I think that’s there’s this quote that says – I’m going to butcher it – but it’s, someone who can look into the future and see cars and all those things, that’s great. Someone who can see into the future and see traffic, then they’re really looking ahead. I think you’re really looking ahead here. I think there’s going to be this burden on the package maintainers to explain the authorities that they’re using, and it’s going to be seen as really sloppy; just really gross to be doing something simple and to be using all of these authorities and allowing that kind of access, the ambient and excess authority that I was talking about.

I think we’re going to see a programming practice come to pass, where if you’re not doing this kind of attenuation, if you’re not being careful about the authorities that you’re using, then you’re going to be a bad developer. I think that’s going to be driven by usage. We could imagine a world in which NPM highlights what kinds of authorities are being used. If you have a choice between two packages as a user, and one of them takes all of these authorities and then other doesn’t, you’re going to choose the one that’s safer. I think different code repositories, whether it’s NPM or GitHub, or what have you, they can surface the information about the authorities that are being used and that will help us make those decisions and push the process forward.

Participant 7: You answered my question with the previous question, but just elaborating on that. Our web stack and our team is very angular-focused. Do you see any potential for this type of security to be bolts into web frameworks like that?

Sills: The question was will this kind of security be bolted onto web frameworks, like angular react, that sort of thing? I haven’t seen development towards that point, but I hope it will happen. If we see this pattern in the greater JavaScript, ecosystem then anywhere where there’s the use of third-party code, whether it’s in a spreadsheet that just uses a JavaScript for manipulating cells or whether it’s a plugin architecture, like Salesforce’s locker service, anything that could possibly use packages or third-party code, we’re going to see a push towards practices like this. I hope so.

Participant 8: I tried Googling SES TOFU and it didn’t work. I’m guessing I spelled it wrong. Can you either spell it or add more words?

Sills: Let’s see. You probably did spell it right. It’s just TOFU. I can try to provide a link in the slides afterwards.

Participant 9: I saw that recently the JavaScript class was added to TC39 and TC39 approved it. Does that help with the abstraction and maybe writing secure code? Do you have any opinion on that?

Sills: The question was that Java has added class two onto the JavaScript standard recently, and does that help with code security? It’s funny that you should mention that. Our chief scientist at Agoric who was previously at Google, Mark Miller, is on the TC39 Standards Committee, and Agoric is part of the TC39 standards committee. Mark really hates class, for some reason. I’m not exactly sure why, but I can tell you from his perspective, it’s not helping security wise. I wish I could give a better answer, but we’ll have to ask him.

Participant 10: It’s still prototype within the [inaudible 00:45:09].

Sills: It’s still prototyped under the hood so that explains it. You could do something like prototype poisoning through class. It’s obscuring certain attacks that could happen.

See more presentations with transcripts

Uncategorized

Presentation: Driving Technology Transformation at @WeWork

MMS • RSS

Article originally posted on InfoQ. Visit InfoQ

[Note: please be advised that this transcript contains strong language]

Transcript

Haas: My name is Hugo Haas, I am a fellow engineer at WeWork. I work on developer platform which is all the infrastructure and tooling that WeWork software is built upon. In my career I’ve worked on a lot of large distributed systems at different companies before joining WeWork. What I wanted to tell you about today is why technology is important to WeWork, and I’m realizing that this may not be completely obvious, why we’re going through a transformation, and as Randy hinted, talk to you about a couple of problems concretely that we have because of our journey, because of our business, and how we’re tackling them.

Let’s dive in, and I need to say just a few words about WeWork to level set. A lot of people have heard about WeWork. There’s a good chance that when you think about WeWork, you think shared office space, startups, freelancers. It turns out that – and it’s important background for what I’m going to be saying next – WeWork is a lot more than that. First, we are a global platform for organizations of all sizes. We do have lots of customers who are freelancers and small businesses. We also have large enterprises as our customers, and we don’t normally offer space to those customers, to our members, we also build a community, put them all in touch, and we offer services to them. That’s starting to hint why we would need some technology to do this.

A couple of numbers that I also wanted to start this talk with, are about our scale today. We have 466,000 members locally, 485 physical locations, and those locations are in 105 cities in 28 different countries. It’s a fairly broad footprint in multiple countries. Finally, engineering, we are now roughly 1,000 people in technology in four different hubs globally. This has grown quite a bit over the past few years.

Why a Technology Transformation?

With all that said, let me start talking about why a technology transformation. The first thing to understand about technology at WeWork is that we have a very broad set of problems that we solve for. I’m going to list a few things, real estate management, efficiently sourcing where we’re going to be placing our new locations, how to price them.

I’ll give you an example, there are right now 60 WeWork locations, in New York City. Should we open a 61st? Yes? No? If so, where? How should it be priced so that obviously the company makes money, but at the same time, we attract customers, and when we open the doors, we already have a fair amount of booking in them, and the offices are not empty? That requires gathering and crushing quite a bit of information to do that type of stuff. Then there’s obviously technology to help with sales, billing. I talked about community management. That was the second thing that was mentioned in tools for the community.

One of the things that we do is that we learn about our members and we help them get in touch with one another. For example, you may be in a WeWork office, need help with some logo design, and it may turn out that down the hall or in another office two blocks down, there’s somebody who does freelance design work, and so we can help with those connections, and our members really like that type of stuff.

As I talk about doing business also with enterprises, we also offer insights about how employees use our space and be able to surface what works, what doesn’t work, some inefficiencies. There are actually quite a lot of things going on when you think about what type of technology does WeWork have.

The last aspect of our journey and why I’m going to be talking about this transformation, is our growth through the years. WeWork has been in existence for about nine years. If you look at those numbers, and you look at the locations number and the memberships number, every year those numbers have doubled. What that means is that whatever solution you were thinking of using to figure out where you wanted to put new locations two years ago, now the dissertation from two years ago may be a little bit not working as well, we’ll say. Some of the things that we used to do initially very manually with spreadsheets, we really have needed to move away from these.

Here I feel I need to pause a little bit. I talked about coming up from very few locations, spreadsheets, doing things manually. Who here has been in a company which started with a business and then they realized, “Oh, shit, we need to do more technology and we need to become a technology company”? A fair amount of people have done this, and yes, this is the situation in which WeWork ended. I like to think of WeWork not that long ago as the Wild West. Each of those domains that I described evolved pretty independently, worked a little bit like their own startups, made their own choices, and we ended up with very different decisions; I would say, a balkanized set of technologies.

The other piece of this puzzle in terms of technology organization versus not, is that initially, technology wasn’t a joint for the company. We were providing office space, and yes, we needed to help with room booking. We needed to put people in touch, but the primary thing here was, “Let’s find an office and let’s open a new WeWork office.” What that means is that operational excellence, engineering excellence, was not a primary concern for where we started. If you now look at what I’m talking about in terms of the scale of the real estate sourcing, workplace insights, these are some real engineering problems and we really have needed to step up our game.

In order to tell you more about how we’ve done this, I’m going to be using an overly simplified initial stack, and I’m going to tell you essentially three stories about how we’re evolving things. When we started, we basically ran in a public cloud, U.S.-based; some stuff was running straight on AWS, other applications on top of Heroku. They had a little bit their silo, one app talking to each database and they were doing their own thing. I mentioned each of those made their own decision, and I used here the example of observability, how they were doing logging, metrics collection. They all chose different solutions.

One of the problems that came out of this, in addition to pure vendor management, is that if you are trying to figure out what’s going on across multiple systems, you need to figure out where everything lands and try to join all these. That was an interesting problem. The second one is, as we realized, “We cannot do everything by hand and it’s time to gather all of our data,” the company naturally evolved to using a warehouse and taking all these data and putting it into the warehouse to do analysis on it, but the data hasn’t been well thought-out to nicely join, and extracting insights from it was not straightforward.

The three things I wanted to cover today – I’m going to tell you the role of developer platform, the group that I work on, about how we are consolidating and unifying some of the solutions that we’re using, and how we are measuring the progress we’re making. I’m going to be talking to you about how we are dealing with all of our data and moving from thinking about data after-the-fact, to thinking KPI first. The last thing that I’ll talk about is how we’re moving away from a pure U.S.-based clouds footprint.

Developer Platform as a Transformation

Let’s dive in with developer platform. I talked about all those domains, and very similar to Julia Grace this morning who said that when she joined Slack, there was no infrastructure group, there was no developer platform group. Developer platform is under all of this, and very similarly, about two years ago, this is about when the developer platform group showed up. We had to go and look at all those different solutions and how we wanted to consolidate all of those. The mission that we decided to set for ourselves, the vision, was and still is, enable WeWork to build better product faster. I’m going to be talking a little bit more about what that means and how we’re looking at this.

The way we thought about doing this was, “We have a lot of different tools and it’s very hard for an engineer to reason about all of this, even know everything that exists, so we’re going to be providing tooling and infrastructure to enable our engineers to refocus on their domain problems.” I talked also about our lack of engineering excellence culture or operations excellence culture, and we thought, “What we’re going to do is that as we choose those solutions, we are going to either enforce some behaviors, or put good defaults so that this changes.” One example is that we’re working on allowing to define service level objectives the same way that you define your service. As you deploy something on our platform, you then have some alerting based on some goals that you have. All of a sudden, the developers are much closer to production and what’s happening in the real world.

In terms of types of solutions that we offer, we cover everything from developer experience, testing CI, infrastructure, data platform, observability. Some of the choices that we’ve made I wanted to highlight a little bit here, and I’m going to be talking about a couple of those pieces in more details. We have gone down the GitOps road and everything, whether obviously, [inaudible 00:13:45] code changes, but also deployment configuration changes happens through git commit, CI/CD, sending the artifacts, building and sending the artifacts on our run time platform. I talked about the infrastructure, so compute, storage, and our data platform on top. I’m going to be talking in more detail about compute and data platform in the next couple of sections.

Then, providing out-of-the-box observability so that our engineers are part of the loop, they really own what is running in production, the DevOps model. That’s the pattern that we’ve been using to build our platform and continue down our journey.

I wanted to go back to one thing that I said earlier, which was this goal that we set for ourselves, enabling WeWork to build better product faster. We talked quite a bit about what does that mean and how to measure it. We thought about it in terms of iterations, especially a lean startup model and measuring. What we ended up doing actually, is leveraging some research that was published fairly recently in the Accelerate book about building and scaling high-performance technology organizations. I’m not going to go into a lot of details about what this book says, and I’ll just jump straight to the conclusions from the book. I do encourage you to read the book. It’s very interesting and eye-opening.

If you think about developer platform, I said that we were providing tooling from developer experience to runtime, to observability and monitoring. Basically, we are providing tooling for the entire development life cycle, and so that gives us the opportunity to instrument all of our pieces and get some insights for each of the projects and each of the applications that run on top of the developer platform. This is what we are going down the path for, extracting engineering excellence metrics out of the developer platform for each of those applications and reporting them for the team. I’ll jump straight to the conclusion of the book, spoiler alert, but I still recommend reading the book.

There are four metrics that have been discovered. There’s remeasure; how well you do as a high performing technology organization. Lead time for change, meaning how long it takes between the time you commit code until it’s running in production. You can just think about this as, “If there’s a lot of manual QA, this is going to be long.” Deployment frequency; how often you deploy software. Here you can think of it as if you deploy less frequently, either you just do fewer changes or you batch them all up more and that probably brings some bigger risks of breakages.

Time to restore service, if something bad happens, how long will it take for you to recover from it? Here it takes aside these things like monitoring, and then goes back to how fast can you actually diagnose the problem and launch a fixed version? Change failure rate, which is how often did you release something and it actually broke?

These were the four metrics that appear to be really good indicators of a high performing technology organization. We are instrumenting our developer platform to be able to track all this, and obviously, developer platform helps with some of those. How teams also behave and perform is another factor, but the goal is to improve on all of those four.

Data Ecosystem in a Fast-Paced Environment

I’ve been talking about developer platform as a whole and a journey that we’re on. I said I would touch on data platform and compute as a couple of examples of choices that we’re making. Let’s go and start with the data platform piece. I call this data ecosystem in a fast-paced environment. If you think back to what I described earlier, the situation where we landed a few years back was everybody developed their own application. They had their little database, or big database depending on the service, and then we after the fact, took all of this on the warehouse and did some processing there. There were some problems with this.

How do we fix this? We’re working on building a data platform so that we can think about KPIs first, what are the events that we care about, capturing those events, processing them, deriving insights from them, maybe near real-time, maybe after the fact. This is what the data platform does, and I’m going to go through some pieces at a fairly high level of the data platform that we are building and some of the choices that we’re making. First step, capturing the events, and we have this event collector API, which is a fairly thin API on top of Kafka. The second building block is storing those events after you’ve collected them. There are two types of storage. One is in motion, and this is where we use Kafka. Then it’s at rest, that can be either in a data lake or in a warehouse. I put an asterisk on things that we are still discussing here, we’re looking at things like Delta Lake, Iceberg.

Now that we have the data somewhere, either in motion or stored, we need to compute, to process it with some compute. There are three types of compute here, stream processing, batch processing, some sequel interface. Lots of discussions going on about which one we’re going to do. Finally, consuming all of this either through exports or through some querying interface, visualization interface. On top of all this, we have a scheduler and we use Airflow so that we can trigger jobs, trigger some other jobs in case there are some dependencies there. We built all of this – or we’re in the process of building all of this, I should say – as a good foundation so that we can say, “Hey, these are the metrics that I want to keep track of and I want to have more visibility into everything that’s going on, either in real-time or analyzing a whole lot more data than initially.”

This is all well and good, but back to where we started, which is every domain does its own thing and everything lands on the warehouse, this doesn’t prevent you from getting into a similar problem where there’s lots of data that flows through this, but you don’t know exactly what it is, where it’s coming from, if it’s going to disappear because it was something from a researcher. One thing that we are doing on top which is different from your traditional just pure data platform solution, is focusing on metadata management to keep track of all of this. I’m going to talk about how we do this.

First, I’m going to be talking about why. Hopefully, I talked about what can happen if you don’t have this. Our motivations here is that we really want to move to a data-driven culture. To do that, we want to democratize the use of data. Part of democratizing the data is making it super easy for everybody to use, and that’s the self-service aspect, and also decentralize it so that you don’t have the data platform folks who are a bottleneck in the middle.

The second piece is building trust in data. I’m sure that this has happened to some folks. You build some processor on top of some data and then the data schema changes and you didn’t own the data that you were working on top of, so everything breaks on your end, and all of a sudden it’s, “Well, my stuff is completely broken.” Those are the type of scenarios that we want to avoid. We want to be able to provide some guarantees around the quality of the data so that independently I can know that I’m working on this dataset and this is its schema, and I can build on top of the schema and we’ll be good going forward. The last piece of the puzzle is that I talked about knowing where the data is coming from, and it’s important to give context to the data. Who created this data? Where is this data coming from? Is it production quality research? How fresh is it? How often does it get refreshed? There are lots of aspects that need to be surfaced here in order to provide all this.

In order to do this, we’ve built this metadata service which sits at the top, which is called Marquez, and there are both an API and a UI. The idea here is that you have stuff going on in your data platform, batch processing, stream processing, you’re just collecting events. You’re recording all of this, the metadata about, “What’s the schema? Who are you?” Marquez takes all the records of all of these, and then with the UI as a human, you can see what’s going on and sift through a directory and get some visibility about everything that is in your data platform. One thing which is interesting here is that going down this path, we’re working on tagging of specific attributes in data sets so that it allows you to keep track of things like PII and GDPR compliance type of things.

From the way we’re talking about it, you know that I haven’t called out any specific technology here, we’re building it in a very modular way. Whatever technology you’re using, you can build a module that registers a data set or a job that is running, and Marquez will happily keep track of it. One thing that we’ve done differently from a lot of different projects, is that we’ve been doing this open source from day one. If you go to github.com/MarquezProject, you will find the core metadata service with the UI and the API. You will find Java and Python clients. I mentioned that we use Airflow, you will find a module, a library, that allows you to register metadata about all the Airflow DAGs that you have.

If you’re interested, we’ve been participating in contributing to this, obviously. Stitch Fix has also been participating in it, and if you have an interest, I strongly encourage you to go and check it out and play with it. That’s how, in addition to stepping up how we process and store data, we are taking care of not going back to being the Wild West and keeping it healthy and well-organized.

Infrastructure Needs for WeWork’s Footprint

The second thing that I wanted to touch on, which is even more specific to WeWork, are our needs around the infrastructure that are linked to our footprint. I want to go back to the slide where I showed where we have all of our locations. As you can see, we are in twenty-eight countries, we’re in lots of different places. There are a couple of callouts that I want to make. First, we have offices in Australia, we’ve announced an office in South Africa. These are two places which are very far from a U.S.-based cloud footprint. I don’t know if anybody here has dealt with serving media to Australia and dealt with the latency and bandwidth issues. There are some significant problems around serving places far away like these.

Our second callout is China. We are currently in China, and I’m sure that there are people in the room who have dealt with looking at bringing their service in China, and the tech stack and the vendors there are different. When we think about providing our services, we think about providing our services globally the same to everybody in the world, which means all those little dots need to have a similar experience with the same type of services. This is not completely novel. I’m sure there are lots of companies here who have global concerns. Why would you want to go from just a U.S.-based footprint into a more global, international? Because of availability concerns.

Sometimes quality of fiber connections between continents, and maybe flaky performance issues, sometimes you have data resiliency problems. The data needs to live in a particular country. I started by saying that we started with a U.S.-based footprint. That forces us to go into a global footprint. What that means is that as we look into all of those different geos, using a single cloud vendor becomes a problem because there are places where you cannot use the same vendor as another place, and China is a prime example for that.

How are we approaching this problem? I talked earlier about developer platform providing this compute using containerization and providing this compute box. We created this thing called WeK8s, and I’m sure that if you have a Kubernetes project, you may have a similar naming convention of some kind. I’ve seen it in a couple of companies. WeK8s is our managed Kubernetes offering, and what we’re providing with it is Kubernetes in order to schedule containers, Helm in order to define packages and what you’re going to be deploying, and a number of additional services on top; currently service management with Vault, service mesh with Istio, for observability we use Grafana and Prometheus. With those building blocks running on WeK8s, you can start running most of our applications. I am not covering the storage aspects here, but focusing only on the compute piece.

We have our engineers deploy their software, and we’re currently using Argo CD. The applications on top of WeK8s, and as far as they’re concerned, they are deploying on top of Kubernetes using Helm. What’s under WeK8s? WeK8s and Kubernetes is just abstracting whatever is under it, and if a cloud provider has a managed Kubernetes offering, we can essentially use this, whether it’s AWS, GCP, [inaudible 00:33:31] cloud, run WeK8s on top and provide configuration in order to make all the rest work well. The engineers don’t need to know about the bottom layer. Another benefit is that you go on your laptop with the Docker desktop application, you can also bring up a Kubernetes cluster and we can also run WeK8s on developers’ desktops. That’s our approach for multi-clouds.

Randy hinted at multi-cloud and hybrid. Let me talk about a second thing, which is even more specific to WeWork. WeWork is not just a cloud company, we have offices. These are hard things that you can touch, and we have problems to solve around inside those offices. I want to use one business problem specifically to illustrate the type of concerns that we have. I’m going to be talking about space access. A couple of numbers I’m going to repeat: 466,000 members and 485 location. The bottom picture shows a WeWork card, this black card, and that allows you to get into where WeWork office. Some of our members have access to one building, other members have global access and have access to all of the buildings. The bottom line is that that’s a lot of cards to have recognized with a whole lot of badge readers around the world, and this is stretching the state of the art in terms of scale.

The second example that I wanted to give is, Meetup is a WeWork company. Meetups are typically happening after hours, and it won’t be a big surprise if I said that there are a fair amount of meetups that happen in WeWork offices. The interesting thing here is that it’s after-hours and typically community managers are gone after 5:00, so an interesting question is, “How do you get into the building?” You can always have the organizer hold the door, etc., and let people in, but wouldn’t it be nice if in your application you would say, “Hey I’m going to attend this this meetup,” the organizer says, “Yes. You are in.” Then you can just use this app and for the three hours that the meetup is scheduled for, you can just get into the building, get into the room, and then you’re just locked out. That type of dynamic access again is stretching the state of the art, especially if you think about the scale at which we need to do it.

These are the types of problems that we need to do, and you could think about doing this with a cloud-only solution. One of the problems here is that we’ve all been in the office and lost the internet. It tends to suck, you can’t do email, you can’t do Slack. These days, everything is in the cloud, so when you lose the internet, things typically ground to a halt. Imagine that if all of your badge readers are being driven from the internet, and all of a sudden you cannot get in and out of your office. We’re trying to avoid that type of problem. This is what is leading us towards bringing some logic compute storage in buildings. This is an example of a growing number of use cases that we have, and a little bit similarly to why we’re looking at a global footprint, we may need to move some of this processing and storage onsite because of availability concerns, latency concerns, bandwidth concerns.

We do have a few challenges here. Number one, these are offices, which means these are not data centers. We have IT closets, and it’s not like tons of racks, great cooling. There are space and cooling issues that we need to consider. The second problem that we may be facing, I talked about no onsite technician. In our buildings we have community managers, and those community managers do a great job, but managing the office, being good hosts, connecting people, they are not going to log onto servers if something goes wrong or say, “Can you open up this unit and see if you could replace the hard drive?” We need to think about how we handle this problem.

The last piece of this puzzle is that if you think about the scale of this, you’re going from a handful or a couple of handfuls of public clouds locations globally to one of those clusters in each of the buildings. All of a sudden, the scale of your deployment goes up one or two orders of magnitude. This is also bringing interesting challenges here.

How are we doing this? The beauty of WeK8s and using Kubernetes as this abstraction is that we’re also working on bringing a third type of substrate which is a non-print substrate that we can run on our hardware in IT closets. I mentioned all of our problems or potential challenges with the fact that we don’t have technicians. As we are doing this, we’re thinking very carefully about automating everything, such as you take one of those computers and you bring it into an office and it can get imaged from the cloud completely automatically with PXE booting, and if something goes wrong, we can just reimage it. The type of maintenance that would need to be done would probably be either we can recover, or we just swap. That’s how we’re thinking about our hybrid infrastructure and how our footprint is very different from one that would be of a cloud-based company.

Takeaways

Conclusions, a few takeaways. We are seeing redeveloper platform as a cornerstone of our technology transformation. We are providing solutions to our engineers to simplify their lives, and this is how also we’re measuring that we are making progress. With regards to managing our data and all the data that flows into the data platform, we are building a metadata service Marquez in an open-source fashion. The final piece of the puzzle that I presented today is that we are moving from a U.S.-based infrastructure to a global and hybrid infrastructure to run all of our applications. A lot of this is still early days, and these are interesting problems. If you get excited, we are hiring in all of our hubs, and I will be happy to stop here and hear the questions that you guys have.

Questions and Answers

Participant 1: You talked about your application infrastructure being Kubernetes based and the whole platform put together. My question would be, you’re deploying it in different environments – do you use any of the higher-level cloud services, like inter-process communication, managed databases, and so forth? That’s one question. The second is you’ve got a very large data environment that you’re building out, and yet you have many clouds in different places. Do you bring all that data back to one place, or are they analyzed in a regional location?

Haas: These are great questions, I’m not going to have very good answers yet. One of the reasons is that we are in the process of going there. A couple of things that we have, we have multiple clouds today because we are in China and in the U.S., so we have at least two clouds. We do join some of this data. Your first question was around, how do we keep all the data in sync and managed databases.

Participant 1: Just leveraging the other cloud services. Inter-process communication, managed databases, notifications.

Haas: Right now a lot of this is work in progress. Applications tend to run right now scoped to a single cloud/location. In terms of alerting, we do bring all of this back to a single and central alerting and observability solution that provides us with a global view.

Participant 2: The engineering excellence metrics, what is their use? Who’s responsible for understanding and reacting to them? Is it the platform team, the individual development teams, the whole organization?

Haas: I’m going to give two answers here. We are currently rolling them out. We haven’t gotten through the process of how to act on them. Right now we’re just trying to surface them because in a lot of places we were driving blind. To answer your question maybe slightly differently, these are quantitative signals that we have. In addition to those metrics, we’re actually rolling out to qualitative assessment by each of the scrum teams, each with a framework – we call the level of framework – which allows them to reflect on, “How am I doing on development? How am I doing on testing? How am I doing on instrumentation?” whatever it is. Here, it helps them reflect, and we also have a broad view of how our organization is going.

Participant 3: For those metrics, do you have any tools that you can share with us that helps you to get other data? For instance, the cycle time?

Haas: You’re talking about the engineering excellence metrics. We are instrumenting a few things. Basically, the idea here is to be able to tie what’s going on in Github with what’s going down the CI pipeline, then lending on Kubernetes which is really when something is up in production. We’re taking this and instrumenting CircleCI Kubernetes to get signals when something gets deployed, when something gets built, and keep track of those. That only covers part of the problems, because that gives you the deployment frequency. Based on the commit and what’s in the repository, you can know how long the lead time was.

That doesn’t help you with the problems of change, failure rate, for that part we are going to need to use GRI. We use GRI internally, and through some work Randy’s doing around incident management, are able to tie some of the deployments to incidents that we’ve had and how fast we’re recovering. Some are easier and will lead a stronger signal, and others we need to figure out, and it’s a little bit of having the right processes in place so that we can get these data.

See more presentations with transcripts

Uncategorized

Calibrated Quantum Mesh – Better Than Deep Learning for Natural Language Search

MMS • RSS

Article originally posted on Data Science Central. Visit Data Science Central

Summary: Move over RNN/LSTM, there’s a new algorithm called Calibrated Quantum Mesh that promises to bring new levels of accuracy to natural language search and without labeled training data.

There’s a brand new algorithm for natural language search (NLS) and natural language understanding (NLU) that not only outperforms traditional RNN/LSTM or even CNN algos, but is self-training and doesn’t require labeled training data. Sounds almost too good to be true but the original results are quite impressive.

Calibrated Quantum Mesh (CQM) is the handiwork of Praful Krishna and his team at Coseer in the bay area. While the company is still small they have been working with several Fortune 500 companies and have started making the rounds of the technical conferences.

Here’s where they hope to shine:

Accuracy: According to Krishna, the average NLS feature in the ever more important chatbot is typically only about 70% accurate. The Coseer initial applications have achieved accuracy above 95% in returning the correct meaningful information. No keywords required.

No Labeled Training Data Required: We’re all aware that labeled training data is the cost and time sink that limits the accuracy of our chatbots. A few years back M.D. Anderson abandon its expensive and longtime experiment with IBM Watson for oncology because of accuracy. What was holding back accuracy was the need for very skilled cancer researchers to annotate the documents in the corpus instead of tending to their research.

Speed of Implementation: Without training data, Coseer says most implementations can be up within 4 to 12 weeks with the user exposing the system to the in house documents on top of the pretrained system.

Also different from the current major providers using traditional deep learning algos, Coseer choses to implement either on prem or with private clouds for data security. All of the ‘evidence’ used to come to any conclusion is stored in a log that can be used to demonstrate transparency and compliance with data security regulations like GDPR.

How Does This Work

Coseer talks about three principles that define CQM:

Words (variables) have different meanings. Consider “report” which can be either noun or verb. Or “poor” which can mean “having little money” or “or substandard quality” or its homonym “pour” itself either noun or verb. Deep learning solutions including RNN/LSTM or even CNN for text can look only so far forward or back to determine ‘context’. Coseer allows for all possible meanings of a word and applies a statistical likelihood to each based on the entire document or corpus. The use of the term ‘quantum’ in this case relates only to the possibility of multiple meanings, not to the more exotic superposition of quantum computing.

Everything is Correlated in a Mesh of Meanings: Extracting from all the available words (variables) all of their possible relationships is the second principle. CQM creates a mesh of possible meanings among which the real meaning will be found. Using this approach allows the identification of much broader interconnections between previous or following phrases than traditional DL can provide. Although the number of words may be limited, their interrelationships may number in the hundreds of thousands.

Use All Available Information Sequentially to Converge the Mesh to a Single Meaning. This process of calibration rapidly identifies missing words or concepts and enables very fast and accurate training. CQM models use training data, contextual data, reference data, and other facts known about the problem to define these calibrating data layers.

Unfortunately Coseer has released very little in the public domain to explain the technical aspects of the algorithm. Based on the repeated references to ‘relationship’ and ‘nodes’ we can probably infer this is a graph DB application, and I would bet using a DNN architecture to work through all the permutations in a reasonable amount of time.

Any breakthrough in eliminating labeled training data is to be applauded and certainly the increased accuracy will result in many more happy customers using your chatbot.

Other articles by Bill Vorhies

About the author: Bill is Contributing Editor for Data Science Central. Bill is also President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist since 2001. His articles have been read more than 2 million times.

He can be reached at:

Bill@DataScienceCentral.com or Bill@Data-Magnum.com

Uncategorized

Natural Language Processing

MMS • RSS

Article originally posted on Data Science Central. Visit Data Science Central

Deploying Natural Language Processing for Product Reviews

Introduction

We have data all around us and there are of two forms of data namely; tabular and text. If you have good statistical tools tabular data has a lot to convey. But it is really hard to get something out of the text, especially the natural language spoken text. So what is natural language? We, humans, have very complex language and natural language is the true form of human language which is spoken/written with sincerity also surpassing grammatical rules. To consider the best example where you can find this language is in “Reviews”. You write review mainly for two reasons, either you are very happy with the product or very disappointed with it and, with your reviews and a Machine Learning Algorithm, entities like Amazon can figure out whether the product they are selling is good or bad. Depending upon the results on the analysis of the reviews they can make further decisions on that product for their betterment.

Consider an example wherein we have thousands of comments in a review of a product sold by Amazon. It is very time consuming and hard for an individual to sit back and read all the thousands of comments and make a decision about whether people are liking the product or not. We can simply make machines do this work by implementing a Machine Learning Algorithm from which Machines can learn the Natural Human Language and make decisions. An approach to this is Machine “bags” words basically of two types viz; positive and negative words. After bagging these words machines apply majority rules to identify: are there more of positive or negative words thus helping us identifying the acceptance of the product sold.

The technique described above is called as “Bagging of Words (BoW)” Natural Language Processing (NLP) Machine Learning Algorithm which is what we are going to learn today and also could have been this complex name of this blog but phew, I managed with a better one. So, there are other techniques too called as Word Embedding NLP and might be more but as far as I know, they are these two. Today we are only focusing on “BoW” type of NLP in this article.

BoW NLP Implementation Road Map

Implementing this algorithm is little complex and can be confusing as it is longer than the usual algorithm process. So, below is the block diagram or a road map to give you an idea about the steps to implement this algorithm and finally make predictions with it. You can always fall back on this diagram if confused to understand why a particular thing is being done!

Flow chart of implementing BoW NLP algorithm

For machines to learn our natural language they need to be provided a certain format. Thus, we need to do data pre-processing so that we present data in a certain way expected by the machines to make them learn. In pre-processing we have to make sure data is in strictly two columns one for Rating and other for RatingText. Rating is either 1 or 2 (negative and positive respectively) and Rating Text will be equivalent rating text or review. Rating Text should not have unusual spacing and no punctuation. Below is the two snippets showing the data before and after pre-processing so that you understand better.

Data before pre-processing

Data after pre-processing

The next step is creating Corpus (Bag of Words) which can be easily done by a function in R (addressed later during listings). From this corpus, we then create Document Term Matrix (DTM) wherein a matrix represents the number of times a word is repeated throughout the entire corpus. Once we have the document term matrix we then split our data into training and testing sets.

The last step is implementing our model. Here we are deploying the Naive Bayes model, once implementing the model on the training data set we later make predictions on the test data set using the model created. In the end, we will compute the accuracy of our prediction to estimate how well our model performed in making predictions.

Model Data

The data is Amazon review data set, it is open data set available for use which was made available by Xiang Zhang. It was constructed as review score 1 for negative and 2 for positive ratings, it has in total 1,800,000 training and 200,000 testing samples. Don’t worry we are just using 1000 of the instances to avoid computational complexities. The data is available for download at the link below:

https://drive.google.com/file/d/1SncEr1yOBW4OmxWIYaKsNw_MS1Rh8S8j/view?usp=sharing

Just you need to make sure, when you have the file downloaded convert it into .csvformat so that you can use for analysis in R.

Model

Pre-processing Data

First things first as mentioned before we will begin with our data pre-processing. We will load our data first, followed by converting it into a data frame from a character string and then we will separate the data into two columns viz. “Rating” and “RatingText”. We can separate using a R function of the library “tidyr”. Below are the listings.

Listing for loading 1001 lines of Amazon Review Data set

View of the first 10 rows of the loaded data

Listing for Separating data into 2 columns “Rating” & “RatingText”

View of first 10 rows of data separated

Now we need to make sure we retain only alphanumeric entries of the data and removing anything other than that. Below is the listing to do that for both the separated columns.

Retaining only alphanumeric entries in the data

Lastly, we need to remove unwanted spacing in the text. Below is the listing for removing the unusual spacing and maintaining a single spacing throughout

Removing unusual spacing

View of removed spacing data

We now simply put our data into an object named “text” and check the data dimension to be sure that complete data is written to the object.

Console view of defining an object “text” and storing our data

The dimension print above of the text makes sure we have entire data into the object text.

Creating Corpus

Our next step includes creating corpus this can be easily done with a function VCorpus – Volatile Corpus from the library tm. To access this function one can download the package tm with install.packages(“tm”) followed by loading the library in R environment using library(tm). Once loaded we can use tm() function. This function will convert our review text into corpus.

Listing for creating corpus using tm() function

From the volatile corpus, we create a Document Term Matrix (DTM). A DTM is a sparse matrix that is created using the tm library’s DocumentTermMatrix function. The rows of the matrix indicate documents and the columns indicate features, that is, words. The matrix is sparse because all unique unigram sets of the dataset become columns in DTM and, as each review comment does not have all elements of the unigram set, most cells will have a 0, indicating the absence of the unigram.

While it is possible to extract n-grams (unigrams, bigrams, trigrams, and so on) as part of the BoW approach, the tokenize parameter can be set and passed as part of the control list in the Document Term Matrix function to accomplish n-grams in DTM. It must be noted that using n-grams as part of the DTM creates a very high number of columns in the DTM. This is one of the demerits of the BoW approach, and, in some cases, it could stall the execution of the project due to limited memory. As our specific case is also limited by hardware infrastructure, we restrict ourselves by including only the unigrams in DTM in this project. Apart from just generating unigrams, we also perform some additional processing on the reviews text document by passing parameters to the control list in the tm library’s DocumentTermMatrix function. The processing we do on the review text documents during the creation of the DTM are:

All the text will be in lower case
Removing number, if any
Removing Stop words such as a, an, in, and the. As stop words do not add any value in analyzing sentiments
Removing Punctuation, again they do not add any meaning in the analyzing sentiments
Perform stemming, that is converting a word into its natural form example removing s from plurals

Below is the listing for creating DTM from our Volatile corpus and doing the above mentioned pre-processing

Listing for creating DTM and pre-processing

Inspect View of our train_dtm

Above we can see how our DTM is created where rows indicate document and columns indicate features as explained before. We see the output has 1001 documents with 5806 columns representing unique unigram from the reviews. We also see our DTM is 99% sparse and consists of non-zero entries of only 34,557 cells. The non-zero cells represent the frequency of occurrence of the word on the column in the document represent on the row of the DTM. The DTM tends to get very big, even for normal-sized datasets. Removing sparse terms, that is, terms occurring only in very few documents, is the technique that can be tried to reduce the size of the matrix without losing significant relations inherent to the matrix. Let’s remove sparse columns from the matrix. We will attempt to remove those terms that have at least a 99% of sparse elements with the following listings:

Removing 99% of sparse elements

Inspection after removing Sparse

We now see the sparsity being reduced to 97% and the number of unigrams reduced to 688. We can now implement our algorithm but first let us split our data into training and testing sets.

Splitting data into Training and Testing sets for implementing algorithm

Once we have the data split we can now build our model on the Training set and later we can test our model for prediction on the testing set.

Implementing Naive Bayes Model

One point to be noted is that Naive Bayes model works on the nominal data. And our data currently has entrees which are for the number of times a word is being repeated and number of times a word gets repeated does not affect our sentiment thus we can convert our data into nominal data with following rule: number exists = Y and if 0 then N. Below is the listing to do this:

Listing for converting data to Nominal Data

Snippet view of converted nominal data

Now as we have nominal data we can implement a Naive Bayes model. The Naive Bayes algorithm can be implemented with function naiveBayes() in R library e1071. If you do not have this library you can use the following codes to install the package followed by loading it into the R environment as follows:

Listing to Install and use e1071 package

Below is the listing to implement Naive Bayes model

Listing to implement Naive Bayes model

Finally, with the model in hand, we will now make predictions for the test data set and check for the accuracy of the predictions done by the model. Below are the listing and checking accuracy

Listing for computing predictions and checking accuracy

Accuracy obtained from the model

With the quick implementation of BoW alongside with Naive Bayes model we got an accuracy of 79.5%. The accuracy can be increased further by focusing on the addition of different elements like new feature creation and so on.

BoW NLP – Takeaways

Bag of Words (BoW) Natural Language Processing (NLP) using a Naive Bayes model is very simple to implement algorithm when it comes to examining Natural language by Machines. Without very much efforts the model gives us a prediction accuracy of 79.5% which is a really good accuracy when it comes to simple model implementation. The only drawback of BoW is that it does not undertake the semantics of the words; example “buy a used bike” and “purchase old motorcycle”, BoW will treat them as two very different sentences. To overcome this we need to take another method known as Word Embedding into consideration.

Read more here

Uncategorized

Presentation: PID Loops and the Art of Keeping Systems Stable

MMS • RSS

Article originally posted on InfoQ. Visit InfoQ

Transcript

MacCarthaigh: My name is Colm, I’m an engineer at Amazon Web Services and I’m here to talk about PID loops and about how we use them to keep systems stable. I have a subtitle for this talk, which I went back and forth on whether I should give it or not, which is PID loops come from control theory. This is my third or fourth talk, different talk about control theory in the last year. Back in January, I did a workshop where we looked at control theory with a bunch of practitioners and academics who do formal verification of systems.

The conclusion I’ve reached in that year is that control theory is just an unbelievably rich vein of knowledge and expertise and insight into how to control systems, how to keep systems stable, which are very much part of what I do. I work at Amazon Web Services where we have all these cloud services, big distributed systems and keeping them live and running is obviously a very important task, and really digging into control theory, reading books about it, getting into the literature, talking to experts, it’s like finding literally a hundred years of written-down expertise that just all applied to my field that we seem not to know about.

The title for this whole track that we’re doing here is about CS in the modern world. I think it’s a travesty that we don’t teach control theory as part of computer science generally. It’s commonly part of engineering and physics and that’s where I first learned it. I did physics in college, and it amazes me how much relevancy there is in there. Like it says here, the fruit is so low hanging; there are so many things to learn that it’s touching the ground. We could just take these lessons away. In fact, you can spend just a week or two reading about control theory, and you will be in the 99th percentile of all people in computer science who know anything about control theory, and you’ll be an instant expert and it gives you superpowers, some of which I’m going to talk about, to really get insights into how systems can be built better. It’s fascinating and nothing excites me more right now.

I’m going to start with something that’s a little meta. Back in late last year, AWS has a big conference called re:Invent. I gave a talk which was one of the series of talks there, too. We actually gave some let’s-open-the-curtains talks about how we build things internally, how we design systems, how we analyze them. My colleague Peter Vosshall gave one about how we think about blast radius and I gave one about just how we control systems, and just a small bit in the middle I was talking about this stuff, control theory, because I didn’t have time to go into too much detail.

Cindy [Sridharan] reached out after and contacted me and said, “That material looks pretty interesting. Let’s get some more on that,” and here I’m a QCon. I did something, I gave a talk, somebody observed it, gave me feedback, and now I’m reacting and giving that talk. Congratulations, that is 99% of control theory. It applies to almost everything in life, not just how we control stable systems and analyze them and make sure that they’re safe, but business processes too. I’ve seen it crop up in really strange places.

I was talking to a sales team and they were telling me the first thing they do when they go into a new field area and they have to get leads and sell products and so on, they just start measuring. They just start measuring their success rate so that they can identify their wins and build on them and identify their losses and correct whatever they need to correct. It just fascinated me, they’re basically taking a control theory period process to even a very high-level task, like how to sell things. It doesn’t apply to engineering things.

Control Theory

If you’re not familiar with control theory, if you’ve never heard about it, didn’t even know it existed, it’s about a hundred-year-old field. It evolved independently in several different branches of science. There are a few competing arguments as to where the first discoverers, where the chemists claim it was first discovered when analyzing chemical reactions, the physicists claim it was first discovered when studying thermodynamics and engineers have been using it for mechanical control systems and so on. Eventually, all these different fields realized, “Hey, we have these same equations and same approaches that turn out are very general.” They’re about controlling things, getting intent. We want to make the world a certain way and getting the world into that state. It turns out there’s a whole branch of science around it.

I’m going to give you real examples, places where we’re using this just to give you a flavor of what control theory does. My real goal is I’m going to try and give you jumping-off points to places where you can take things and then take them on your own and maybe dig in deeper if this excites you and interests you. There’s lots of Prior Art. At QCon in San Francisco just a few months ago, was a great talk on control theory where Valerie goes into a really good formal approach into how some container scaling systems can be modeled and approached in a control theory framework, and it’s talk well worth watching. There’s more math in that talk than there will be in my talk. If you’re really excited by calculus, look up that talk. It’s a really good one.

There are books on this subject, thankfully, I’ve read both these books. They’re pretty good. The feedback control book is very directly applicable to our field. If you’re controlling computers and distributed systems and so on. The second book, Designing Distributed Control Systems, was not really written with computer systems in mind. When they talk about their distributed systems, they’re actually talking about big real-world machines where not everything’s connected and talking to one another. As you see, there’s a logging machine on the front cover. Obviously, it’s got nothing to do with, say, running S3 or system I work on. It turns out it has a lot of patterns and a lot of lessons in it that directly apply to our field. It’s amazing how much I keep coming back to it.

I see control theory crop up directly in my job, because I go to a lot of design reviews. I’m a principal engineer at AWS and one of my jobs is a team wants to build something or is in the process of building something, they’ll have some design reviews, and they’ll invite some people, and often I’m there, and we’re looking at the system and whether it’s going to work and what we can do with it, and I do the same with customers. There are sets of customers I meet pretty frequently where we were talking about how they’re going to move to the cloud, what are they building in the process, what did their systems look like?

I see a lot of places where control theory is directly applicable but rarely applied. Auto-scaling and placement are really obvious examples, we’re going to walk through some, but another is fairness algorithms. A really common fairness algorithm is how TCP achieves fairness. You’ve got all these network users and you want to give them all a fair slice. Turns out that a PID loop it’s what’s happening. In system stability, how do we absorb errors, recover from those errors? I’ll give you some examples.

The Furnace

No talk focused on control theory would be complete without a really classic example, which is a furnace. If you were learning control theory at the university, this would be the very first example anyone would pick out and I’m not going to do anything any different, but it’s a good example because it’s so easy to follow. Imagine we’ve got a tank of water and we’ve got a heat source, and we want to get the tank to a known temperature. We can vary the heat source. We can turn it up, we can turn it down, we can make it hotter, we can make it cooler. How you would go about doing this is obvious, you would measure the temperature of the water and then if the water is too cold, you turn the heat up, and if the water is too hot, you turn the heat off or you turn the heat down if it’s approaching our temperature, really simple stuff.

It turns out that control theory has a lot to say about how this can be done stably, there are a lot of very naive approaches. If you just have a really big fire and you just keep it under the furnace for a long time, until exactly you measure the right temperature and you just remove the heat, it’ll probably start cooling down really quickly. Or you can overheat in some cases where there’s lag, because the place you’re measuring in the tank of water is maybe not being convected too efficiently, and the rest of the water isn’t at that temperature.

Control theory is the science and analysis of figuring out how to get this right, how to make it stable. It has a lot to say, the biggest thing it has to say, is that you should approach the system in terms of measuring its error. I’ve got a desired temperature, let’s say, I want to get this water to 100 degree Celsius – I’m European, so someone will have to translate – which is boiling temperature, and starts off at room temperature, which is about 20 degrees Celsius. In that case the error is 80 degrees Celsius. We focus on the error, what is the distance from the desired state that we want it get to.

A simple kind of controller is a P controller. That’s a proportionate controller, where all we do is we take some action proportionate to the error. The error is 80, so I keep the heat high, then it goes to 70, 60, 50, and I gradually turn down the heat as it goes. That’s a proportional controller, really simple control system. You would think that would just get to the line and go really just fine. In reality, a proportional controller will tend to oscillate because there are natural lags in the system and there’s no perfect way to measure things, as we all know, so we’ll just hover above the line and it won’t be a perfectly stable system.

To improve on that, we don’t just act in proportion to the error, we use an integral of the error, too. That’s just fancy math for saying we take the whole area under the curve of that error. We take some action that’s proportionate to how much error we’ve had over a period of time, and when you add that you’ve got a PI controller. You’ve got a proportional component, you’ve got an integral component that’ll still tend to oscillate, but far less. It’ll tend to close to asymptote the line you’re trying to hit, trying to get things to a target temperature.

Many real-world systems are PI controllers, potentially the thermostat and HVAC system in this room or cruise control in your car and so on. Control theory in it’s really refined sense actually says, well even a PI controller isn’t perfectly stable, and the reason is if this system were to suffer a big shock, if a lot of water were to suddenly to be taken out of the tank, it won’t react well; there’ll be just too much noise in the measuring signal, and it’ll overdrive or underdrive the system.

To correct for that, you need the derivative component. That’s where PID comes from: a proportional-integral and a derivative component. Control there actually says you can’t be a completely stable system without all three. Even in the real world, there’s a lot of PI controllers, there are theoretical shocks they may not be able to absorb. If you can build a mental model of analyzing systems through this framework – we’re going to go through a few – it can give you really simple insights, and you can very quickly determine that a system may not be stable or safe or needs correction in some way.

Autoscaling

This is the example of a furnace, it’s pretty much the exact same graph and the exact same response for an auto-scaling system. Autoscaling is a fancy furnace, there’s not much else going on. In the case of an autoscaling system, we’re measuring some target metric, say CPU utilization on a fleet. Let’s say I’ve got 10 EC2 Instances and my goal is that none of them should be more than 20% of CPU or something like that, which would be a typical value. It seems low, but people pick those values because they want very responsive latencies, and they don’t want Q’s building up, or garbage collection and all those kinds of things.

We measure its state and we see what’s the CPU. If it’s way below 20, then the system can be, “You’ve got too many hosts, I’m going to scale you down,” just like turning the heat off. Then the CPU starts to rise a bit because the fleet gets a little bit more contented now that there are fewer boxes, and you can see what’s going to happen. Eventually, it’ll get small enough that the CPU will go above 20% on whatever the remaining number of hosts are, and we’ll start scaling out again, or vertically or horizontally, whatever way the system has been tuned to scale. The autoscaling system that we’ve built – we haven’t had a direct EC2 autoscaling system, you can use it to provision instances. This is exactly how it works. We have also have autoscaling systems that are built into our elastic load balancer, that are built into how we provision storage or if you’re using a service like DynamoDB, how many storage slices to use to give you a certain amount of IOPS, and so on. This is the exact kind of control system that’s going on behind the scenes.

They get fancier over time, a simple PI controller will do pretty good job, but it’s maybe not as optimal or as efficient as something with more knowledge can be. As an example, something we launched pretty recently that’s part of our auto-scaling service, is we launched support for machine learning-based forecasting, which is incredible. What we can do is we can actually look at your metric and say, “This is the metric that you want to hit overtime. You want to be this 20%,” and so on. We can analyze our metric using techniques like Fourier analysis that break it down into constituent frequencies. A lot of metrics that we look at, a lot of people’s daily load patterns, are daily load patterns. They also have a weekly load pattern. Every day is maybe busy in the evenings, if it’s, say, an entertainment service, people watching videos or something like that. Then every week, maybe Friday night is your busiest week, because that’s our busiest day of the week. Then you might have busier days or quieter days across a year because of holidays and sporting events.

That kind of analysis can find these patterns and then they can be forecast. We can replay them into the future and see what’s going to happen. We can apply machine learning techniques to it. We have inference models that can figure out what they think is going to happen next and prescale before they’re going to happen. This looks like a machine learning feature, but to me, looking at this system through the lens of a PID controller, this is just a fancy integral. This is the “I” component of a PID controller. What I’m trying to do is I’m trying to look at my total history and project a little into the future. Integration is a form of forecasting and so it still fits in my PID control model, and can still be analyzed in that way, and I’ve tested it out, I’ve done a few things. Along those things, you can play with different values and see how the system responds as you would expect a PID controller to work, and it does.

Placement and Fairness

That’s a really simple straightforward application of a PID controller. A less simple one and less common one is, we can auto-scale when we’ve got elastic capacity. We can take users wherever they are in the world and send them to their closest best region that I’m operating in. We can scale elastically so that everybody can be serviced by their absolutely best region. That’s great when I’ve got elastic capacity, but when I’ve got fixed capacity, we also have a CDN, Amazon CloudFront, and that’s got fixed relatively static capacity. When we build a cloud for insight, it goes in there with a certain number of machines, we put some racks in a room, and it’s got those machines and actually the count might go slightly down over time because there’s a failure rate and not everything gets replaced instantaneously.

They have the capacity they have. Just like launching an EC2 Instance, I can’t just go in tomorrow and add another rack. It takes a bit of planning and there’s a whole procedure for that. You might think this is somewhere where figuring out which side can take which users, how could a PID controller help there? One easy way to do a system here would be, figure out what the capacity of each site is. Figure out your peak time of every single day, and then make sure that at that peak time, no site would be overwhelmed. Only allocate each site enough close users so that you’re not going to overwhelm each site. That’s very inefficient, because at night when things are a bit less busy, there are lots of users who might be slightly better serviced by a site, but they’re going to maybe their second or their third-best site, which it’s just not optimal and actually ends up just because of the way time zones work and people having different periods of activity across the world, leads to suboptimal packing.

Something we do in CloudFront is we run a control system. We’re constantly measuring the utilization of each site and depending on that utilization, we figure out what’s our error, how far are we from optimized? We change the mass or radius of effect of each site, so that at our really busy time of day, really close to peak, it’s servicing everybody in that city, everybody directly around it drawing those in, but that at our quieter time of day can extend a little further and go out. It’s a big system of dynamic springs all interconnected, all with PID loops. It’s amazing how optimal a system like that can be, and how applying a system like that has increased our effectiveness as a CDN provider. We now stream quite a lot of video that people are watching at home. A pretty high percentage of it is hosted on CloudFront these days and I think a lot of that’s due directly in part to that control system.

X-Ray Vision: Open Loops

These are straightforward examples of where we’d applied PID loops, I’m going to go through some more. I wanted to give us more of a mental model and some jumping-off points and ways of utilizing PID theory more directly in our daily lives and our daily jobs. When I gave this talk back in November, I said that if you go to the effort of really understanding control theory, it can be like a superpower. It can really help you deep dive into these systems, and the superpower I think is X-Ray vision, because you can really nail some things. The first of five patterns then I’m just going to go through into what I see as common anti-patterns or common lessons from control theory that I see us miss in real-world designs, the first is Open Loops.

It’s very natural when you’re building a control system, something needs to do something. I’ve got to get a config onto 10 boxes, I’ve got to launch 10 instances. It’s natural to have a script or write a system that just does those things in that order. We’ve probably all seen scripts like this, at small scale they work fine. If you’re doing things manually, this is how we do manual actions as well, but systems like this are an Open Loop. They’re just doing actions, and nothing is really checking if those actions occur. We’ve gotten really good at building reliable infrastructure, and in some ways too good. It works so reliably, you just do it time after time, you never really notice. Well, what if it failed one day? Something’s going to go wrong. That can creep into systems in very dangerous ways.

A surprising number of control systems are just like this, they’re just Open Loops. I can’t count the number of customers I’ve gone through control systems with and they told me, “We have this system that pushes out some states, some configuration and sometimes it doesn’t do it.” They don’t really know why, but they have built this other button that they press that basically starts everything all over, and it gets there the next time, and some cases they even have their support personnel at the end of a phone line. That’s what they do, they get a complaint from one of their customers saying, “I took an action, I set a setting and it didn’t happen.” and they have this magic button, they press it and it syncs all the config out again and it’s fixed.

I find that scary, because what it’s saying is nothing’s actually monitoring the system. Nothing’s really checking that everything is as it should be. Already every day they’re getting this creep from what the state should be, and if they ever had a really big problem, like a big shock in the system, it clearly wouldn’t be able to self-repair, helpfully, which is not what you want. Another common reason for Open Loops is when actions are just infrequent. If it’s an action that you’re not taking really often, odds are it’s just an Open Loop. It’s relying on people to fix things, not the system itself.

There are two complementary tools in the chest that we all have these days, that really help combat Open Loops. The first is Chaos Engineering. If you actually deliberately go break things a lot, that tends to find a lot of Open Loops and make it obvious that they have to be fixed. It drives me crazy a little because I find that you can probably just think through a lot of the things Chaos Engineering will find, and that can be quicker. Then the other is observability, what this problem space demand is, we’ve got to be monitoring things. We’ve really got to be keeping an eye on it.

We have two observability systems at AWS, CloudWatch, and X-Ray. One of the things I didn’t appreciate until I joined AWS – I was a bit going on like Charlie and the chocolate factory, and seeing the insides. I expected to see all sorts of cool algorithms and all sorts of fancy techniques and things that I just never imagined. It was a little bit of that, there was some of that once I got inside working, but mostly what I found was really mundane, people were just doing a lot of things at scale that I didn’t realize. One of those things was just the sheer volume of monitoring. The number of metrics we keep on, every single host, every single system, I still find staggering.

We’ve made collecting those metrics extremely cheap and then we collect them as much as possible and that helps us close all these loops. That helps us build systems where we can detect drift and we can make sure they’re heading for stable states and not heading for unstable states. We try to avoid infrequent actions. If we have a process that’s happening once a year, it’s almost certainly doomed to failure. If it’s happening once a day, it’s probably ok.

My classic favorite example of this as an Open Loop process, is certificate rotation. I happened to work on TLS a lot, it’s something I spent a lot of my time on. Not a week goes by without some major website having a certificate outage. Often it’s because they’ve got a three-year or a one-year certificate and they put it on their boxes. They’re, “I don’t have to think about it for another year or three”, and then the person leaves and there’s nothing really monitoring it, nobody’s getting an email, and then that day comes and the certificate isn’t valid anymore and we’re in trouble.

A really common Open Loop is credential management in general, of which this is just one example. Everywhere else we built, we have this problem too, we have to keep a lot of certificates in sync, so we built our own certificate management system and we actually made it public. We use this ourselves, it’s tied into EOB and CloudFront and a bunch of things. Amazon manager is monitoring certificates as well, looking for any drift, making sure that the load balancer that should have things really has them, that a CloudFront distribution that should have it really has it.

It’s even possible to monitor and alert and so on and uncertified expiry times if you’ve loaded them in manually. The magic to fixing these Open Loops is to really think about measuring first, like I said with that earliest example about just taking feedback and integrating that, but approaching systems design as, “I’m not going to write a script or a control system that just does X, then Y, then Z.” Instead, I’m going to approach it as, “I’m going to describe my desired state, so it’s a bit more declarative, and then I’m going to write a system that drives everything to that desired state.” It was a very different shape of systems. You’ll just write your code very differently when you’ve got that mental model.

In my experience, that model is far better, because it is a closed-loop from day one, far better because they tend to be more succinct ways to just describe these systems, far better because it can also be dual purpose. Often your provisioning system can be the same as a control system. For example, you show up on day one and you’ve got a new region to build in our case, or availability zone to build. You can just run your control plane and it starts off and the error is, I don’t have any of my hosts so provision them, because that’s what it’s meant to do. That’s what it’s built to do, and it ends up having a dual role, which is cool.

If you’re into form of verification like I am, it is vastly easier when you get to the stage where you can formally verify something to verify declarative systems, systems that have their state described like that. It’s just easier, the tools are more structured that way. We’ve been doing a bunch of this, we’ve got papers out where we’ve shown how we use TLA plus, CryptoWall, and you can search for those if you’re interested. We also use tools like F-star and Coke, which are pretty awesome and useful for verifying these kinds of systems. That’s the first X-Ray vision, superpower, look for open loops.

X-Ray Vision: Power Laws

The second is about Power Laws, and it’s about looking for power laws and systems and how to fight them. The reason for this is to do deal with another mental model that I have, which is around how errors propagate. Imagine your distributed system, in my case the cloud, as this one big uniform collection of boxes. These boxes might represent hosts or applications or processes or however granular you want. These could be every tiny little bit of memory on every device in your entire system. Someday an error happens somewhere that you didn’t really plan for – an exception is thrown, a box chokes up, a piece of hardware fails. Well, in distributed systems, that box has dependencies, they tend to fail then, too, especially if they’re synchronous systems.

Now these other things that we’re talking to are failing in unpredictable ways and that spreads. Like any network effect, when things are interconnected, they tend to exhibit power laws and how they spread, because things just exponentially increase at each layer. That’s what we’re fighting in system stability, that’s the problem and it’s a really tough one. Our primary mechanism for fighting this at Amazon Web Services is we compartmentalize. We just don’t let ourselves have really big systems. We instead try to have cellular systems, like our availability zones and regions that are really strongly isolated, errors we can’t even think about, just won’t spread.

That’s been working pretty well for us and also as a lesson from control theory. A lot of the literature in control theory talks about if you’ve got a critical system, dividing it up into compartments and controlling them independently at the cost of maybe some optimality, tends to be worth that. A good example there is a nuclear power plant. Typically, each reactor is controlled independently, because the risk of having one control system for all of them just isn’t worth it. That helps, but that still means you could have an error that propagates to infect the whole compartment, but we want to be able to fight that. We need our own power laws that can fight back, things that can push in the other direction. These are like those integral and derivative components. These are things that can drive the system into the state we want.

Exponential Back-off is a really strong example. Exponential Back-off is basically an integral, an error happens and we retry, a second later if that fails, then we wait. An exponentially increasing amount of time, three seconds, then 10 seconds, then 100 seconds. It’s clearly exponential and it’s clearly acting based on the total history and that really helps. In fact, it’s the only way to drive an overloaded system back to stability, is with some exponential Back-off. Very powerful.

Rate limiters are like derivatives, they’re just rate estimators and what’s going on and deciding what’s to let in and what to let out. We’ve built both of these into the AWS SDKs. If you’re using the S3 client or DynamoDB client, these are built-in and we’ve really finally tuned these things. We’ve tuned them, got the appropriate amounts of jitter, they’ve got the appropriate amounts of back-off. Then we also rate limit the retries themselves so that we don’t get crazy distributed retry storms that can take down entire systems. We’ve really focused to make the back pressure as efficient and effective as we think it needs to be.

This is not an easy problem, there’s a whole science and control theory called loop tuning, about getting all these little parameters perfectly optimal. I think there are about four or five years of history now gone into AWS SDKs, and how we’re trying to get this right and we’re constantly trying to improve it. It’s pretty cool and worth copying.

We’ve got other back pressure strategies too, we’ve got systems where servers can tell clients, “Back off, please, I’m a little busy right now,” all those things working together. If I look at system design and it doesn’t have any of this, if it doesn’t have exponential back-off, if it doesn’t have rate-limiters in some place, if it’s not able to fight some power-law that I think might arise due to errors propagating, that tells me I need to be a bit more worried and start digging deeper. There’s also a great paper – or a blog post, I should say – by Marc Brooker, my colleague, which you can search for, where he goes into some really fine-grain detail about how the stuff we put into the SDKs actually works. That’s the second of five patterns.

X-Ray Vision: Liveliness and Lag

The third is about Liveliness and Lag. Pretty much any control system is doomed to failure if it’s operating on old information. Old information can be even worse than no information. If we were controlling the temperature in this room based on the temperature 30 minutes ago, that’s just not going to work because there weren’t so many people in the room at that time and the information’s just false and it’s going to heat it up too much, it’s no good. This can crop up a lot, and the reason this can crop up a lot in distributed systems is we often use workflows to do things.

Workflows can start just taking variable amounts of time. As the workflow grows and we put more work in it, it can just start taking longer to do things, and also reporting back metrics. Getting information back can also become laggy, especially when you’ve got a really busy day or a really chaotic event, lots of stuff going on, and that can result in ephemeral shocks to the system. Huge spikes in load or a sudden decrease in capacity can become unrecoverable, because the system just becomes overwhelmed and then the lag starts driving the system, and it can never really get back into the state it wants to be.

The underlying reason for a lot of this is because we use FIFOs for most things, I’ll get to that in a second. The bulletproof fix for these systems is impractical and very expensive, but it’s to do everything everywhere in constant time. A simple example of that is, let’s say I’ve got a configuration file and it’s got some user parameters in it that they can set. One way to build that system is, user sets parameter, we’ll just push that little diff or delta out to the system. Works great until lots of users changed their settings at the same time, if you’ve got some correlated condition. Now the system gets super laggy. A different way to build it would be we’ll just push all the state every time, especially if it’s not too big, that can be practical, but it gets expensive for really big systems. The same for measuring things; measure everything all the time. Well, we’re a little bit better about that, but that tends to be how monitoring systems work.

We have some systems at AWS – our most critical systems – where we’ve built this pattern in. Our systems for doing DNS health checks, networking health checks, they’re so critical to availability. They have to work, even during the most critical events, even during when there’s total chaos. They absolutely have to work, so we’ve built those as completely constant time systems. When you set up a roof of the tree health check, that’s pinging your website to see if it’s healthy or not, and should your traffic load go to this webserver in this zone or this web server in this other zone, that’s happening at the same rate all the time and the information about it is being relayed back all the time. It’s healthy or unhealthy, not “It went from healthy to unhealthy,” so only send that delta back. That makes it really robust and reliable.

If you do need a queue or a workflow, think carefully about how deep that should be allowed to grow. In general, we want to keep them really short, if they get too long, it’s best to return errors. That’s a lesson I’ve heard restated from so many different places, that I think it’s an extremely deep one. For information channels if you’re relaying back metrics and so on, LIFO is commonly overlooked, and it’s a great strategy. A LIFO queue does exactly what we want here and will always prioritize liveliness, it will always give you the most recent information, and it’ll backfill when it has some capacity, any previous information, it’s the best of both worlds. It’s rarely seen, you rarely see LIFOs in information systems. Ours are, but I don’t know why it’s not more common.

X-Ray Vision: False Functions

My fourth pattern, and a short and simple one, is to look for False Functions. The thing you’re measuring, you want it to be like a real function. You want it to be something that moves in a predictable way, that is something you’re actually trying to control. It’s common for there to be many dimensions of utilization in a distributed system. As a simple example, let’s say we’ve got a web server and it’s taking requests. As load goes up, CPU goes up at a certain rate because of SSL handshakes and whatnot. Memory consumption goes up at a certain rate because of overloads, and maybe I’ve got a caching system and my cache utilization goes up at a different rate, all different rates. They all have their own functions.

What can happen is that someone, we don’t always perceive that these are different things and we instead measure some synthetic variable that’s a fake function of all three. It’s like looking at the max of all three, and that doesn’t work, it turns out not to be predictable. You haven’t really dimensionalized the system. This all is a complicated process control theory way to say that the Unix load metric is evil and will bite you, because the Unix load metric or a network latency or queue depth metrics in general – Unix load is a queue depth metric – is the compound of so many other things that are going on, that it doesn’t really behave in a linear way. You’ve got to get at the underlying variables.

A lot of systems that are built just on measuring load tend to be chaotic and not really able to correctly control. We’ve found that it’s best to measure some of the underlying things. CPU turns out to be surprisingly reliable, it’s surprising to me because I spend a lot of time working on low-level CPU architecture, and I know how complicated CPU pipelines are, but it turns out at the macro level, just measuring CPU percentage can be very effective.

X-Ray Vision: Edge Triggering

My last pattern is about edge triggering. What edge triggering is, is you’ve got a system, we’re heating our furnace. Let’s say we just keep the heat on, and it gets all the way to the target temperature, and then once it gets to the target temperature, we turn the heat off. That is an edge-triggered system, we triggered at the edge, and only at the edge. There’s a lot of control theory and a lot of debate about edge-triggered systems versus level-triggered system. They can even be modeled in terms of one another.

I like to watch out for edge triggering in systems, it tends to be an anti-pattern. One reason is because edge triggering seems to imply a modal behavior. You cross the line, you kick into a new mode, that mode is probably rarely tested and it’s now being kicked into at a time of high stress, that’s really dangerous. Another is that edge triggering, when you think about it, needs us to solve to deliver exactly once problem, which nobody has solved or ever will solve, because what if you missed that message? What if your system is, “I just send an alarm when I crossed the line”? What if that gets dropped? Now I’ve got to retry it.

Your system has to be idempotent, if you’re going to build an idempotent system, you might as well make a level-triggered system in the first place, because generally, the only benefit of building an edge-triggered system is it doesn’t have to be idempotent. I like to see edge triggering only from humans if I’m alerting a human, actually sending them a page or something like that. For control systems, it’s usually an anti-pattern. It’s better instead to be measuring that level, like we said.

That gets us to the end of all my patterns. The biggest one is to look for the system being measured. Honestly, you’ll be surprised how many times you will just notice that the system isn’t really being measured or observed, and that’s enough to really improve the stability a lot. If you learn and look into all these techniques, they’re highly leveraged, like, I said, the fruits touching the ground, it’s pretty cool.

Questions and Answers

Participant 1: Because we have used machine learning for the first four patterns, why can we not use that in the edge triggering system to load the edge just before the high-stress point and that becomes a pattern again? Why is it an anti-pattern then?

MacCarthaigh: You’re asking, if we just give ourselves some margin of error, if we put the line even a little bit lower, wouldn’t that improve safety? You can absolutely do that, but in high shock or high-stress situations, you’ll never guarantee that there isn’t too much lag between those things, that you just don’t have time to correct. Another is, there’s an entire area of control theory designed around exactly what you’re talking about, which is called hysteresis, which is just almost exactly that property and you can totally do it. It just gets really complicated really quickly, and level triggering tends to be far simpler. Your intuition is right though, it can be done.

Participant 2: You mentioned three variables, P, I and D. Can you give me an example of a system that will not recover from a stress? I missed this part, I don’t understand, what does it mean? The system is, let’s say, auto-scaling scenario. Can you give an example of a stressful scenario where a similar system would not recover?

MacCarthaigh: Yes. If we merely had a P controller, for example, for auto-scaling, and we suddenly lost half the capacity, half the servers died, what would happen is the error would go up so quickly and the remaining hosts, that a P controller would way overscale the system, and then load would go down so much, because there were so many more hosts, so then it would scale it back in again and it would just oscillate like that for quite a while. That’s what we mean by instability in the context of a control system.

Participant 3: Thanks for the talk, really good insights. We quite often have this situation, which was just asked, when we have a relatively stable load and then a sudden spike which we can’t actually predict. My question will be, what would be the direction to look into, maybe in these control systems, how to properly respond? Usually, the problem is that the response time is too slow. We can’t scale that fast, and we need to know before that and it’s not that predictable.

MacCarthaigh: I don’t have an answer, we have not yet built pre-cognition, which I would love. Our main focus for that AWS has been to build systems that are just inherently at more capacity. For example, one of the differences between application load bouncer, our main layer seven load bouncer, and our network load balancer, is their network load bouncers are scaled to five gigabits per second, minimum millions of connections per second, can do terabytes ultimately. We barely even need to control them, just because the headroom is so high. That’s how we’re trying to fix that, and then meanwhile for other systems, while we just have to react and scale as quickly as possible, and we’ve been trying to get our launch times down as low as possible. We can now scale and launch EC2 Instances in seconds which helps, but there’s no other fix for that. There’s nothing we can do in a controller that can magic away the unpredictable.

Participant 4: You talked about back-off retry, all that stuff being really important. What sort of process frameworks, tools, do you have to have to make it really easy for teams to do that?

MacCarthaigh: At AWS we try to bake those into our requests libraries directly. We have an internal library system called CAL, which we build all of our clients and servers on and we just make it the default, it’s in there. Then for our customers and ourselves, because we use the SDKs as well, we’ve got our SDKs and we bake it all in there and the SDK team that maintains all that, they think a lot about this stuff. We just try to make it the default that comes out of the box, and not have to educate customers or expect anybody to do anything different than the default. That’s been very successful.

Participant 5: In control theory, the most gain that we can get is to move the control points to as close to the pole. I think in resource management, it’s not the case. It’d be great to have insight of where we actually take the point in between the stably versus the performance.

MacCarthaigh: I’m not sure I’ve heard all the questions, sorry.

Participant 5: From the given resources that we can get, for example computation resources from there, when we reach pursuing the more and more stable systems then we tend to actually lose the maximum capacity that we can expand.

MacCarthaigh: There is definitely tension between stability and optimality, and in general, the more finely-tuned you want to make a system to achieve absolute optimality, the more risk you are of being able to drive it into an unstable state. There are people who do entire PIDs on nothing else then finding that balance for one system. Oil refineries are a good example, where the oil industry will pay people a lot of money just to optimize that, even very slightly. Computer Science, in my opinion, and distributed systems, are nowhere near that level of advanced control theory practice yet. We have a long way to go. We’re still down at the baby steps of, “We’ll at least measure it.”

See more presentations with transcripts

Google Introduces Spinnaker Simplifying Continuous Delivery on Their Cloud Platform

MMS • RSS

Subscribe for MMS Newsletter

Did you know...

C++20 Feature List Now Frozen: Modules, Coroutines, and Concepts are In; Contracts Out

MMS • RSS

Subscribe for MMS Newsletter

Did you know...

What does the Machine Learning process look like?

MMS • RSS

What would you like to achieve (define the goal)

IMPORTANT DISCLAIMER – NOT ALL PROBLEMS WILL AND SHOULD BE SOLVED BY MACHINE LEARNING!

Prepare the data

Select an algorithm(s)

Build and train the model

Test the model (and score it)

What’s next?

Subscribe for MMS Newsletter

Did you know...

Introduction to Machine Learning

MMS • RSS

How to start with Machine Learning?

What types of Machine Learning do we have?

Subscribe for MMS Newsletter

Did you know...

The tools you should know for the Machine Learning projects

MMS • RSS

THE ENVIRONMENT

THE LANGUAGES

THE LIBRARIES

MATPLOTLIB AND SEABORN

WHAT ABOUT THE CLOUD SOLUTIONS?

SUMMARY

Subscribe for MMS Newsletter

Did you know...

Presentation: Making 'npm install' Safe

MMS • RSS

Transcript

When It Goes Bad

A Pattern of Attacks

Solutions

What We Need: Code Isolation

POLA

SES as Used Today

Questions and Answers

Subscribe for MMS Newsletter

Did you know...

Presentation: Driving Technology Transformation at @WeWork

MMS • RSS

Transcript

Why a Technology Transformation?

Developer Platform as a Transformation

Data Ecosystem in a Fast-Paced Environment

Infrastructure Needs for WeWork’s Footprint

Takeaways

Questions and Answers

Subscribe for MMS Newsletter

Did you know...

Calibrated Quantum Mesh – Better Than Deep Learning for Natural Language Search

MMS • RSS

Subscribe for MMS Newsletter

Did you know...

Natural Language Processing

MMS • RSS

Deploying Natural Language Processing for Product Reviews

Subscribe for MMS Newsletter

Did you know...

Presentation: PID Loops and the Art of Keeping Systems Stable

MMS • RSS

Transcript

Control Theory

The Furnace

Autoscaling

Placement and Fairness

X-Ray Vision: Open Loops

X-Ray Vision: Power Laws

X-Ray Vision: Liveliness and Lag

X-Ray Vision: False Functions

X-Ray Vision: Edge Triggering

Questions and Answers