Uncategorized

Weekly Digest, January 28

MMS • RSS

Article originally posted on Data Science Central. Visit Data Science Central

Monday newsletter published by Data Science Central. Previous editions can be found here. The contribution flagged with a + is our selection for the picture of the week. To subscribe, follow this link.

Featured Resources and Technical Contributions

Featured Articles and Forum Questions

Picture of the Week

Source for picture: contribution marked with a +

To make sure you keep getting these emails, please add mail@newsletter.datasciencecentral.com to your address book or whitelist us. To subscribe, click here. Follow us: Twitter | Facebook.

Hire a Data Scientist | Search DSC | Find a Job | Post a Blog | Ask a Question

Uncategorized

Presentation: Dissecting Kubernetes (K8s) – An Intro to Main Components

MMS • RSS

Article originally posted on InfoQ. Visit InfoQ

Bio

Joshua Sheppard is an enthusiastic SE manager with over 15 years working on web applications of different flavors. He cut his teeth on regular expressions in Perl for a meta search engine and had the misfortune of porting those to Java in 2001. Lately he has built a data science team, writing Python, and exploring the benefits and challenges of container orchestration with Kubernetes.

About the conference

This year, we will kick off the conference with full day workshops (pre-compilers) where attendees will be able to get “hands-on” and exchange experiences with their peers in the industry.

Uncategorized

Best dynamically-typed programming languages for data analysis

MMS • RSS

Article originally posted on Data Science Central. Visit Data Science Central

One can seriously argue about what programming language is the best for data analysis, but there is one universal metric that can define your choice: speed of calculations. Therefore, the word “best” in the title means the languages that lead to most performant applications. If most performant program can also be written in an easy-to-use, easy-to-learn, dynamically-typed scripting language, this can point to our best choice. Lucky enough, there is an objective method to check this. Run a benchmark test for an algorithm implemented in different dynamically typed languages, and compare the execution time.

There is an interesting discussion on execution time of a simple Monte Carlo algorithm that calculates the PI value posted to Stackoverflow thread. This web page lists implementations of this algorithm in different languages (Java, Python, Groovy, JRuby, Jython, etc.).

The bottom line of this benchmark is:

(1) Java language gives the most performant implementation (I suspect C/C++ will show a similar performance)

(2) The most performant dynamically-typed implementation is Groovy. The execution of the Groovy script on the JDK9 shows the same speed as for the Java/JDK9 itself. This is a staggering observation. You can use dynamically-typed language, and still the execution of the code can be as fast as for a full-blown Java application. But Groovy with loose types code is about a factor 4 slower than the Java code.

(3) Python is about 10 times slower than Java and Groovy (with strictly defined types), and a factor 3 slower than Groovy with loose types. But Python code can be as fast as Groovy code when using PyPy. This is another cool observation. Also, Python can be more optimized, but this requires external libraries (numpy). But, even in this case, Groovy is far more performant compared to Python with numpy.

(4) JRuby running on JDK9 and the standard Python / CPython have a very similar performance.

(5) Jython and BeanShell are slowest in code execution. If you want to get the most of Jython/BeanShell, use these language to call external Java libraries which can give you the same execution speed as for the native Java).

This analysis shows that the most performant easy-to-use scripting language is Groovy. You can use Groovy for developing very fast calculations that use simple types. At at the same time, Groovy can be used as a glue language for calling sophisticated Java libraries, thus providing a very reach multiplatform computing environment for data analysis.

PyPy implementation of the Python language comes second. It is fast, but it still does not support all extension modules of the standard Python.

The 3rd place is Python and JRuby. These scripting languages show a very similar performance.

K.Jonasmon (M.S. in computer science)

Uncategorized

Presentation: Learning to Love Type Systems

MMS • RSS

Article originally posted on InfoQ. Visit InfoQ

Transcript

Today I want to talk about type systems in JavaScript. I think you’ve heard a lot about it today, and you may have heard about this very surprising nugget that more than 46% of users on npm are using TypeScript. So today we’re going to look at why investigating a type system, whether or not you use TypeScript or Flow, is useful, and how it can help you write better code.

So I’m really excited about this talk. I spent way too many hours down the rabbit hole researching arcane things like type theory, but I’ve distilled it down, and I have some pro tips to share to help you write better programs. It’s a very deep topic, and I’ve personally learned a lot about type systems as well as functional programming, and I hope you will too. Before we begin, there is a small warning. There is quite a bit of code and math in the talk. I’ll try my best to avoid using jargon unless I absolutely have to, then I’ll explain it so you will know what I’m talking about.

So can I get a quick show of hands, how many of you have shipped a bug into production before? Well, I know I have, and it’s okay, right? We’re all human, programming is hard, and it’s really easy to add bugs into software, especially in the Wild West that is JavaScript. And I don’t know about you, but when bugs inevitably happen, I feel pretty terrible about it. And it’s very easy to experience a real crisis of confidence, especially if you can’t figure out what the cause of the bug was. But when you do figure it out, all is good in the world, and your confidence is restored.

Now, I’ve been programming for a while now, and I’ve seen my fair share of JavaScript errors, as I’m sure you have too. Here we have everyone’s favorite, “undefined is not a function.” Recently, I’ve grown to really love working with types on top of JavaScript, and primarily I use TypeScript a lot for my personal projects as well as at work, and I’m in love. Flow is also a really attractive option. It has a number of differences compared to TypeScript, but you can also choose it to statically type your code, but in this talk I will be using TypeScript examples. I also wanted to briefly talk about GraphQL, which is a really interesting way to deliver a type system over the network.

My name is Lauren and I am an engineering manager in Netflix and I ship bugs into production. Confusingly, I have many internet handles. On Twitter I am sugarpirate_, on GitHub I’m poteto, but with an “e”. So clearly I’m not very good at the internet.

How Many Ways Can This Program Fail?

So what is a type system other than something that yells at you a lot? There are many different definitions out there, but my explanation is a type system is a way to add constraints to your code. And the type system is really there to help you enforce logic that is consistent within those constraints. But why? Why are constraints useful in code? So here’s my bold claim. Constraints are useful because they limit the number of bugs in your programs. And let me explain with a simple example.

Here we have a function that takes one argument and divides it by two. Can’t get any simpler than that. But how many ways can this function fail? I came up with some possible inputs that I could pass into this function and here we have our usual suspects like null, undefined, numbers, strings, and so on. Now, I’m going to enumerate over all of those inputs, and after running this code, we should see an array of test results. So let’s run that in our console and see what happens. However, unfortunately for us, null and undefined do not have methods on them. So my code doesn’t even run.

But wouldn’t it be nice if the compiler could tell you that? I don’t want to have to run this code in my browser to find these bugs manually, or have to write a million unit tests, because I’m pretty lazy. And this is really only a simple function. It’s just division by two. Now, in JavaScript, your first instinct might be to do something like add checks for null or undefined, or if you want to get even extreme, you can define undefined as a function, goodbye to that error. But you end up having very defensive code, and you’re checking these argument types in your functions, which are only checked at run time, so you’re shipping more and more code to your browser. Which is fine I guess, but I know it’s frustrating not being able to catch simple bugs like these before you get into production. So if we fix the errors again and we run our program, we’ll see that most of these inputs throw an error, a type error specifically, or result in some kind of nonsensical value like not a number.

So let me ask you again, how many ways can that function fail? Think about all the possible inputs that you could pass into that function, like every single one of those. There is an infinite number of permutations of garbage you could pass into this function. You could give it an array of eleven 1s, array of a million 8s, 100,000 character string, it doesn’t matter, they would all cause problems.

But using TypeScript, let’s just add the simplest type annotation to drastically constraining this function down into only accepting one set of types. So now, x will only be a number, at least at compile time, and the compiler will yell at me if I try to do anything else. And if you run this function with every single number that’s possible in JavaScript, in theory, none of them should cause an error, because you have now typed your program correctly, and in most cases, this function is what we call total.

So my bold claim today is that types and constraints will reduce the bugs in your program. Today we’re going to look at a very brief introduction to TypeScript, some background into type systems, and how it can help you write better programs. And finally, we will explore what it means to have a type system over the network.

A Gentle Introduction to Types

So we briefly saw what types in TypeScript look like. Let’s learn some of the basics just to establish a foundation, so everyone’s on the same footing for the next topic. And I’m not going to go through everything here, so don’t worry too much about syntax. Focus on the ideas and you can always look up syntax later. So first, let’s look at variables, something I hope you use all the time. In TypeScript, you specify the type after the variable name, and these two examples are actually both equivalent, since TypeScript can infer the type, even if it’s not explicitly declared. For this, you have to tell TypeScript what kind of array it is. So here we have number[], which means an array of numbers.

You can also define your own types, and here I’ve made my own type called ServerStacks. It is a union of string literals, meaning that anything that has a type of ServerStacks can only be one of these exact strings. You can also define things like interfaces and other types like that, but here my user interface describes the shape of some arbitrary user object, and you can also use interfaces to describe functions. This is a function that takes an argument of type User. So as the early example suggests, because isAdmin is a “boolean,” we know that it’s not possible to assign that value, that property, to the number one, which is a number. And this is all caught for you at compile time, so if you use something like VSCode or an IDE, you’ll also get a nice squiggly line and an error message while you’re programming.

So those are just the basic examples, and again, don’t worry about the exact syntax. Remember the idea is everything has a type. It could be a primitive type like a number or a string, it could be a function, it could be a user-defined type. It could be an interface. These are all the things that describe your application.

Now, I really want to talk about how you can make these types reusable, just like how you make your code reusable. So, bear with this contrived example. makeArray is a very simple function that takes a value X and then puts that into an array as the first element. But what we have here only covers numbers, strings, and booleans. What about objects? What about the types that I defined earlier? What about symbols? What about instances? How can you type this function so you can accept any type, without defining all the possible variations, but still have the type safety that you want in your program?

The answer is to make it reusable by making it a generic function. Now, notice that the function logic itself has not changed; only the type signature has. So let’s walk through this. The first thing you’ll notice here is that this single letter T in the pointy brackets. This is a type variable, and much like a regular variable, it’s for types. This tells TypeScript that the function is generic and can work over more than one type. If you want to get fancy, this is also called parametric polymorphism, but most people call it generics.

Now, in your function arguments, we say that x is of type T, and this will capture the type of whatever was passed into the function, and says, “let T be the type of that argument x.” So this could be a string, it could be a number, it could be a user, whatever you pass into this function. Then we can use T later in the type signature. Finally, we specify the return type of this function, which in our case is an array filled with items of type T. So now when you use your function, TypeScript will know how to guide you. If you pass in a number, it knows that it will return an array of numbers. And if you pass in a string, it will give you back an array of strings. This is a very simple example, but also very powerful, because it lets you reuse your types and helps you not to repeat yourself, because when you use a typed language, it very quickly gets pretty verbose.

Now, here’s another example to drive things home. This map function is generic and takes two type arguments, A and B, which we will use later. map takes two arguments, the first is a function, and the second, a list of items. The first argument is a function which takes a single argument of type A and then returns a single value of type B. The second argument to this function is an array of items that are of type A. Now, the return value of this function is an array of items that are type B. So just based on the type signature alone, it should be fairly obvious what this function does. And as you might have guessed, this function is one of the built in methods on the array prototype, called map. And with this type signature, TypeScript will know how to type your functions accordingly based on what you pass into it. And you’ll be able to tell what is returned without actually having to evaluate the function, and what this really enables you to do is to start thinking about how your function’s composed, purely based on the inputs and outputs that go into them.

Why Less is More

So we look at some high level concepts and again, don’t worry about the exact syntax. Focus on the ideas. If you recall earlier, we talked about how many ways that program could fail. Having types means that your functions and the classes and bits of your code can specify constraints about what you can pass into them. And there is really a ton of things that functional programming can teach us here. And if you’re interested in learning more, there are a lot of videos in this playlist, which I link to at the bottom of this slide.

Now, type systems are grounded in mathematics, and so is functional programming. Interestingly, type systems can also be applied to object oriented languages, as I’m sure you’re all aware. And the concepts behind this are actually pretty interesting. And I think it can help us understand why type systems help us write better code, without all the jargon that you see up here.

So let’s start with some quick high level overview. Functional programming is really the practical implementation of a bunch of different mathematical theories. We have proof theory, which is about constructive logic and mathematical proofs. We have category theory, which is about algebra and how things compose, about abstractions, and if you’ve ever dealt with things like monoids or functors or monads, these are all examples of categories within category theory. And finally, we have type theory, which is about programs and the idea that propositions are types. So Curry, Howard and Lambek are logicians and mathematicians who independently found that these three different mathematical theories were describing the same idea, just in different ways. And the correspondence between them is really the relationship between these theories, proof theory, category theory, and type theory.

Now, I promised if this sounds arcane to you, don’t worry, I’m going to explain all that crazy talk. The key takeaway here is that propositions are types. For example, my type signature here states that given two numbers x and y, a number exists, and that programs are proofs of those propositions. Because just like in mathematics or in real life, you don’t just get to say that things are true. You have to prove it or you have to show your work. And in this case, x plus y is a proof that a number exists. We’re going to revisit that concept again later to see why it’s useful, but first let’s take a look at functions not in the programming sense, but in the mathematical sense.

So this is a function as described in mathematics. In plain English, this function maps an object of type A to an object of type B. Pure functions, if you ever heard of the term, are really mappings. For a given input, you always get an output, and the same output for that input. And note that object here doesn’t refer to a JavaScript object, but an ordinary object unrelated to programming. If you remember what we said earlier, types are propositions. They assert that something is true or exists, for example, this function will return a number. And our programs are proofs. A number exists because here I have produced the number one in my code. Therefore, type checking is really proof checking, and if you work backwards from that intuition, what this means is that if you have good types in your system, you can really let the type system suggest the implementation for you; basically, almost like type-driven development. And in theory, based on the kind of advancedness of the type system you’re using, you could declare all of your types and function type signatures without writing a single line of actual implementation code. And if it compiles, your program should just work, trademark. Then it’s really a matter of filling in the details, proving that the propositions that you’ve defined are true.

So let’s start with a concrete example. This function takes a list of items and returns the first one. Once you’ve written the type signature down, the implementation can sometimes seem pretty obvious. For example, if you return the list itself, you get a TypeScript immediately yelling at you that, “Hey, this isn’t the right type,” because you’re returning an array instead of a single item, as your type signature suggests. So if you try returning the first item in the list, then that’s really all you can do, because there’s very little else that would make this proposition true. And you could actually notice that you don’t have to return the first index, you could return any one, and this would still type check. And this here is really an example of a limitation of a lot of type systems, that you cannot actually check these kind of things.

And other languages like Idris do have more advanced type systems where they have concepts like dependent types, where this sort of type checking can be possible. But what I want to highlight is that TypeScript and Flow are really not silver bullets. They are a pragmatic way to balance productivity and type safety within your code. And if we look at our map function again, if you try this out iteratively and return the items, if you just return the items that you got, you’ll see some compile errors, since, of course, the type signature does not match. It does not prove that our proposition or type signature is true.

If you call the callback function on a single item in the list of items, then you’re getting a little bit closer, except you’re not returning a list of objects of type B; you’re only returning a single B. So logically, the only thing you can do here is to take that callback function and run it on every item in the list. And this seems pretty obvious once you’ve done it a few times.

And what about React, our favorite library? The nice thing about React is it’s very elegant to mental model, that the view is a function of your state. And in React, functional components, as their name suggests, are functions, because they take props and they give you back a ReactNode. So another fairly contrived example, if you try this out with the number one, obviously it will not work, because Lauren, one is a number, not a ReactNode. If you make it a function that returns a one, of course, it doesn’t compile either. So logically, again, the only thing you can do is to return a JSX element, and your types will pass. Thinking in type signatures and pure functions make it much easier to compose functions and to start thinking about creating higher order components and higher order abstractions that you can reuse throughout your code.

So let’s revisit the implications of the Curry-Howard-Lambek correspondence, the combination, or the correspondence between proof theory, category theory, and type theory, and learn how they can help you write better functions in your JavaScript applications today. Here we have a graph of the function, f of x is equals to x squared, and just like in Hannah’s talk, what I want to call here is that the functions that you are very familiar to using in JavaScript are really similar, if not the same, as the kind of functions that you learned in high school math, they’re the same conceptually.

And some terminology before we go further, just so we’re all on the same page: pure functions as I explained earlier, map an input to an output. And when you’re talking about functions like these, you also talk about things like the domain and the codomain. The domain of a function is the set of all possible inputs that can go into the function, for example, in this function, it’s all the numbers possible. And the codomain of this function is the set of all possible outputs, which in this case is also numbers.

Now, has anyone heard of total or partial functions? I’m not talking about partially applied functions. That’s a different concept. Now, I don’t think it’s very commonly talked about, but functions can really be pure or impure, but that is really independent of whether they are total or partial. A partial function is a function that is not defined for all possible input values. So think about the functions in your code base that don’t always return a value, or they can return undefined, or it could even throw exceptions, or worse, an infinite loop or crash your entire computer. If you take that half function again, we can see that this is a pure function. It doesn’t have any side effects, and it always gives you the same output for the input that you pass into this function. However, it is a partial function, because there are inputs that could cause it to throw errors.

So if we look at all the possible domains and codomains for this function, we know that if we pass this a number, it will give us back a number. And if we pass this function a string, it could either return a number or not a number, because JavaScript. And this is actually a feature, not a bug. JavaScript was designed to work in the worst of environments and it will try its hardest to make your code at least run even if it’s terrible. And this understanding of how JavaScript was designed is really the core design philosophy behind many typed languages that compile to JavaScript, like TypeScript or Flow. It’s never about achieving 100% type safety in JavaScript, because it’s pretty much impossible, but you can try your best.

And getting back to this example, you also have your other types like null or undefined, you have objects, you have arrays, which are really special objects, and these all return not a number, which by the way, is a number. Thanks again, JavaScript. And if you pass in a symbol, you’ll get a type error, which is not actually an output or a codomain, it is an exception that’s thrown by the runtime.

Now, with a type annotation, we can restrict the domain of this function to only numbers. And therefore, this function is now total, at least at compile time, because it should never be undefined or throw an exception. And what is a total function? As the name suggests, and from the previous example, a total function is a function that is defined for all possible values of its input, which means that it always terminates and it always returns a value, never undefined or an exception.

Let’s take a look at another function. fetchUser takes a user name as a string. It hits some API behind the scenes and returns a promise. And this is a partial function because the fetch could fail. I’d like to introduce you to something to make this function total. The promise now resolves with a user wrapped in an Either type. And now this function can only return a promise that resolves to this Either type. Whether or not it fails, it will always return this type. And the Either type really encodes the two possible outcomes from this code. It returns a Left, which represents a failure, or a Right, which represents success. What we have here is actually a monad, but I promised no jargon, so don’t worry about what that means just yet. The idea here is that regardless of what you get back as output, you want to have something with a very similar interface, so you don’t have to care if it was successful or if it was a failure.

Now, here’s an example of implementation or proof of this type signature, of which there are many. There are many examples of different kinds of code that you could write to fulfill this type signature. So first, I’m just going to fetch my user from my API, and if it fails, I will return a left, which again represents failure, if the fetch is not a 200 OK. And I also will pass the left function an error object, which I can use later. And note that the error here is not thrown, it’s just passed into the left function as an argument.

If the fetch was successful, we want to return a right, which in this case is an object that matches the user interface that we defined earlier. When we use this function, note that irregardless of whether the request fails, the return value of that function has methods that you can call safely. So this code on screen won’t cause any issues even if the fetch fails, and you’ll see that we’re actually handling both the success and the failure scenario in the same block of code. And when you get into functional programming, you’ll learn how to compose these concepts to do even cooler things, so you can either look up category theory, or come speak to me later after my talk.

The cool thing about the Either type is that you can capture the failure and then do something with it. So if I was building a UI, maybe I could alert the user with an onscreen message telling them that something went wrong. Or if I went down the RxJS path, then maybe I could also do that as well with different observable handlers. So let’s take a look at one more example. This function detects the first element in the viewport inside of a scrollable container element, because this function can return undefined, it is a partial function as well. For example, what if I gave it a selector that yielded no elements? We can reduce the codomain of this function to make this total by wrapping this in an Option type, now that every input to this function will give us an output in this codomain, and you will never get undefined being returned from this function.

The Either type and the Option type are quite similar. The main difference is that the Either type also captures the failure value, while the Option type does not. And the Option type, or the monad, represents some value or nothing. Now, this is the actual function implementation, but don’t worry about the details. You can see that there are two places where this function can give you an undefined value. querySelector is part of the HTML DOM methods, and it will return undefined if it finds nothing. And Array.find will also return undefined if it does not find anything within the array. But unlike the Either type, we’re not going to capture the reason for failure, we’re just going to return the value none. And if we do find the first visible element, we’ll return that element.

So when you use this function, what does it look like? Irregardless of whether an element was found or not, we’re going to get back a consistent interface, regardless of whether the type is something or nothing. And both of these objects, some and none, have the map function defined, so my code will not throw an error either, because, oh sorry, it will not throw an error like “undefined is not a function,” our favorite error.

And finally, here’s an example of a react component that fetches data. Because functional components are pure functions, you can also make them total. So this component lets me define four possible states that could be present in my component. No data, the fetch is pending, the fetch is failed, and the successful case with data. Writing total functions means that you think about these scenarios up front, and you also think about the way that these abstractions can compose with each other so you can do more with less.

I mentioned that domains and codomains describe the set of all inputs and outputs into your functions. So whenever you talk about things like domains and codomains, you’re also talking about set theory. And in set theory, cardinality means the number of items of the set. Let me just explain why this is a useful concept and not some arcane math thing. I did make a bold claim earlier that the more specific your types and the more constraints you have in your code, the less likely you’re going to experience bugs. And this is not a silver bullet by any means, but lowering the cardinality of the code does mean constraining the number of possible inputs in the function domain, and, therefore, it should prevent bugs from happening.

So if we start from the basics, a set is informally a collection of objects. In our context, we want to think about the set of all values and the types that we’re dealing with in our functions. Here, we have a type called Conferences. Note that this type is not a string, which would match every possible string, but because there are only three possible values that I passed into the type, we say that the cardinality of this type is three. But again, if it was just a string type, then there could be an infinite amount of strings that would match this type.

So what I’m trying to say is that primitive types are not precise. If you look at the cardinality of all the primitive types that you have in JavaScript, the only types with finite cardinality are booleans, nulls, and undefined, and you probably don’t want to write a program just using those types, unless you are very hardcore. Even the non-primitive type object has an infinite cardinality, and any possible object that you have in your system you can think of would match this type. So defining precise types can really help reduce bugs in your code, so let’s look at some examples.

Here we have a generic function, toString, which calls the toString method on the object that was passed into it. And the strange thing here is that even though null and undefined don’t have the toString method defined on them, as we saw earlier, this will still be type checked, or this will still be considered type safe by TypeScript. And what this means is that you have lost type safety, because this will cause a type error if we run this code and that kind of totally defeats the purpose of using TypeScript in the first place. So this is a function with unbounded polymorphism, which basically means that T, the T type variable, can be literally any type, and you almost always want to stay away from the any type, because it’s not precise.

We know in advance that null and undefined are going to give us type errors, so we want to make this a bounded generic function. To do this, we’re going to add a constraint to our generic type. If you look at the code here, you’ll see that NonNullable is not actually JavaScript. It is TypeScript syntax, and it is a type that will constrain down that generic type T into something that’s much narrower. And what it does is it tells TypeScript to exclude null and undefined from all of the possible types that could be captured by the type variable T.

And now, assuming everything else in your code has toString defined, you have made this function total, since it will always return for every possible input. If you want to think about this visually, imagine a Venn diagram with two sets; one being the set of nullable types, null or undefined, and the set of all types, which is basically every single type in your code base. Here I’ve only listed the default JavaScript types, but the types that you’ve defined in your code could also be members of this set two. And what we really want here is the difference between these two sets. All of the types minus the nullable types, which will leave you with the non-nullable types.

The non-nullable type is implemented with conditional types, a new feature, or relatively new feature, in TypeScript 2.8. And this is the basis for many of the new helper types that they’ve just added. The great thing about TypeScript is that it now ships with a lot of these little helper utilities built in now. So they’re very useful when you start thinking about adding constraints to your code and making it more strict. And again, there’s a link at the bottom if you want to check out more details in the TypeScript source.

So we’ve talked a lot about functions and functional programming concepts and arcane jargon that I promised I wouldn’t talk about. But at the end of the day, just keep in mind that what you really want is to be pragmatic. JavaScript is a multi-paradigm language, so you can do whatever you want. We can do object-oriented. You can do functional programming. You can do functional-reactive programming. You don’t have to get too religious about it. That’s what’s great about JavaScript.

And TypeScript is pretty similar. It’s really about pragmatism. And it wants to strike a balance between productivity and type safety, because you don’t have to go all-in in order to benefit from it. And again, it’s not going to solve all your problems. You’re not going to become a Fortune 500 company if you use TypeScript, because, you know, TypeScript will help you catch the obvious bugs, but it can’t catch every single bug for you. The nice thing about taking this approach is that you really start thinking about the different kinds of constraints that, or even invariants in your code, that always have to hold true. And you think about things like fault tolerance and making your functions total, but only where it makes sense.

Types over the Network

I want to close off by talking about GraphQL and explore the idea of an end-to-end type system. GraphQL is a query language and specification, and just like TypeScript and Flow, it’s also strongly typed, and allows you to define a schema that describes your entire object graph. The benefit of this approach is that your clients ask for exactly what they want, and they get predictable results every time. GraphQL APIs are organized in terms of types and fields and not endpoints, and this great because types ensure that your client applications only ask for what’s possible, and they provide you clear and helpful errors when they try something that’s not implemented. They also help with avoiding writing manual parsing codes, since you have type information.

And speaking about exposing type information, this also represents a lot of opportunities for interesting tools to be built on top of this platform. For example, here we have a tool called graphical, which is a visual playground that you can mount on top of any GraphQL compliant API. And what this means is that if you’re using GraphQL anywhere in your tech stack, you get documentation for almost free. And this is a really great developer experience for clients of your service.

GraphQL inverts control back to your client, and they request only the fields they want, nothing more, nothing less. And the backend developers here may be pretty alarmed by that, since the clients can just ask for anything they want, any combination of fields, so, you know, what about things like N+1’s, and that’s definitely a very valid concern. But GraphQL on its own is really decoupled from the way that you retrieve your data. So this is not a problem that is unique to GraphQL. And so tools already exists out there that can solve these kind of concerns. For example, you may want to look at Facebook’s DataLoader module, which will coalesce requests in a batch before you execute them.

But the nice thing about GraphQL and its type schema is that you get a lot of the same developer ergonomics that you have in TypeScript. And the only difference really is that the type system is now communicating to you over the wire instead of in your editor. And this is really great when you’re working in very large distributed teams. The GraphQL type system is still evolving, and there’s so much potential in where it could go. But that led me to a very interesting idea. As Neha introduced me, I work at Netflix, and we do have a lot of microservices. So I was thinking to myself recently, what if we had a distributed type system? What would that look like? Not just type safety in the editor while I’m creating my client applications, but type safety across multiple teams in a very large organization.

So this is a lot of what we’re thinking about in my team at the moment. We’re building tools that power the Netflix studio, and we’re trying to make it more efficient and to empower better decision-making. And the goal is really to help with content creation, which is a very highly complex domain, spanning multiple business functions like finance, legal, business, staffing, creative, which are also very tightly connected as a graph. And there are a lot of different core entities that we deal with everyday, such as talent, movies, vendors, deals, projects, what have you, across multiple systems.

So the challenge here is really in connecting diverse sources of enterprise data across multiple fast-moving domains. And we’re investigating a couple of things here. We are investigating the idea of a GraphQL API gateway in the form of a hybrid between service orchestration and choreography. We’re also exploring, as I mentioned earlier, the distributed type system, which can enable developer productivity across your organization in real time, allowing you to move quickly while building up a larger graph that describes your entire business, but allow you to stay consistent as the different data models and the different domains in your business evolve.

What’s really exciting here are the possibilities at the various levels in your organization. For example, as a UI engineer, one of the benefits of this approach is that you can now think about things like autocompletion in your editor or your IDE while you’re building client UIs. And you can see even new fields come online as other teams are evolving the graph, right in your editor. Another thing you may be able to do with this technology is to know things like the impact of page load time by field, because you can tell exactly how long it takes for a field to be returned by this graph.

As a backend engineer, imagine if you had pull request feedback, or even build failures, whenever a schema change in your service will break clients of that data. That’s something that we encounter pretty often. And especially if you’re building microservices that are heavily dependent on by different people, you do want to make sure that any interface changes are properly communicated. And also, what about visibility into the aggregated graph of all of your different data models as it evolves over time?

So to do this, we’re thinking about combining gRPC, GraphQL, and TypeScript, which are all strongly type languages, combining them to create the ultimate end-to-end type system, like a kind of a strange Voltron. This is not perfect at the moment. Although Node has the best support for GraphQL tooling currently, the generated gRPC code in NodeJS isn’t the greatest. And this kind of architecture still has many rough edges. For example, code generation between the three formats is not one to one, since type information from the different formats cannot be cleanly transformed from one to the other.

Now, this all sounds nice, but how do you use it? Unfortunately, a cohesive solution doesn’t quite exist yet, so you do have to glue many pieces together. But I will have to say that TypeScript and Flow and type languages help make this possible. It’s always tough when you’re doing things like this about architecture to have multiple teams move towards a common goal. But because in JavaScript, TypeScript is so easily refactored to words in an incremental fashion, adoption hasn’t been that bad.

There are also some SaaS products today that can help you get close to this pattern if you’re interested. For example, this is Apollo Engine. It has a feature that lets you track the evolution of your GraphQL schema, and do things like raise alerts when breaking changes are introduced. There are also many open source tools to make translating the schemas between these different technologies and types easier. And lastly, another interesting project that you might want to look at is this unofficial Google project called Rejoiner, which allows you to stitch together gRPC microservices across multiple languages, and exposing that as a single GraphQL API.

So we’ve discussed a lot today. First we looked at an introduction to basic types in TypeScript. We looked at things like generics and how do we type signatures, a reminder that generics help you write reusable types, just like you do with reusable components and functions. Then we looked to functional programming to see how it can help us or inspire us to write better functions that are both pure and total. We also explored a high level introduction to the Curry-Howard-Lambek correspondence; the idea that propositions are types, programs are proofs of those propositions, and that type checking is really the same as proof checking.

And lastly, we looked briefly at GraphQL and we explored what type systems can give you, even when they’re delivered over the network. I’m really excited to see how this idea evolves and enables greater productivity in large microservice-heavy organizations. We’ve only scratched the surface today, but I hope you learned a little bit about type systems and functional programming, and how it can help you write better JavaScript. Thank you all so much for listening, you’ve been great.

Questions and Answers

Participant 1: What would you recommend to help people start integrating to their existing projects? For any type system.

Tan: Any type system into an existing code base? I think TypeScript and Flow are really great tools, because they let you incrementally adopt TypeScript for example, which I have the most experience with. You can get started by simply renaming your JavaScript file to a .ts file, and you can literally do nothing else but you get some benefits already from TypeScript. And some recent news that actually makes this even easier is that now Babel has an integration with TypeScript, so Babel can actually compile your TypeScript for you without even having to do anything special.

So I think that’s a really great way to just incrementally refactor your existing JavaScript into TypeScript, whereas other typed languages, like ReasonML or like PureScript, you have to kind of start, rewrite your whole system from scratch, which isn’t really that incremental. So, hopefully that helps.

Participant 2: I have a question for you that’s related to that, which is when you started loving type systems, was this a change that was rolled out slowly over the team as well, and how was the type system integrated into the codebase?

Tan: The question was really how did we get started adopting TypeScript in our organization? And I think it started very incrementally. One of the early ways we started was just simply by adding type definitions to some of the libraries that we share among the different teams. And if you use an editor like VSCode, you actually can benefit from those type declarations, whether or not your project was written in TypeScript, because one of the really interesting things about the TypeScript language server, from what I understand, is that it will actually attempt to install the type definition packages for you, whether or not you’re using TypeScript. And that’s how we can give you autocompletion in your editor for certain things like, you npm install Lodash, and suddenly you have Lodash autocompletion. How does it do that? That’s actually through TypeScript. So that was one of the ways we started and it was a great way to start thinking about all the different interfaces that we were implicitly relying on and now making them more concrete.

Participant 3: So the GraphQL is not actually a JSON body that we send it. So it has its own specification. So what is the media type that we set it for sending a payload?

Tan: It is still a JSON blob, but the main body is captured as a string. So you’re right in the sense that it’s a little bit harder to work with; especially if you’re doing service-to-service communication, you now have to implement a GraphQL parser to be able to translate that request into something else. Which is why we, for example, in the example I showed earlier, we’re still using things like gRPC or Thrift for service-to-service communication. And then we expose GraphQL as a kind of product-facing API for our clients.

Participant 3: In enterprise, the teams, the enterprise has spent money on building something more like a gate pay, wherein we can define an interface and validate an interface before it goes into production. So do we have any such product for GraphQL? Open source?

Tan: In terms of validating your interfaces before you go into production?

Participant 3: Yes, something like Swagger, that validates the interface.

Tan: GraphQL is only a specification, so I don’t think GraphQL itself with provide that for you, and I’m, off the top of my head, I don’t know if there are packages like that. But I’m going to guess that there are. But I’m not sure.

Participant 4: Yes, I was just wondering if you have any preferred libraries for algebraic data types playing well with TypeScript, and also, TypeScript doesn’t have type inference. At least it doesn’t seem, it’s pretty verbose, kind of like Java or something? And it’s just aesthetically not pleasing to me. Do you know of any other type systems that compile to JavaScript that have type inference?

Tan: So first of all, to answer your first question, which was about what do I prefer for algebraic data types, there is a really great package called fp-ts. It’s by a mathematician, his name, I can’t remember his name, but his GitHub username is gcanti, and he also has a very interesting package called io-ts, which is an interesting way to actually add run time type checking. I haven’t really used it myself, but it’s a very interesting exploration into that. So fp-ts is Fantasy Land compliant, so we can use that, if you’re familiar with that style of functional programming, and it’s pretty well-written. It’s a TypeScript-first project, so it has even support for higher kinded types, and he goes into quite great detail about how it supports that. So that’s definitely where I would start.

And then in terms of type inference, TypeScript does have type inference, it’s just not very advanced. So it has a basic understanding of some types. And here are really the main differences between TypeScript and Flow. So, the way the TypeScript compiler is created, it’s based on the AST, an abstract syntax tree. So it can really only understand the types that it’s aware of. And once it encounters a type that it has no idea what to do with, it just breaks. But a language like Flow is actually based on, the compiler is graph-based, so when it defines a type that it has no idea what it is, it can just wait, come back to it later, and then try to infer what that type is. So that’s really one of the major differences between Flow and TypeScript. But, yes, TypeScript does have type inference, it’s just not super great.

See more presentations with transcripts

Uncategorized

Server and Network Operations Automation at Dropbox

MMS • RSS

Article originally posted on InfoQ. Visit InfoQ

Dropbox’s engineering team wrote about their network and server provisioning and validation automation tool called Pirlo. Pirlo has a pluggable architecture based on a MySQL backed custom job queue implementation.

Dropbox runs its own datacenters. The Pirlo set of tools consists of a TOR (top of rack) switch initializer and a server configurator and validator. It runs as worker processes on a generic distributed job queue, built-in house on top of MySQL, with a UI to track the progress of running jobs and visualize past data. Pirlo is built with as pluggable modules with plenty of logging at every stage to debug and analyze automation runs. Dropbox has an NRE (Network Reliability Engineering) team, that works on building, automating and monitoring its production network. Most of Dropbox’s code is in Python, although it is unclear if Pirlo is written in Python as well.

Both the switch and the server provisioner use the job queue, with the specific implementation in the workers. The workflow is similar in both cases, with a client request resulting in the Queue Manager choosing the correct job handler. The job handler implementation runs plugins registered with it, which carry out the actual checks and commands. Each plugin performs a specific job, emits status codes, and publishes status to a database log including events that have the command which was run. This is a how most job queues work, so it is natural to ask why the team did not opt for existing ones like Celery. The authors of the article explain that

We didn’t need the whole feature set, nor the complexity of a third-party tool. Leveraging in-house primitives gave us more flexibility in the design and allows us to both develop and operate the Pirlo service with a very small group of SREs.

The switch provisioner, called TOR starter, kicks off when a request is received by a starter client. A TOR switch is part of a network design where the server equipment on a rack is connected to a network switch in the same rack, usually on top. It attempts to find a healthy server using service discovery via gRPC, and the queue manager chooses a job handler for the job. Switch validation and configuration is a multi-step process, and starts with establishing basic connectivity. Followed by executing each plugin, it culminates in downloading the switch configuration and rebooting it.

The server provisioning and validation process is similar. The validator is launched on the server machine in an OS image created with Debirf, which can create a RAM based filesystem to run Debian systems entirely from memory. Nicknamed Hotdog, it is an Ubuntu based image that can boot over the network and runs validation, benchmarks, and stress tests. The results are pushed to the database and analyzed later. The tests include validation of various hardware and firmware components against predefined lists of configurations approved by the Hardware Engineering team. Repaired machines also go through this test suite before they are put back in production.

Pirlo’s UI shows the progress of both currently running and completed jobs. Dropbox previously used to use playbooks (or runbooks) to perform provisioning and configuration. Other engineering teams that run their own datacenters have also moved on from runbook-based provisioning to Zero Touch Provisioning (ZTP) albeit using different methods.

Uncategorized

Data-driven Marketing Strategy: Spatial Analytics for Micro-marketing

MMS • RSS

Article originally posted on Data Science Central. Visit Data Science Central

Data-driven Marketing Strategy: Spatial Analytics for Micro-marketing

Organizations, often in their me-too hurry to adopt a new technology, just pour their old-wine (data) into a new bottle. What was originally called ‘sales-information-system’ in the good-old-days underwent many avatars before it became BI (business intelligence) and then off-late, it is time to switch again. The latest avatar is called Data-Visualization-Tool, and every CIO worth his name has a budget. Data-visualization Tools, undoubtedly offer a much better interface for Sales and Marketing Teams to analyze data in their quest to do better, besides serving as a platform for “Self-service Analytics”.

The purpose of this article is not to list ten-best data-visualization tools, or how wonderful they are in transforming the way the marketing-manager devises his strategy. This is about the data that resides inside the visualization tool that ultimately-should provide actionable insights.

The data that you need vs. The data that you have

The spoiler, as usual, is “the data” or the lack of it!

Typical Sales-Information-System has limited-value as it (essentially) slices-and-dices the same internal Sales data…. Last Year Vs. Current Year or drill-down by Product-Geography to the extent transactional data exists. Almost all of it a post-facto-analysis, often used for forecasting based on historical data.

A typical Marketing-Sales manager uses the Sales-information-system for his reporting requirements, usually when he needs to submit his monthly report or when he must make a power-point for someone, but rarely-ever as a tool to draw much needed insights for devising his market strategy. For example: evaluation of a Dealer’s performance or a field-sales-executive’s performance gets done purely based on an internal target vs actual, never on actual-opportunity-size vs performance. Critical information like top-10 micro-markets by each state where majority of ad-spend should be focused, or where exactly the market-size justifies recruitment of more dealers or expansion of field-sales-force etc. are all decisions based on collective-gut.

The reasons are not difficult to see…

The critical information needed for devising a market-strategy is the data available elsewhere, outside of the organization’s ERP. Namely – the market-opportunity-size by product & the ever-changing customer’s preferences by each micro-market, in each town and village, every year.

Customers ‘profile, their preferences, local factors etc. are very different in different micro-markets (small towns & villages) and hence determine the market-size for a specific product or the relative market-share of each brand, in each of the micro-markets. The Marketing strategy (product-promotion-pricing etc.) needs to be customized for each micro-market.

The source of all data: The field sales executive

A company’s performance is usually just as good as its Field-Sales-Force. Often a low-paid, name-less face-less guy who operates out of remote areas, never-seen at Regional Head Quarters except once-a-year during the Budgeting season.

The Field-Sales Executive is supposed to have his ear to the ground, know the-local-markets well, visit the key end-customers and meet the Dealers regularly.

Field-Sales-Executive is at the bottom-of-the-pyramid, where the rubber-meets-the-road… hence his mobile is tracked, he is asked to file Status-Reports monthly, weekly and in a few cases daily, …in essence- a “source” for ground-level information.

In our experience, we have not seen any “sales information reports” being generated specifically to enable a Field-Sales-Executive to do a better job. We have not seen too many companies talking about enabling the field-sales force with useful information- actually help him maximize sales, productively use his time.

The sales information system – whatever may be the mode or the nomenclature, is almost-always made for the higher-management to review the performance of sales executives, and in most cases even the basic-transaction-level drill-down itself is never there, let alone drill-down view of market intelligence by each village.

Marketing Strategy: The Anatomy of Hitherto Unresolved Problems

Most Organizations devise a marketing strategy at a macro-aggregate level- which seldom percolates down customized for the specific-needs of each micro-market. Geography-specific investments are rarely rational-data-driven decisions.

Many organizations struggle to answer questions such as where to locate-recruit Dealers, where to locate warehouses, Ad-spend, Village-level territory-panning etc. Metrics-collection & Reporting is almost-always bottom-up; The field-sales-executive or the Dealer (where rubber-meets-the-road) is the chief source of market-intelligence – which gets collated and supplied on a dash-board to the CxOs.

As explained earlier in this article, the typical Sales-Information-System has limited-value as it essentially slices-and-dices the same internal sales data…. Last-year vs. Current-year or drill-down by Product-geography to the extent transactional-data exists. Almost-all-of-it tends to be post-facto- analysis, best used for forecasting based on historical data, or sometimes for root-cause analysis; there is little that enables a field-sales-executive by providing actionable insights that help him to do a better job.

Operating Blind:

In good many organizations, there is a lag between the date of transaction, and the date by which it gets reflected into the ‘central IT system, or the ERP of the company. The lag could be anywhere between a few days to few months. The reasons?

Could be anything from using of a separate disconnected PoS application where the actual invoicing happens to pure indiscipline.
The tell-tale signs of all-is-not-well with the IT systems are visible when the company takes enormously long-time to declare the audited-financials every quarter.
Multiple-disconnected applications mean multiple hand-offs in the information chain, and more disturbingly multiple-data-standards which make the tasks like the corporate-consolidation of financials every quarter a nightmarishly difficult exercise.
The tell-tale signs include hundreds of excel-sheets of floating around and sales & finance teams burning mid-night-oil each time the top management asks for a slightly different report.

The Marketing Manager may fail to find the right expression, but instinctively knows what he wants. He prefers the field-sales force to spend more time ‘actually’ selling on the ground, rather than sit in the branch-office filing reports. He prefers the IT Department to be the source of information-on-demand for all reports and he knows “right information- a.k.a. actionable-insights” delivered on-demand can help the field-sales force improve their performance by leaps.

However, more often than not, the marketing-manager operates at the mercy of the CIO. Being not sufficiently tech-savvy, he blindly accepts what-ever system that CIOs push as the latest miraculous cure.

Marketing Managers eternally hope for “the solution” that finally provides them the actionable insights that allow them to spend more time ‘actually selling’ rather than ‘hunting for data’.

CIOs on the other hand complain on how the systems are perfect, but the so-called marketing-managers are like cave-dwellers, painfully slow to adopt and use such wonderful systems.

The truth perhaps lies in between.

And the Blind leading the Blind:

In our experience, Marketing Managers deeply distrust the data provided by the internal IT departments. often one finds them developing a parallel information-system on excel-sheets with data painstakingly collected from sales-team or from dealers.

The reasons once again, are not difficult to see…

The critical information needed for devising a market-strategy is the data available elsewhere, outside of the organization’s ERP and CRM; namely – the market-opportunity-size by product-geography & the ever-changing customer’s preferences by each micro-market, in each town and village, every year.

Other-than a half-hearted attempt at collating the social-media data, we have not seen any CIO attempting to actually try and collate the market-opportunity-size by product for each of the geographies or for each of the micro-markets. Micro-marketing as a concept – is still in nascent stages of development, and I have not seen any software remotely related to micro-marketing in Gartner’s hype cycle… not as yet.

There is absolutely no doubt any organization which implements spatial-analytics & micro-marketing solutions will develop a tremendous competitive-advantage.

The cure for Operating Blind: Micro-marketing & Spatial analytics

I have personally used Infruid’s Vizard as the Software of choice for the first implementation of Spatial Analytics solution development that I was involved with. But you may imagine a typical Data-Visualization tool like Tableau or QlikView or whatever other tool you may be familiar with that comes preloaded with the complete market-data for each of the 650,000 villages in India for each of the products that your customers mass-distribute across India. Then all you have to do is upload your historical product-specific sales data into the Visualization tool. And viola, you have micro-market insights that you have only dreamed of, but never seen before. More details below:

For example, our Pathfinder solution comes preloaded with market-data for each of the 650,000 villages in India.

Complete Socio-Economic data for each Village, Town and City – including Population & Gender
Total Income levels & Agri-Income
Net-sown Area & Irrigation percentage
Top-3 Crops & Average yield per acre
Economy Size & Income Levels
No & Locations of Agri-market yards, Cold-storages etc.
Roads & Accessibility for each Village, Town
Rural Market Potential Index
Accessibility Index.

Custom-content to be loaded

Location & Vital Stats for Existing Dealers
Location & Vital Stats for New Dealers
Historical Sales by product by District/Town/Village
Actual Business by product, Geography through Time
Location & Coverage of Assigned Territories by each Field-Sales-Executive.

What we estimate through a custom-algorithm

Market-Potential for different categories of products for each of the 650,000 villages
Top-10 villages for each product in each sub-district, and in each district
Identifying the relative-growth of each micro-market.
Identifying under-served markets
Distance of each micro-market from the nearest dealer.

Benefits: Deep insights & Data-driven Decisions

Deep drill-down village-level market intelligence & Precision analytics enabling decisions such as –

Locate your new dealers, retailers or show-rooms closer to your largest markets
Locate your Sales-executives closer to largest / under-served markets.
Locating your new manufacturing plants, and new ware-houses.
Territory planning
Sales Performance measurement based on actual-service-levels in each markets & Opportunity fulfilment.

Please do write to me at – krishna@proyojana.com if you are interested in knowing more.

*The infographic above has been sourced from Adverity – The Austrian Data Analytics company.

Uncategorized

New Home Sales Projection: A Time Series Forecasting

MMS • RSS

Article originally posted on Data Science Central. Visit Data Science Central

1. Background

New home construction plays a significant role in housing economy, while simultaneously impacting other sectors such as timber, furniture and home appliances. New house sales is also an important indicator of country’s overall economic health and direction. In the last 50 years there has been few significant bumps and turning points in this sector that shaped the trajectory of the overall economy. Here I review the historical patterns and make a short-term (5 years) projection of new home sales.

2. Data & methods

Census Bureau (census.gov) is the source of the dataset for this analysis (side note: it’s a great source for social, economic and business time series datasets). This is a non-seasonally adjusted monthly sales report. I did the analysis in R. Although Python has great ts resources, but for forecasting it is not even close to R, thanks to the great forecast package developed by Rob J Hyndman (which I often refer to a one-stop shop for all forecasting problems). The decomposed time series data shows a predictable seasonality and a trend component. A Holt Winter Exponential Smoothing should work just fine for data of this kind and shape. Nevertheless, I’m using 3 different methods to compare: ETS, HW and ARIMA. There are plenty of resources on the internet on each of these forecasting methods, so I’m not going to discuss them in here (and there is another good reason for not discussing theories too much!).

3. Key results

New home sales was steady for a long period of time, with no growth for 30 years since 1960s. Then sales picked up in the beginning of ‘90s and kept growing steadily for about 15 years until 2005.
New home sales started to decline in 2005, and in just 5 years declined by 75%.
Sales is recovering since 2012; but far from catching up with the pre-crash sales yet
Current sales is about 630k new homes per year
5-year projection until 2023 in a business-as-usual scenario shows total home sales at 870k – a total growth of about 40% (7% per year).
The projected growth is not even close to pre-2005 level (>1200k/year). With the current trend it can be around 2035 to catch up to 2005 level
There seems to be a 5-10 years cycles in the historical trend. If this holds, we may see some de-growth in the near future

4. Discussion

As revealed in the report that came out recently, and also reported in Wall Street Journal, seasonally adjusted rate of new home sales has declined by 8.9% in October of 2018, signaling a market slowdown. Some market analysts expect this to continue, predicting post-2012 boom may be over. It will take a couple of years to understand the trend before we can say with higher certainty what the future holds for this important economic sector.

[The R codes & further analysis on this topic here. Follow me on Twitter for updates on new analysis.]

Uncategorized

How to Flourish in Industry 4.0, the Fourth Industrial Revolution

MMS • RSS

Article originally posted on Data Science Central. Visit Data Science Central

Call it a “Forrest Gump moment;” an instance of being in the right place at the right time for no other reason than just plain luck. A “Forrest Gump moment” is based upon Tom Hanks’ character in the movie “Forrest Gump,” a guy who always seemed to be in the right place at the right time meeting Presidents Kennedy, Johnson and Nixon at critical points in American history.

I too have had a Forrest Gump moment in meeting President Reagan, however, my deeper Forrest Gump moments have been my long association with the history of analytics. I was fortunate to be at the birth of the Business Intelligence and Data Warehouse era while working for Metaphor Computers to deploy Decision Support Systems across Procter & Gamble in the late 1980’s. And was fortunate again in the late 2000’s to be at the launch of the Data Science era while building Advertiser Analytics at Yahoo.

Now again I find myself at another Forrest Gump moment in my opportunity Hitachi Vantara. We are at the cusp of the next Industrial Revolution fueled by new technologies such as autonomous vehicles, VR/AR, AI, robotics, blockchain, 3D printing and IoT. The purpose of this blog is to provide some insights into how to properly prepare to derive and drive new sources of customer, product and operational value for this next Industrial Revolution – Industry 4.0.

Birth of Industry 4.0

A recent Deloitte report titled “Forces of Change: Industry 4.0” describes Industry 4.0 as such:

“Industry 4.0 signifies the promise of a new Industrial Revolution—one that marries advanced production and operations techniques with smart digital technologies to create a digital enterprise that would not only be interconnected and autonomous but could communicate, analyze, and use data to drive further intelligent action back in the physical world.”

The key aspect of Industry 4.0 is the melding of the physical and digital worlds around new sources of operational data that can be mined to uncover and monetize customer, product, service and operational insights (see Figure 1).

Figure 1: Deloitte “Forces of Change: Industry 4.0”

Analytic Profiles for the human players and Digital Twins (see Figure 2) for the physical devices, will play a critical role in powering this “Physical to Digital to Physical” (PDP) loop:

Physical to Digital: Capture information from the physical world and create a digital record from physical data.
Digital to Digital: Share information and uncover meaningful insights using advanced analytics, scenario analysis, and artificial intelligence.
Digital to Physical: Apply algorithms to translate digital-world decisions to effective data, to spur action and change in the physical world.

Figure 2: Digital Twin Example

A Digital Twin is a digital representation of an industrial asset that enables companies to better understand and predict the performance of their machines, find new revenue streams, and change the way their business operates.

In the world of Industry 4.0, the Digital Twin is the foundation for IoT monetization. It is around these Digital Twins that organizations will build intelligent IoT applications such as predictive maintenance, inventory optimization, quality assurance and supply chain optimization (see Figure 3).

Figure 3: Monetizing Digital Twins Across Industrial Use Cases

See blog “Leveraging Agent-based Models (ABM) and Digital Twins to Prevent Injuries” for more details on Digital Twins.

Understanding Industry 4.0 Challenges

The biggest challenges for Industry 4.0 will be where and how to apply these new industrial technologies to derive and drive new sources of customer, product, operational and market value. However, we are fortunate to have learnings from previous revolutions – Industrial and Information – that we can apply to Industry 4.0.

In the blog “How History Can Prepare Us for Upcoming AI Revolution”, we discuss how thoserevolutions were fueled by organizations that exploited new technology innovations to identify and capture new sources of customer, operational and market value creation. The industrial Revolution was driven by new technologies such as interchangeable parts (the famous ¼” bolt), specialization of labor, factory floor assembly concepts, and availability of steam power. The Information Revolution was powered by new technologies such as x86 and MS-DOS/Windows standardization, packaged database and transactional applications, and availability of the internet.

And the driving force for capturing the economic benefits from each of these revolutions, and what are seeing today with the Intelligence Revolution, was the transition from hand-crafted solutions to pre-packaged, mass manufactured solutions. I expect the same pattern from Industry 4.0.

One other Industry 4.0 challenge with which organizations must wrestle is the role of the government (in the form of regulations that nurture both competition and collaboration) and universities (in preparing the workers for Industry 4.0) in working with industrial concerns to accelerate the preparation and ultimate adoption of Industry 4.0 (see Figure 4).

Figure 4: Source: “How State Governments Can Protect and Win with Big Data, AI and Privacy”

There must be strong collaboration, like what I am seeing from organizations like Team NEO(Northeast Ohio) and Jobs Ohiothat are driving collaboration between government, universities and industrials like Hitachi.

Industry 4.0 – Creating Smart Products and Spaces

“Tomorrow’s market winners will win with the smartest products. It’s not enough to just build insanely great products; winners must have the smartestproducts!”– Bill Schmarzo

Smart is one of those over-used, poorly-defined terms that float around IoT and Artificial Intelligence (AI) conversations. So what do we actually mean by smart? For purposes of my teaching, I define “smart” as:

“Smart” is the sum of the decisions (optimized) in support of an entity’s business or operational objectives.

Organizations need to make the necessary investment (in techniques like stakeholder personas, stakeholder mapping, customer journey maps, prioritization matrix) in order to identify, validate, value and prioritize the decisions or use cases that comprise a “smart” entity. For example, Figure 5 outlines some of the decisions or use cases that a city would need to optimize with respect to a “smart city” initiative.

Figure 5: Example of the Decisions or Use Cases that Comprise a “Smart” City Initiative

See the blog “Internet of Things: Getting from Connected to Smart” for more details on creating “Smart” entities.

Once we have identified, validated, valued and prioritized the decisions or use cases needed to optimize our “smart” products and/or spaces initiative, we now need to develop the data management and analytics strategy to “operationalize smart”; that is, we want to create “smart” products and/or spaces that can self-monitor, self-diagnose and self-heal (see Figure 6).

Figure 6: Understanding the 3 Stages of “Smart”

The 3 stages of a continuously-learning “Smart” entity that we discussed in the blog “3 Stages of Creating Smart” are:

Self-monitoring: An environment that continuously monitors operations for any unusual behaviors or outcomes (Analogy Detection, Performance Degradation).
Self-diagnosis: Leverages Diagnostic Analytics to identify the variables and metrics that might be impacting performance, and Predictive Analytics to predict what is likely to happen and when it is likely to happen.
Self-heal: Applies Prescriptive Analytics to create actionable insights, and Preventative Analytics to recommend user or operator corrective actions to prevent problems such as unplanned operational downtime.

Preparing for Industry 4.0

There is much hard work that organizations need to do to prepare for Industry 4.0 including:

1) Begin with an End in Mind.Understand your organization’s key Business Initiatives. Understand what’s important to the organization from a business, financial and/or customer perspective, and use that to frame and accelerate the monetization of these 4IR technologies. While we may not understand the technology journey that we’ll experience trying to reach that end, the end point should not be a mystery.

2) Understand the Key Industry 4.0 Technology Capabilities…But Within a Business Frame. It’s important for IT to gain familiarization with how the 4IR technologies work, what’s required to support the technologies, and what sorts of business and/or operational opportunities can potentially be addressed with these 4IR technologies.

3) Build out the Solution Architecture. Organizations should embrace a holistic architecture that supports these 4IR technologies in order to deliver “intelligent” Industrial applications (applications that get smarter with every customer interaction) and “smart” entities (that leverage edge-to-core IOT analytics to create “continuously learning” entities).

4) Use Design Thinking to Drive AI Organizational Alignment and Adoption. Embrace Design Thinking as a way to drive organizational alignment and adoption with respect to where and how these 4IR technologies can be best deployed to drive meaningful business and operational value.

5) Build out Organization’s Data and Analytics Capabilities.Become expert at acquiring, integrating, cleansing, enriching, protecting and analyzing/mining the data that is the source of customer, product and operational insights that power the organization’s top priority business initiatives.

6) Operationalize the Analytic Insights. Embed analytic insights and evidence-based recommendations into smart products and spaces that learn from each customer and/or operational interaction.

7) Monetize the IoT Edge. Leverage IoT edge to enable near-real time operational and product performance optimization that further enhances business decision-making and extends more value to customers and operations.

I think even Forrest Gump would agree that it’s a good time to be in the data and analytics business.

Uncategorized

Fable 2 Interview with Fable Creator Alfonso García-Caro

MMS • RSS

Article originally posted on InfoQ. Visit InfoQ

Fable entered its fourth year with a new major version that greatly improves its performance, code-generation, and stability. It took more than six months to go from the initial design discussions to a first beta of Fable 2, and after a few more months of beta testing, Fable 2.1 was finally released.

Among the initial design goals for Fable 2, one of most relevant was implementing F# records as plain plain JS objects and unions as array. This has been finally scrapped due to the impossibility of preserving F# full semantics but in the process Fable’s type system became lighter mostly due to the fact reflection code is now generated only when actually used. Other notable changes for the programmer concern the POJO (Plain Old JavaScript Object) attribute, which is not required in Fable 2 for JavaScript interoperability, and JSON serialization being not included with Fable.Core anymore due to the availability of community-provided serializers such as Fable.SimpleJson.

Fable 2 also includes many internal changes to make it play better with JavaScript tooling like bundlers and minifiers. In particular, Fable 2 will prefer converting both class methods and nested module As a matter of fact, the effort to keep a one-to-one correspondence between F# and JavaScript language constructs wherever possible created a number of JavaScript anti-patterns that at times defeated JavaScript tools.

InfoQ has spoken with Alfonso García-Caro, Fable creator and maintainer.

What improvements can developers expect from Fable 2?

There’ve been many improvements in the Fable 2 release, specially around stability, bundle size and performance of the compiled applications. Basically all of the bugs reported were fixed and the tool is now very reliable. We’ve also worked a lot in reducing the size of the production bundles generated with Fable without having to sacrifice the benefits of the F# language and its core library, and we managed to do that by playing well with the JS module system and tools like Webpack. In general, it’s expected that Fable 2 applications are around 40% smaller after minification.

About performance, there’s still room to improve. But depending on your application we’ve seen cases of JS code running 2x faster compared to Fable 1. One of our test benches is the Fable REPL which fully runs on the client side by turning the F# compiler and Fable itself into JS, which is a great entry point to F# and Fable and thanks to the performance improvements it has become a very pleasant development environment (Firefox doesn’t seem to like it very much, though).

Fable 2 is an almost complete rewrite of Fable 1. What was the reason for starting from scratch?

Because a rewrite is always the answer! Tell that to your managers… Just kidding. Writing compilers was something new for me and I learned a lot about that and about the F# and JS languages when developing Fable 1. So this new major release was a great opportunity to do a cleanup, discard the ideas that didn’t work, focus on those that did and try a few new ones. Another very important principle was to make the code more accessible to other contributors than myself. This is not ideal yet, but Fable 2 development has been a much more collective task and I’m very happy of the result.

Can you share a few lessons your learned from the rewrite?

When writing a compiler you’re always thinking in performance, both of the generated code and the compiler itself. And it’s very tempting to put performance above everything else. But nobody needs something that it’s very fast if it doesn’t produce the desired results. So Fable 2 puts the focus in correctness and only when everything works fine we apply the optimization. And the truth is, this has actually better performance in most situations.

Another lesson I learned is how to make the ideas behind a program more explicit through the use of types, without relying on implementation details. That’s the whole point of using a language like F#, modeling your domain, but it’s easy to forget. In Fable 2, the language constructs that had special treatment in Fable, like options, lists or uncurried functions, have been promoted to get their own entries in the Fable AST. That way, if we need to change how options are implemented, for example, we only need to touch the last step without breaking anything else.

Fable 2 has a new architecture as well, which makes place for an explicit optimization phase in between the AST generation and JavaScript generation phases. What are the advantages that this brought?

Well, the architecture is still more or less the same: we take the AST from the F# compiler, which parses and checks the code, then we transform it to Fable’s own AST so we can manipulate it while keeping the type information, and transform it again to a JSON AST we send to Babel to generate the actual JS code. But it’s true that in Fable 1 we didn’t have an intermediate optimization phase, so all the optimizations were like patches applied to either the F#-to-Fable or Fable-to-Babel step.

Now we do have this optimization phase, which also consists of several isolated steps on its own. This, together with the improved AST, is a much better way to try to apply optimizations (like beta reduction or in the case of Fable, function uncurrying), because whenever you detect a problem you can just disable the optimization and quickly confirm if it was causing the issue or not. And although it looks like this will make the compiler much slower because you need to traverse the AST more often, it’s actually not the case (mainly because we’re applying the beta reduction first which greatly reduces the size of the AST).

The Fable ecosystem has also grown significantly in the last year thanks to many contributions from the community. Could you highlight two or three of the most significant of them and comment on their importance for Fable developers?

Only two or three? There are many great projects although we’re still a small community. Actually, more than the compiler itself, my current focus is in helping some of these projects and update the website to provide better documentation and a tool to make the libraries more visible. Most Fable users already know about Elmish which has become the standard architecture for Fable apps. I also use the tools by Maxime Mangel like Fulma or Thoth libraries everyday. Another prolific contributor is Zaid-Ajaj who, among many useful packages, has published Fable.Remoting: a tool to automatically generate a REST API between your client and your server. For a different approach, Diego Esmerio created Elmish.Bridge to automatically propagate the Elmish state between the server and the client through websockets. Finally we have Fable.Reaction to combine the power of ReactiveX with Elmish. And we shouldn’t forget about SAFE to write web apps with full-stack F#.

Please visit the site of each project for more information or check the videos from the latest FableConf for more in-depth talks.

Finally, you just moved back to NPM for Fable distribution, after living temporarily in NuGet. What did motivate this change and what benefits does it bring?

Fable has always had two hearts: .NET to communicate with the F# compiler and JS to communicate with Babel (although we already have an experimental version fully working in JS). This means two processes but the key is who calls whom. Originally we had a Node process calling the dotnet one, but in the move to .NET Core we gave the new dotnet CLI tools a try. The evolution of this tools however has been a bit confusing and there are still some limitations. Additionally, since the most popular way to use Fable is together with Webpack through the fable-loader, it made sense to go back to NPM so Fable could easily be invoked from JS. It’s still necessary to install the dotnet SDK on your computer, though. If you want to read the whole story, check this tweet out.

You can get Fable 2 from NPM. It requires dotnet SDK 2.0 or higher and yarn. If you want to port existing Fable 1 code to Fable 2, do not miss this transition guide.

Uncategorized

SIP text log analysis using Pandas

MMS • RSS

Article originally posted on Data Science Central. Visit Data Science Central

SIP application server (AS) text logs analysis may help in detection and, in some specific situations, prediction of different types of issues within a VoIP network. SIP server text logs contain the information which is difficult to obtain or even cannot be obtained from other sources, such as CDRs or signaling traffic captures.

The following parameters, among others, can help in estimating VoIP signaling network status:

SIP dialog length. SIP dialog length of hundreds or even thousands messages points to a possible IP network problem, VoIP equipment malfunction, SIP signaling fraud, abnormal subscriber behavior
Number and type of SIP messages retransmissions. A relatively high number of the retransmissions may be caused by AS hardware (HW) and/or software (SW) issues, IP network issues, peer SIP entities issues
Request-response times for different SIP transactions in different SIP dialogs. Request-response time trends can help to predict overloads, IP network issues etc.

Depending on signaling load, a SIP AS can generate up to several tens of gigabytes of logs in text format per day, that’s why analysis of the SIP AS text logs is time- and resource-consuming task. Pandas data frames (DF) can help in such analysis. Pandas provides powerful tools for working with large DFs. Maximization of large DF processing speed may be achieved, in particular, by vectorizing all operations applied to the DF. All of the code for this post is available on GitHub. See also

SIP text log file processing steps:

Open a SIP text log file for reading. I recommend opening SIP text log files in the same order as they were created by AS SW
Read one line at a time from the opened SIP log file
Extract SIP messages. Usually, these messages are located between specific delimiters
Create a list consisting of dictionaries. Each of the dictionaries consists of SIP message timestamp (key) and SIP messages (value, as a list)
Save the list to a pickle file on HDD or network storage. The file will be used for creating different DFs
Create a DF containing specific information extracted from SIP messages stored in the pickle file

In this concrete case, the SIP DF contains the following columns:

‘Timestamp’ – added by SIP AS
‘Call-ID’ – extracted from SIP ‘Call-ID’ header
‘CSeq_num’, ‘CSeq_meth’ – derived from CSeq header of a SIP message
‘Direction’ – transmitted (Tx->) or received (Rx<-) SIP message, added by SIP AS
‘SIP method’ – SIP method name

Fig. 1. SIP DF example

Having such SIP DF, we can extract some amount of helpful information.

1. SIP dialog length

Fig. 2. The number of long SIP dialogs is very low, each dialog of length > 100 messages may be analyzed for clarification the particular call scenario.

2. Request-response times (in ms) for transmitted INFO or INVITE requests

Fig. 3. Resp_Req_Time plots show approximately the same distribution of request-response times for INFO- and INVITE-transactions for the same groups of SIP peers. Request-response times > 500 ms point to retransmits. 500 ms is the default value for SIP T1 timer.

3. The number of retransmissions of INVITE or INFO requests.

Retransmit of a SIP request may be detected as a sequence of the transmitted SIP requests with the same Call-ID and SIP method and CSeq sequence number.

4. Request-response times (in ms) for received INFO-requests.

We cannot use Pandas groupby operation in this case because of the following reasons:

Different INFO-200 OK transactions in a SIP dialog may share the same Call-ID and CSeq_num values
INFO-requests and 200 OK-responses belonging to different dialogs may arrive in arbitrary moments of time and, consequently, will be stored in SIP DF in arbitrary order
Retransmits of INFO-requests are possible, i.e. ‘SIP method’ column may contain sequences of retransmitted INFO-requests and 200 OK-responses

One of possible solutions is splitting DF into two separate data frames df_req and df_resp. ‘Timestamp’, ‘Call ID’, ‘CSeq_num’, ‘SIP method’ columns are the same for both DFs, ‘TS_req’ and ‘TS_resp’ are unique for df_req and df_resp. ‘Call ID’ and ‘CSeq_num’ columns are necessary for further analysis of particular INFO-200 OK transactions.

Fig. 4. Request-response time count plot for received INFO messages

Conclusion

Pandas DFs may be used as an additional tool for obtaining helpful information from SIP logs. Some of the methods described in this post may be used to analyze text log files of other protocols based on the request-response model.

Original post

Weekly Digest, January 28

MMS • RSS

Subscribe for MMS Newsletter

Did you know...

Presentation: Dissecting Kubernetes (K8s) – An Intro to Main Components

MMS • RSS

Bio

About the conference

Subscribe for MMS Newsletter

Did you know...

Best dynamically-typed programming languages for data analysis

MMS • RSS

Subscribe for MMS Newsletter

Did you know...

Presentation: Learning to Love Type Systems

MMS • RSS

Transcript

How Many Ways Can This Program Fail?

A Gentle Introduction to Types

Why Less is More

Types over the Network

Questions and Answers

Subscribe for MMS Newsletter

Did you know...

Server and Network Operations Automation at Dropbox

MMS • RSS

Subscribe for MMS Newsletter

Did you know...

Data-driven Marketing Strategy: Spatial Analytics for Micro-marketing

MMS • RSS

Data-driven Marketing Strategy: Spatial Analytics for Micro-marketing

Subscribe for MMS Newsletter

Did you know...

New Home Sales Projection: A Time Series Forecasting

MMS • RSS

1. Background

2. Data & methods

Subscribe for MMS Newsletter

Did you know...

How to Flourish in Industry 4.0, the Fourth Industrial Revolution

MMS • RSS

Birth of Industry 4.0

Understanding Industry 4.0 Challenges

Industry 4.0 – Creating Smart Products and Spaces

Preparing for Industry 4.0

Subscribe for MMS Newsletter

Did you know...

Fable 2 Interview with Fable Creator Alfonso García-Caro

MMS • RSS

Subscribe for MMS Newsletter

Did you know...

SIP text log analysis using Pandas

MMS • RSS

Subscribe for MMS Newsletter

Did you know...