×

How Jetstack Set Up a Global Load Balancer for Multiple Kubernetes Clusters

MMS Founder
MMS Hrishikesh Barua

Article originally posted on InfoQ. Visit InfoQ

Jetstack’s engineering team talked about setting up a global load balancer across multiple Google Kubernetes Engine (GKE) clusters while utilizing Google’s non-standard container-native load balancing with Google Cloud Armor for DDoS protection.

One of Jetstack’s customers had an existing setup consisting of multiple Kubernetes clusters with DNS based routing to different load balancer IPs. They wanted to incorporate Cloud Armor DDoS protection, and use container-native load balancing to “improve traffic visibility and network performance”. The team went through multiple migration phases to introduce these features and a custom way to tie in a single GLB with more than one Kubernetes cluster backends.

Kubernetes has 3 different ways of “load balancing” in its spec, not including Ingress, at the Service level – ClusterIP, NodePort and LoadBalancer. Jetstack’s customer utilized the “LoadBalancing” service type, which translates to an implementation specific LB based on the underlying cloud platform. In GKE, it is implemented by a network LB (NLB). However, to accept traffic from the internet, a Kubernetes cluster typically has an Ingress which is implemented by a global LB (GLB) in GKE. The customer’s previous setup had geolocation based IP addresses routing in AWS Route53. Route 53 can return different IP addresses depending on where the DNS queries originate.

Google’s NLBs do not support the Cloud Armor DDoS protection service, although the Cloud Armor configuration supports Layer 3-7 network rules. So switching to an L7 LB – i.e. a Global Load Balancer (GLB) – was necessary. Creating an Ingress resource in GKE automatically creates this. An L7 GLB brings with it flexibility in routing URLs and TLS termination at the load balancer itself, and restricts traffic serving ports to 80, 8080 and 443. The latter resulted in some changes in the app which previously used multiple other ports. There were still multiple L7 load balancers at the end of this phase, with DNS pointing to their IP addresses.

GKE has a feature called “container-native load balancing” which allows pods to directly receive traffic from a load balancer. This is not part of the Kubernetes spec but an optimization in GKE, and thus cannot be used in other vendors’ managed Kubernetes offerings. Without this, traffic from an LB to a pod takes a circuitous route inside GKE’s network. The extra network hops involved in this can increase latency. Container native load balancing requires creating Network Endpoint Groups (NEGs) – a Google specific feature – which contain the IP addresses of the backend pods that service traffic. The second phase of migration included this.

In the third phase, the primary change was to use a single GLB IP address instead of using DNS to return region-specific IP addresses of different load balancers. Kubernetes does not have a mechanism to include multiple clusters behind a single Ingress. There is a beta tool from Google which attempts to do this but it is in an early stage. It is important to note that having a single GLB (or another Ingress LB) for multiple Kubernetes clusters is different from having multiple Kubernetes clusters working together. The former is about using a single endpoint through which global traffic gets routed to independent Kubernetes clusters. The latter is about using a single control plane for multiple Kubernetes clusters. Doing the former in other clouds is also not simple. Using Terraform’s Google provider for automation, Jetstack’s team created NEG resources and the GLB separately, and tied them together with annotations. There is another tool that purports to make this easier. Other companies have solved this in other ways – e.g. using an Envoy control plane, and by using Cluster Registry.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Evolving Architecture with DDD and Hypermedia: Einar Høst at DDD Europe

MMS Founder
MMS Jan Stenberg

Article originally posted on InfoQ. Visit InfoQ

Hypermedia is an enabler for a better architecture, Einar Høst claimed in his presentation at the recent DDD Europe 2020 conference in Amsterdam. In his talk he described the architecture challenges at NRK TV, the TV streaming service at the Norwegian public broadcaster, and how they migrated their monolithic architecture into a more modular design and implemented hypermedia in their Player API.

Høst, working for NRK TV, started by noting that TV streaming is a competitive and rapidly evolving domain. Since linear broadcast is declining, online TV streaming services becomes more important and as a public broadcaster they must take more of the responsibility, which means that stability becomes more important.

The Player API serves metadata and manifests to all the different clients people use to watch TV. Høst groups the clients in two categories: progressive clients, like browsers, mobile phones and others that do frequent updates to include new features, and legacy clients, for instance clients running on smart TVs with less upgrades. They have radically different deployment cycles which have an impact on the architecture.

Back in 2016 they had a monolithic API serving the clients. It was running in two data centres at Azure backed by a relational database and various background jobs. The complexity was overwhelming, which made it hard to reason about the system. Changes had unforeseen effects and caused intricate failure modes. All the problems resulted in a fear of change and stagnation.

Høst claims that the main reason they ended up with such a complex monolith was the Entity – for NRK TV, the TV show. There are many valid aspects that relate to a TV show depending on whose context you are using, and the entity inherits the complexity from all the involved domains and becomes a monolith, or a big ball of mud.

To be able to discuss a TV show from different perspectives, they started a functional decomposition of the monolith with a Domain-Driven Design (DDD) mindset and using bounded contexts. One example of a context is the catalogue where the focus is on describing the media content. Another one is the playback context with a focus on playing the actual content. Other contexts include recommendations, personalization and search. The main reason for this decomposition was to contain complexity, but also to be able to focus on different critical features in different contexts. With one large context everything becomes equally critical; it’s for example not possible to have redundancy in one part, but not in the rest of the application.

After this decomposition they started to use bounded contexts as service boundaries, and when they wrote new things, they also created new services. An API gateway was introduced as an architectural seam enabling them to route requests to different endpoints. With this in place they started to use the strangler pattern, gradually moving functionality from the monolith to services.

From a client perspective this decomposition into services is not interesting. Especially end users wants a coherent story which allows them to navigate across boundaries. They therefore started to recompose from a client’s perspective and used hypermedia to enable a user to seamlessly move between services.

Høst defines hypermedia as media with hyperlinks and emphasizes that he is not talking about REST. They are just using links where each link includes a relation to describe what the link refers to. These links show the possible ways a client can navigate from one resource to another with the relation describing the relationship between the two resources. There should be a link for each reasonable next step, thus offering a client the things it can do next. Together these links connect resources to form coherent narratives that a client can follow to achieve the desired goal. Links also enables support for multiple paths through the API and more than one way to reach the goal.

For hypermedia format, they are using the Hypertext Application Language (HAL) and Høst notes that the reason is that it’s lightweight, and allows them to add links gradually, which is important for them since they are in the process of shrinking the monolith.

For Høst and the teams, versioning of their APIs has not been an issue since they normally don’t replace individual endpoints. Instead they add new and improved narratives using links, with link relations describing the new narratives. Users can then gradually move over from the old to the new narratives. Since the clients have very different deployment rates, it’s important that they can switch over at their own pace. By tracking which narratives that are used, it’s possible to remove links when they are not in use anymore.

Looking at the present situation, Høst notes that the monolith is shrinking; they can now use independent deploys and standard HTTP caching techniques. All new resources use links and many of the old ones have been retrofitted with links. They also have a completely new playback solution and he specifically notes a new personalization solution that uses Orleans virtual actor framework.

The slides from Høst’s presentation are available for download. Most presentations at the conference were recorded and will be published during the coming months.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Working-Draft CSS Motion Path Now Supported in Most Browsers

MMS Founder
MMS Bruno Couriol

Article originally posted on InfoQ. Visit InfoQ

With the release early this year of Firefox 72, the CSS Motion Path specification is now implemented in most browsers. With CSS Motion Path, developers can implement a larger range of complex animations without resorting to JavaScript, or importing full-featured animation libraries like GSAP (GreenSock Animation Platform).

The CSS Motion path specification describes a set of properties that allow authors to position any graphical object and animate it along an author specified path. The object can be positioned, transitioned and animated along this path over a period of time. The time may be static if no animation was specified.

example of motion path
[Source: CSS Motion Path Working Draft]

The CSS Motion Path specification defines four properties:

The Motion Path module allows specifying the position of a box as the distance (offset-distance) of the box’s anchor point (offset-anchor) along a geometrical path (offset-path) defined against the coordinates (offset-position) of its containing block. The box’s orientation can optionally be specified as a rotation (offset-rotate) with respect to the direction of the path at that point.

The offset path property represents the geometrical path objects will follow. Of the few proposed values in the specifications for the property, only path() (which takes a path string with SVG coordinate syntax) seems to be supported uniformly:

.rocket {
        offset-path: path('M1345.7 2.6l2.4-4.4');
}

An animation can then be defined to animate the object’s element between different values by means of the offset-distance property. A value of 0% means the object is positioned at the start of the path. A value of 100% means the object is positioned at the end of the path:

@keyframes move {
 0% { offset-distance: 0%; } 
 100% { offset-distance: 100%; } 
}

With the keyframes defined, the animation can be configured with the animation property:

animation: move 3000ms infinite alternate ease-in-out;

The offset-rotate property allows developers to specify how much to rotate the object’s box being animated. By default, the center of an object’s box is positioned on the path. The offset-anchor property allows developers to position another point of the box on the path. The property offset-path can also be animated.

The following two animations by Facundo Corradini illustrate the animations which can be achieved with CSS Motion Path without any JavaScript.

CSS Motion Path demo animation on hover

CSS Motion Path demo animation of the offset-path property

Before CSS Motion Path, moving elements through a path required a conscious and precise use of simultaneous translation and rotation, skills in handling browser discrepancies and often resulted in a very complex set of keyframes. JavaScript developers could also resort to JavaScript animation libraries like GSAP. While CSS Motion Path does not entirely eliminate the need for animation libraries (it is supported for three-quarters of user browsers and not supported by Safari), it does raise the threshold for usage of such libraries.

CSS Motion Path is still in the Working Draft status. Firefox support landed in January this year. Chrome support is effective since version 46 (behind the #enable-experimental-web-platform-features flag). With the Chromium-based released also in January this year, the Edge browser also supports CSS Motion Path.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


MIT CSAIL TextFooler Framework Tricks Leading NLP Systems

MMS Founder
MMS Patrick Kelly

Article originally posted on InfoQ. Visit InfoQ

A team of researchers at the MIT Computer Science & Artificial Intelligence Lab (CSAIL) recently released a framework called TextFooler which successfully tricked state-of-the-art NLP models (such as BERT) into making incorrect predictions. Before modification, these models exhibit accuracy above 90% on tasks such as text classification and text entailment. After TextFooler’s modifications, which changed less than 10% of the input data, accuracy fell below 20%.

Extensive research has gone into understanding how adversarial attacks are handled by ML models that interpret speech and images. Less attention has been given to text, despite the fact that many applications related to internet safety rely on the robustness of language models. Engineers and researchers can incorporate TextFooler into their workflows in order to test the boundaries of hate-speech flagging, fake-news detection, spam filters, and other important NLP-powered applications.

TextFooler identifies the most important words in the input data and replaces those words with grammatically correct synonyms until the the model changes its prediction.  CSAIL evaluated TextFooler on three state-of-the-art deep-learning models over five popular text classification tasks and two textual entailment tasks. The team proposed a four-way automatic and three-way human evaluation of language adversarial attacks to evaluate effectiveness, efficiency, and utility-preserving properties of the system.

CSAIL describes the core algorithm in depth in the research paper they released in January. First, the algorithm discovers the important words in the input data with a selection mechanism. The selection mechanism works by giving each word an importance score by calculating the prediction change before and after deleting that word. All words with a high importance score are then processed through a word replacement mechanism. This replacement mechanism uses word embeddings to identify the top synonyms of the same part of speech, then it generates new text with these replacements, only keeping texts which are above a certain semantic similarity threshold. Finally, if any generated text exists which can alter the prediction of the target model, then the word with the highest semantic similarity score is selected to be used for the attack.

The below example shows how TextFooler modified the input data to change model interpretation of a movie review from negative to positive.

Original:  The characters, cast in impossibly contrived situations, are totally estranged from reality.

Attack: The characters, cast in impossibly engineered circumstances, are fully estranged from reality.

“The system can be used or extended to attack any classification-based NLP models to test their robustness,” said lead researcher Di Jin. “On the other hand, the generated adversaries can be used to improve the robustness and generalization of deep-learning models via adversarial training, which is a critical direction of this work.”

At UC Irvine, Assistant Professor of Computer Science Sameer Singh focuses heavily on adversaries for NLP.  Singh acknowledges that the research has successfully attacked best-in-class NLP models, but also notes that any attack with an architecture like TextFooler’s, which has to repeatedly probe the target model, might be detected by security programs. [Source]

The code, pre-trained target models, and test samples are now available on Di Jin’s GitHub as part of the TextFooler project.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Sorbet: Why and How We Built a Typechecker for Ruby

MMS Founder
MMS Dmitry Petrashko

Article originally posted on InfoQ. Visit InfoQ

Transcript

Petrashko: This is a talk about Sorbet. Sorbet is a type system and a typechecker that was built at Stripe for Ruby. In this talk we’ll discuss, why would we do that, and how did we do that.

First of all, who am I? I’m Dmitry [Petrashko]. I’ve earned a PhD in compiler construction working with Martin Odersky. My PhD thesis was Dotty, which is going to be called Scala 3, and now I’m working on developer tooling at Stripe, which includes everything from processes, core standard libraries, coding conventions, CI, everything. My work is to make sure that engineers at Stripe have the most productive years of their career there.

Here’s an outline of this talk. During the talk please at any moment feel free to stop me and ask questions. I’m happy to answer all of them. We’ll start with the context in which this is possible, which is Stripe.

About Stripe

Stripe is a platform that external developers use to accept payments. If you want to accept payments on the internet there are a lot of things that needs to be handled there to do it correctly. You want to make sure that you’re compliant. You want to make sure that you’re correctly doing things with credit cards [inaudible 00:01:10] them. That’s hard, and companies use us to solve this problem for you.

We run in 32 countries and millions of businesses worldwide use us. There are billions of dollars processing every year through Stripe, and more than 80% of Americans adults bought something in Stripe in 2017. We have hundreds of people in 10 offices around the world. Customer reports more developer productivity after deploying Stripe, and as always, we’re hiring.

Now, Ruby at Stripe – Ruby is the primary language used at Stripe. It’s enforced subset of Ruby. You cannot use everything from Ruby. You can use the things that we consider to be sane in a company with a lot of engineers. We want to have our codebase be maintainable and uniform. We’re not using Rails. We’re using our own framework, our own [inaudible 00:01:57], our own [inaudible 00:01:58] layer, our own things that we believe work best for us.

Most of our product is a monorepo, and that’s intentional. We believe that you get benefits by having a single versioning scheme while having clear notion of dependencies, and while being able to do all the changes in the same repo and the same PR. Majority of the code lives in 10 macroservices, and majority of new code goes into that. Now scale of engineering at Stripe – again, hundreds of engineers, thousands of commits per day, million lines of codes.

Problems Being Solved

In this environment, what was the problem that the type system was supposed to solve? Here’s an email from Pre-Sorbet times at Stripe. This was a discussion about some specific user feature. As the discussion happened, the discussion was that the most common way it breaks is by seeing something called NoMethodError in production. NoMethodError is what happens in Ruby if you’re trying to invoke a method on a class which doesn’t have this method. For example, this will happen if you have the wrong class, not the one that you expected. You expected to have a string, you have an integer. Or, you mistype a method. This is the right class but you’re just calling a method that doesn’t exist in it. This was the most common kind of error seen in production.

The second one is NameErrors. This is slightly different but it’s on the same vein. NameErrors is when you refer to classes, not to methods, while having type sometime. At the time those were the most common problems described in production, and in order to address them we went towards building a type system and deploying it at Stripe. Here are the design principles that we had behind it in order to make it work well specifically at Stripe.

The first one was explicitness. We want to write type notations. In fact, we see them beneficial. The reason for this being that they make code readable and predictable. As somebody who’s not writing this code but rather reading this code, it makes it easier to understand what to expect of this method. What we mean by non-explicit here, the alternative that we could have chosen is to not write type notation but have a type system that’s smart enough to figure it on its own and do it across methods. For example, [inaudible 00:04:18] does it, in particular for a [inaudible 00:04:20]. We’ve intentionally decided not to do this. Because in a company that’s big where a majority of people read somebody else’s code, it’s beneficial to understand not what it does now but what it was expected to do, to have explicit intent and make sure that implementation fulfills this intent, rather than trying to back-solve the intent from implementation.

The second one is effectively a counterbalance to the previous one. While we want our users to write signatures for methods and describe boundaries this way, we don’t want this to feel burdensome for this. In particular, when you’re writing code inside the method we can actually figure out the majority of types for you. For example, here in the very first line, A is an integer. We don’t want you to write it as an integer, we can figure it out on our own. Similarly on the second line it’s a string, but here, if you want, we allow you to declare it as a string. We don’t require you to do this, but if you want to be explicit about it you have an option to do this.

Next one is more of internal rather than user-visible design constraint. We want the type system to be simple but not simpler. What we mean by this is we want it to be as simple as possible while fulfilling the needs of Stripe. Here’s a list of features in the order they were added to the type system. The reason for those was that every next feature was added not because we thought this feature is fancy or because we always wanted to implement this feature. It’s because we saw real code with Stripe that’s super common, not one method, hundreds of method – methods written by hundreds of engineers that need this thing to be modeled. Based on this we started iterating with a minimal set of features, adding features one by one, building the exact set that was needed to support Stripe codebase that pre-existed before that time.

Then the next requirement was we didn’t build the type system for Stripe at the time. We built a type system that we wanted to live for a long time and continue delivering impact, which means we want to make decisions that scale with size of engineering, both size of engineer of our users and size of internal team who builds the typechecker. We also wanted to make sure we can address the many needs of different users of Stripe. Different people need different amount of rigor. Somebody is working on experimental features and they have still no clue what it should do, and they don’t know what design it have, which infrastructure it has, which structure it has. They want to go YOLO.

Somebody wants to make sure that this code is super rigid, doesn’t have a single problem, and can be proven to be correct. In the real world you can make both of them perfectly happy. You can make both of them at least not be grumpy with you to make sure that you can both prove majority of the usefulness without being constrained of somebody who wants to go YOLO, and you can also allow some people some degree of going YOLO.

The next one is, we want to scale with codebase size. From our experience, the majority of the codebases of big companies, Stripe, Google, Facebook, grow non-linearly with time. There’s an argument whether that’s a cubic function, whether that’s exponential function. In practice it’s a factor that grows super fast. We need to make sure that the tools we’re building can be fast and still provide pleasant experience for users, and it also allows our users to isolate the complexity of the codebase, where they will no longer need to know the entire codebase and understands actions at a distance. Rather this tool can be used to create those packages, so interfaces, those abstractions that will make it easier for users to reason about their code.

We also wanted to make sure that our project scales with time, and by this we mean if there are some decisions that can be much easier addressed early in the project, they would better address earlier rather than being postponed.

To give you some of the numbers for them, the most users want to assess here is performance. Our current performance is around 100k lines per second, per CPU core, and we scale linearly up to 64 cores [inaudible 00:08:30]. For comparison, to give some base line, if you compare us with javac, we’re around 10x faster than javac per core, and we’re around 100 times faster than existing tools that use Ruby, which is rubocop, which only does syntax checking. It doesn’t do any kind of modeling semantically.

From our experience so far, today, our tool is the fastest way at Stripe, at Shopify, and Coinbase, and all those companies to get iterated feedback. We’ll go into this deeper, but the short summary is, it’s today integrated in IDE, and the current latency for response is milliseconds.

The next one is compatible with Ruby. We did not intent to deviate from Ruby, either semantics or syntax. We want to continue using the tools that Ruby has. We want to continue using the standard IDs. We want to continue getting the value and want to provide the value for other companies. The whole point here is to improve the existing Ruby codebase, rather than creating a new codebase that might have been following better rules. We want to be able to adopt it in existing codebase, adopt it incrementally, and thus, code as Ruby.

Going deeper into this one, we want to be able to adopt it gradually. What this means is different teams, different people will be adopting at a different pace. Thus, we introduce something called strictness levels. The basic level does basic checks, such as syntax-level errors, and for all the code in our codebase, even if you didn’t describe it as strictness level we wanted to at least parse. The next one is typed: true, which enables basic type checking. It makes sure that for methods for which you have signature they’re correctly using its arguments, and when you’re calling other methods you’re correctly using the results.

Typed: strict enforces that every existing method that you write in this code has a type signature. It effectively says that not only I want to check my bodies, I also want to make a promise to people who consume me that I’m going to fully annotate my entire interface. This means, as somebody who’ll be using this, I’ll get all the awesome features. I’ll get jump to IDE, I have other completion. I have Jump to Definition, I have type at hover. This is the level that you want to use if you’re providing the typed API for your users.

Finally, this is level strong. This level, this allows you to ever do something untyped. Every intermediary exception, not only your API, everything that you do inside your implementations has to be typed and has to be verified whether it runs correctly, statically. Having all those in place, we want to adopt Sorbet.

Adopting Sorbet at Stripe

As a team who’s building this for internal usage our goal is not to just build the tool. It’s to actually make sure that the tool is useful, which means make Stripe use it.

Here’s a time of line of adoption. It took us 8 month to build the thing to the level that we believed it’s good enough, it’s good enough to start adopting it widely. Actually by that time those 8 month included pairing with two specific teams to make sure that their codebase can be typed. The first team was the one who we believed has the simplest codebase, and the second ones was the one that we believed had the hardest codebase. We first wanted to make sure we , handle simple cases because we’re just starting, and then wanted to make sure we can handle the most complex ones. Then we spent 7 months rolling this out, and then from there we’re working on editor and open-source tooling.

What does this provide us? I’ll illustrate this on a bunch of examples of Ruby code, and I’ll describe you what would happen before at Stripe and what would happen now. If you were to have a look at this Ruby code, if you carefully see, there’s a typo. The Hello inside the main method is mistyped, one letter “L” is missing. If you were to run this you will see this when running this. In practice, if you didn’t discover it yourself it will take you either time to test CI, which means 5 to 15 minutes to get feedback that you had a typo there, or maybe worse. You’ll hear it later when you deploy the change when the deploy will start failing and your deploy will be rolled back, or even worse. It won’t fail instantaneously but it will fail subtly in its own way and you’ll get paged at 4 a.m. in production.

If you were to do this at Stripe now, you wouldn’t be able to even [inaudible 00:13:02] this. You will get this error and it will tell you that there’s no such thing as Helo. This is true for 100% of code at Stripe. One hundred percent of code at Stripe currently cannot have a typo in a class name.

The next one is the method names. Similarly, in this one somebody wanted to have a call method called greeting, but the call side forgot that the name is [inaudible 00:13:26] and they just called greet for brevity. Similarly, in order to [inaudible 00:13:30] this error in base line Ruby and run this code, and you’ll see this error. Similarly, unless you found this error yourself you’ll find it either in tests, or in deploy, or worse, in production. Now, with our thing, you can find it statically. You add the type notation typed: true, which means we’ll start type checking your code, and then we’ll be able to tell you that the method greet doesn’t exist. I’m actually not showing the entire error message. The error messages actually include suggestions, “What did you mean?” They didn’t fit in the slides but we’re also trying to be helpful in suggesting what was the thing that you wanted to do it, and you actually can pass us a flag to make us auto fix it.

This is true for 85% of code at Stripe today. You may ask why, and for remaining 15% of the code it takes effort to do this. For a majority of the code that we migrated so far, we migrated through automatic migrations, where we can go and fix [inaudible 00:14:34] bugs through automatic refactoring tools. We’re a small team. When we started we were three people and we did the majority of migration ourselves. They built in tool that automatically do restructuring and handle error cases for you. The remaining 15% now are most bespoke things which are one-offs, and there’s commonly one or two errors that prevent file from being [inaudible 00:14:57].

Increasingly this is code that when team goes and touches it they will see the majority of features and their ID don’t work, because they only work in type files. Then they have a [inaudible 00:15:09] in front of them to get it to type, and majority of typing that has been happening over the last year is driven by users. I work on this file, I want the features that I’m used to working, so I’m just going to go and type it.

Yes, this is more than just errors. It allows you to express intent, and expressing intent is super useful for when you have a big organization and you have a question of, you have a problem. Is this one team who doesn’t understand how to use the API that another team provides? Or is the team that provides the API who didn’t handle the use case that the user expected to be handled?

Here’s an example of this. You have a method do_thing, very descriptive name. Can you pass string to it? Maybe. What happens if you pass nil to it, which is Ruby speak for “null” in Java? Maybe it works. How would you expect it to work? Should it handle it? Who should handle it? Having type signatures changed the way culture and collaboration between teams works. Now you explicitly describe your intent, and you know that the first one is ok. The method is expected to take a string. The second one does not. This is a good illustration of another thing, which is we didn’t want to postpone hard decision. One of the hard decisions that a lot of type systems postpone is whether nils inhabit every type. For example, in Java nil is a varied string, and thus, you can have errors in production when you pass nil. Kotlin is trying to fix it after the fact, and Scala 3 will be trying to fix this after the fact, but it’s taking years for them. For us, from start we had this figured out, and our engineers don’t know that this could have been problem that could have been [inaudible 00:16:54] them and paging them from production at 4 a.m.

Now, recap. What have we achieved? In 100% of files we can catch uninitialized constants. In 85% of files we can catch NoMethodErrors, and this is one more metric. In 75% of files, 75% of call sites, we know the specific method that you’re calling. While this is not necessarily useful for type checking, this is the metric which tells you in how many locations can we do auto complete? This tells you how many locations can you hover over it and will you show documentation about what’s called there? It would show you which types have been evaluated there. This is the metric that tells you that how useful this tool is as somebody who you’re iteratively talking to as part of your development.

It’s like somebody who knows your codebase really well is able to answer every question about it in milliseconds. Rather than you taking time, going too deep into a thing, trying to figure out whether it handles something, or where it’s asking a team whether it’s support to handle it, or where it’s having this team be in a different continent in a different time zone, and you need to wait 24 hours for it.

What Our Users Say

Did I say we’re in 10 offices? All of this were just numbers, and they’re numbers coming from somebody who chose this number, so they’re metric. They might be representative, maybe I’m just trying to say that this is an awesome metric but people actually hate it.

Here are some screenshots of what users say. This is an example of a person who is describing that they would use the type system to annotate the code that they rarely use. They touch it once every year or so. They always forget what it does. Going once there and describing what it was supposed to do makes it easier for them to go back to this code in a few years. Here a user is describing that they like the pair programming experience that they get from the tool. It’s super useful that we can work on programs that are incomplete, programs that don’t parse, programs that are still being written, which means developers can get early feedback and can adapt their design to be better modular, better readable, better in almost every way.

In the past, if somebody was to write a method that used to take either a string, or an integer, or an array, or a Boolean, now if they were to write this as an explicit signature they would feel bad. Before this they could have thought, implicitly, I guess, maybe it will work, but now they have to actually write this down, think about this, and now when they think about this, they’re starting to question whether that’s the right desire. Similarly, as part of code review, now code reviewers see this, and if somebody is trying to introduce the method that serves 55 purposes and takes 55 different kinds of arguments in specific combinations, it’s now explicit and people will [inaudible 00:19:54].

This is a message from an early user who found an undocumented flag to enable hover. We’re doing internal feature releases where some features for ID are enabled for some users, some features are not, but by passing magical flags you can enable them for themselves. We grew a group of users that was back solving them in order to enable features. They also broke our metrics because some of those features are not enabled because they’re not fully stable, because they crash. At some moment we became so useful that people would prefer to enable features crash once every day, because they’re so useful despite crashing from time to time. Since then we stopped having those flags, because we want to make sure everything doesn’t crash. We’re pausing it, we’re having metrics around it.

Being written in C++, crashes are scary because everything that you have a crash can [inaudible 00:20:55] something like this. We don’t intend to have crashes.

Finally, it’s fast. We have million lines of code and the tool completes in seconds. When starting from scratch and when you use in incremental mode in ID, it’s milliseconds. Depending what you do it can be single-digit milliseconds, can be a few more.

What We Learned

Ok, so what have we learned? Sorbet is a powerful tool that feeds many needs. We have different users. Some of them are building new projects and new products, and they want to move fast and not yet figured out what their things are supposed to do, and thus, not care if they’re broken. We have some other users that are literally moving money, literally moving big piles of money, and they want to make sure that they move the money correctly, because otherwise, well, big money is at stake.

People love using Sorbet. Originally we had people who were pushing back. Originally we had people who said something along the lines that Sorbet is stopping them from doing the thing that they want. This commonly comes from two reasons. One is, a lot of people came from teams that, in previous companies, didn’t grow as fast as we do, and they had a team of five people, and they had a team of the same five people over four years. Thus, making it easy to onboard new engineers, having code that’s easier to understand was not something that’s a big constraint of theirs. Stripe is growing fast, and thus, for us it’s super important to make sure people can easily understand the code. A majority of the code that Stripe is writing today will be maintained by more people as we grow.

Second reason is that at the time they were complaining about this, just to be honest, Sorbet only had a stake. We were saying, “You’re wrong, your program is wrong. Don’t do this.” Now, we also have candies. We can give you information about, what does your code do? We can do refactoring for you. We can do jump to definition. In previous time if you had a method with a common name and you wanted to figure out which method with this name specifically are you calling, and there are thousands methods with the same name in the codebase. You need to figure out which of the hundred is actually here. This was hard. Now, you can just Command-click and you’re there.

The final part, which is more of internal part, this could not have been possible without automating the migration. The majority of typing, the majority of making the code actually follow the rules, fixing the common code patterns, fixing up [inaudible 00:23:39] null checks, or at least making them explicit was made by our team. It wouldn’t have been possible to stop product development in Stripe and ask everybody, “Please go type your code or rewrite code to do something about this.” The biggest value proposition is this happened underneath people, underneath the majority of developers without them needing to do much. They were doing feature development, and in a year later it became much easier to do feature development. It became much easier to maintain their code, so they love it.

This was the majority of their love. The rollout was the most important part. In retrospective, it was the super right call to have this very same team who developed the tool do the rollout because it allowed us to understand the user cases. It allowed us to figure out what should actually be allowed and prohibited, and which features do we need?

Now, we’re in editor and OSS tooling mode. We have integrations with VS Code. We implement something called language [inaudible 00:24:40] protocol, the referenced implementation for VS Code, but there are other implementations that work for it. People at Stripe have implemented implementations for VI. There is also an [inaudible 00:24:52] implementation lurking around. Yes, it seems to be doing well. We have errors, we have hover. We have Go to Definition. We have auto complete in documentation. As you’re writing the method you can see which methods have this name. You can also see the documentation for them and figure out which ones did you want.

Try It Yourself!

Also, in case you were seeing from the previous talk here that was about WebAssembly, this thing is written in C++. We compile it to WebAssembly, it runs in browser. If you go to sorbet.run you can actually try it. Unfortunately, with the way how it works it doesn’t show all the features in your phone. Specifically, auto complete will not work because we’re actually using a small version of VS Code there, and Microsoft did not intend to explore cover use cases of running VS Code on the phone.

If you go there from your VS Code you’ll have the basic experience, and if you go from your computer you’ll get auto complete, you’ll get Jump to Definition. You can also Jump to Definition to Standard Library. You can see which method exists on standard Java, standard Ruby things. It’s the way how a lot of people now prototype small things, because it allows them to [inaudible 00:26:02]. Increasingly, we see people from both inside Stripe and outside Stripe use it as replacement for [inaudible 00:26:10]. You can explore this code, you can understand it better. It’s easier to write than [inaudible 00:26:15] because it’s auto complete.

State of Sorbet

What’s the current state of it? We’re collaborating closely with the Ruby core team. Ruby 3 will have types. We’re settling out the details about what’s going to be the syntax. Syntax might be slightly different, but Ruby 3 will have types, and that’s awesome.

It’s an open source. We will open source it after having extensive private beta, with more than 40 companies adopting it. At this moment we know of a lot of big and small companies using it, both internally and for their recruitment. It provides better user experience and developers want to be productive, so having this is in our codebase you can use Sorbet. Sorbet is trendy, it’s a good way to get people to enjoy and intend to work on your codebase.

You can check it out. The common companies who have blogged about it include Coinbase, Shopify, Heroku, [inaudible 00:27:17], and Ruby. Just to go into closing, Sorbet has moved errors discovery from test or production to development, which makes for much faster iteration tool for developers. It’s open sourced. You can use it. It works as common Ruby code. I started by saying that we don’t use Rails. Majority of [inaudible 00:27:38] does use Rails, and that’s why I’m making so much emphasis on that other companies also do this. Because they do use Rails and they made it work for them. Specifically, CZI, Chan Zuckerberg Initiative has created a huge project called sorbet-rails, which adds enough things as extensions to Sorbet to support Rail’s codebases and all the people using those. Docs are live at sorbet.org.

Questions and Answers

Participant 1: Ruby has a lot of really dynamic features, like whether I can re-open a class wherever I want. I can do instance_eval, OpenStruct, MethodMissing, just to name a couple. How does Sorbet deal with that?

Petrashko: The question was, Ruby has a lot of dynamic features, including being able to re-open a class, which is Ruby speak to define new methods into existing class, or define new interfaces as class, now implements just by the previous definition of it not saying it does, or being able to change the scope so that you’re evaluating something with this pointer, which Ruby’s spoken [inaudible 00:29:03] which is called class_eval and instance_eval. Those are very dynamic features, very [inaudible 00:29:07] programming features, and do we, and how do we support them?

The answer is two-fold. First of all, there is always something called T.unsafe. There is always a way for you to say, “I am intentionally doing something that you don’t know what it is. I know that it’s right.” There is a backdoor that you can always open. Sometimes it’s the right tool, sometimes it’s not. For every feature, the question is, “Do we want to support it and make it official? Or, do we want to effectively say that this is an unsafe feature you can still continue using this, but we believe that there are better practice around this.”

For class_eval and instance_eval, we consider it’s unsafe. For reopening classes, we consider it safe. Sorbet natively supports finding all definitions that reopen a class, and being able to find all interfaces from them in other definitions. It’s similarly from experience of deploying this at Stripe and in other companies. We believe we found a balance where some of the features are natively supported. Some of the features can be supported via backdoors.

Participant 1: One other question. Do you enable Sorbet in your test suite?

Petrashko: Sorbet has two components. I didn’t go deep into this in slides, but then I dive deeper into questions. Sorbet has runtime components and static components. Static component is static typechecker. It runs as a concurrent job in our CI, and if your code doesn’t type check it can’t be merged. In tests we have also runtime components to verify whether your types are correct in runtime, and we run it in both CI, and production, and development, everywhere.

Participant 1: I meant to say, do you check the types of your tests?

Petrashko: The question is funny, because we didn’t intend to. We grew experience from Facebook and Dropbox, who did not type check their test. Our users went to type check those tests for themselves. Our team, the majority of the [inaudible 00:31:12] were saying you should go and do this yourself for your users. Our users at some point found it so useful that they wanted the same features to work with tests, and they made it work. Some tests do. We still don’t officially require it, but nor do we encourage or prohibit it. The thing about tests is we find the tests are typically more fragile, and thus, sometimes type system is not the thing that you want. In particular, if you like [inaudible 00:31:38] type system doesn’t want to model this.

In particular, because our type system models what happens in production, and sometimes tests change the thing to not do the thing that production does in incompatible way. It’s increasingly rare, but existing tests do this. Thus, our type system is the thing that most closely mimics production, and some tests don’t want it to, so those tests are not accurate.

Participant 2: Does Sorbet have the interface where you might have three or four methods and you want a class to implement to [inaudible 00:32:18]? You might have multiple classes [inaudible 00:32:20]?

Petrashko: The question was, does Sorbet have interfaces, as in will you be able to say that this is some structure that you want multiple classes to implement, and thus, follow? Ruby has something called modules which we use as interfaces. We extended this to define notion of abstract methods. You can say that this is a module which has a method, and this method we describe as abstract in the signature. We will prohibit classes from being instantiated both in runtime and statically, unless they implement all their abstract methods. This effectively allowed us to have interfaces. We had this pretty early. More recently around four months ago we implemented the notion of sealed also, where not only can you define interface. You can also define interface that knows every class and implements it.

When you pattern match over it, we can do exhaustiveness checking. They do handle all of it. Yes, interfaces are super awesome. We’re using them a lot. They’re a great feature.

Participant 3: I may have missed this earlier. Is the typechecker implemented in Ruby itself?

Petrashko: The answer is no. The runtime type system is a Ruby library. It works on base line normal Ruby. It doesn’t need any patches. You can use it in any Ruby codebase. The static type system is a separate program that’s written in C++. The reason it being written in C++ is from my prior experience working in compilers. I believed that the thing that defines performance of compilers is good work on memory locality.

Compilers effectively have a bunch of core data structures that are huge hash maps, and the thing that will define whether you’re fast or not is whether you can quickly find stuff there. Thus, the thing that’s important for you is not your CPU performance, is not how many threads you have, it’s how much memory you can read through your CPU, and how much memory do you waste reading through CPU? We chose a language which allows us to control from memory layout. The original team all knew C++. A lot of existing typecheckers and compilers have built a lot of tools for them, and we just built on them. In retrospect, we believe this was the right choice.

It also allowed us to compile to WebAssembly, and they have a very nice website, but at the time we didn’t consider this as part of the choice. The website was built on a plane to Japan, and it’s awesome, so it was super easy.

Participant 4: How does this work with existing Ruby gems? Is there a mechanism to provide external [inaudible 00:35:04]?

Petrashko: The question was, with Ruby having libraries, the way Ruby calls it, being gems, how do we type check something that calls methods into gems? The answer is, Stripe had a different solution for this, which is we actually [inaudible 00:35:21] majority of the things, but the other companies didn’t. Shopify and Coinbase independently built a way to write an RBI as an interface file for a gem. At this moment I believe majority of people in open source, including Shopify, are converging on using the one implemented by Coinbase. You can put it into a gem and it will generate you a type signature for it.

Majority of the times they’re going to be just untyped because it can guess it, but the idea is that given a gem you will get the skeleton.

Then if you want to go [inaudible 00:35:57] types to some methods you can go do this manually, commit it to your code control, and upstream it to a common repository that’s called Sorbet typed, where companies exchange those type of interfaces between each other. All of this is built by Coinbase. They did an awesome job.

Participant 4: Is it open source?

Petrashko: All of this is open source.

Participant 5: Are there any advantages to using this for a new project over using a type language?

Petrashko: It’s a tricky question. The question is, do you want to have the same codebase where people can be both strict and YOLO in different ends? This is the value proposition for new codebases, and somebody may want this, somebody may not. Some companies decide to, let’s say, write a prototype in one language, say, Python, but write the actual implementation in other language, say, C++. The value proposition here is that you can go from YOLO to strict while staying in the same codebase.

There might be other reasons why as part of migration you may decide to choose a different language, let’s say performance. At the current situation, the value proposition basically is this. If you’re starting a new project, which I don’t think matters for a new project, but it’s an interesting consideration if you intend to have a big company. I don’t think anybody’s in the situation thinking as far along, but to the best of my knowledge this is the fastest ID integration and the fastest typechecker that I know about for a practical language.

RubyMine, in our codebase, takes a few minutes to start, and double or triple-digit seconds to do Jump to Definition. RubyMind has pretty much the same performance for Ruby and Java here. They use the same infrastructure for them, so the question is probably if you’re starting a small company you should include in your planning how are you going to work in terms of million lines of code? This is something that is also unique proposition for this project. At the same time, I feel like there are 10 companies in the world who care about it.

Participant 6: How did you get such fast performance?

Petrashko: I didn’t talk about this on JVM Language Summit. It’s available on the internet. The short summary is, the most important tool for you is memory locality. Knowing what you want to do can define data structures that work well memory locality-wise for specific transformations that are performance sensitive. At the same time, we want to encode enough extension points because some people will want you to extend [inaudible 00:39:14]. You want to make sure that when they extend [inaudible 00:39:16], your performance still stays good. It’s a balance between having the things entirely locked down where performance matter, and doing very good data structures for it, and data structures designed around not the common things of CPU performance.

There are entire algorithms [inaudible 00:39:38] studied called external memory algorithms, and those are the algorithms where you’ll be solving a problem, how do I store the petabyte of data given a single computer which has a gigabyte of memory? You effectively have the same question between your CPU caches and your memory. When you intend to read a single byte from your memory, you actually read an entire row from the actual RAM into your cache. This capacity, this throughput is the one you want to utilize fully. Short answer is, there is a set of algorithms which is external memory algorithms that if you started from there you’d be able to write software that uses caches effectively. With our contemporary hardware having many layers of caches, if you use them effectively you have 10x speed ups, 100x speed ups.

Participant 7: I was at RubyConf two years ago maybe and there was a couple talking about type checking Ruby. Do you feel this is becoming a standard that will be adopted [inaudible 00:40:52] industry-wide, or is it just you [inaudible 00:40:55] because you wrote it?

Petrashko: I’m presenting a talk in RubyConf in a week and there’s a separate track on typing with me opening the track, and then Shopify presenting this, their experience. Then one of core Ruby [inaudible 00:41:23] presenting his tool that works for typing. I think this stands here as this is one of the tools that Ruby companies, Ruby communities can decide to use. In some cases, this is totally beneficial. In particular, we strongly believe that this is a beneficial tool in a codebase that’s big where notion of interfaces is useful, being explicit is useful, and setting up explicit expectations is useful.

It’s also super good when you want to discover it through the codebase. That said, this tool doesn’t stop untyped Ruby from existing. There are use cases where companies benefit from having super magical DSL that just solves 90% of their business problem in this DSL, and getting rid of this DSL for sake of type checking or trying to support it is just not worth it. My belief is in many companies it’s useful. I don’t believe it’s universally useful for everybody. There are some use cases where there are some things that Sorbet wouldn’t like that might be better for you, and maybe you choose that one.

Participant 7: I was more asking about, type checking is good or not depending on the use case, but Sorbet specifically. It seems it’s used by a lot of the big companies. Do you envision this solution to type checking becoming the standard, or are there other competing alternatives that you also think are really good?

Petrashko: For context, the other implementations of type checking for Ruby are RDL, which is [inaudible 00:43:05] of Jeff Foster in his lab for around 15 years. There is work by [inaudible 00:43:12] Ruby. It’s a different person who just happens to have the same name. There has been a project by GitHub called TypedRuby, and there has been a project by IntelliJ folks. I don’t think it had a public name.

The current state is RDL is suggesting to use us if you want something production ready. They’re doing research, they’re good. [inaudible 00:43:34] is suggesting to use us if you want something that’s fast. I don’t know what’s happening with the IntelliJ one. I think their thing was more similar to feedback-driven profiling, where they’re not statically doing it. Rather they’re gathering feedback from the tests and crowdsourcing from everybody. I think they synergize with each other where you can use that tool to infer types for us, and vice versa.

The basic question is, the RDL is really good in the sense it has advanced language features that we don’t. You can express complex types to do type-level computations. You can do proofs. You can have your types access your database and do logic based on that. The [inaudible 00:44:18] that comes is this type checking speed is around 80 lines per second. Sometimes this is a better tool, sometimes the other is a better tool. We’re all working together. We’re all part of Ruby Type’s working group for Ruby 3. All of those people who are listed are in the same meeting every month, where depending what’s the set of use cases that you want to handle, different one of them work better.

The experience so far is we’re the only one that has IDE support. We’re substantially faster than the others. That said, we’re less expressive than the others. Now the question is, do you value having a smaller language but which is supported better, or do you want to be able to do type-level documentations? Sometimes not using Sorbet and using something else is a better answer for you.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Continuous Monitoring with JDK Flight Recorder (JFR)

MMS Founder
MMS Mikael Vidstedt

Article originally posted on InfoQ. Visit InfoQ

Transcript

Vidstedt: Let’s see if we can learn something about Flight Recorder today. I’m hoping that’s why you’re here because that’s what I’m going to talk about. I’m Michael Vidstedt, I work for Oracle, I run the JVM team at Oracle and I’ve been working with JVMs my whole career, couple of different ones. This is Oracle’s not so short way of saying that everything I’m about to say is lies, so don’t take anything for real.

Let’s imagine that you have a product. It’s running a service somewhere and it’s continuously running and you obviously want to keep it that way. One day, one night, it’s probably Saturday night, 2:00 a.m., middle of the night, there’s a problem somewhere and SLA gets breached, there’s a problem in there somewhere. Obviously, your job now in the middle of the night is to figure out what went wrong, so obviously you start looking through all the data you can find: logs and whatever you have set up in your system. It may well be the case that that data is really available. You see the general area where it’s crashing or not behaving, whatever it is. You can see that but you can’t get to the actual root cause of the problem. What do you do? At some point, you may be tempted to add some more logging or add some specific code that tries to capture more of the data so that you can see what actually went wrong. That’s great, the problem is obviously that over time you run it and nothing happens, so you don’t get more data.

As a matter of fact, this code you just added probably has some kind of performance overhead or some problem associated with it, so at some point, you remove that again. You need to remove it because it’s adding overhead to your service. Guess what happens? Saturday, 2:00 a.m. in the morning, you have a problem again. Obviously, what the flight industry has learned from this is let’s capture data continuously and let’s do that in the background, you don’t even notice what’s happening. They call it a flight recorder and that’s effectively what I’m going to talk about. The flight recorder we have worked on is built into the Java Runtime itself, but it behaves very much like an actual real flight recorder and over the next 40 or so minutes, I’m going to talk a bit about what that is on our level.

The agenda looks something like this. I’ll talk about the overview of what JDK Flight Recorder actually is. I will talk about events because it does turn out that those are central to how JFR works. By the way, I’m going to use JFR, the abbreviation, to refer to this functionality a lot as well. I’ll talk about the fact that JDK Flight Recorder is meant for production. If you leave this room without remembering the fact that this is built for and designed for use in production, I think I’d failed, so I will try to weave that in a few different slides. I’m just going to say something quickly about how you can use JFR and obviously some future work because that’s normally how you wrap up a presentation.

What Is JDK Flight Recorder?

What is Flight Recorder? JFR is short for JDK Flight Recorder. It is available now in the JDK close to you. I’m saying that because it does depend a bit on exactly what JDK you use, but chances are that the JDK you’re using has JFR in it. It’s been there since JDK 7. It was there in another JVM before that and I’ll touch on that in a few slides as well, but it is probably in the JDK you’re using. It is event-based. I’ll go more into detail about what I mean about that, but it basically is the central to the concept of JFR is event and capturing events. Again, I’ll talk more about that later.

It is built into the Java Runtime itself, and the nice part about that is that is an unfair advantage to us. We can piggyback on a lot of the stuff that is happening inside of the JVM and the inside of the libraries that make up the JDK and captures and you store a lot of the information that the JVM already has. The fact that we have built this into the Java Runtime itself makes it very efficient, it has very low overhead. As a matter of fact, it’s designed to have less than 1% overhead in the common cases. That means, again, first iteration, production. This is meant for production, you should be able to run with it, turned on all the time and not even noticed that it is there in the background, and then if something goes wrong, you already have the data available.

The other nice part about being having this in the Java Runtime is that we can collect events from many different levels, so all the way down from the operating system in the CPU through the JVM, through the libraries, and also on the application level. You can get events from all these different levels and you can very easily correlate them, so if something goes wrong in your application level that not necessarily is breached, you can look into all the different details around what else was happening at the same time. We also have very powerful APIs that you can use both to produce your own events, but also consume the events and make sense of the manualized media around it.

This is in a nutshell and now we’ll go through all of that more in detail. I could start talking more about theory but I’d like to try to just throw a demo at you and see how that works. I’m going to show you a very simple use case for how you can use JFR. In this case, we have all the events already in the Java Runtime itself and what I’ve done is to build a very small agent – a Java agent, in this case. It’s a small program, that’s where it runs in process next to the application that I’m going to run. It is monitoring the application, so it’s taking the event stream, the data stream that is being produced, and it’s analyzing that and it’s displaying it in a simplistic way. In this case, it’s going to be a text-based representation, but it could be something else, of course.

I’m also going to use a very old demo, Java 2D. How many people know what a Java agent is? Ok, some of you at least. Imagine that you have your main – the normal application main – as a way to say, “Here’s the JAR file,” and inside of that, I have a class with another sort of main that’s called premain. It’s like you are running in parallel with your applications, so you get the premain call first and you can do stuff in it. It’s a handy way just to run something on the side of an application. Basically, what I’m saying is run my health report thing here, my special JFR analysis agent and run that in parallel with the application.

If I fire that up, basically, what you’re seeing here is JFR in action. In the background in the JVM, in the Java Runtime, we collecting all these events. Some of them are related to GC, it’s the environment we’re running it with the amount of memory and all that, we have allocation rate at the top. There are details around what’s going on in the JVM and then there are two different categories here.

The first one is “Top Allocations,” so where is the application allocating memory. The second one is “Hot Methods,” so where is execution time actually being spent. This is just a snapshot of what JFR can provide, if I show you the code, you’ll see that it’s very tight and streamline. I’ll go more into the details about this later but just to get a feeling for the kind of information that we’re collecting, hopefully, this will help put it in perspective. That was the demo.

Some history – I’m not going to go into all the details but JFR has been around for a long time. As a matter of fact, I worked on another JVM before called JRockit and we had a problem at some point where we were running into challenges ourselves understanding what was actually going on inside the JVM. We developed the JVM but we wanted to know what’s going on so that we can improve on it and make it better.

Also, our support team, they were sitting next to us, and from time to time, they brought up the fact that, “We could sure use more insight into what your code actually is doing.” That idea was the foundation for what later became Flight Recorder. It has changed its name over the years a bit, but it has been there for 15 years or so. In, let’s say, the standard JDK, the HotSpot JVM, it’s been around since JDK 7. The very first version was ported over to HotSpot in 2012 and we’ve made a number of improvements on it ever since. In JDK 9, we published APIs, made the APIs available for people to use both for producing and consuming events.

One of the biggest things that happened to JFR was that in 11, which is now a year ago, we open sourced JFR, so now it’s available in open source and probably in every JDK more or less out there. You don’t need to know all the details about when what happened but that said, it’s been around for a long time so it’s very mature technology. My point is, you can use it in introduction. I’ll stop saying that at some point.

JFR Events

Let’s look at what events actually are because, again, the events are very fundamental to how JFR works. An event is basically this; it is a small blob of data. I’ll go more into the storage format of it but, logically, it has an event ID, so it’s basically some kind of identifier telling which event this is, a unique identifier. There is a timestamp, so when did this happen? There’s a duration, so how long did this take? Not all events have a duration, so some can be instant and therefore not having duration. There’s a thread ID, so which thread generated this event or in which thread was this operation generated? Not all events have a thread ID, some of them are not tied to a thread specifically but many are.

There’s a StackTrace ID, so basically associated with every event, there can be a StackTrace, I should say. But, in order to keep the footprint of this down, instead of storing the whole StackTrace with every event, we are putting the StackTraces on the side because they tend to be almost the same so we’re basically storing them separately and we have an identifier to tell which StackTrace to use for this event. Then, there’s event specific payload, so depending on which type of event you have, you have various other fields or data points in there as well. That’s the lead strike, let’s look at it from a Java perspective.

This is basically what it would look like if you wanted to generate your own event. There are two sources of events and I’ll get to that on a later slide as but there are events that are generated from Java code. If you start picking up and using this and playing around with it, this is what you will see. There are also events that are being generated from inside the VM itself, so that’s native C++ or CC++ code. Very similarly to the next few slides that I’m going to talk about on the Java level, there’s a corresponding functionality inside of the JVM itself to capture events. If you want to create your own event, this is what you have to do. It’s literally that easy. You inherit or you extend the class called event and, lo and behold, you have your own event.

How do we actually make use of that? Let’s say that you have some really business critical code. This is your application, it does something really important and you want to put an event around it to get some more information. I don’t know what it is, maybe it’s processing HTTP request or it’s calculating some prime or whatever it is. Basically what you do is you add this around it, so you allocate an instance of the event you just curated. You start the event, so you allocate the event, you start the event, and the “begin” here is basically taking a timestamp, so it’s saying, “This is when the event started.”

You do your business logic and then you call end, which is also just taking a timestamp, so it’s basically now has the start and the end point for the duration of the event. Then, you say, “commit,” and that’s actually what takes the event data and puts it into the event stream and stores it off inside of the VM and we’ll see more about how that works. The first question you might ask is, “Why would I call end and then commit? Couldn’t we just coalesce them?” It does turn out that you don’t have to call end, if you don’t, commit will do the work for you. There is a case where it’s useful to make a distinction between the two and that’s when you want to actually have the end timestamp, but there’s some kind of computation you want to do before storing the event into the stream.

In some cases, it may make sense to call end explicitly but if you don’t, commit has your back. Typically, what you’ll see is something like this, but this isn’t very useful in itself. Now you have an event and the only thing you know is that that event happened. You may want to put some of your own data into the event. The way you do that is by adding fields to the class, not surprisingly, perhaps, and you initialize those fields as part of creating the event. You can do this whenever you want as long as it’s before a commit, well enough to the allocation, of course. For example, in this case, I’m storing constant before the event is even generated, but maybe what you want to do is to take the business logic or whatever and maybe some value comes out of that so you want to store that in a field instead, that’s totally fine. Do it whenever you want, as long as before the commit. That’s basically it.

There are some additional things that are helpful to do with your event. There are a number of annotations that you can make use of. These things help when presenting the data in diagram. They don’t modify how things are being produced, but if you want to visualize this later, you may want to provide some useful hints around what the field value actually is. I’m not sure I did my very best naming them here, it’s just messaging value. Maybe you at least want to signify that in some way.

You can also annotate the event itself, the class. I’ll go through a few more annotations on the next slide, but the default name of an event is actually the full package name and the class name. In some cases, that works out but in some cases, you’ll find that your class is in a package called “com.internal.secret-something$,” and it just isn’t very nice when you want to present the data in diagram, you may want to have a name that is more easily understandable, let’s say. Name the event in more human friendly way, so to speak. That’s basically it, now you’ve created your event and you’ve put it into the event stream and you can consume it later and we’ll see what that can look like.

A few additional annotations that you can use are these. I mentioned name and label, there is a description. Label is supposed to be very short, it’s like a word or two. Description is not a full essay but maybe you want to add some color to it, so a few sentences, maybe two, so try to keep it short and interesting. There’s also other metadata that you can annotate your events with. As I mentioned, most of the events have durations and one of the ways we’re dealing with the amount of data is by filtering the events. Instead of just capturing everything all the time, chances are that some events are only relevant when the duration is long enough, so you can set a threshold on your event.

If you, for example, know that HTTP requests that take less than 100 milliseconds, not a problem, don’t care, but the ones that take longer, those are typically the ones that I want to capture. Then, you can say, “Threshold, here’s the filtering threshold for duration.” Similarly, there’s an Enabled annotation, which basically says this event enabled by default. There’s also an annotation for saying if the stack trace should be included or not, so those are some of the configuration options for your events.

This list is in no way exhaustive, there are more annotations and more information. The Java docs in the jdk.jfr package is very helpful, so if you want to know more about which annotations exist and when and where to use them, I suggest you go look there. I mentioned that the Java Runtime itself generates a lot of events. This, likewise, is not an exhaustive list, but it does cover some of the key things that we are collecting inside of the JVM and the libraries. There are right now around the 140 different events, it is growing, but I showed you a bit earlier in the demo that we are capturing things like the environment, which CPU are you running on, how much memory do you have, that thing. Command line information, version information, both from the Java Runtime itself and operating system, things like that. There’s I/O, both from the file and network side. GC, JIT compilation, more or less the expected stuff, I guess, is in there. Again, this is something we’re constantly working on, expanding on.

In the background, this is how it works. I mentioned that you’re getting two sources for events, one is generated from Java code. That’s what we saw an example of just in the few previous slides here, that’s the top cloud thing there. There’s also JVM-level events, so things that we feed in from the JVM through the native code, more or less. These both go into buffers and those buffers are thread local.

That’s very powerful because what it means is that there’s no cross communication across sockets or CPUs or even threads. They all go into thread local buffers first and that’s very efficient. They’re storing things through thread local buffer and we’re very good at optimizing that. What then happens is that as the thread local buffers fill up, we store them and we take the buffers and put them into a global list of buffers. This is, obviously, the uncommon case and also something that can happen in the background. This is being processed by another thread and once the global buffers fill up, we store things into what we call the repository, which you can think of as a recording file in the background.

Obviously, we stored that data to file when the buffers fill up. The other case, which is new with the event streaming functionality that we’ll touch on later, is that we now also try to store things through the repository approximately every second. I’ll go into this more in detail later, but we want to make sure that the repository can constantly be observed, that it’s in an observed state and has the most recent data in it, but there’s a trade off between putting the data there all the time and getting data in a timely manner. Basically, what you’ll see going forward is roughly once a second, we’re going to store the data into the repository. If you take anything away from this slide is that the overhead of producing events is, from the application perspective, more or less zero.

What does a flight recording file look like? We store things into the repository and the end data format that we use for this is the flight recording data. It’s a binary representation and it’s trying to be very compact. It is not compressed, which is something we’re looking into potentially doing in the future, but it is very compact and this is something we’ve improved on over time as well. A lot of the identifiers and the stuff I mentioned goes into an event, you don’t need the full 64 bits or whatever it is, it’s like most of them are zero. Varint happened to be a very efficient representation of that, so we’re using that extensively in the flight recorder format.

The files are also self-describing. There’s a high level format you need to know about how JFR recordings work, but the events themselves are self-describing. We capture all the important information about what an event looks, which fields it has, how to interpret those fields, what types they have, all of that is captured in the recording itself. If you add the new events, for example, you don’t have to have the site information about which fields you have, that’s all in the recording. The recording is self-describing, it stands on its own feet. I’m not going into all that much more detail around what exactly the flight recording file looks like but hopefully, you get a rough understanding of that.

I mentioned that events can be filtered, the obvious one is on a type or a name basis, so you can say that, “These events, I want to include, these other events I don’t want to include,” and I also mentioned that you can filter by duration and we find that the combination is very powerful. Most of the time, as I mentioned, you’re probably not interested in the nanoseconds or even necessarily microsecond events. It’s when it’s starting to come up in milliseconds or seconds, that’s where it gets more interesting but you can filter on different durations for different events.

I also mentioned that the fact that we are capturing events from many different places can be very powerful. We capture things everywhere from the operating system, so what operating system are you using, the versions of things, the CPU load, that kind of information, through the JVM, so JIT compilation, GC, class loading in the JVM, locks, synchronization, and all the way through the libraries. If you do application-specific events, you get all of these events in the same stream and it can be very powerful to start from the top. You see that your HTTP request took a while to execute. I’m actually going to show them on this later as well. In order to figure out what actually happened, you can dive in and go as deep as you want to find what actually led up to that problem, why did it take so long in the end?

Designed for Use in Production

I have a whole section dedicated to the fact that you can use and should use this in production. A lot of this I already covered but it has extremely low overhead. The concept of having thread local buffers and the JVM piggybacking on the fact that a lot of the GC and JIT data is something that the JVM already collects and we can just take that data and store it into the event stream. That means that this does have very low overhead.

We generate the events into these thread local buffers, which means that the application won’t block, it can continue executing, it’s just doing some thread local storage and that’s very fast. If you still aren’t convinced, maybe this will help you. We have JFR default on for our Oracle Fusion apps. I’m not even sure what kinds of applications we have. We have customer relationship stuff and I’m a low level Java guy, but we have cool applications and they’re all running with JFR enabled. We also see customers who extensively make use of JFR throughout their whole deployments. These are really huge companies, hundreds of thousands of machines or instances, that thing.

I tried to stop rubbing in now, but this is for production, it’s not just for development. “Still, Michael, you keep talking about this performance thing. You keep telling us that it doesn’t actually have overhead. Surely that can’t be true. If we go back to our example here, I’ve now sprinkled all this code around my business logic and surely that will mess things up.” To start with, let’s have a look at what actually happens here and this is going to be a journey through how the JVM does optimizations.

I add a few lines of code, I will admit that. The begin thing, we’ll look into what it does. Commit may be the obvious candidate for how can that possibly be efficient? Let’s look at what commit looks like. The first thing I’ll say is this is not what commit looks like but it’s sort of what commit looks like. It actually is for various reasons, bytecode generated, not an actual method but it sort of looks like this. The first thing it does is ask, “Is the event enabled?” If it isn’t, there’s an early out to say, “Don’t do anything else.” There are other checks, so, for example, I mentioned duration. If the duration isn’t big enough, if this event didn’t take long enough, then we’ll also early out and say, “Yes, the event isn’t interesting.”

There are also other checks that are being performed to see if this event actually should be stored into the event stream. These checks are all relatively cheap but obviously, they are there, and in the end, if the event is actually to be committed, then we call the equivalent of, let’s say, actuallyCommit down there. Remember, even that is actually cheap, and goes into thread local buffers but let’s have a look at what happens if this event isn’t enabled, at least the first thing. If we can’t get that right, surely the performance will not be good. So going back to our example.

This is what the code looks like right now, you have your really important business logic there in the middle and obviously, you want to make sure that that’s executing and it doesn’t have the rest of the overhead. The first thing we do inside of JIT compiler is we use one of our secret weapons, which is inlining. We start looking at all these method calls and we look at what they actually execute and then we put that in the method that is calling it. In the case of begin, I mentioned that the only thing it actually does is take a timestamp, so we inline the timestamp thing.

In the next step, it turns out that that is actually JVM intrinsic, it’s something that the JVM knows really well how to get. It’s actually a CPU register, they read the time stamp counter instruction in there, so it’s very cheap and we know how to optimize that, so we put that there also.

Commit, likewise, we inline and I didn’t inline all of commit here but I did the inline the top check. We’re saying, “If this is enabled, do stuff.” we can inline that as well. If it turns out that the event actually isn’t enabled, we’ll get the value false and the second secret trick we have in the JVM is dead code elimination. We know that that’s not going to be executed, so we just remove it. Now we know this event, so we allocate it and we store a value in it. What we can do is use our third secret trick which is scalarization. We don’t have to store things in the actual instance on the heap if nobody is going to be around to read it. This event clearly isn’t used anywhere else, it’s local to this stack frame, it’s local to this thread, so therefore, we don’t actually need to store the value on the heap. Great, now we have a stupid allocation, let’s remove that because nobody will see it.

Guess what we’ll do then? We’ll use our second weapon again, that code elimination, and business logic. The story is similar for the other checks. In the end, basically what this comes down to is it may look like a lot of code, it may look scary, it may look like it’s not performing but the JIT compiler is really good at this and basically, what you’ll end up with is something that is extremely cheap. Zero if it’s not enabled, really cheap if it is.

Here are some bar charts, because everybody loves bar charts. This is showing JFR disabled, which is zero as I said. I will come back to the stack depth part in a bit, but it does have an overhead. Yes, I’ll admit, it’s not zero when it’s turned on but it’s very small as you’ll see. There are a few other comparisons here as there’s lon4j with off, which still has an overhead. It’s log4j with INFO, which as you can see that’s pretty costly compared to the alternatives.

Anybody dare to guess what java.util.logging INFO and Redirected System.out, how big those will be? they didn’t fit into this chart, so it’s changing the scale a bit. Yes, they have ridiculous overhead. Now, it says, “Your Mileage May Vary,” at the top, this does depend on what you’re logging, where you’re logging it, all of that. Just to get a feeling for the kind of performance you can expect from JFR, it is very close to zero and that’s what we designed it for.

Less than 1% overhead is the goal, we’re trying to keep it there. There are ways you can mess that up. If you actually start generating events with long stack traces, that’s one of the key things that is costly for us, it is taking the stack trace of the running thread. If you have very deep stacks and you generate these events in some innermost top loop, then you will see overhead, but in the typical case, you won’t. The default configuration is tuned so that it doesn’t have more than 1%. There are other configurations you can run with, so if you start collecting allocation profiles, for example, or just capture a lot more data, then, yes, there is an overhead, but we get a lot of detailed information using the default profile.

Using JFR

If you want to use this, how do you get started? I’m going to cover the JDK 11+ case here, because I’d like to talk about JDK 11+. There are options for older JDKs as well but this is what it looks like in 11+. The option that you’re looking for is StartFlightRecording and there are some options to that. The second example here shows how you can start the recording and ask it to store the data to a specific file. Otherwise, you will just record it in the background and you’ll have to go in and get the data out explicitly. If you started as the second line says, then you store the data to a five called tmp/foo.jfr and you have your replication arguments after that. If you already have a VM up and running and you didn’t specify StartFlightRecording but you still want to get data out of it, there’s a command in your bin directory called jcmd or J command and it has the options to start the recording and to get the data out as a file as well, so that’s the JFR.start. You can, by the way, specify the jcmd to either a process identifier or the name of the main class. Obviously, that doesn’t work if you have many applications running the same main class but those are the options, let’s say. There are also some options that you can specify to limit the amount of data you get out, and so on and so forth.

I’ll show you a quick example of what that looks like. I’m going to still use the Java 2D demo and I’m going to run it and say, “Take the default profile, these default events, and store those into a file called /temp/j2d.jfr.” I should remove this file to make it clear that I haven’t cheated. I’ll start this up and I’m going to run it for a short while to have it collect some information in the background. We’ve run it for a while and if we now go and look, we will see that we have a file.

There’s another handy tool that you can use to look at this. Inside of the bin directory, if I go look in my JAVA_HOME, on the bin, I will find a JFR command. This is also new since, I want to say, 11, I could be off by a release or to. This is actually one of the challenges with the release cadence we have right now. We’re releasing one release every six months and it’s really hard to remember what went into which release but I think it was 11, it should be closed at least. If I run that, I’ll get some handy information about what I can do with it, so it’s also saying, “If you actually want to capture data, here’s how you do that, you can use jcmd as well.” It’s telling you that if you want to print out and analyze what actually went into the recording, you can do something along these lines, so print the events.

There’s large number of events now that scroll by. As you can see, they’re human readable and nice. There are other ways you can get the data out, so you can, for example, say, “I want that exact same information but as JSON because JSON is cool nowadays,” you will get it as JSON instead. You can look at summary information, so maybe I should have started with that. I can say, “Summary,” and I will get information about what version of the JFR file format, whatever, is used here, how many chunks it has, so this is like you back to how is the data is stored. Inside of that file, we are storing things into chunks because we do want to deliver data continuously in some ways, so there are chunks that each stand on their own feet again. Inside of this file specifically, we only have one of them. We see when the recording was started, how long it was going for, and the events that went in. This is a histogram, the kind of events that came into that recording.

Those are all cool ways of looking at JFR, repeated on this slide for simplicity. You can also say categories; one of the annotations that I think I had on the slide was category. That’s another high level presentation we have seen, this event is logically part of a group of other events, so we have one category called GC, for example. If you only want to print out GC events, you can say, “–categories GC,” and you can define your own categories as well. I’m not going to go into the details.

Ok, so what is JFR useful for? Production. Apart from production, you can also use it for other things – the obvious thing is development. You want to know, while you’re developing your application, what methods are actually hot? What is allocating memory? How can I make this run faster in a more stable? Development is another obvious use case. The one we found interesting was actually testing, so it turns out that as part of implementing JFR and especially the event streaming stuff I’ll touch on in a minute, we found that using JFR to test JFR was actually extremely powerful. We can capture data that we otherwise just wouldn’t be able to and then analyze whatever operation we expected to do a certain thing, actually did the thing in the end as well.

This is true for things like allocation or lock profiling, like the profile suddenly changes. You make a change and all of a sudden, you have lock contention and everything grinds to a halt – or maybe not a halt, it just goes more slowly, maybe 10% more slowly so that it’s very hard to see. JFR can potentially help you figure things like that out.

That’s in a nutshell what JFR has been for a long time. One of the key use cases we’ve been missing is continuous monitoring.

Future Work

For better or worse, the way JFR was implemented was more for, let’s say, profiling, spending time, like a minute or three, collecting data from an application, dumping out that data, and then analyzing it. That’s powerful for many use cases but one of the problems is if you actually want to use it continuously. If you have to wait for a full minute and then go through a number of steps, then that’s cumbersome.

I’m going to talk a bit about the future. Some of the future is almost here already but I never promise that anything goes into releases so I can only say that the event streaming stuff is in mainline right now and is therefore highly likely to make it into the next release. Duke here is holding a beaker because that’s what we do in the labs. If you watch carefully the beaker, it will now change to a coffee beaker thing – you wouldn’t believe how much time we spent internally working on the color of that coffee, by the way.

As I mentioned, today, as in all the shipped versions of JFR, you basically have to go through the cycle of starting the recording, stopping the recording, and dumping out the file every time you want to look at the data and that’s not exactly ideal. It can work well for your development use case where you’re just doing profiling but if you want to actually look at your running service somewhere, that’s not exactly what we want to do. We worked on something called event streaming, it has a JEP number associated with it. JEPs are JDK Enhancement Proposal, so they’re a description of what the functionality is and some pointers to more data around what’s going on.

The goal here was to provide functionality that you can use to continuously monitor what’s happening inside of the instance and also provide the necessary APIs to make that easy. Our goal was to make it more or less a one liner to consume data and act on various things. We didn’t get it down to a one liner but I’ll show you on the next few slides that it’s at least relatively straightforward. This is enabling you to do continuous monitoring while the recording is still in progress. We’re continuously recording things and given that we now dump the data out to the repository approximately once every second, you can observe it, you can consume it and act on it approximately every second.

What can consuming look like? This is the canonical example of consuming events. It is opening the recording stream. I should have included the package here, I think it’s jdk.jfr.something. We’re opening up the recording, we’re saying, “Enable one event,” in this case, it’s jdk.JavaMonitorEnter which is, locks synchronization starts and do that with a threshold of 10 milliseconds. Start the JavaMonitorEnter event with 10 millisecond duration, so any event that takes longer than that will be captured in the stream.

Then we’re going to say, “If one of those events occur, if we see that in the event stream, then do this,” so in this case, we’re going to say, “Print it out to standard out.” I’m not going to go into the details around here, but basically one of the fields in that event is called monitor class and it’s the type we’re synchronizing on and that call is not Java Lang class. You’re getting a representation, a JFR specific representation of a class or a type. I’m not going to go into the details, but in any case, it’s going to print out the event, the type associated with the event. Then you say, “Start,” which is a synchronous call, it’s going to block until the event stream is closed in the other end.

There are asynchronous calls as well that you can do, methods for processing this in a separate thread but it’s going to do all of this and it’s going to print it out. If you want to add another event, it’s very similar, so in this case, we’re saying, “CPU load, enabled that, do that with duration of one second and print it out when you get it.”

I’m going to try a much more complicated demo. I know literally nothing about Spring Boot but I’m going to use that anyway. I’ve implemented the small agent, a monitoring agent, and actually this time it’s not going to run in process. It’s going to run on the side, so it’s going to run on the same machine, my laptop, but it’s going to run in a different process on the side.

What I’m going to do is to start up the Spring application, I’ll show you the code in a minute. I’m saying, “Start flight recording,” and I start up my Spring Boot thing. This could be any app. I’ll fire that up and meanwhile in the background, I have this application going. I can’t really talk to this anyway because it’s Spring Boot and I don’t know Spring Boot. There is a small application here, basically what it does is it sets up a few resting endpoints, I think they’re called, and very specifically, they look like this. There’s a /hello endpoint. If you go to this web service /hello, you will simply produce something that says hello.

There are a few other ones here and that’s what we’ll be demoing, so hello1, that’s almost the same thing, it’s just it does something CPU-intensive. Imagine that I have some weird requests here and it’s consuming CPU, I’ll obviously want to know what’s going on. Keep that in mind, hello1 is CPU-intensive. hello2 Is GC-intensive, so it does something weird with allocation or something. It should maybe be clear from the name. hello3 will be using lock contention, so these are three use cases or three weird things that can happen.

Obviously, this is fake application but in theory, this could be your production code. I have a small agent on the side and it basically does what I just showed you on the slide, so it’s opening up the recording stream. It doesn’t do that in the same process, instead what it does is ask around JVMs that are running on the same machine and it’s opening up the recording directly from the repositories, from the outside. Basically, what is happening now is that I have the Spring Boot instance that is producing data into the repository, into the file stream, and I have this other process that is now observing that exact data.

What we made sure with event streaming is that the data is always consumable, it’s always consistent. You can always get to it and observe it on the side. What that looks like is something like this. By the way, if you’re paying attention, you’ll see that I’m running Java on a java file, I’m not compiling it first. This is new functionality in some release that is relatively recent that I’ve forgotten exactly which one it is but if you have a single main class with a main method in it, you don’t need to run javac, you can just run it directly with Java.

As I mentioned, what this is doing now is looking around for processes, JVM processes on the same machine, and it’s now found one, it’s the Spring Boot thing, not surprisingly. What I’ll do now is I’ll first run hello just to see or show you what’s happening. The monitoring agent did notice that the hello endpoint was being accessed. “hello1,” if you remember, is a CPU-intensive one, so now it’s going to run the CPU-intensive thing and hopefully, as you can see, it took slightly longer to access and this looks like a crash but it isn’t. It’s actually printing out that one event took a long time, the CPU event here. What I’ve done is, I’ve opened the recording stream, I’ve added the subscription to the CPU load event or the jdk.ExecutionSample event and I’ve said, “If it takes longer than this, print it out for me.”

Basically, that’s what we’re seeing here, it found that something was taking the time and you can see that it’s the CPU-intensive method much like we’d expect. “hello2” is the GC-intensive one and I’m not sure why it started accessing the error thing in there, but in any case, we see that there’s a GC Pause and the reason being that we have a GC-intensive thing running. Then, finally, we have the lock contention case, which is going to tell you that there’s a wait here in the lock contention method and you can also see that the class were synchronizing on this object.

These are examples of very simple monitoring, I’ll show you just very quickly what the monitoring agent actually looks like, just to give you a quick view of that. It should be relatively similar to what you’d expect. Here’s an event stream, we’re opening it up and we’re saying, “On the HTTPRequest event, if the request takes a long time, then get more information and show that,” that’s in a nutshell what happens. That’s event streaming.

I want to say that with event streaming, in my mind, we have the first real version of JFR. It’s always been powerful but the fact that we now have event streaming in there is providing that use case that we’ve been missing, in my mind. Obviously, there’s more stuff that we can work on but that’s one of the key things that we needed to do to have the full story in place.

Some of the other things we’re working on is making it possible to access the events over JMX. We want to, as I mentioned earlier, keep adding events. There are other things, especially on the JDK libraries level, that we want to feed into the same stream, which crypto algorithms are being used, certificate expiration dates, that sort of thing, high level stuff. There’s a product called Loom, which is looking at providing fibers and continuations. I’m not going to go into the details, but basically it turns around a bit the concept of what a thread is and since threads are so key to how JFR works, that’s something that we need to look at how we can support with JFR as well.

It’s always hard when you used to build in command lines but I think we can do some tuning to make that slightly easier to use. There are a couple of other things that we’re looking into. Event throttling, so instead of just filtering on name and duration, maybe you want to record sample every nth event of a certain type. Deep tracing, in the sense of maybe for a short duration, you actually want to capture everything just to get a very detailed view of what’s going on. This does put some stress on the code that is capturing the events and making that efficient and all that.

There are a lot of integration opportunities here. The stuff I’ve showed you is obviously very low level and provides the primitives for getting the data out. The next step that we’d like to see is more of the big libraries frameworks and IDEs picking up and making use of this, visualizing it, providing it to the user in a well-shaped form. We’re providing the primitives for doing this. As I showed, hopefully getting specific data out isn’t that hard, it’s like a few lines of code, but I think what we’re hoping to see is that this will be picked up by a lot of the vendors out there.

If you want to try this out, and please do try it out, we are relying on your help to figure out what to do next to make sure that everything works as you’d expect, so please do try it out. We have obviously shipped releases with JFR in it. We are also providing early access binaries of the next version of Java, so in this case, that’s JDK 14. You can pick them up, you can try them out, and let us know what you think, there’s an email list here that you can send your feedback to.

Just as a summary, this is actually the same slide as in a nutshell thing that I showed at the very start. Flight recorder is JFR JDK Flight Recorder, it’s there right now already. Event streaming is not in the released version yet but it’s coming soon. It’s event based built into the Java Runtime with all the unfair advantages that come with it. It has very low overhead and it’s production use. Remember that, production. You can correlate events from many different levels and we have APIs both for producing and consuming the events that we like to think are very simple to work with.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: MakeCode: Types, Games and Machine Code

MMS Founder
MMS Michal Moskal

Article originally posted on InfoQ. Visit InfoQ

Transcript

Moskal: Hi, my name is Michal Moskal, I’m at Microsoft Research, and we work on this platform called MakeCode. Yes, there’s quite a few people behind it, not only me, I’m just giving the talk. Microsoft MakeCode is a platform for building educational programming experiences, so think of it as something like an IDE for kids for programming a specific thing. It’s not like a general-purpose – maybe the platform itself is general-purpose, but usually you target it towards a specific experience and usually, we program different hardware devices like this. The main feature from the technical side of it is that it all works everywhere as a web app, so you just need a web browser, which is actually quite unusual when it comes to programming those devices. It’s an open-source platform that can be extended by others, so we have several editors that people use and other people can build their own.

The other thing about it is that we try to make the barrier to entry as low as possible. You should be able to get started in a minute, but we also try not to limit what you can do within the platform and I’m going to show that in a second. It’s also quite a big project as far as the education initiatives goes. It was first deployed on the BBC micro:bit. It’s a new device like this one, and there was 4 million units of that one. It’s also fairly cheap, so you can buy and try it for yourself.

Demo: MakeCode for micro:bit

I’m going to show you how it looks like so you get an idea. If you go to makecode.com, you will see a bunch of cards for different editors. This one is for the micro:bit, this device that I was showing you. Most of them are for different hardware devices, including the Lego Mindstorms, including Wonder Robot. This one is for Minecraft so we can also program different agents onto Minecraft. We have a new one called Arcade, which I’m going to show in the second half of the talk, which is for building this retrostyle games that you can then run on hardware devices like this one.

Let’s go to the micro:bit. When it comes to the site, what you can see is that those are your projects but actually most of the site is about different tutorials and different projects that you can do with the device. This is essentially for the kids to get started, though, we mostly see usage of MakeCode in the classroom and that’s what we optimize for, so there isn’t really so much of this self-guided material for someone who’s just starting it, we really reliant on the teachers here still.

Ok, so let’s start a new project. Now, I’m loading the editor here and what you can see is the primary interface that most of our users use is this block interface where you just snap together with blocks, so let’s see how it looks like. We go to this basic category and, for example, we get an icon, we get it up in here. You can see the simulator here on the left already reloaded and run this program. Let’s say we take another one and then we change the icon to something else. What will happen now? It will now be animated because this is a loop that will run forever and there is a delay built into the show icon of, I guess, about half a second so that you see them. Yiu can also do it once at the beginning of the program and, of course, then you get the expected result.

The way most of those things are programmed is using events. For example, when I press the A button, I want it to do something. I go to the “Basic” and say I want to show “Hello” on the screen. It runs my program, I press the button, I get Hello. If you’re wondering about the resolution of the screen, this is because that’s how the device look like. It’s HD five by five. Now, buttons are great but actually there are some sensors on this device. This is actually what we found quite important, there are many electronics platforms where you can build stuff but usually, you have to connect a bunch of sensors to get anything running. In classroom, this is really difficult especially with younger kids; the cables are a mess and so forth.

We actually are working mostly with the devices that are much more integrated that have some sensors and some actuators build-in. This one has this screen and it also has a bunch of sensors, including accelerometer, so I can, for example, react for a shake. When I shake it, I want to show a number or whatever. Because I am using the shake block, it’s simulating accelerometer with the mouse. If I was on the phone, it would actually used the phone accelerometer but here on the computer, it’s simulating it with the mouse so if I shake it, I get zero. If find it hard, we also give you this button for doing the shake but it’s just fake.

Now how do we get it on the device. I hit “Download” here and that gives me a file. This file shows here and the micro:bit itself shows up as USB drive, so I just need to drag this file to the device and it’s flashing the device. I mean, it’s actually flashing LED, which means that it is programming the device. Actually, this dragging the file is usually the most difficult part of the process. I mean, this is easier. We get the heart, we get the other thing. If I press A, we get Hello. Once we get to the end of the Hello, if we shake it, we get zero. Ss I told you, the median time to blink is quite quick here.

The other cool thing about the micro:bit is that it has a radio. That radio can be used for Bluetooth, but we usually don’t use Bluetooth because Bluetooth is very difficult to set up in the classroom because you have to do all the pairing and it will pair with your phone or it won’t and so forth. Usually, what we use the radio for is for communicating between micro:bits because we have a whole classroom of kids with those devices. What can we do? We go to the “Radio” here and then we say we want to send a number and we take the acceleration on the x-axis and we keep sending that number.

Because I’m using the radio block, I’m getting two micro:bits in my simulator so I can see how it would work. I go to my radio and I get an event when I received a number. Then what I want to do is I would want to plot that number on the screen Now, if I’m moving this micro:bit, it’s sending the number to the other one and the other one is plotting it on the screen. Similarly here. We can now drop this file on the device. I don’t know what it’s doing. Antivirus scanning, I suppose. It’s usually fairly immediate. I have another one here. micro:bit comes with the battery, so you can actually see that it works.

I’m going to flash the other one. We’re going to flash the same program on it. After a few seconds, I’m moving this one and you see the other one is displaying the numbers. The cool thing about it is that, for example, with this you can do like a remote-control car because you have this connectors here so you can connect different things here. They are so big so you can put these big crocodile clips here. I go to “Advanced,” and I find my Servo here. Se have this block called “Map,” I want to map the received number, which is in this ranch and then I want to map it 0 to 360. Now, I connected this over here, it’s actually showing it to me on the screen. When I moved this one, I would get the little remote-control car.

I was showing you the blocks all the time and this is because that’s where most of our users are, but really what happens under the hood is that we have a TypeScript program here, so let me hide that so that you can see it. The TypeScript program corresponds pretty much one-to-one to those blocks. What you can do actually is you can even go to the TypeScript and change it a little bit. There is like a full IntelliSense, this is Monaco Editor that’s using JS code. You can go ahead and change it, go back to blocks, and it will update the blocks. If you say something here that is not supported on the block site, then you’ll get these grey blocks that you cannot edit.

The idea here is for the users that they can get started with those blocks and they can upgrade themselves to TypeScript. This TypeScript here is labeled JavaScript because teachers have absolutely no idea what TypeScript is, we are lucky if they know what JavaScript is, so it’s to keep it simple. Also, when you look at these programs, they actually are JavaScript, there is actually not a single use of Types here. You can use Types but most of those programs are quite simple and the users don’t even use Types. This class syntax, for example, that’s the same in the new ECMAScript 6.

If you look here on the left, on the, under the full simulators, there’s suddenly a file explorer and that lets you look under the hood of the platform. This is again, what I was talking about – the barrier to entry is very low but where you can go is actually quite high. You can go ahead and write all those classes in TypeScript and so forth and you can even go ahead and look how did we implement all of that. Most of the runtime for the micro:bit, actually, you will see it here.

If we go to the core library, for example, you see all the files that we use. For example, we use this showNumber block and here it is implemented, it rounds the number to two digits after the dot, and shows it as a string. Here, there is a bunch of comments that describe how the blocks look like. They describe the color, the shape, and so forth. This is actually how you define the blocks for the users, so it’s just define the TypeScript API, if you just say block on them, it will show you some block. It will try to split the API into words and so forth, but you can also go ahead and modify it so that it’s more readable for the users.

This simulator is, of course, run in JavaScript so you can even go ahead and look at the JavaScript that we generate from that code. One thing that you note here in this program is that these things, for example, they blocked for half a second but there’s no await or anything like that. We essentially adopt a single threaded execution and at the lowest level, the API blocks for a given amount of time that waits for event, it just blocks and during that time, other events can run. Otherwise, you cannot be interrupted, so it’s not really like multi-threaded program where we really have to be worried too much about other threads changing your data. It’s more well-defined in that sense.

The other thing is that this is an extensible platform where you can add different extensions which are just libraries or packages. We can load a bunch of popular extensions here, and you can go ahead and search for “Robot,” and you get different extensions for different robots. Usually those extensions are provided by hardware manufacturers that manufacture different accessories for that thing but they could be arbitrary pieces of code. Let’s say that we add one of those, like this gigglebot here. When we add it, we get a new category of blocks here that’s what this robot does and we can also look at it here.

If we go back to JavaScript, we can look at other extension here, it’s added, it’s localized in French. Localization is actually quite important in schools as you can imagine. There is also another package that it depends on, these giggle sensors. When you look at some of those things, you see that these are maybe not the greatest programmers. It doesn’t matter, it still works. The most important thing is that they expose this low-level APIs to those blocks in a much higher level way. I think it’s so much for the micro:bit demo, let’s go back to the talk.

Why Should You Care?

People do different projects with it. It’s not just the device but you embed it in a project. That actually makes it much more accessible because this is something that they know how to do, so that you combine the skill that you know how to do like how to build this cardboard robot and put the thing inside and do stuff like that. You combine that skills that you know how to do with the skill that you don’t and that makes you feel more empowered at the one that you don’t know how to do, which is programming, which is what we really want them to. There are gazillions of projects. For example, down there, it’s a step counter and you put it on your leg and it detect the shake event, that’s your step counter. It’s very simple. People do drones and also stuff like humidity sensors, weather sensors, and whatnot for different science project.

Why should we care? One answer to that is that the IoT is coming for real this time, apparently. Arm expects 1 trillion devices in 15 years, there is now maybe 5 billion of them, maybe 20, it’s hard to find the exact data. Someone have to program those things. Nowadays, most of those devices are programmed using C and C++ and there’s probably at least 10 times as many web programmers than there are embedded programmers. The other thing is that those MCUs are getting cheaper and cheaper.

People used to use these 8-bit MCU or 4-bit MCUs, but the 32-bit ones that we run on, they are getting cheaper and cheaper. Because of that, the high-level languages are entering this level of embedded programming, so this includes us but also Python using CircuitPython and MicroPython. There’s also plain JavaScript interpreters like Espruino, iot.js, and so forth.

Not Convinced?

I know you’re not convinced, so who thinks that it’s easy to hire a developer today? That’s essentially because there isn’t enough of them and there are all those boot camps in universities training camp for graduates, not enough. Moreover, in the future many jobs will involve programming, even though those won’t be just like straight programmers, they will just do some other things and have some programming on the side.

When you think about it, this is maybe a bit much. This high-level APIs and this high level way of programming where you layer those abstractions, this is how the programming will look more and more like in the future. Think about it, it’s much easier to teach people how to put those blocks together. Then it’s also quite important to not limit them. You have to let them actually do what they want in the same environment and lay a little abstraction on top of each other. Here, you could see that you could go from those blocks down to TypeScript. I actually going to show it to you but you can even go and look that and edit the C++ files that run on the Runtime C++ files. When you think about it, you have this progression where we program at higher and higher level of abstraction. People started to program with machine code, they were punching the cards, they went to Assembly, then they went to languages like Phyton and C, then they went to C++, then they introduced to Garbage Collection.

That’s maybe Java, then you have more of this dynamic programming with JavaScript. Nowadays, you mostly program using even higher level of abstraction, you just go to npm which, whatever, half a million, a million packages, and find the one that you need. If you extrapolate, of course, at some point, you see where this is going. In some of those transitions like between the machine code and Assembly, between C and C++, between JavaScript and npm, you can actually go to the lower level. You don’t actually lose any performance because you can always go to the lower level. In some of them, you do lose some.

Here we’re talking about this very low floor to start and reasonably high ceiling, so this is actually what my code is about. We pay usually for all those abstractions with some performance but the question is how much do we pay.

Richard’s Benchmark

This is a benchmark, it’s called the Richard’s benchmark, it moves stuff between these and so forth. When you look at these numbers and what they are, this is how much slower than the C implementation of that benchmark. This is on the embedded system, I think on this one. The MicroPython is 300 times slower than C and duktape which is a JavaScript template is 600 times slower than C. That’s seems quite a bit slower, how much people are willing to pay?

If you try with node, it’s about nine times slower than C on this particular benchmark. It seems that this is how much people are willing to pay because they’re using all of this. Static TypeScript, which is the one that we use here is 16 times slower than C, which is actually quite a lot when you think about it, at least I feel like that. The other thing is that MicroPython is actually not so bad because if you run the CPython, it’s about 1,000 times slower than C, which is probably why most people don’t really use Python for performance-sensitive areas unless, of course, you distribute it off to a GPU, that’s different.

It’s not always that great. Here, for example, this is a different benchmark where we start to just move stuff in an array. Here, node is only three times slower than C and if you get the benchmark simple enough, you can get node pretty much on the level of C. This one is 3 times slower than C, we have 21 times slower than C, but still, an order of magnitude faster than those interpreters.

Static TypeScript

How do we do it? We cheat a little bit. We don’t really run a full-fledged JavaScript engine on that thing because this little device here has 16 kilobytes of RAM and if you also try to run Bluetooth on it, you are left with 3 kilobytes of RAM. You could pretty much forget about a full-fledged JavaScript engine.

What we do is we compile the JavaScript or the TypeScript into machine code directly in the browser. It all runs as a web app and we actually don’t go to the cloud to compile, we compile inside of the browser. This is critical, again, in the schools because the internet connection goes up and down, there are limits, and also, if you compile it in the cloud, you have to pay for the cloud and with millions of users, that becomes difficult. What we do is we move eval because we compile and we also don’t do prototype inheritance, we start to do classes just like you would do in Java. You just define a class like you can in ECMAScript 6 but in ECMAScript 6, it just translates to prototype inheritance, we actually treat class as a class.

You can still do higher order functions like closures, you may have a Garbage Collector, you may have the “any” type so it’s not like we follow some very strict type discipline on this. The control flow is static in the sense that there is no eval and there is no prototype inheritance but there are still visual calls, for example, and interfaces. Data handling is often dynamic, meaning that there is only one number type, and it’s the double number type, but usually in those devices, you don’t really want to compete with doubles all the time because they have no FPU. If you try to do double addition, that’s like 400 cycles, which is a bit slow. It’s one [inaudible 00:25:29] machine without an FPU.

What we do is we would use 31-bit integers. If they overflow that, then we go into the box doubles, those are typical optimization techniques used in things like that. Yes, we mostly target this 32-bit ARM Cortex-M microcontrollers with about that much of a RAM. We get decent performance, as you’ve seen on those slides, we have some more data, and sometimes it’s close to V8. It’s always slower than V8 but then again, forget about any V8 on that thing. We are always much faster than the interpreters that are usually used for those things on the platforms. It’s an open source compiler and assembler, all implemented in TypeScript and it’s all runs inside of that web app.

TypeScript here was actually really critical to get all of that project running because this is quite a complex web app. That actually is what allowed us to build a complex application, including the compiler, including all of that, but also including all the UI and then we could also pull in the components like the Monaco Editor, which is also written in TypeScript. This is really something that enables this static typing, and this ability to refactor is something that makes it possible.

We do have a custom runtime. As I said, it’s more Java-like classes, not the prototype inheritance, I was talking about that already. We do support UTF-8 because many of the companies that make those things are Chinese and they actually quite interested in UTF-8 or Unicode.

Does Performance Matter?

Does performance actually matter on those things? Well, eventually. When you look at these programs that I was showing in the micro:bit, it doesn’t really matter how fast it blinks, it’s fast enough. Given enough time and enough users, you will eventually find someone who hit against those limits that you have there and they’re actually a power user and they will complain loudly.

Moreover, if you think about IoT devices, most of the time, those devices sleep and then they wake up and they do something. The periods in which they wake up is what the time is your power usage, your energy usage. Actually, the performance matter for that as well, the performance is really the battery usage in that case. For micro:bit in particular, the memory consumption matters because you have very little of memory, the speed not so much. In some places, the faster is the better and then this is why I show you the gaming thing.

Arcade

I think this is the coolest feature that we ever made. It’s here, it’s called Arcade. Arcade is this system for writing these games that are retro-looking. They use hardware specs of that thing, so this is 160 by 120 pixel screen – again, really HD compared to the micro:bit. We let you do 16 colors but we let you change the palette. The screen is really capable of like 80-bit color or something but we don’t have much memory, so we limit it to 16 colors and it also gives the games this old school look. Here you can see different tutorials, different example games, little performers, little shooters, and so forth.

Let me just show you quickly how we write those games. You can go ahead and create a little sprite here. Kids love doing the sprites. You can also go ahead and find one from the gallery, let’s take the beautiful duck, but then you can also modify it and put makeup on the duck. Again, we can save that. You get a little duck here and then the other thing that we can do is, for example, set the acceleration in the y to 100 and that should have the duck fall down. The basic programming model is the same, it’s event-based, so we get the event when we press the button. Let’s take our sprite and we set the velocity in the y to minus 100. Now I can have the duck fly. Then you have the things that come under it and then eventually, you end up with a game like that. All within blocks.

When you look at the actual code of this thing, it’s about 60 lines and most of it deals with the duck flapping its wings. It’s actually not critical for the game so you can actually get that game run in 20 lines of code. Again, very high level, but if you want to, you can go all the way down to the pixel level and change the particular pixels.

You can actually go and do something quite evolved. This is JavaScript, no blocks anymore. You get a little Wolfenstein game and it runs on the hardware as well. Actually, on this one, 27 FPS. I know because this is the version that really does every single pixel in TypeScript, then I added a new API to draw a line of pixels and scale it and that makes it run on 14 frames per second. Here, when you think about those games, the performance actually matters. If you now try to run them 10 times slower, the 14 FPS becomes 4 FPS and it’s not fun anymore. That’s an example where it actually matters.

If I hit download here, it shows me the different hardware devices that are supported. All of those things that you see here, you can buy. They are between $25 and $50. They can be cheaper but no one had came up with the cheaper ones. The microcontroller that we used to run that thing is about $2, so just this board is probably maybe $5 to produce and so it means that it should be $15-$20 to sell. Again, cost matters for the schools. If you buy one, you’re going to be ok. If you buy 500, different story.

You can select the hardware that you want and you also can run it on the Raspberry Pi and you can build the controller and connect it to a TV and this is pretty cool, again, for a classroom for a demo.

We can download this one and it’s the same process as before. The device shows as a USB drive and you drag the file and it just runs it. What I actually wanted to show you is that you can even go here and look at the generated assembly code if you really wanted to. This is how long we let you go. The game library here is actually quite a bit more complicated than the micro:bit library. You can see that there is like a bunch of classes, so these are different pixels that fly on the screen. There are different classes that we actually use quite a bit more of the language here than we did in the micro:bit.

Micro:bit has a very simple library and most of the programs are simple. The good thing about the games is that they let you go and come in deep in some sense, because in the micro:bit, most of the programs are five lines long and they either that simple or they’re quite complicated where you do these radio protocols, mesh protocols, and whatnot and there isn’t much in between. With games, you can actually go very gradually up and up with the complexity.

Takeaways

The Internet of Things is coming more and more and it will be programmed in high-level languages for the most part because that’s where the programmers are. This actually has been the trend with all the programming that we did, that it went higher and higher level. This will actually happen for all the programming environments in the future, they will become more and more higher level. We have to make it easy for people to program because otherwise, there won’t be enough people to program. Performance matters eventually. Static tapes are good for you, they let you compile programs for these devices, they let you build application and so forth.

Questions and Answers

Participant 1: You showed the radio with them communicating. Do they have any ID? You mentioned radio mesh, can you connect them through Java and stuff like that?

Moskal: Yes. It’s a very low-level radio protocol, it’s a broadcast protocol. The devices themselves have MAC address, so you can use that for an ID.

Participant 1: A follow up on that – is there a gateway or something that you could connect to a computer and then do MQTT or something to the cloud?

Moskal: Yes, it has been done.

Participant 2: Is there a standard protocol for the manufacturers to follow? These devices, do just one company make it or multiple companies?

Moskal: There are multiple companies making those things. In the past we mostly work for an existing device like the micro:bit and then we build the editor for it and so forth. With Arcade, we have a hardware specification that tells you what you’re supposed to do. There are different sources of MCUs that you can use and so forth and we’ve actually seen all those companies, they just come to us and they tell us, “We build this device,” and then we ask them to send us one or two so we can test them and then provided that they can sell them, we can put it up on the website. It’s fairly generic, we don’t have to add it for it to work on your device, we have a way of specifying the actual hardware configuration inside of the hardware so that we auto-detect that thing and so forth.

Participant 3: Does the IDE ever warn you about performance or warn you about RAM is about to be used up?

Moskal: I recently added memory tracing because we compile everything, so we have the full visibility over all the Garbage Collector roots, so we can actually trace all the objects that you have alive and we can show you the amount of objects. It’s not really exposed, especially to kids, not yet. We are working on it. The speed of the MCU doesn’t seem to be a problem so much but the memory actually is, so people often run into issues when they run programs on high level and they run out of memory. The browser has infinite memory compared to those things. We’re working on it, not quite there yet.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: High Resolution Performance Telemetry at Scale

MMS Founder
MMS Brian Martin

Article originally posted on InfoQ. Visit InfoQ

Transcript

Martin: Welcome to the Bare Knuckles Performance track, and to my talk about high-resolution performance telemetry at scale. First, a little bit about me. I’m Brian [Martin], I work at Twitter. I’m a Staff Site Reliability Engineer. I’ve been at Twitter for over five years now. I work on a team called IOP, which is Infrastructure Optimization and Performance. My background is on telemetry, performance tuning, benchmarking, that kind of stuff. I really like nerding out about making things go faster. I’m also heavily involved in our Open Source program at Twitter. I have two Open Source projects currently, hopefully, more in the future. I also work on our process of helping engineers actually publish open-source software at Twitter.

What are we going to talk about today? Obviously, high-resolution performance telemetry, but specifically we’re going to talk about various sources of telemetry, sampling resolution, and the issues that come along with it. We’re going to talk about how to achieve low cost and high resolution at the same time, and some of the practical benefits that we’ve seen at Twitter from having high-resolution telemetry.

Telemetry Sources

Let’s begin by talking about various telemetry sources. What do we even mean by telemetry? Telemetry is the process of recording and transmitting readings of an instrument. With systems performance telemetry, we looked at instrumentation that helps us to quantify run-time characteristics of our workload, and the amount of resource that it takes to execute. Essentially, we’re trying to quantify what our systems are doing, and how well they are doing it. By understanding these aspects of performance, we’re then able to optimize for increased performance and efficiency. If you can’t measure it, you can’t improve it.

What do we want to know when we’re talking about systems performance? In Brendan Gregg’s “System Performance” book, he talks about the USE method. The USE method is used to quickly identify resource bottlenecks or errors. USE focuses on the closely related concepts of utilization and saturation, and also errors. When we’re talking about utilization and saturation, they are different sides of the same coin. Utilization directly correlates to making the most use of the amount of resource that our infrastructure has. We actually want to maximize this, within the constraint of saturation though. Once we have hit saturation, we have hit a bottleneck in the system, and our system’s going to start to have degraded performance as load increases. We want to try and avoid that, because that leads to bad user experience. Errors are maybe somewhat self-explanatory. Things like packet drops, retransmits, and other errors can have a huge impact on distributed systems, so we want to be able to quantify those as well.

Beyond USE, we begin to look at things like efficiency. This is a measure of how well our workload is running on a given platform or resource. We can think of efficiency in probably a variety of ways. It might be the amount of QPS per dollar that we’re getting. Or, we might be trying to optimize a target like power efficiency, so trying to get the most workload for the least amount of power expense. The other way to look at efficiency is how well we’re utilizing a specific hardware resource. We might look at things like our CPU cache hit rate, our branch predictor performance, etc. Another really important aspect of systems performance, particularly in distributed microservice architecture stuff, is latency though, which is how long it takes to perform a given operation.

There are a bunch of traditional telemetry sources that are generally exposed through command-line tools and common monitoring agents. These help us get a high-level understanding of utilization, saturation, and errors. Things like top, like the basic stuff that you would start to look at when you’re looking at systems performance. Brendan actually has a really good rundown of systems performance diagnostics in 60 seconds or something like that. These are the types of things that you might use for that.

You are looking at things like the CPU time running, which can tell us both utilization and saturation, because we know that the CPU can only run for the amount of nanoseconds within that second for that number of cores that you have. We also know that disks have a maximum bandwidth, and a maximum number of IOPS that they are able to perform. Network also has limits there. Network is where it becomes obvious when there are errors, because you have these nice little like protocol statistics, and packets dropped, and all sorts of wonderful things. These are the traditional things that a lot of telemetry agents expose.

The next evolution is starting to understand more about the actual hardware behaviors. We can do this by using performance counters, which allow us to instrument the CPU and some of the software behaviors. Performance events help us measure really granular things, like we can actually count the number of cycles that the CPU has run. We can count the number of instructions retired. We can count the number of cache accesses, at all these different cache levels within the CPU.

This telemetry is more typically exposed by profiling tools, things like Intel VTune and the Linux tool perf. However, some telemetry agents actually do start to expose metrics from these performance counters. They become really interesting if we’re starting to look at how the workload is running on the hardware. You can start to identify areas where tuning might make a difference. You might realize that you have a bunch of cache misses, and that by changing your code a little bit you can start to actually improve your code’s performance dramatically.

Performance events for efficiency. We can start to look at things like power. We can look at cycles per instruction, which is how many clock cycles it actually takes for each assembly instruction to execute. CPI actually is a pretty common metric to start to look at how well your workload is running on a CPU. You can look at, as I said, cache hit rates. You can also look at how well the branch predictor is performing in the CPU. Modern CPUs are crazy complicated, and they do all sorts of really interesting stuff behind the scenes to make your code run fast. On modern hyperscaler processors, we should actually expect less than one cycle per instruction, because they are able to retire multiple instructions in parallel, even on the same core, but in reality that’s rare to see on anything but a very compute-heavy workload. Anytime you’re accessing things like memory, it’s more likely that you’re going to see multiple cycles per instruction, as the CPU winds up essentially waiting for that data to come in.

Then, we have eBPF, which is really cool and it gives us superpowers. It’s arguably one of the most powerful tools that we have for understanding systems performance. Really, I think a lot of us are just starting to really tap into it to help get better telemetry. eBPF is the enhanced Berkeley Packet Filter. As you might guess, it evolved from tooling to filter packets, but it’s turned into a very powerful tracing tool. It gives us the ability to trace things that are occurring both in kernel space and userspace, and actually executes custom code in the kernel to provide summaries, which we can then pull into userspace. While traditional sources can measure the total number of packets transferred, or total bytes, with eBPF we can actually start to do things like get histograms of our packet size distribution. That’s really exciting. Similarly, we are able to get things like block I/O size distribution, block device latencies, all sorts of really cool stuff, that help us understand down to actual individual events. We’re doing actual tracing with eBPF. It gives us a really interesting view about how our systems are performing.

eBPF is very powerful for understanding latency. We can start to understand how long runnable tasks are waiting in the run queue before they get scheduled to run on a CPU. We can actually measure the distribution of file system operation latencies, like read, write, open, fsync, all that stuff. We can measure how long individual requests are waiting on the block device queue. That helps us to evaluate how different queueing algorithms actually wind up performing for our workload. We can also measure things like outbound network connect latencies, the delta end time between sending and getting a SYN-ACK back. All very cool stuff.

eBPF is also insanely powerful for workload characterization, which is an often neglected aspect of systems performance telemetry. It becomes very difficult to work with a vendor if they pose the question of, “What are your block I/O sizes?” and you’re just, “I don’t know. It’s this type of workload, I guess.” You actually need to be able to give them numbers, so they can help you. With eBPF, you can start to get things like the block I/O size distribution for read and write operations. Now, you can actually be like, “It’s not 4K. It’s actually this size.” You can give them the full distribution, and that really starts to help things. You can also look at your packet size distribution, so you can understand actually, if you’re approaching full MTU packets, or if jumbo frames would help, or if you’re just sending a lot of very small packets. A lot of my background is from optimizing cache services, like Memcached and Redis type stuff, where we have very small packet sizes generally. Yes, it can be interesting that you can start to hit actual packet rate limits within the kernel, way before you hit actual bandwidth saturation.

Sampling Resolution

One of the most critical aspects about measuring systems performance is how you’re sampling it, and how often you’re sampling it. It all comes down to sampling resolution. The Nyquist-Shannon theorem basically states that when we’re sampling a signal, there is an inherent bandwidth limit, or a filter, that’s defined by our sampling rate. This imposes an upper bound on the highest frequency component that we can capture in our measured signal.

More practically, if we want to capture bursts that are happening, we need to sample at least twice within the duration of the shortest burst we wish to capture. When we’re talking about web-scale distributed systems type things, 200 milliseconds is a very long time actually. A lot of users will not think your site is responsive if it takes more than 200 milliseconds. When we’re talking about things like caches, 50 milliseconds is a very long time. You start to get into this interesting problem where now things that take a very long time on a computer scale are actually very short on a human scale. You actually need to sample very frequently, otherwise, you’re going to just miss these things that are happening.

In order to demonstrate the effect that sampling rate has on telemetry, we’re going to take a look at CPU utilization on a random machine I sampled for an hour. There it is. This is sampled on a minutely basis, and is just the CPU utilization. The actual scale doesn’t matter for the purpose of our talk here. We can tell from this that our utilization is relatively constant. At the end, there’s like this brief period where it’s a little bit higher and then comes back down to normal. I would say that other than that, it’s pretty even.

No, that’s not actually how it is. Here, we’re seeing secondly data in the blue, with the minutely data in the red, and we now see a much different picture. We can now see that there are regular spikes and dips below this minutely average that we are otherwise capturing. Time series is also a lot fuzzier just in general, and arguably a lot harder to read. I wouldn’t want to look at this from thousands of servers at that resolution. It just becomes too much cognitive burden for me as a human.

Here, we’re actually looking at a histogram distribution of those values, normalized into utilization nanoseconds per second thing, so they’re comparable. Really, the most interesting thing is about the skew of these distributions. The minutely gives you this false impression that it’s all basically the same. The reason why the chart looks very boring to the righthand side is because that secondly data has a very long tail off to the right that just doesn’t show with this particular scaling. A long scale might have been better here. The secondly data, even though it’s right-skewed, it’s more of a normal distribution, whereas the minutely data is weird-looking.

This problem isn’t restricted to just sampling gauges and counters. It can also appear in things like latency histograms. rpc-perf, which is my cache benchmarking tool, has the ability to generate these waterfall plots. Essentially, you have time running downwards, and increased latency to the righthand side of the chart. The color intensity sweeps from black through all the shades of blue and then from blue all the way to red, to indicate the number of events that fell into that latency bucket.

Here, we’re looking at some synthetic testing, with rpc-perf against a cache instance. In this particular case, when I was looking at minutely data, my p39 and p49 was higher than I really expected it to be. When I took a look at this latency waterfall, you can see that there are these periodic spikes of higher latency that are actually skewing my minutely tail latencies. To me, this indicated that there was something we needed to dig in and see what was going on. It’s pretty obvious that these things are occurring on a minutely basis, and always at the same offset in a minute. These types of anomalies or deviations in performance can actually have a very huge impact on your systems.

How can we capture bursts? We just increase our sampling resolution. This comes at a cost. You have the overhead of just collecting the data, and that’s pretty hard to mitigate. Then, you also need to store it and analyze it. These are areas where I think we can improve.

Low Cost and High Resolution

How can we get the best of both worlds? We want to be able to capture these very short bursts, which necessitate a very high sampling rate, but we don’t want to pay for it. We need to think about what we’re really trying to achieve in our telemetry. We need to go back to the beginning, and think about what we’re trying to capture, and what we’re trying to measure. We want to capture these bursts, and we know that reporting percentiles for latency distributions is a pretty common way of being able to understand what’s happening at rates of thousands, millions of events per second, so very many.

Let’s see what we might be able to do by using distributions to help us understand our data. Instead of a minutely average as the red line here, we actually have a p50, the 50th percentile, across each of these minutes. This tells us that half the time within the minute the utilization is this red line value or less, but also that it’s higher than this the other half of the time. It actually looks pretty similar to the minutely average, just because of how this data is distributed within each minute.

If we shift that up to the 90th percentile, we’re now capturing everything except those spikes. This could be a much better time series to look at if we’re trying to do something like capacity planning. Particularly when you’re looking at network link utilization, you might have 80% utilization as your target for the network, so you can actually handle bursts that are happening. Looking at the p90 of a higher resolution time series might give you a better idea of what you’re actually using, while still allowing for some bursts that are happening outside of that.

Here, we’ve abandoned our secondly data entirely, but we’re reporting at multiple percentiles. We have the min, the p10, 50, 90, and the max. We can really get a sense of how the CPU utilization looked within each minute. You can tell that we still have this like period of increased CPU utilization at the end. Now, we’re actually capturing these spikes. Our max is now reflecting those peaks in utilization. We can also tell that generally, it’s a lot lower than that. The distance between the max and the p90 is actually pretty intense. We can actually start telling that our workload is spiky just based on the ratio between these two-time series. Particularly bursty workloads might be an indicator that there’s something to go in and look at, and areas where you could benefit from performance analysis and tuning.

The other really cool thing about this is that, instead of needing 60 times the amount of data to get secondly instead of minutely, now we only need five times the amount of data to be stored and aggregated, and we have still a pretty good indicator of what the sub-minutely behaviors are. Also, this is a lot easier to look at as a person. I can look at this and make sense of it. The fuzzy secondly time series is very hard, especially with thousands of computers. To me, this winds up being a very interesting tool for understanding our distribution without burning your eyes out looking at charts.

How can we make use of this? We want to be able to sample at a high resolution, and we want to be able to produce these histograms of what the data looks like within a given window. You’re doing a moving histogram across time, shoving your values in there, and then exporting these summaries about what the percentiles look like, what the distribution of your data looks like.

This leads me to Rezolus. I had to write a tool to do this. Rezolus does high-resolution sampling of our underlying sources. There is a metrics library that’s shared between Rezolus and my benchmarking tool, rpc-perf. It’s able to produce these summaries based on data that’s inserted into it. Rezolus gives us the ability to sample from all those sources that we talked about earlier, our traditional counters about CPU utilization, disk bandwidth, stuff like that. It can also sample performance counters, which gives us the ability to start to look at how efficiently our code is running on our compute platform. It has eBPF support, which is very cool for being able to look at our workload and latency of these very granular events.

Rezolus is able to produce insightful metrics. Even if we’re only externally aggregating on a minutely basis, we get those hints about what our sub-minutely data distribution looks like. The eBPF support has been very cool for helping us to actually measure what our workloads look like, and be able to capture the performance characteristics. These types of things would be unavailable to us otherwise. There is just no way to do it except to trace these things. eBPF is also very low overhead just because it’s running in the kernel, and then you are just pulling the summary data over, so it winds up being fairly cheap to instrument these things that are happening all the time. We’re able to measure our scheduler latency, what our packet size distribution is, our block I/O size distribution, very important stuff when you’re talking to a hardware vendor to try to optimize your workload on a given platform, or the platform for the hardware.

Rezolus is open-source. It’s available on GitHub. Issues and pull requests are welcome. I think that it’s been able to provide us a lot of interesting insight into how our systems are running. It’s helped us to capture some bursts that we wouldn’t have seen otherwise. I think it’s useful even for smaller environments. I’ve worked in small shops before I worked at Twitter, and I really wish I had a tool like this before. At small shops, often you don’t have the time to go write something like this, so it’s really great that Twitter allowed me to open source something like this so it can become a community thing. Oftentimes at smaller shops, your performance actually matters a lot, because you don’t necessarily have a huge budget to just throw money at the problem, so understanding how your system is performing and being able to diagnose run-time performance issues is actually very critical. At Twitter’s scale, even just a percent is a very large number of dollars, so we want to be able to squeeze the most out of our systems.

How Has It Helped in Practice?

In practice, how has Rezolus helped us? Going back to this terrible latency waterfall. We actually saw something like this in production. We measured CPU utilization when we were looking at it in production. Periodically, the CPU utilization of the cache instances was bursting. Rezolus, in addition to providing this histogram distribution thing, also looks for the peak within each rolling interval, and will report out the offset from the top of the second that the peak occurred. That can be very useful to start correlating with your logs and your tracing data within your environment, and help to really narrow down what you’re looking at.

Rezolus was able to identify this peak in CPU utilization, which was twice what the baseline was, and it had a fixed offset in the minute. We narrowed it down to a background task that was running and causing this impact. For the particular case of cache we were able to just eliminate this background task, and that made our latency go back down to normal.

Rezolus has also helped us detect CPU saturation that happened on a sub-minutely basis. This is something where if you have minutely data, you’re just not going to see this at all, because it winds up getting smoothed out by this low pass filter that the Nyquist-Shannon theorem talks about. Rezolus was able to detect this in production, and capture and reflect the CPU saturation that was happening. That helped the backend team actually go in and identify that our upstream is sending us this burst of traffic at that period of time. They were able to work with the upstream service to smooth that out and have it not be so spiky.

Summary

In summary, there are many sources of telemetry to understand our systems performance. Sampling resolution is very important, otherwise you’re going to miss these small things that actually do matter. In-process summarization can reduce your cost of aggregation and storage. Instead of taking 60 times the amount of data to store secondly time series, you can just export maybe five percentiles. The savings becomes even more apparent when you are sampling at even higher rates, you might want to sample every hundred millisecond, or 10 times per second, or something like that. You might want to sample even faster than that for certain things. Really, it all goes back to what is the smallest thing I want to be able to capture. As you start multiplying how much data you need within a second, the savings just becomes even more. Having high-resolution telemetry can help us diagnose run-time performance issues as well as steer optimization and performance tuning efforts. Rezolus has helped us at Twitter to address these needs, and it’s, again, available on GitHub.

Questions and Answers

Participant 1: You mentioned this sub-second sampling. What are the tools you use to sample, say, CPU utilization in microseconds or disk usage in microseconds? What are those basic tools that you are using?

Martin: Those basic or traditional telemetry sources are exposed by procfs and sysfs, so you are able to just open those files and read out of them periodically.

Participant 1: Ok, procfs, but then you need root access maybe, or is it something that any person can access?

Martin: That kind of stuff actually does not need root access. When you start looking at things like perf events, you do need sysadmin privileges, essentially. You do need root access. Or, for the binary to have cap sysadmin would be the capability. eBPF also requires high-level access, because you’re injecting code into the kernel. Actually, there’s something in, I forget what recent kernel version, but they’re starting to lock down things like that, and by default, it seems like they’re still trying to work out how to deal with that exactly.

Participant 1: Another question is, how do you do it? I don’t think on production you can do it. You have to come up with a system that is very similar to production and where you run these tests to make sure that, or are you running on the production itself?

Martin: We run Rezolus on our production fleet. Having the tooling so we’re able to rapidly identify run-time performance issues, and give teams insights to help diagnose those issues, and root cause them and resolve them, is definitely worthwhile. Rezolus winds up being not super expensive to run, actually. It takes about 5% of a single core, and maybe 50 megabytes of resident memory to sample at, I believe that’s for 10 times per second, with all the default samplers. So none of the eBPF functionality, but all of the perf, and all of the traditional systems performance telemetry, in what I think is a very small footprint. Definitely the insight is worth whatever cost that has to us.

Participant 2: In your example, does Rezolus record the diagnostics, once something is abnormal in terms of statistics? If it does, what’s the footprint looks like, and the cost of it?

Martin: That’s a very interesting question about whether Rezolus records as it detects anomalies. That’s something that we’ve been talking about internally, and I think would be a very cool feature to add. One could imagine Rezolus being able to use that metrics library to easily detect that something abnormal is happening, and either increase its sampling resolution, or dump a trace out. Rezolus winds up in a very interesting position in terms of observability in telemetry and stuff like that, where it could pretty easily have an on-disk buffer of this telemetry available at very high resolution. This hasn’t been implemented yet, but is definitely a direction that I want to be able to take the project in.

Participant 3: Is it possible to extend Rezolus to provide telemetry on business KPIs as opposed to just systems KPIs?

Martin: The answer is yes. Rezolus also has this mode that we use on some of our production caches, the Memcache compatible servers. We use Twemcache at Twitter, which is our fork. There’s a stats thing built into the protocol. Rezolus can actually sit in between, and sample at higher resolution the underlying stats source, and then expose these percentiles about that. We actually use that on some of our larger production caches, to help us capture peaks in QPS, and what the offset is into the minute. Yes, one could imagine extending that to capture data from any standard metrics exposition endpoint, and ingest that into Rezolus, and then expose those percentiles.

Participant 3: Is there some development work that’s needed to do that, or can it do it as-is?

Martin: There would be probably some development work needed that would actually be, I would call that almost trivial. If one were familiar with the codebase, it would definitely be trivial. Yes, there’s already a little bit of that framework in there, just from being able to ingest stats from Twemcache, but it would need to be extended to pull from HTTP and parse JSON, or something like that, whatever the metrics exposition format is. This is actually another thing that we’ve been talking about internally, because Twitter uses Finagle for a lot of things, and our standard stack exposes metrics via an HTTP endpoint. The ability to sit in between our traditional collector and the application and provide this increased insight into things, without that high cost of aggregating, is very interesting to us. That I think is work that is likely to happen, although there are some competing priorities right now. That would be the stuff that I would definitely welcome a pull request on.

Participant 4: I have just a followup on the CPU and memory overhead of Rezolus itself. Do you consider that low enough that you can freely run it all the time on all the prod servers? Do you prefer to have a sample set of servers? Or would either of those approaches have an impact on the data volume and counts for you [inaudible 00:42:31]? What’s your approach to [inaudible 00:42:34] in production and when do you [inaudible 00:42:36]?

Martin: The question was about the resource footprint of Rezolus and the tradeoff between running it all the time and getting that telemetry from production systems versus the cost and do we do sampling, or run it everywhere all of the time.

My goal with the footprint that it has is that it could be run everywhere all the time in production. We always have telemetry agents running. I think that Rezolus can actually take over a lot of what our current agent is doing, and fit within that footprint, and still give us this increased resolution. Yes, I’m trying to think of what the current percentage of rollout is, but the goal is to run it everywhere all the time. In some resource-constrained areas, so if you think about like containerization environments, where you have basically a small system slice dedicated that isn’t running containers, that’s where things become a little more resource-constrained. Again, I believe that Rezolus can help take away some of that work that’s being done, and will be able to actually get rid of existing agents. The resource footprint of things like Python telemetry agents and stuff like that, it just gets really unpredictable. Rezolus is in Rust, so the memory footprint is actually very easy to predict. The CPU utilization is pretty constant. It doesn’t have any GC overheads, or stuff like that. I think that we’ll be able to shift where we’re spending resource, and be able to run Rezolus everywhere all the time. I think that we will be looking into things like dynamically increasing the sampling resolution, having an on-disk buffer, being able to capture tracing information, possibly integrated into our distributed tracing tool, Zipkin, stuff like that, and be able to start tying all these pieces together. Everywhere all the time.

Participant 5: You use Rezolus, you find a spike in performance. Then, how do we magically map that back to what’s going on? What application’s causing the performance?

Martin: That is where things become more art than science. Typically, I’ve had to go in and trace on the individual system and really look to understand what’s causing that. We can look at things like which container is using the resource at that point of time. I think at some point you wind up falling back onto other tools while you’re doing root cause analysis. Those might be things like getting perf-record data. That might be things like looking at your application logs, and just seeing what requests were coming in at that point in time, or doing more analysis of the log file.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Data Mesh Paradigm Shift in Data Platform Architecture

MMS Founder
MMS Zhamak Dehghani

Article originally posted on InfoQ. Visit InfoQ

Transcript

Dehghani: For the next 50 minutes I’ll talk about data mesh, long overdue paradigm shifts in data architecture. I know I did resist using the phrase “paradigm shift.” I couldn’t resist. It ended up in the title, and it’s one of the most used and abused phrases in our industry. Have you heard the phrase? Do you know the origin of the phrase?

Participant 1: Thomas Kuhn, “The Structure of Scientific Revolutions.”

Dehghani: Thank you very much. You are one of the very few people who actually know the origin of this. The other person who knows the origin of this, knew the origin of this was our CTO, Rebecca Parsons. As you rightly said, in 1962, an American physicist and a historian of science, a philosopher of science, wrote this book, “The Structure of Scientific Revolutions.” He coined the term paradigm shift in this very controversial book at the time. He made actually quite a few scientists upset.

What he shared in his book was his observations about how science progresses through the history. What he basically said was scientists start their journey in terms of progressing science in this phase he called normal science, where, essentially, scientists are working based on the assumptions and theories of the existing paradigm. They’re looking and doing observations to see what they expect to see, what they expect to prove. Not a whole lot of critical thinking is going on there, and you can imagine why scientists weren’t so happy about this book.

After that, they start running into anomalies. They’re making observations that don’t quite fit the current norm, and that’s when they go into the phase of crisis. They start doubting what they believe to be true, and they start thinking out of the box. That’s where the paradigm shift happens to the revolutionary science. Essentially, we’re going from incremental improvements in whatever scientific field we are to a completely new order. An example of that, when scientists couldn’t make sense of their observations in subatomic level, we had the paradigm shift from the Newtonian mechanics to quantum mechanics.

What does that have anything to do with modern data architecture? I think we are in that crisis phase in the Kuhnian observation. The paradigm that we have adopted for 30, 40, 50 years about how to manage data doesn’t really solve our problems today. The inconvenient truth is that companies are spending more and more on data. This is an annual survey, NewVantage from the Fortune 1000 companies and they surveyed the leaders.

What they found out is that we’re seeing an immense amount of increase in the pace of investment. That increase over the course of one year is budgets that are being spent between 50 million to 500 million and above, despite the fact that the leaders in those organizations seeing a downfall in their confidence as that money is actually giving measurable results. Even though there are pockets of innovation in terms of using data, we don’t have to go far. We just look around Silicon Valley where we see how digital natives are using data to change their businesses.

The incumbents and a lot of large organizations are failing to measuring themselves failing on any transformational measure. Are they using data to compete? Are they using analytics to change their business? Have they changed their culture? Why I don’t want to underestimate the amount of work that goes into multifaceted change and transformation in organizations to actually use data to change the way we behave, changing our culture, changing our incentive structure, changing how we make decisions, but technology has a big part in it. This is an architecture track, so that’s where I’m going to focus.

Data Technology Solutions Today

The current state is that the current accepted norm and paradigm has put this architectural landscape into these two different spheres with hardly much intersection. We have the sphere of operational systems. That’s where the microservice is happening. That’s where the systems running the business are operating, your e-commerce, your retail, your supply chain. Really, we’ve seen an immense amount of improvements over the last decade in how we run our operational businesses. You just have to go to microservices track or DevOps track to see how much we have moved forward.

Then on the other side of the organization down to haul in the data department, we are dealing with the big data analytical architecture. Its purpose is, “How can I optimize the business? How can I run the business better so I can upsell, cross-sell personalized experience of my customer, find the best route for my drivers, see the trends of my business, BI Analytics ML?”

That has a very different architectural patterns and paradigms that we’ve accepted. If you think about that sphere of big data architecture, there’s three big generational technologies that I’ve seen, the ones working with a lot of clients, start with data warehousing. Do you know when was the first writing, and research, and implementation of the data warehouse entered the industry? In the 70s. Late 60s, the first research papers, and the data marts, and implementation of that were in the 70s. We had the data warehousing. We improved in 2010. We evolved to data lake, and now data lake on the cloud. If you look at the implementation or the existing paradigms of data warehousing, the job of data warehousing has been always get the data from the operational systems, whether you run some a job that goes into the guts of database and extract data.

Before you use the data, try to model it into this model that’s going to solve all the problems like world hunger, and we can do all sorts of analysis on it into snowflake schemas or star schemas and run a bunch of SQL like queries over it so we can create dashboards and visualizations, put a human behind the analytical system, to see what the heck is going on around that business.

The type of technologies that we’ve seen at the space – by the way, disclaimer, this is no endorsement of any of these technologies. It’s just a random selection of things that you might see in the wild as a representative technology to support data warehousing. You have things like cloud providers, like the BigQuery, or Power BI if you’re an insurer, that gives you the full stack to get the data into hundreds of tables and be able to query them in different ways. Then you have your dashboard and analytics on top for your reporting.

The data warehousing, which we used for about 40 years, had been problematic at scale. This notion that we can get data from all different complex domains and put them in one model thousands of tables and thousands of reports, then we can really use that in an agile and nimble way, has been an unfulfilled promise. We improved, we evolved, and we said, “You know what, don’t worry about that whole modeling we talked about, just get the data out of the operational systems, bring them to this big, fat data lake in its original form. Don’t do so much modeling, we deal with modeling afterwards.”

Then we throw a few data scientists to swim in this data lake and figure out what insights they can discover. Then we would model the data for downstream consumption in a fruitful purpose way, whether it’s specific databases or a data warehouse down the line. That has also been problematic at scale. The data department of running Hadoop clusters or other ways of storing this big data hasn’t been that responsive to the data scientists that need to use that data.

The type of technology that we see around here, the big storage like the Blob Storage, because now we’re talking about storing data in its native format so we go with a plain Blob Storage who have tools for processing the data, Spark, and so on to join, to filter, to model it, and then we have orchestrators, like Airflow and so on to orchestrate these jobs. A lot of clients that I work with still are not satisfied. They still don’t get value at scale in a responsive way from data lake.

Naturally, the answer to that is, get the lake on to the cloud. Cloud providers are speeding and competing in getting your data in the cloud and provide services that are easier to manage. They’re doing a great job, but essentially, they’re following the same paradigm. This is a sample solution example or solution architecture from GCP. I can promise you, if you google AWS or Azure, they pretty much look the same. You’ve got, on the left-hand side, this idea that your operational systems, your TP, everything, through batch, through stream processing, throw it into the data lake and then downstream model it into BigQuery or big table if you want to be faster, and so on.

Look Convincing?

That looks wonderfully convincing. Wiring, fabulous technology to shove the data from left to right into this big cloud. I want to step back for a minute, look at 50,000-foot view the essential characteristics that are commonly shared across these different solutions that we’ve built, and get to the root cause of why we’re not seeing the benefits that we need to see. 50,000-foot view, I can promise you that I’ve seen so many enterprise data architectures that pretty much look like this. Obviously, they’re drawn with more fancier diagrams rather than my squiggly hand drawing.

Essentially, it’s a big one big data platform data lake, data warehouse, and its job is consuming data from hundreds of systems, the yellow-orange boxes that are drawn, across the organization or beyond the bounds of the organization, cleanse, process, serve, and then satisfy the needs of hundreds of consumer use cases feed the BI reports, empower the data scientists, train the machine learning algorithms, and so on.

If you look at that technology, the solution architecture that I showed you, there is nowhere a discussion around the domains, around the data itself. We always talk about throw the data in one place. This idea of this monolithic architecture, the idea of domains, the data itself is completely lost. The job of the architects in organization, when they find themselves with this big architecture is to somehow break it down into its pieces, so that they can assign different teams to implement the functionality between different boxes here.

This is one of the ways that companies at scale are trying to break down their architecture into smaller pieces. They design ingestion services, so services that are getting the data out of the devices or operational systems. They have the processing team that is building the pipelines to process that, and there are teams that are working on the API’s or downstream databases to serve them.

I’m very much simplifying. Behind this is actually a labyrinth of data pipelines stitched together. When you step back for a minute, what are we seeing here? We’re seeing a layered architecture that has been a top-level decomposition. It’s been decompose based on its technical capability: serving, ingesting, and so on. The boundaries are the technical functionality. If you tilt your head 90 degrees, you have seen this before. We have seen layered enterprise architecture where we had UI and business logics, and databases underneath.

What was wrong with that? We moved from that to microservices. Why? Because the change doesn’t happen, the change is not constrained to these boxes that we’ve drawn on the paper. The change happens orthogonally to these layers. If I want to introduce a new signal that I want to get from my device, and now process it, if I want to introduce a new source or introduce a new model, I pretty much have to change all of these pieces. That’s very friction-full for process. The handover, the handshake, it makes sure that consistently happens across those layers. If you come down a little bit more closer and look at the life of people who actually build this architecture and support them, what do we see?

We see a group of people, siloed, data engineers, ML engineers in the middle stuck in between the world of operational systems that generate this data and the world of consumers that need to consume the data without any domain expertise. I really don’t envy the life of data engineers I work with. I’m hoping that we can change the life of these data engineers right here right now from here on. What happens is that the orange people that are running the operational systems, they have no incentive to provide their analytical data, those historical snapshots, the events and reality and facts of the business to the rest of the organization in an easily consumable way.

They are incentive to run their operational business they are incentive to run that e-commerce system and build a database that is optimized to run that e-commerce system. On the other side, the purple folks, they are just hungry for the data. They need the data to train the machine learning, and they’re frustrated because they constantly need to change it and modify it, and they’re dependent on the data engineers in the middle. The data engineers are under a lot of pressure because they don’t understand the data coming to them. They don’t really have the domain expertise. They don’t know how the data is being used.

They’ve been essentially siloed based on the tools expertise. Yes, we are at that point of evolution or the growth of technology then, and still, the data tooling is fairly niche space. Knowing Spark and Scala and Airflow, it’s a very niche space than, generally, software engineers. We’ve seen these silos before. We saw the silo of DevOps and remove the wall. The wall came down, and we brought the folks together. We created a completely new generation of engineers, called them SREs, and that was wonderful, wasn’t it? With silos, we just have a very difficult process full of friction.

The stats, just to show the skill set gap that we are facing and we’ll continue to face with a wall in between is the stats that you can get from LinkedIn. Last time I searched was a few weeks back. I doubt things have changed much in three weeks. If you look for data jobs open today for the label “data engineer,” you find about 46,000 jobs open on LinkedIn. If you look for people who are claiming to be data engineers on the platform, you see 37,000 folks. I’m pretty sure all of them are in good jobs with good pay.There’s this huge gap in the skill set that we can’t close, which is silo and people.

This centralized monolithic paradigm, it was great maybe for a smaller scale. The world we live in today is a world that data is ubiquitous. Every touchpoint, every action and interaction is generating data, and the business are driven to innovate. That cycle of innovation: test, and learn, and observe, and change, that requires constant change to the data and modeling and remodeling. This centralized system simply doesn’t scale. A centralized monolithic system that has divided the work based on the technical operation, implemented by a silo of folks.

Going back to Thomas Kuhn’s observation, you had the data warehouse, and the lake, and the lake on the cloud, what have been doing for 40 and 50 years? We’ve been stuck in that normal science. We believe that the only way we get use of data is just getting into big, fat data Lego platform, get our arms around this so we can make sense of it. That this centralization was the dream of his CIOs of 30 years ago that, “I have to get the data centralized because it’s siloed in this databases that I can get into.” That’s the paradigm shift I’m hoping to introduce.

Where Do We Go From Here?

Let’s talk about data mesh. Hopefully, so far, I’ve nudged you to question the existing paradigm. I’m going to go a bit top-down – my mental model was fairly top-down – talk about the principles that drives this change, and then go deeper into some of the implementations and hopefully leave you with a couple of next steps. The principles of that underpinning data mesh are basically the ingredients of the best and most successful projects that we have had globally at ThoughtWorks. It’s applying the learnings of modern architecture that we’ve seen in the adjacent world of operational system and bring that to the data. The very first one is the decentralization.

How can we apply domain-driven thinking and distributed architecture to data? How can we hide the complexity of that cell infrastructure that runs and operates the big data? I don’t want to trivialize that, it is very hard to operate a Kafka cluster at scale. It is very difficult to run your Spark cluster. How can we abstract that away into self-serve infrastructure with platform thinking, and to avoid those silos of hard-to-find hard-to-use meaningless, not trustworthy data? How can we apply product thinking to really treat data as an asset? Finally, to have a harmonious and well-played ecosystem, what sort of governance we need to bring to the table? I’m going to go into each of these ones one by one, and hopefully, [inaudible 00:19:44] better.

Domain-driven distributed architecture. Raise your hand [inaudible 00:20:00] of Eric Evans’, “DDD.” About 10%. Go on Amazon, [inaudible 00:20:14] just through this and get the book, or go to [inaudible 00:20:20] website [inaudible 00:20:21] stake.

What domain-driven design or domain-driven distributed architecture introduces is this idea of breaking down monolithic systems into pieces that are designed around domain. Picking the business domains that you have, right now, what we discussed was the way we’re trying to break down these centralized monolithic data platforms around pipelines the job of the different pipeline phases. Now we are applying a different approach. We’re saying, find the domains. The examples I put up here are from health insurance because that’s where I am. We’re waist-deep right now with a client implementing their next-generation data platform.

When you think about the operational domains, a lot of organizations are already divided that way. In the healthcare domain, you have your claim systems that provides claims like pharmaceutical or medical claims that you’re putting together. You might have your biomarkers lab results and so on. These are the different domains that you see in this space. If you think about these data domains as a way to decompartmentalize your architecture, you often find either domains that are very much closer to the source, so where the data originates, for example claims. You have the claim systems already either accepting, or rejecting, or processing different claims.

Those systems are generating historical analytical data about claims. There are domains that are closer to the facts of the business as they’re getting generated. We’re talking about immutable data. We’re talking about historical data that is just going to be infinitely forever and ever generated and stay there. These data domains hardly change because the facts of the business don’t change as much. Of course, there are industries where I get a new app, and my app features changes, so the signals coming from that app constantly changes. Normally, in bigger organizations, these are more permanent and static data domains.

Then you have domains that you are refining, you’re basically creating based on the need of your business. These are aggregate data domains. I put in the example of patients critical moments of intervention, which is a wonderful actually use case on a data set that the client that I’m working with right now is generating by aggregating a lot of information about the members, members behavior, their demographic, the change of address, and apply machine learning to find out, “What are those moments that, as an insurance provider, I need to reach out to my members and say, ‘You need to do something about your health. You just changed your address. You haven’t seen a doctor for a while. You don’t have a support network. Probably you haven’t picked a doctor or done your dental checkups. Go and visit Dr. AOB,'” so creating these data sets.

These are aggregate views, or the holy grail of healthcare data right now is longitudinal patient records, so aggregating all of your clinical visits and lab results into some a time series data. These are more consumer-oriented designed domains. Theoretically, we should be able to always regenerate these and recreate these from those native data products that we saw. Where did the pipelines go? The pipeline still exists. Each of those data domains still needs to ingest data from some upstream place, maybe just a service next door that is implementing the functionality or the operational systems.

They still have to cleanse it and serve that data, but those pipelines become the second class concern. They become the implementation details of these domain data sets or domain data products. As we go towards the right-hand side, the orange and red blobs, we see more of the cleansing, more of the integrated testing to get accurate source of data out built into the pipelines. As you go towards the consumer-facing and aggregate views, you see more of the modeling, and transformations, and joins, and filters, and so on.

In summary, with distributed domain-driven architecture, your first partition, architectural partition becomes these domains and domains data products, which I go into details towards the end. I really hope that we don’t use pipeline, data pipeline as a first-class concern. Every time I ask one of our data engineers, “Can you draw your architecture?” He just talks about pipelines. Pipelines are just layers implementation details. What really matters is the data itself and a domain that it belongs to. There’s this wonderful concept, architectural quantum, that Neil Ford and Rebecca Parsons, the co-authors of “Evolutionary Architectures” book turn, which is, the smallest piece of your architecture, units of your architecture that has high cohesion and can be deployed independently of the rest.

We are moving to a world that the architectural quantum becomes this domain data products, that are immutable showing the snapshots and the history of the business. How can we avoid this the problem that we have had to move from centralization, this problem of having these silos of databases and data stores now spread across these domains and nobody knows what is going on, and how do we get to them? That’s where product thinking helps us.

I think it’s become quite actually common for us to think about the technical platforms that we build as products because the developers, the data scientists, they are the consumers and customers of those products, and we should treat them so. If you ask any data scientist today, they would tell you that they spent 80% to 90% of their time to actually find the data that they need, and then make sense of it, and then cleanse it, and model it to be able to use it. Why don’t we apply product thinking to really delight the experience of that data scientist and remove that 80%, 90% waste?

What does that mean? That means each of these domains that we talked about, like the claims domain, becomes the data, the historical analytical data for it, becomes a product. Yes, it has multiple shapes, it’s a polyglot data set. You might have streams of claims for the users that prefer real-time or near real-time events about the claims. It might have buckets of batch or historical snapshots for data scientists because they love bucket files and batch processing for 80% of their job. For data to be an asset and be treated as such, I think there are some characteristics that each of these data products need to carry.

First and foremost, they need to be discoverable. Chris [Riccomini] mentioned in the previous talk that with the world of data cataloging and Sunfire in a good way and there are tons of different applications because data discoverability is the first and foremost characteristics of any healthy data platform. Once we discover the data, we need to programmatically address it so we get access to the data easily. As a data scientist or data analyst, if I can’t trust the data, I will not use it. It’s really interesting because in the world of API’s and microservices, running a microservice without announcing your uptime and having an SLO, it’s crazy.

You have to have an understanding of what your commitment to the rest of the organization is in terms of your SLOs. Why can’t we apply the same thing to the data? If you have maybe real-time data with some missing events and some inconsistencies that’s acceptable, you just got to explicitly announce that and explicitly support that for people to trust the data they’re using. Good documentation, description of the schema where the owners are, anything that helps data scientists or data users to self serve using your product.

Interoperability – if I cannot have distributed data, if I can’t join the customer from the sales domain to the customer from the commerce domain, I really can’t use these pieces of data. That interoperability to unify the IDs or some other failed formats to allow that join and filter and correlation is another attribute of a data product.

Finally, security. It’s such a privilege to talk while Chris [Riccomini] is here because I can just point to his talk. He talks about Orback and applying access control in an automated way at every endpoint, at every data product. These things just don’t happen out of good intention. We need to assign people with specific roles, so particular role that we’re defining where we’re building this data product or data mesh is the data product owner. Someone whose job is care about the quality, the future, the lifecycle of a particular domain’s analytical data, and really evangelize this to the rest of the organization, “Come and see. I’ve got this wonderful data you can tap into,” and show how that can create value.

In summary, treating data, bringing the best practices of product development and product ownership to data. If you’re putting one of these cross-functional teams together with the data product owner, I will start with asking for one success criteria, one KPI to measure. That is delighting the experience of the data users the decreased lead time for someone to come and find that data, make sense of it, and use it. That would be the only measure that I track first, and then, of course, the growth and more number of users using it.

If you’ve been listening so far, you’re probably wondering, “What are you asking me?” A question that a lot of CIOs and people that actually spend the money ask me is that, “You’re telling us distribute the analytical data ownership to different domains, create different teams. Then what happens with all that technical complexity, the stack that needs to implement each of these pipelines?” Each of these pipelines needs some a data lake storage, need to have the storage account setup, need to have the clusters to run their jobs, probably to have some services. There’s a lot of complexity that goes into that.

Also, decisions such as you want to have your compute closer to your data. You want to have perhaps a consistent storage layer. These decisions, if we just distribute that, we create a lot of duplication duplicated effort, probably inconsistencies. That’s where our experience in the operational world to creating infrastructure as a platform comes to play. We can apply the same thing here. Capabilities like the data discovery, setting up the storage account, all of those technical, the metalwork that we have to do to spin up one of these data products can be pushed down to a self-serve infrastructure with a group of data infrastructure engineers to support that. Just to give you a flavor of type of complexity that exists that needs to be abstracted away, here’s just some list.

Out of this list, if I had a magic wand, and I could ask for one thing, that’s unified data access control. Right now, it’s actually a nightmare to set up a unified policy-based access control to different mediums of storage. If you’re providing access control to your buckets, or if you’re on Azure ADLs versus your Kafka versus your relational database, every one of them has a proprietary way of supporting that. There are technologies that are coming to play to support that, like extensions, future extensions to open policy agents, and so on. There’s a lot of complexity that goes into that.

In summary, the platform thinking or data infrastructure, self-serve data infrastructure is set up to build all of the domain agnostic complexity to support data products. If I set up one of these teams – and often, very early on in the projects, we set up a data infrastructure team, they ask for them – the metrics they get measured by is the amount of time that it takes for a data product team to spin up a new product. How much complexity they can remove from the job of those data engineers or data product developers so that it takes a very little amount of time to extract, to get data for one domain and provide it in a polyglot form to the rest of the organization. That’s their measure of success.

Anybody who’s worked on distributed systems know that without interoperability, a distributed system will just fall on its face. If you think about microservices and the success of APIs, we had the one thing that we all agreed on. We had HTTP and REST. We pretty much all agree that’s a good idea. Let’s just start getting these services talk to each other based on some standardization. That was the key to the revolution of APIs. We need something similar here when we talk about this independent data products, that providing data from different domains so that that data can be correlated, joined, and processed or aggregated.

Formulating, what we are trying to do is creating this nexus of folks that are coming from different domains and formulating a federated governance team to decide what are those standardization we want to apply. Of course, there always going to be one or two data product that are very unique, but most of the time, you can agree upon a few standards. The areas that we are standardizing first and foremost is how each data product describe itself so that it can be self-discovered. The APIs to describe a data product, which I will share with you in a minute, to find and describe a data product and discover it. The other area that we very early on work on is this federated identity management. In the world of domain-driven design, there are often entities that cross boundaries of domains. Customer is one of them, members’ one of them, and every domain has its own way of identity, these, what we call PoliSims.

There are ways to build inference services in machine learning to identify the identity of the customer across different domains that has a subset of attributes and generate a global ID so that now, as part of publishing a data product out of my domain, I can do internally a transformation to a globally identifiable customer ID so that my data product is now consistent in some ways with the other data products that have the notion of customer in them. Most importantly, we try to really automate all of the governance capabilities or capabilities that are related to the governance. The Federated ID system management is one of them. Access Control is another one. How can we really abstract away the policy enforcement and policy configuration for accessing polyglot data into the infrastructure?

Let’s bring it together. What is data mesh? If I can say that with one breath, in one sentence, a decentralized architecture where your units of architecture is a domain-driven data set that is treated as a product owned by domains or teams that most intimately know that data, either they’re creating it or they’re consuming and re-sharing it. We allocated specific roles that have the accountability and the responsibility to provide that data as a product abstracting away complexity into infrastructure layer a self serve infrastructure layer so that we can create these products much more easily.

Real World Example

This is a real-world example from the health insurance domain. On the top corner, we see a domain. We call it Call Center Claims. It happens that these organizations have been running for 50 years, and they usually have some a legacy system. This is the Online Call Center application, that is a legacy system. The owners and writers of it are no longer with us. We had no other option but running some change data capture as an input into a data product that we call Online Call Center that’s running within the domain of Online Call Center. It’s not something different. What this data product does is provide the Call Centers claims daily snapshots because that’s the best representation of the data from that domain from this data of that legacy.

In the other corner of the organization, we have this like brand new microservices, the handling the online claims information. You have a microservice is new, the developers are sharp, and they’re constantly changing it, so they’re providing the claims events as a stream of events. We bundle a data product within that domain called Online Claims data domain that now gets data from the Event Stream for the claims and provides polyglots data output, essentially. One is, similarly, the events that it’s getting, it does a bit of transformation. It unifies the IDs and a few different field formats that we agreed upon. Also, for data scientists, it provides [inaudible 00:39:54] files in some a data lake storage.

A lot of the downstream organizations, they don’t want to deal with duality of whether it’s online or whether it’s data center, so we created a new data product. We called it just the claims data product. It’s consumed from upstream data products ports, from the Online and from the Call Center and aggregate that together as one unified stream. Obviously, it provides a stream. We want to still maintain the real-timeness of the online.

The events that get generated are actually for the legacy system is synthesized from the daily changes, so they’re not as frequent. We also have this snapshot, so we have now the claims domain. We can play this game forever and ever. Let’s continue. You’ve got the claims, on the other side of the organization you’ve got the members, people who deal with registration of the new members, change of their address, change of their marital status, and so on. They happen to provide member information right now as buckets of file-based information.

We had this wonderful ambitious plan to use machine learning to aggregate information from claims from members and a bunch of other upstream data products and create a new data product that can provide to the staff information about members that needs some an intervention for a better health, less claims, and less cost for the insurance company. That downstream data product, the member interventions data product runs actually a machine learning model as part of its pipeline. You saw in the previous diagram, these ones are more native and closer to source data product. As we move towards this, you move towards aggregated and new models and consumer-oriented.

Look Inside a Data Product

One of the questions or puzzles for a lot of the new clients is, “What is this data product? What does it look like? We can’t really understand what it is because we’re kind of inverting the mental model.” The mental model has always been upstream into the lake, and then lake converted into downstream data. The system is very much a pipeline model. It looks like this, looks like a little bug. This is your unit of architecture. I have to say, this is the first incarnation of this. We’ve been building this now for a year, but, hopefully, you will take it away, you make your own bug, and build a different model. This is what we’re building. Every data product that I just showed, like the claims, online claims, and so on, it has a bunch of input data ports that gets configured to consume data from upstream streams, or phy dumps, or CDC, or APIs, depending on how they’re consuming the data from the upstream systems or upstream data products.

They have a bunch of polyglot output data ports. Again, it could be streams. This is what the data that they’re serving to the rest of the organization. It could be files, it could be SQL query interfaces, it could be APIs. It could be whatever makes sense for that domain as a representative of its data. There are two other lollipops here, what we call control ports. Essentially, every data product is responsible for two other things rather than just consuming data and providing data. The first one is, be able to describe itself. All of the lineage metadata information addresses of these ports, the output ports that people care about, schemas, everything comes from this endpoint. If there is a centralized discovery tool, they would call this endpoint to get the latest information. Because of the GDPR, or CCPA, or some of the audit requirements that usually the governance teams have in the organization, we provide also an audit port.

If you think about on-prem to cloud movement, if your upstream happens to be on-prem and downstream happens to be on cloud, the conversion from the copying from on-prem and cloud happens in the ports configuration. If you look inside, you will see a bunch of data pipelines that they are really copying the data around and transforming it and snapshotting it or whatever they need to do to provide a downstream. We also deploy a bunch of services or site cars with each of these units to implement the APIs that I just talked about, the audit API or the self-description API.

As you can see, this is quite a hairy little bug. In microservices world is wonderful, you build a Docker image and inside it has all the complexity of implementing its behavior. Here, every CICD pipeline, which is a CICD pipeline independent for every data product, actually deploy a bunch of different things. For example, on Azure, on the input data ports are usually Azure data factory, which is like the data connectors to get the data in the pipelines or as data bricks, Spark jobs storage is ADLs. There’s a whole bunch of things that need to be configured together as a data product.

Discoverability is a first-class concern. You can hit a RESTful endpoint to get the general description of each of these data products. You can hit an endpoint to get the schemas for each of these output ports that you care about, and documentation and so on. Where’s the lake? Where’s the data warehouse? Where is it? It’s not really in this diagram. I think the data warehouse as a concept like having BigQuery for your big table for running fast SQL queries, it can be a node on the mesh. The lake as a storage, you can still have a consistent storage underneath all of these data products, but they’re not a centralized piece of architecture any longer.

The paradigm shift we’re talking about is from centralized ownership to decentralized ownership of the data, from monolithic architecture to distributed architecture, to really thinking about pipelines as a first-class concern to domain data as a first-class concern. Data being a byproduct of what we do, as an exhaust of an existing system, to be a product that we solve.

We do that through, instead of having siloed data engineers and ML engineers, we’ll have cross-functional teams with data ownership, accountability, and responsibility. With every paradigm shift needs to be a language shift and a language change. Here are a few tips, how we can use a different language in our everyday conversation to change the way we imagine data, to change the way we imagine data as this big divide between operational systems to analytical systems as one piece of architecture that is fabric of our organizations.

Hopefully, now I’ve nudged you to question the status quo, to question the paradigm, a 50-year paradigm of centralized data architecture. Get started implementing this decentralized one. This is a blog post that, to be honest, I got really frustrated and angry and I wrote in a week. I hope it helps.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Q&A with Martijn Verburg and Bruno Borges of Microsoft Regarding Contributing to the OpenJDK

MMS Founder
MMS Rags Srinivas

Article originally posted on InfoQ. Visit InfoQ

Microsoft announced a few months back a reaffirmation to contribute to the OpenJDK.

Since details about the exact contribution was scant, InfoQ caught up with Martijn Verburg, Principal Group SWE Manager, Java and Bruno Borges, Principal Program Manager, Java at Microsoft about the specifics of the intended contribution from Microsoft to the OpenJDK.

Martijn Verburg and Bruno Borges talk about specific areas of contribution to the OpenJDK, especially in the areas of performance, Garbage Collection and so on while still maintaining a flexible approach based on community needs. They provide more details about the Stack Allocation ‘patch’ — a series of patches targeted at reducing object allocation rate of Java programs automatically by performing stack allocation of objects that are known to be local to a method.

InfoQ: Cloud-Native development has shifted to containers and polyglot development. Given that, how is the recent announcement related to Microsoft contributions to OpenJDK relevant to developers and architects?

Borges: Developers and architects that use Java around the world today are looking to move their Java workloads to cloud-native and container-based environments. Naturally, they still want to enjoy the various Java-based programming models, Java’s strength in interoperability and of course continue to leverage the domain and technical expertise of their teams.

Microsoft supports these customers on Azure today (e.g. on Azure Kubernetes Services (AKS), App Service, Big Data, Azure Spring Cloud, Functions and others) and we will improve that experience by enhancing the underlying OpenJDK runtimes currently in use by these services, especially as it pertains to performance in cloud-native, container-based environments.

InfoQ: Can you outline some of the investments that Microsoft is planning to make in Java and OpenJDK for the benefit of all Java developers and users?

Verburg: Microsoft will be mainly investing time and technical expertise into OpenJDK’s runtime performance, this includes, but is not limited to cold startup time, warm start-up time, reduction in memory pressure (which reduces Garbage Collections and reduces footprint), improvements to various GC algorithms and auto-configuration of the JVM for certain types of workloads.

Microsoft will also be helping out in tooling and telemetry and is already looking to help with Java Flight Recorder and related features. We see tremendous potential with the new JEP 349 JFR Event Stream API.

InfoQ: At this point Microsoft seems to be taking a very flexible approach towards contribution. Can you comment a bit more on specifics, like what areas does Microsoft want to make an impact with the OpenJDK, timelines, etc.? Are Microsoft engineers already working on parts of OpenJDK?

Verburg: Yes, we’ve delivered our first 4-5 patches into OpenJDK and are working closely with the stewards of Java (Oracle) and other major OpenJDK players (e.g. Red Hat and IBM) to identify where we can best collaborate to serve our customer’s needs. 

Here are some of those patches:

We’re also working on a Stack Allocation ‘patch’. This series of patches target reducing object allocation rate of Java programs automatically by performing stack allocation of objects that are known to be local to a method, (or in compiler terms ‘non-escaping’). We’re working through various industry benchmarks to make sure we haven’t caused regressions and are also integrating some great suggestions by Oracle and other OpenJDK vendor VM engineers. We expect to raise a JDK Enhancement Proposal (JEP) for this work soon.

Primarily we’re starting with small fixes and improvements to HotSpot (Garbage Collection, Just In Time compilation, etc), but once we have gained some more experience in working with OpenJDK and more importantly its community, we’ll look to contribute more significant and impactful features.

InfoQ: Visual Studio has been around for decades, known and loved by developers. Is there a particular vested interest for Microsoft to contribute to OpenJDK and thereby enhance its own developer tools?

Borges: Microsoft invests in the Java extension for VS Code, which is proving to be extremely popular with Java developers who are building microservices and/or cloud-native applications. We will certainly continue to invest there.

InfoQ: GraalVM has been getting some attention lately amongst the Java community although it targets languages besides Java. Is Microsoft planning to work on GraalVM and its intention perhaps to add language support for C# for instance?

Verburg: For us, it looks like GraalVM is at an early-stage technology and community when it comes to adoption among Azure customers, but we are watching it closely. We participated in its inaugural community workshop last year, and recently one of our engineers attended their summit in San Diego during CGO 2020 bringing great findings and learnings. There’s been tremendous progress and with the recent formation of their Advisory Board with several community members, we believe the project will gain more attention and potentially more adoption. We certainly look forward to be working with Oracle on it when the time comes.

InfoQ: What are the more exciting components of the OpenJDK (and the Java ecosystem) that InfoQ audience should pay close attention to? Anything else to add to developers in general and Java developers in particular?

Verburg: The Road to Valhalla is of particular interest — background and object model,  is of particular interest. In Java 14 we have records coming in which is just one of the building blocks which will unlock a whole category of programming productivity improvements as well as some serious performance enhancements.

In summary, Martijn Verjburg and Bruno Borges outlined specific areas that Microsoft will be focussing on for contributions to the OpenJDK, especially in the areas of performance, Garbage Collection and so on while still maintaining a flexible approach for the future based on community needs.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.