March 2020 - Page 4 of 14 - Mobile Monitoring Solutions

3D Printed N95 Masks Not Viable

MMS • Erik Costlow

Article originally posted on InfoQ. Visit InfoQ

The 3D printing community can create many unique parts to assist with COVID-19 shortages, but plans for printed N95 masks pose a number of difficulties.

By Erik Costlow

Article: Spring Boot Tutorial: Building Microservices Deployed to Google Cloud

MMS • Sergio Felix

Article originally posted on InfoQ. Visit InfoQ

In this tutorial, the reader will get a chance to create a small Spring Boot application, containerize it and deploy it to Google Kubernetes Engine using Skaffold and the Cloud Code IntelliJ plugin.

By Sergio Felix

Web Components at Scale at Salesforce: Challenges Encountered, Lessons Learnt

MMS • Bruno Couriol

Article originally posted on InfoQ. Visit InfoQ

Diego Ferreiro Val, principal architect at Salesforce and co-creator of Lightning Web Components (LWC), talked in WebComponentsSF about the challenges and lessons of building a platform leveraging web components at enterprise scale. Albeit with missing pieces, the web components standard was instrumental to achieve Salesforce’s interoperability, backward and forward compatibility objectives at scale.

Ferreiro started by presenting the specific constraints of Salesforce. Among other services, Salesforce sells a platform for customers to implement their own enterprise application, and as such must be completely customizable. Customers may thus create their own components, objects, and user interfaces. Ferreiro quoted the following business requirements for the platform:

multi-author,

multi-version,

backward compatibility indefinitely,

accessible,

personalizable,

localizable,

secure,

performant (including old browsers)

The third requirement is particularly meaningful in Salesforce’s context. Ferreiro emphasized:

If our customers write a component today, it has to be working in the next ten years.

Salesforce had the opportunity three years ago to start from scratch and looked around for available options to achieve their objectives while satisfying their constraints. Developers wanted away from proprietary technology, favored a common component model that allows them to use Salesforce or Adobe indifferently, acquire transferable skills, and not being locked into a given technology, all the while leveraging standard tooling.

After months of research and discussion with existing front-end frameworks, Salesforce observed that the web platform has managed to evolve without breaking former content. Salesforce thus concluded that the best way to be future proof is to align with standards as much as possible, and the best option was to rely on web components. Ferreiro commented:

It took us a while to arrive at this conclusion but that is kind of at the core of what we are building.

Ferreiro observed that however, the web component standard was providing an API that was too low-level to be directly exposed to Saleforce customers, and thus decided to add their own syntactic sugar on top of it. The result of that work, the Lightning Web Components framework (LWC) added better ergonomics, filled the gaps in accessibility, in performance, and browser compatibility. An example of code is as follows:

<template>
  <p>Counter: {count}</p>
  <button onclick={increaseCounter}>Add</button>
</template>

import { LightningElement, track }  from "lwc";

export default class MyCounter extends LightningElement {
  @track count = 0;

  increaseCounter() {
    this.count += 1;
  }
}

Adopting web components proved to be challenging, with two challenges taking a particular importance: IE11 support and Shadow DOM encapsulation. IE11 support was and remains very important as IE11 amounted to 43% of the traffic as of September 2019.

Supporting web components in old browsers is thus a requirement for enterprise software. This means bringing ES5 to ES7 by means of polyfills and babel plugins. Ferreiro emphasized that the team spent months implementing IE11 support, corrected a number of bugs in existing polyfills, resorted to ad-hoc babel plugins for performance reasons, and were the first to implement proxy support for IE11. Emulating Shadow DOM and CSS variables in IE11 in a way that is performant enough was another set of challenges. So were event retargeting, focus and tabs. Not less importantly, the team had to fix Selenium and became the maintainers of the IE11 Selenium driver. The team also added web component support to JSDOM and fixed a few other issues that occurred in other browsers like Safari and Chrome.

Shadow DOM encapsulation meant changing the way to do testing and styling. Traversing the shadow root to access the encapsulated DOM meant rewriting thousands of tests. While Ferreiro described the effort as painful, he also mentioned that as a result, components and tests became more reliable, resilient and scalable. As positive news, some testing frameworks implement some form of shadow selector abstraction, and the Selenium WebDriver is soon to have a shadow piercing primitive.

Shadow DOM styling customization remains still an issue in progress, even as it improved style encapsulation. Ferreiro believes that specifications in-progress like custom properties, ::part and :state() will solve 90% of the style customization use cases. While the CSS Shadow parts specification is still at the working draft stage, it is already implemented in both Chrome and Firefox. Constructible Stylesheets may also help to gain additional control over component blueprints in a controlled way.

Ferreiro concluded that although the path to production was hard, web components are helping Salesforce scale. Pages of the Salesforce CRM application typically feature over 8,000 components. Salesforce counts 5M developers, and 1M created components. 73% of Salesforce developers use LWC, and 95% of Salesforce developers feel that web components are the right direction. While the web component standard is not finished, Ferreiro believes that the future is bright and signaled that Salesforce will keep pushing the web platform forward and representing the enterprise use cases in the W3C and TC39 committees.

Salesforce.com, Inc. is a cloud-based software company providing customer relationship management services and selling a complementary suite of enterprise applications focused on customer service, marketing automation, analytics, and application development.

Lightning web components are available under the MIT open source license. LWC’s RFCs can be accessed on a dedicated website. Contributions and feedback are welcome and may be provided via the GitHub project.

Uncategorized

Presentation: Next Generation Client APIs in Envoy Mobile

MMS • Jose Nino

Article originally posted on InfoQ. Visit InfoQ

Transcript

Nino: Let’s start with an idealized representation of networked applications. At its most basic we have clients that communicate over the network to a server with a request and then the server responding back to the client over the network. In the middle here we have APIs, and that’s what we’ve been talking about all day today. They are the contracts that establish how clients communicate with servers, and they’re this boundary that creates the standards for the communication. However, I’m sure most of you have experience with architectures that don’t look like this. They probably look a little bit closer to that, or maybe even nastier. Or, on a particularly bad day when you are on call, it probably looks like that.

My name is Jose Nino, and I am an infrastructure engineer at Lyft where I’ve worked for the past four years building systems that take this reality and try to make them feel more like this idealized version. For the first three years that I worked at Lyft I worked in the server networking team. There we recognized that there were two dimensions to API management. There was the what data do we send via our APIs – this is the shape of our APIs – and then there was the how do we send that data. I am going to call that the transport of our APIs.

Server-side, after some evolution, we standardized along IDL. The technology that we used there is protobuf. Whenever a new service pops up at Lyft the product engineers need to define their models in protobuf. Then we are able to compile protobuf to generate both the client and server stubs. Here, client I refer to another service back in infrastructure that’s trying to communicate across the network.

In the transport dimension, my team created Envoy a little bit over five years ago and open-sourced it a little bit over three. Envoy at its most basic is a network proxy, and it can be run standalone or as a sidecar. Importantly, for our backend infrastructure, we standardize the transport of our sidecars by using Envoy at every hub of the network. Every service runs with Envoy as a sidecar. What this has resulted for us is we have a common network substrate that all services use to communicate with one another.

Together in the backend, what we’ve produced here is an ecosystem of strongly-typed APIs that are guaranteed by our proto models and a common universal network primitive guaranteed by Envoy.

This would have been the end of my talk circa the beginning of 2019. We felt pretty good about the state of our back end. Obviously there’s always more room for improvement, but what we started thinking, a group of us at, at the beginning of 2019 is that we had left the most important hub out of this ecosystem. That hub is the mobile client because traditionally we treated clients as independent from the back end infrastructure. We’ve built unique solutions for what we thought were unique problems.

What we identified here was a technology gap because in spite of all the work that we had done server-side to increase consistency and reliability, we recognized that increasing reliability to 99% on the server side is really meaningless if we leave out the mobile clients because they are the ones that are being used by our end users to interact with the platform that we build.

We started thinking, “What do we want from our client APIs?” Really what we wanted were the same guarantees that we had given for the server. Most importantly, when problems do occur, we wanted the same tooling and observability in order to identify them.

In brief, what we were proposing is that we don’t need to treat the mobile clients any different from our backend infrastructure. We want the mobile clients to be another node in the network, another part of this mesh.

This is going to be the focus of the rest of the talk, hence the title of Next Generation APIs with Envoy Mobile. About a year ago I formed a new team called the client networking. There we started evolving how client APIs were defined in shape at Lyft, and also we started evolving how we transport them. Today I’m going to take you a little bit through that evolution of both the shape of the API and how that evolution culminated in Envoy Mobile, which is new networking library that takes Envoy and brings it to the mobile clients.

API Shape

Let’s start with the evolution of the shape of the API at Lyft. This is the earliest workflow that engineers use at Lyft to define their client APIs. They discovered a new product feature that they wanted to have, wrote a textbook about it, and then went and handwrote all the API codes both for the client and for the server.

We quickly realized that there were problems with this approach. For instance, programming errors could lead to different fields in the payloads where, for example, iOS might have a particular key but Android might have a misspelled key. Because this was free form JSON that we were sending over the wire, the JSON would end up in the server and then the server would catch on fire. Then our server engineers would say, “Don’t worry about it. We can fix this with just a feature flag tagging your client.” We recognized that this was not an ideal situation. We clearly had problems because we had no visibility over problems with our payload until they hit the server.

A few of our Android engineers took on themselves to try and fix this problem, and they introduced YAML schema to our API definitions. What they tried to do with that was to have guarantees about the fields that should go in the payload, but this had a large problem, which was that this was only deployed to our Android clients. We still had inconsistencies between our iOS and Android clients. Really, what this meant was that there was no source of truth for our client APIs. There was no guarantee that both of our clients would behave the same way.

What we were shooting for here was consistency between our client APIs. We wanted a single source of truth that determined that the shape of the API was going to be the same between the two clients and the server. Really what we wanted was to do the same thing that we had done for server internal APIs. We wanted to provide that consistency guarantee.

What we envisioned was a workflow that looked similar to what the internal APIs looked like. We wanted to go from tech spec to a commit in our IDL repo. Then from our IDL repo we wanted to trigger CD to automatically generate both the client and the server stubs. We didn’t want to stub just at the server. We wanted to do that also for the client.

We used protobuf again to solve this problem. I want to stop and describe a little bit about what we chose protobuf. First, protobuf is strongly-typed. We had compiled time guarantees about what the runtime payload would be. Second, protobuf also gives us JSON deserialization and serialization. This means that we had a path for migrating the legacy APIs that were in free-form JSON into this new, strongly-typed APIs. Lastly, we were also using the server-side. We were going to have consistency guarantees not only between our two mobile clients but also between the mobile clients and the server. We could also benefit from all the operational knowledge that we had already derived by working with protobuf on the server side for years.

That’s what we did for our public client APIs. We made a pipeline so that when engineers wrote a commit into their proto, extending the models that a particular public API had, that would trigger generators that then would create not only our server stubs but also our client stubs. With the client stubs we went a little bit further and precompiled those files into modules based on the API package that they belonged to. We did this do reduce compilation times and also to better organize our APIs so that mobile engineers could work with them easily.

Using Generated APIs In Swift

Let’s take a look at what the cogenerators do once they run. Here I’m going to show you Swift code, but you can assume that the Kotlin code for Android clients looks basically the same.

First, we generate the models for the request and the responses. What this strongly-typed model ensured is that we detect problems with the payload at compile time and not as a fire in the server. Second, we generate API stubs that are consistent between iOS and Android, and, extremely importantly, we abstracted implementation of the transport obvious APIs. In other words, now that we’re migrating to these generated APIs, my team as a platform team is able to go and change the underlying infrastructure of this transport layer to bring new changes. This is a foresight that we have going into the rest of the talk.

Lastly, we also generate mocks for our client APIs. These mocks conform to the same protocol as the production code. What this does is that it encourages our client engineers to go and test all of our public APIs. This is another piece where we’re trying to prevent runtime errors by having prevention at CI time when we run jobs on our pool requests.

I’ve been describing some problems that we had and a solution that we reached. What are some of the benefits that we got here? We finally have a single source of truth between not only all of our client APIs between Android and iOS but also between the mobile clients and the server. Now we have the fact that engineers don’t need to discuss or investigate discrepancies when we have incidents.

This is because we have created highly testable, consistent, strongly-typed APIs that can be checked for errors at compile time rather than runtime problems on the wire. Most importantly for the evolution of transport at Lyft, we have abstracted that implementation. Again, my team can go in and make improvements underneath the hood without worrying of massive migrations with long tails and having to disrupt the workflows of a lot of client engineers. By having this consistency and this guarantee, we could start, as a platform team, working on providing the other guarantees that we wanted to work on for our client APIs.

Performance

One of the things that we did at first when we had this consistent platform is that we started optimizing the encoding of our APIs. Earlier today in the brief history of the APIs we heard about protobuf’s binary encoding. That’s something that we used at Lyft. Remember that before we had clients that talked over free-form JSON to services, and then services responded with JSON. Now that we had our generated APIs we could swap the encoding that we actually sent over the wire for a more efficient format.

What we did is we changed from JSON to binary encoding. Importantly, we did this without having client engineers have to worry about what was going on on the transport layer because now that we generated these APIs we could actually negotiate the content between client and server. The client could send a protobuf request, and the server did not understand that it could negotiate a JSON request instead. This allowed us to decouple migrations of our clients and our servers understanding the wire encoding.

What we saw were huge improvements here. On average we saw about a 40% reduction in the payload sizes. In green are JSON payloads, and in blue are binary encoding payloads. What this means in turn is that we saw a big improvement in request success and also a big reduction in request latency because mobile networks are particularly sensitive to the size of the payload. It makes sense that if we reduce the payload sense, then we got better performance out.

Most importantly, this change was transparent to our engineers. This means that we didn’t have to run a migration after migration as we wanted to roll the improvements that we were working on. Now that we have the platform again we can go as a platform team and improve things under the hood.

Let’s take a step back and see where we are in the journey that I set out to describe. We achieved consistency across our mobile platforms and with the server via this protobuf defined, automatically generated APIs. We started to improve performance a little bit by experimenting with the wire format of the payloads that we sent over. By using these generated APIs, like I’ve said, we had a platform to start executing more on these guarantees that we wanted to provide.

Extensibility

Another thing that we wanted to do, for instance, is protobuf allows us to define extensions on the fields. What these extensions allow us to do is to create declarative APIs. Instead of just having fields in the API we can start declaring what the behavior of the API should be. For example, if we have a polling API, we might be able to define the polling interval of that API in the message that this actually being defined. We were already pretty satisfied with going from JSON to binary encoding, but if we want to go even further and reduce our payload sizes, we might enable gzip compression. With protobuf extensions we can do that declaratively.

To enact these behaviors we need to use the transport layer. In the server side we already have Envoy for our internal APIs communicating between services and having this unified network primitive, but in the mobile clients we had two different libraries. Both iOS and Android have different libraries. Historically we’ve used URLSession for iOS and OkHTTP for Android. The problem with having these three places where we need to enact behavior is that it made it hard to focus our engineering efforts in solving these problems.

We went back to these guarantees and we asked ourselves, “How are we going to deliver these guarantees?” To be able to provide them, we couldn’t stop at just having a platform, a unified platform for the shape of our APIs. We needed an ecosystem on par with what we had with the server. We needed to control both the shape and the transport of our client APIs. In other words, we had to focus not only on the shape but also the transport.

API Transport

What we theorized here is that if we went from two different libraries in our mobile clients, then we could more effectively use our resources. We’re going from three different implementations of our transport to two implementations: the mobile client and the server. We actually started thinking, “We could go one step further.” Given that we had worked to achieve true standardization in the server side there was nothing that stopped us from believing that we could benefit from powering our clients with the same transport layer: Envoy.

In the beginning of 2019 my team started investing in this new networking library called Envoy Mobile. We believe that it has the potential to really find transport in mobile client applications the same way that it did for backend infrastructure.

With this project, and by extending the last mile of our network topology to be part of this mesh, what we achieve is true standardization of the network. Similar to what Kubernetes has done for container orchestration, we want Envoy to do the same thing for the network transport. We believe that we can do that with Envoy Proxy on the server and with Envoy Mobile on the mobile clients.

Why is this standardization so important? We believe that for the same reasons as standardization was important in the server, standardization can also bring us the same benefits between the client and the server. These are just some of the tenants that we used to think about this, the first one being: write once and deploy everywhere. This is what I’ve been talking about. Instead of having to have your engineering resources be split amongst three implementations – iOS, Android, and the server – by having this one unified transport we can focus our resources in just that one platform.

Second, we can share common tooling for common problems. For example, with observability, instead of having an observability stack for your mobile clients and an observability stack for your server, by having the same admission of metrics from both places, we can start utilizing the same infrastructure that we use in both places.

Third, by homogenizing the network and the behavior guarantees that we provide, we make life easier for system operators. Instead of having to reason about three different systems and how they interact with each other, by having this universal network primitive we actually reduce the cognitive load that they have.

All these three reasons sounded very compelling to us to go ahead and build this new transport layer, so that’s what we did. For the past nine months we’ve been working on Envoy Mobile, and in that time we’ve had three releases. The first release, it was version 0.1. That was just a proof-of-concept release. We wanted to see, can we actually compile Envoy, which is meant to run on servers, and run it in our mobile applications? Can we actually route requests through them? Importantly, even when we released this initial proof-of-concept demo, we actually went and open-sourced it because we believe that this library has a lot of potential. The power of an open source community has been clear in Envoy. We wanted to do the same for Envoy Mobile.

With the v0.2 release we started to lay the foundation of a library. We actually started adapting how Envoy operates to provide a platform so that mobile applications could use it. Then, with the v0.3 release, which we’re going to cut in the next couple weeks, we’re actually going to call it the first production-ready release, not only because we’ve heartened the library a lot but also because we’ve started running production experiments in the Lyft app.

Envoy

That bears the question of: how do we take this thing that was a network proxy – Envoy – and turn it into a mobile networking library, Envoy Mobile? Let’s take a deeper look into the architecture of the library, starting with its build system. I know everyone loves to talk about build systems. What we used was based on two particular reasons.

First, it has actually pretty good cross-platform support. Here we needed to compile five languages and target two mobile platforms over a lot of different chips and architectures. Bazel provided us a tool chain that actually allowed to do this. Second, and perhaps a more practical reason, was that Envoy was already built with bazel. We could leverage a lot of the same solutions that Envoy had had in Envoy Mobile.

This gives us a high level overview of the organization of the library. This organization also leads us to understand how the library is architected. On the left, in red, you have the platform targets. These are targets that are actually being compiled for iOS and for Android. It has the Swift and the Kotlin code that actually allow the mobile clients to interact with the Envoy Mobile engine.

Then in the middle, in blue, we have the C bridging code. We decided to write this bridging code in C because we see a lot of interoperability opportunities for Envoy Mobile to not only power mobile devices but perhaps other things that want to use Envoy has a library. That’s a topic for another talk. Then on the right we have the green targets. These are the native C++ targets. Not only the C++ code that we have written in Envoy Mobile to adapt Envoy to become a platform in mobile clients but also Envoy itself. Envoy, remember, is at the core of Envoy Mobile.

That gives us an overview of one of the dimensions of the library. How is it organized, and how is it architected? Across another important dimension is: how do we take this thing that was supposed to be a multi-threaded process (the Envoy Proxy) and turn it into a single thread context in a sandbox mobile application, Envoy Mobile? In other words, how do we turn something that was supposed to be a proxy and turn it into an engine?

That concern lead us to this dimension, and that is the threading contexts that occur in this mobile library. The first threading context up top is the application threads. These are the threads that interact with the engine and actually issue network requests. In the middle we have the Envoy thread, which actually runs the Envoy engine. Then at the bottom we have the callback threads. These are threads that the engine can use to dispatch control back to the application when responses come in the network.

If we overlay then the architecture dimension on top of the threading dimension, we get this matrix that’s going to allow me to explain to you how the engine actually works. The first thing that happens in the application threads and in the platform layer is that we create the Envoy engine object. This is the object that allows the application to run Envoy and provide initial configuration. From the application Envoy engine we start the Envoy main thread. Unlike Envoy running the server, Envoy Mobile does all of its work in the main thread.

This was enabled by one of Envoy’s key design principles, which is that most code in Envoy is actually single threaded code. When Envoy is running in the server, Envoy’s main thread is mostly responsible for life cycle updates, and then when requests come in they are actually attached to worker threads. If worker threads need to communicate to Envoy’s main thread, Envoy has an event dispatcher. This is what allows us to cross threading barriers both from the worker threads to the main thread and also from the main thread over to work threads.

The second important concept here is that when Envoy is making HTTP requests it uses an HTTP manager. This HTTP manager is the basis of Envoy’s HTTP stack. We’ll see how this becomes really important in a few slides.

What we did in Envoy Mobile is actually we hoisted these two constructs (the event dispatcher and the HTTP manager) and we bolted them together into what we call the Envoy Mobile engine because the HTTP dispatcher allows us to cross this threading context that I’ve been describing, and then the HTTP manager allows us to actually make HTTP calls out into the network. What I want to highlight here is that we have lifted these constructs from Envoy itself. While Envoy Mobile is a newer library, the real underlying implementations of all of our engine is the well-trod paths of Envoy constructs that have a lot of industry production experience.

After we have the engine started, then the application threads can create HTTP streams. The HTTP streams then can issue calls across the application thread into the Envoy engine with event dispatcher and then into the HTTP manager. Then from the HTTP manager we go out into the internet.

I want to zoom here into the HTTP manager because, as I said, the HTTP manager is the foundational basis of the HTTP stack in Envoy. It’s also a big extension point for Envoy because it has this concept of HTTP filters that attach onto the HTTP manager. What these filters do is they enact behavior onto every single HTTP request that goes out of the engine and then every HTTP response that comes back in. Really, I want to put a pin here because this is exactly what we were looking for before: a place where we can enact those behaviors that we had declared onto our APIs.

After we have the request go out into the internet and our services do whatever they need to do, the response comes back into the engine. When the response comes back into the engine we have some bridging code that calls callbacks back into the platform. This is done via platform-specific callback mechanisms. It allows us to then, again, dispatch from the main thread into the callback threads and give control back to the application layer.

An important design decision that I’d like to discuss here is that we unsynchronized both the request path and the response path. There is at no point a moment of synchronization. We did this deliberately because we didn’t want any platform code to actually hang onto operations while we were issuing network requests.

The only point that we added synchronization here is for stream cancellation, but even that is done in the native layer. This means that even the one part of the puzzle that is a little bit complex doesn’t have to have two different implementations in iOS and in Android. It’s all done in the shared native code. What this results in is a dramatically simpler implementation in the library and, also, as well as easier usage by an end user that would use the library. This is an overview of how the engine works.

Let’s now take a look at what the platform code looks like when you’re actually using Envoy Mobile. Here I want to highlight again that while I’m showing you Swift code, the Kotlin code looks exactly the same. This was, again, deliberately done because we wanted the consistency guarantees between the two platforms.

The first thing that happens at the platform layer is that we build an Envoy client to make the network calls. Internally this starts the engine and all the processes that I just showed you. At this point of creating the Envoy client we also allow you to configure the client differently depending on what your needs might be. You might want different logging levels or different stats flushing intervals. I’ll get to why stats is up here.

Then you can build a request with the request builder. This is the object that we used to actually start the HTTP streams and make requests from your client out into the internet. With requests we have also exposed places that you can modify things. For example, you can add different headers to the requests in a programmatic way.

Lastly, you build a response handler that will receive the callbacks in that diagram that I showed you where the engine then gives back the response to the platform. It allows then the product engineers to write whatever the business logic is. For example, it has a callback for when headers are received. Then that gives control back to the business logic.

That was a deep dive into how Envoy Mobile works and the API that we expose for clients to use. Let’s go back at where we started the conversation of why we went ahead and wanted to create a new transport layer. It’s because we wanted to deliver consistent behavior, not only be able to describe it consistently but actually enact that behavior consistently across our mobile clients.

In our generated pipeline, what I didn’t show you before is that we abstracted the implementation of the actual generated API. Initially, the underneath of that where API calls into NSUrl session for iOS and OkHttp for Android, but now that we had Envoy Mobile as this transport layer we could just go in and swap those two libraries for API calls into Envoy Mobile. Again, this was transparent to our mobile engineers. They didn’t have to know what was being generated under the hood.

We had reached a point in our mobile client development that we had reached with the server. We now had a full ecosystem where we controlled both the shape of our APIs with generated clients and also the transport of our APIs with Envoy Mobile.

Let’s get back to our chart. Let’s see how standardization of our transport layer is helping us achieve some of these missing goals that we had. Going back into the story, we had already a way to declare consistent behavior between our APIs via protobuf annotations. Like I said before, by having these annotations we could declare a behavior of our APIs. We could declare that a particular payload could be gzip and get benefits from that compression, but we didn’t have a way to enact them because we had to go out and implement three different places where this compression might take place: in the iOS client, in the Android client, and then decompression had to happen in the server.

Previously, we couldn’t execute on this. We had three different places where we needed to do that, but now, by having Envoy Mobile as our transport layer in the client and Envoy in the back end, we now finally had a place to actually enact this declarative behavior that we had attached to our client APIs.

Remember I pinned the fact about HTTP filters because this is the place where could enact that behavior. Every single HTTP request coming out of Envoy Mobile and coming back into Envoy Mobile had to go through these HTTP filters, and these HTTP filters allowed us to really enact the behavior that we wanted from our mobile APIs.

We can focus our engineering resources because we have to implement it only once. Now we have a client that wants to send a request. In their API they have declared, “Please gzip this request. I want full compression.” That request goes out from the mobile client via Envoy Mobile, and it goes through a compression filter in this HTTP stack. The request is then compressed and sent over the wire. Then when it gets to our Edge infrastructure where Envoy is running we have, again, the compression filter. This compression filter understands that the payload has been compressed and is able to unzip this payload and pass it on to the service that requires it.

This is just the beginning of what we want to do with Envoy Mobile and the Envoy Mobile filters. What we’re creating here is an advanced suite of network behavior for our mobile clients. This is really unparalleled in scope in other places and in other libraries, especially because we’re bringing consistency in this behavior not only between our mobile clients but also between the mobile clients and the server. Let’s go and imagine some of its use cases.

Use Cases

For example, we might have deferred requests. What happens when a service goes down and then you have a user trying to use your mobile application? You might have to fail that request. It would be really nice if what we could do was actually notice that the network condition is poor and not send that request, then resolve things in the client to make your customer believe that the request has succeeded, and hold that request in an Envoy filter. Then later on when the service is back up and healthy we could actually dispatch all those requests and create consistency with the server. All of this without your customer noticing that they went through a tunnel or they lost connectivity when they went into an elevator.

Let’s imagine another use case for security. We might want to have request signing. I know that at Lyft we receive a lot of spoof requests from malicious clients that want to commit fraud in our platform. A common thing that we could do is sign those requests in our official mobile clients and then have the server verify that those requests are valid. Again, this could be really hard to implement if we had to not only implement a request signing library in Android, a request signing library in iOS, but also a way to verify that in the server. Now imagine if we could just implement one request signing filter in Envoy. By default we get not only the mobile client implementation to sign this request but also the server-side implementation to verify that these requests are valid and not fraudulent.

This is just the beginning. This is just a list of some ideas that my team has and wants to start implementing. Again, this would be intractable if we had to implement this in three places rather than just one.

Observability

Let’s switch gears outside of HTTP filters for a little bit and talk about observability. I said at the beginning that it wasn’t only enough to have consistent behavior when things were going well, but also it’s important that when things do go wrong we have the right observability stack to analyze how things are going wrong and go fix them. One of Envoy’s main driving features is its unparalleled observability. At Lyft we use Envoy on the server side to observe network behavior end to end. We have dashboards that we auto-generate for every single service in order to know what’s happening. With Envoy Mobile we wanted to bring that same observability that we had in the backend infrastructure all the way to the missing hub, our mobile clients.

The goal is that an engineer that is curious about a particular thing in our network doesn’t have to worry about getting observability from all sorts of places. We want them to be able to reason about the network end-to-end, all the way from our mobile clients into the edge of our infrastructure and our service mesh using the same metrics. For example, you might imagine someone is interested in looking at request volume. Instead of having to go and do some analytics queries to know what the mobile clients are doing, you could just query for the same metrics that you would query on a backend service.

That is what we did with Envoy Mobile in the Lyft app. This is a place where Envoy’s succinctability also helped us because Envoy’s only extension points are not HTTP filters, but it also has a plethora of other extension points that we can use. One of these extension points is the stat sync. It allows us to determine different places where we might flush our time series metrics.

One of the stat syncs that has been implemented by the community is the metrics service stat sync. This allowed us to build a very simple GRPC service that would then receive the influx from all of our mobile clients. Then it would aggregate them and flush them to our already existing Statsd implementation. Here we not only leveraged the fact that Envoy has existing time series metrics and existing observability extension points, but we were also leveraging already existing observability infrastructure that we had at Lyft for the back end.

This is what we got. I want to emphasize just how important this graph is. Up top we have metrics coming from Lyft’s mobile clients. Down below we have the exact same metric that we are receiving at the edge. This is insight that we did not have before leveraging Envoy Mobile for our mobile clients and the existing observability infrastructure that we have. This is really showing us the true power of having this unified network primitive all the way from our mobile clients to our server. We can understand the network end to end using the same concepts. This is the place where we start acting on that reduction of cognitive load that our operators have to have in order to understand the system.

Onwards

Let’s go back to this laundry list that I started with at the beginning. I want to posit and hopefully have shown to you that by having an ecosystem where we understand both the shape of our APIs and the transport, I’ve shown you how my team has started to provide the same guarantees that we have server-side also in our client APIs.

This is only the beginning because Envoy Mobile is the first open-source solution that provides the same software stack not only between the edge and the backend with Envoy Proxy but also now all the way to the mobile clients with Envoy Mobile. I hope that I’ve shown you some of the range of potential that we have with this paradigm, not only with the functionality of filters but we could also start doing protocol experimentation.

If you were in the panel earlier today, Richard Lee from Datawire and I were talking over lunch about QUIC experimentation. HTTP/3 over QUIC has shown historically in experiments done by Google and Facebook and Uber that we really have dramatic performance improvements, especially in networks with low connectivity and low bandwidth. These are the types of problems that only large companies like Google or Facebook were able to tackle before because they had the engineering resources to go and deploy three different implementations of QUIC: one for iOS, one for Android, and one for the backend. Now that we have the same transport implementation in all three places we can deploy a team of four, like my team is, and actually go and do these big projects.

Lastly, we also want to open source the cogenerators for our mobile client APIs that go from the protocol files into the Swift and the Kotlin stubs because we want to give the community not only the transport layer but also the shape of the API. We want to give the community this whole ecosystem in order to go and enhance their client APIs. At the end of the day what we believe for next generation client APIs is that these APIs are model-based APIs that are defined strongly in a typed IDL, like protobuf, so that platform engineers like myself can go and iterate on this behavior by using a common transport layer, which is Envoy Mobile.

I wanted to go back down memory lane. This is a picture of my team taking their first Lyft ride ever dispatched through Envoy Mobile. At that point we actually had network requests going from the mobile client all the way to the database flowing through Envoy. Now, in just a couple weeks, we’re going to start doing production experiments with Envoy Mobile.

What we believe is that Envoy Proxy on the back end and Envoy Mobile on our mobile clients are the basis for this ecosystem of next generation APIs. We will continue investing heavily on it at Lyft. More importantly, however, we believe in the power of open-source community. That’s why we open-sourced Envoy Mobile from the very beginning. I hope you check the project out and join us.

See more presentations with transcripts

Uncategorized

Presentation: Understanding CPU Microarchitecture to Increase Performance

MMS • Alex Blewitt

Article originally posted on InfoQ. Visit InfoQ

Transcript

This is the talk on Understanding CPU Microarchitecture For Maximum Performance. What we’re going to be talking about is really what happens inside a CPU, what goes on in the bits and bobs in there, how does that hook in with the rest of the system, how does the memory subsystem work, how does caching work, and so on. We’re going to look at the tools that we have available for being able to do analysis of how we can make our CPU run faster.

We’re going to be focusing on this performance pyramid. We’re going to be talking about the instructions we use, the way the memory works, the way the CPU works, really at the top part of it here – this is what this talk covers. (The presentation, by the way, will be available afterwards; I’ll send out a tweet that’ll also be on the QCon website.) There are other things that you can do for performance. Specifically, if you’re looking at the performance of a distributed system, fix your distributed system first, fix your system architecture first, fix the algorithms first. Really, the top level is the last couple of percent of that particular process so other QCon talks are available.

Computers have been getting really quite complicated. Back when we were just talking about 6502 and the BBC, Apple and other such systems – Commodore 64 was mine – you had a single processor, it just did one thing and it did it very well. These days, server processors come in multi-socket configurations; and these multiple sockets are connected to multiple memory chips and there’s a communication bandwidth path between them. This talk I’m going to be focusing really on Intel specifics and on Linux as far as operating systems concerned: some of the ideas will be Linux and Intel-specific but the idea is we’ll apply to other operating systems on other platforms as well.

You can have dual socket servers; in this case, you’ve got two sockets talking to each other. You can get four-socket configurations and you can get eight-socket configurations. Each one of these sockets is connected to a bunch of RAM chips that then is local to that socket, and the other RAM chips – whilst accessible – are further away and therefore slightly slower. This is known as Non-Uniform Memory Architecture (NUMA). Pretty much any serious server-side system is a non-uniform memory architecture these days.

Inside the Socket

What happens inside the chip? it turns out that if you go down another level, you see the same pattern. This is what the Broadwell chips look like, with a ring bus around for communicating backwards and forwards across the cores. Each one of these little squares, for those of you standing at the back, are the individual cores themselves with cache and processing associated with it.

The 18-core die looks a little bit weird because it’s got a bidirectional pump between the left-hand side and the right-hand side. If you’re going from a core down at the bottom to one over the far side you then have increased traffic to be able to get over there and, therefore, a slower delay as well. The same thing is true of the 24-core as well.

These sockets and these cores are really getting quite complicated and importantly although we think of machines as being a Von Neumann architecture that you can just reference any memory and get the result back; the time it takes to get that back can vary dramatically depending on where that data is being loaded from.

The current generation of Intels have moved over to a mesh-like architecture. For things like the Cascade and Skylake systems, they have a mesh for being able to move sideways. This gives more paths to be able to get from one place to another and you can theoretically do it in less time. They come in a 10, an 18, and a 28-core variety for being able to move forwards. Each one of these chips has got bi-directional memory ports going out either side and so you can actually partition these things into a left and right Sub-NUMA clusters and being able to say, “I want things to run on this half and talk to this bank of memory,” whilst you have another set of processes that run on this side and talk to that bank of memory.

In fact, there’s been a recent release of the Cascade(Lake) 56 core ‘die’ – actually it’s a package rather than a die because what they’ve done is they’ve taken two of the existing dies, stuck them next to each other, and actually put in the pipeline on the same piece of Silicon that goes in.

Cache Is King

Things are getting complicated. If we drill down further into the cores then we see why, because we’ve got different levels of cache inside each one of these processors as well. They use a dollar sign because it’s a play on the word ‘cash’. For Skylake systems you’ve got a register file – which is the number of in-flight variables, if you like, registers that are used to hold data – for Cascade and Skylake systems at 180 integers and 168 floating point values. These can usually be accessed in one clock cycle, which is half a nanosecond if you’re running at two gigahertz – or slightly less if you’re running a little bit faster.

Those, in turn, delegate to a Level 1 cache, which is split into an instruction and a data half. The idea for splitting it is so that when you’re processing large amounts of data, your data isn’t pushing your program out of the space. In particular, most of the time, you’re just reading from the instruction cache, whereas the data is a two-way street. That access time is about 4 cycles.

That then delegates back to a Level 2 cache, which is shared between them; typically 12-15 cycles – something along those lines, depending on the architecture. Of course, these have different sizes. In the case of Skylake’s and Cascade’s systems, it’s about a megabyte at the Level 2 cache that’s specific to that particular core.

If you want to talk to external memory, there’s a Level 3 cache as well – and that’s shared across all cores on the same die that you’re loading it from. There’s usually a 16-megabyte or something of that size that is stored on the die itself and each core can access memory from there. Of course, the time it takes to access it is really a function of how local it is to that data.

One thing I’ll also point out, the Level 3 cache on the Intel chips is non-inclusive at the L3 layer, but inclusive at the L2 layer. What that means is if you’ve got some data in the L1 it will also be in the L2 but it doesn’t actually have to be in the L3. AMD have just launched a new chip that’s got an absolutely massive L3 cache inside there: that has a number of performance advantages and we’ll probably see Intel coming with bigger L3 caches in the future as well.

Thanks for the Memory

Of course, these delegate out and load data from RAM. In this case, DRAM depending on how far away it is, could be anywhere from 150 to 300 cycles. It gets a little bit imprecise talking about cycles there, because it’s a function of both the process of cycle speed and also the memory speed as well.

In fact, if you use a program called lstopo, it will show you how your computer looks.

This was taken on my laptop:iIt’s a single-core system with a bunch of cache split over various different levels inside there – and it’s actually reporting a Level 4 cache. It isn’t really a cache as such, it’s memory shared between the GPU and the CPU on my particular machine: that’s because it happens to be a laptop.

Actually, we’re seeing Level 4 cache turning up and, in particular, we’re seeing non-volatile RAM coming online at some of the Level 4 caches as well. It’ll be interesting to see how that evolves. The cores that are shown at the bottom, we’ve got a four-core processor with hyperthreading available.

Translation Lookaside Buffer

But there’s more than just the memory cache: when people talk about memory caches, they usually think of this Level 1, Level 2, Level 3 combination, but there’s a bunch of others that are inside there. One of them which is very important is called the Translation Lookaside Buffer, or TLB. The TLB is used to map the virtual addresses to where the physical addresses are on the system. The reason why this is important is because every time you do a process change or potentially every time you go into the kernel and back out again, you need to update what those tables are.

Those tables will have a listing essentially that says, “If you see an address that begins with 8000, then map it to this particular RAM chip, which is going to be somewhere on the system. If you see something beginning with ffff, then map it somewhere else.” This happens for every address you look up and so, therefore, it needs to be quite fast. Specifically; if you have something in that cache, great, you can access memory quickly; if it’s not in that cache, it’s going to take you a while.

Page Tables

That’s because it has to do what’s called a page table walk. Each process in your operating system has a page table, and that page table is stored in the CR3 register – but each time you do a process switch, it’s changed over to something else. Essentially, this is the map that your process is running on that particular machine. It’s a tree, so I’ve demonstrated this as a two-level hierarchy here for being able to step between them, but actually, on modern processors, it’s four levels deep, and on Ice Lake, which is Intel’s next generation, it’ll be a five-level deep page structure.

In particular, this level of page structure can give you a certain amount of memory space. At the moment, four-level page tables will take 47 bits, 48 bits’ worth of space for the virtual addresses; five levels will bring that up to 57 bits, which means you can address far more virtual memory than we need. Sixty-four terabytes should be enough for anyone.

(Huge) Pages

These memory pages are split into 4k sizes. Now, this 4k size made sense back in the 386 days when virtual memory came along, but it’s not really great for systems that are taking hundreds of megabytes’ or hundreds of gigabytes’ worth of space. You can change the level of granularity that this mapping happens from a 4k size to a huge page. A huge page basically means something that isn’t a 4k size.

Most Intel systems will have two different sizes for this; they’ll have a two-megabyte and a one-gigabyte support. It’s under operating system control as to which one of those it uses, and each CPU will have flags to say which one it is. Different architectures have the same idea of large pages as well, but they’ll work in slightly different ways or have slightly different sizes by default.

The purpose of using huge pages is so that the TLB doesn’t need to store as many pointers. If you’ve just got one giant page for your process, then all of your lookups go through that one entry in the table and you can load this fairly quickly. It’s good from that point of view; but it does have some downsides. It might be slightly more complex to set up and use: and if you’re using Hugetblfs, then you need to configure it.

Hugetblfs is a bit of a pain. This was the first thing that came out in Linux to be able to support large pages. What you’d have to do was to be able to specify ahead of time how many large pages you wanted, and what the size was going to be, and then you had to be in a certain permission group and then your application would then decide to use these things. Generally speaking, people tried it, people didn’t like it, people stopped using it.

Transparent Huge Pages

There was something called Transparent Huge Pages instead. Now, this has been slightly more successful, but not without its pain points. Transparent Huge Pages says: when you ask for a page, instead of giving back a default 4k one, then give them back a two-megabyte one or a one-gigabyte one depending on how it’s configured. However, most applications were written to assume that when you did an allocation of a page you just get a 4k size back. They’d only write 4k’s worth of data and you’ve ended up allocating two megabytes worth of continuous physical space and you’re only using a small fraction of it.

Transparent Huge Pages when it first came out and just giving everyone large pages by default didn’t really work. There are several configuration options you can do. One of them is in the hugepages/enabled is to specify something that will work if you use madvise. Madvise is a call that you can say, “Yes, I’d like to use Huge Pages, please,” and you can specify that in your code. If you don’t do that, you get the small page, if you do, you get a big page.

One of the problems that happen for high performing systems was that you would ask for a large page and the operating system will go, “I don’t have a large page yet. Standby while I go and get one,” and it would then assemble a whole bunch of little pages and it would take some time, which is not good if you have a low latency system.

A relatively new option that was added to this (within the last few releases of Linux) is the defer option. What the defer option will do is it will say, “Ok, I’m going to ask something. I would prefer a large page but if you don’t have one then that’s fine, I’ll just take a bunch of small pages and you can fix it again afterwards.” That has – on the whole – reduced the issue of the blocking that you would see. You’ll still see a bunch of blog posts and Stack Overflow answers that say, “Don’t use huge pages.” Give this a try with madvise and with defer and see what happens.

Cache Lines

While the operating system deals with memory in the unit of a page size whether that’s 4k or two-megabytes, the actual processor is dealing with memory at a cache line size. A cache line size is, at the moment, about 64 bytes. Now, I say about 64 bytes. It’s exactly 64 bytes for the Intel processors you’re using on your laptops, but it may well be 128 bytes in the future. Don’t assume that it’s going to 64 bytes.

Particularly, if you’re working with mobile devices or ARM architectures, ARM has got something called big.LITTLE and you end up with processors with different size cache lines inside them. So be aware that different ones exist. For Intel servers mostly you’re looking at 64 bytes. By the time you’re watching this in two or three years time on InfoQ, it’ll probably be 128 bytes. You heard it here first.

When you load memory from the process and iterate through it, what will happen is the CPU will notice that you’re reaching out to memory and then start getting the data in. When we’re striding through memory, the processor’s memory subsystem will automatically start fetching memory for you.

If you can arrange your processes to iterate through memory in a linear form, great, you’re going to be able to go through them. Bouncing around randomly like when you’re traversing an object heap, not so good for the memory system.

It does notice when you’re doing other striding information as well. If you’re striding through every 32 bytes or something, the memory pre-fetcher will notice that, and just load every other line for you.

There is something that you can use in compilers __builtin_prefetch – which ends up being a prefetch instruction under the covers – that can request that you’re going to be looking at some memory soon so please make it available. Only use this if you’ve got the data to show that it makes sense. You’re mostly going to make the wrong decisions about it, not because you can’t make the right decisions but because you’ll either request it too early and it will push out stuff that you were using, or you’ll request it too late and you’ll have already used the data by the time that you need it. It’s something that you can use for tweaking but not something I’d recommend jumping to as a first point.
Trying to organize your memory structure so that you can process through it linearly is going to be the way that you can improve performance at that layer.

False Sharing

You can also have something in cache lines called false sharing. False sharing is when you’ve got something of a cache line size and you’ve got two threads being able to read and write inside it. Although they’re writing to different locations, different variables, in computer language speak, if they’re reading and writing into the same cache line then there’s going to be a bit of tug of war between two cores.

If you’ve got those two cores on the other sides of a 56-core package or in another socket then you’re going to get contention inside here. You’re either going to get data loss if you haven’t used synchronization primitives [NB not specific to false sharing, just in general] or you’re going to get bad performance as they’re fighting over the exclusive ownership for that particular cache line.

To avoid this, if you are doing processing with multiple threads and you’re going to be reading and writing a lot of data then have them separated by a couple of cache lines. I say a couple of cache lines, because the load fill buffer when it loads things it’ll load in a couple of cache lines at a time. Although you’re not reading and writing that section you might find that you’re still treading on each other’s toes. The Oracle HotSpot compiler has sun.misc.Contented [or jdk.internal.vm.annotation.Contented] that pads with 128 bytes of blank space when you’re trying to read and write something so that it avoids this particular problem. Of course, that number will change as the cache line size changes as well.

Memory Performance Strategies

In order to take the best advantage of the memory subsystem, try and design your data layout and your data structures so that they fit in within an appropriate amount of space. In other words, if your hot data set that you’re processing and using a lot can fit inside the L1 cache, great that’s fine, you’ll be able to process it very quickly indeed. Or you can buy an Ice Lake chip which has got 48k’s worth of Level 1 instead of 32k. If it doesn’t, see if it can fit in the Level 2 and see if it can fit into the Level 3. These kind of things are visible from JavaScript or Java or Python or whatever you’re dealing with, and you can see if you measure performance as you round the sizes up that you get these step changes as you overflow out of one layer of cache into the next one.

Consider how you structure your data as well. If you’ve got data in a set of arrays, it may be easier to pivot and think of it as a set of arrays each with one field in – because let’s say you’re processing images: you may not need to process the alpha value, but you might want to process the red, green, and blue. If you have them as an array of reds, an array of blues, and an array of greens, then you’ll be able to get a better performance than if you iterate through red, green, blue, red, green, blue, in memory.

Also, consider using thread-local or core-local data structures as well. If you’re doing some map-reduce operation over a whole bunch of data, treat it as a distributed system. In a map-reduce job you’d fire it out, you’d get a load of results, and then you’d merge them back in at the last step. Treat that the same with your cores as well. Get one core to calculate and do something, another core to calculate and do something else and then bring those results and then add them up rather than trying to fight over some shared variable in memory.

Consider compressing data. This happened in the JVM recently when we had compressed strings. Instead of having each character taking up two bytes worth of space, it was compressed down to only taking one-byte amount of space. That was a performance win because although it costs a little bit of extra instructions to expand and contract on demand as it’s loaded in, actually the fact that you’re shifting less data both in and out of the caches and also for the garbage collector to process meant that it was a performance win. You can use other compression strategies. HDRHistogram is a good way of compressing a lot of data down in a logarithmic form. Depending on your application you’ll be able to think of something as well.

Pinning Memory and Threads

You can also pin where memory and threads live. If you’re dealing with a massively scalable system and you’ve got a massively scalable problem to solve, then try and pin threads to certain places to be able to process them. If you’re dealing with something like a Netty benchmark where you’re consuming events over a network socket, then pin your worker thread so that one worker lives on this core, another worker lives on that core, another worker lives on a different core. That way you’ll get the best performance because you’ll never get them processed or swapped around the place.

If you are doing that, you might need to tell the Linux kernel to stay away from a subsection of those cores using isolcpus which is a boot time option that you can say you want the Linux kernel to do its housekeeping just on a subset of your cores. You can also use taskset to say when you start off a program like a logging daemon or something that you want to create a CPU set that your application is going to use and other applications are going to be run elsewhere. These things you can specify as a Linux sysadmin to be able to do that.

There’s also numactl and libnuma which you can use in order to be able to control programmatically where you’re going to allocate large chunks of memory. Again, you can use the (sub)NUMA clusters on the processors to be able to decide where that goes.

Inside the Core

We’ve talked a lot about memory, what about the actual brains of the operation, what about the CPU? The CPU or the core is split into two parts, the frontend and the backend. This isn’t like frontend development, it doesn’t run JavaScript in there. The frontend’s job is to take a bunch of instructions in x86 format, decode them, figure out what they are, and then spit out micro-operations or μops because U is easier to type than μ on the keyboard for the backend to process.

Frontend

We get a bunch of bytes that come in off the memory system that goes into the predecoder which says, “This is where this instruction ends. This is where this instruction ends,” goes into an instruction decode and then converts that into micro-ops.

We’re going to look at just this increment one here, because when you increment something in memory, what it really means is you’re going to load from memory, going to do an addition, and then you’re going to write it back to memory. That increment corresponds to three micro-ops.

Generally speaking, you’ll have a one-to-one type relationship between the micro-ops in the backend. If you’re using complex addressing modes where complex means you’re adding an offset or you’re multiplying some value then there might be a few other micro-ops that get generated as well. Ultimately, the point of this is to spit out a bunch of micro-ops for the backend to do.

At this point, we’re all operating in order. The instructions come in order, the μops come out in order that we want them to execute as well. There’s a few things that go on the frontend. One of them is the μop cache. As you do the decoding from Intel instructions which are fairly complicated into a set of these micro-operations which are internal but slightly easier, they can be cached so the next time you see it, it’ll spit out the same things.

Importantly, we’ve got a loop stream decoder which uses loops so that when you’re running through a very tight loop it will just serve the decoded μops rather than going through this parsing process. It’ll shave off a few elements in the pipeline. (Most pipelines on Intel processors are between 14 and 19 deep depending on what it’s doing.)

Branch Predictors

One of the things that impacts it more than anything else is the branch predictor.

The branch predictor’s job is to figure out where we are going next. It’s like the sat nav for the CPU. It’s right most of the time. It probably has a better rating than me when I’m trying to do the driving navigations. It figures out where you’re trying to go, it assumes whether or not the branch is taken, and then it starts serving the instructions so that by the time the pipeline flows through, the instruction’s all ready and waiting to go.

Sometimes, that will fail. In that case, the parsing and the decoding is thrown away, it resets to where it should be, and then you have what’s called a pipeline stall or a bubble, where it waits for the instructions to go through and then catch up again. If you can reduce bad speculation on the branches, you’re going to get better performance in your application.

The branch prediction dynamically adapts to what your code is actually doing. This is also something that’s visible from the high level as well. If you’ve got a Java program or a JavaScript program that’s iterating through a bunch of arrays and maybe adding up, say, positive numbers into one counter and negative numbers into another counter, the branch predictor, figuring out which of the elements to go down, is going to be confused on random data. You’ll get reasonable performance, but the branch predictor won’t be able to help you, because it’ll be right about 50% of the time.

If you’re dealing with a sorted data set, in other words, you see all the negative numbers first and then all the positive numbers afterwards, the branch predictor is going to be on top of the game, and it will give you its best performance because it will assume once it’s seen the first few negative numbers that they’re all going to be negative until it changes over when you’ve got a bit of slow performance and then the positive numbers will go on from there.

The branch predictor, which decides whether you go down something, is only half of the story. There’s also a branch target predictor as well. This target predictor says where it is that you’re going to go. Now, in a lot of cases, the branch [target] predictor is going to know exactly because you’re going to jump to a specific memory location. You’re jumping to the system exit function or you’re jumping to this particular routine at the start of the loop again.

Sometimes, though, you’re jumping to the value of a register and that register may not have been computed yet. It will take a punt and think actually what it’s trying to do is that it will load this data value in, but sometimes that might be wrong and you need to rewind and do things again.

You’ll see the branch target predictor being confused a lot when you iterate through object-oriented code whether that’s C++, whether that’s your own object orientation, whether it’s the JVM or something. That’s because in order to figure out where to go, it has to be able to load what the class is, look in the class word, and then figure out from the class word where the vTable is and then jump to the implementation in the vTable. Those few jumps are going to confuse the branch target predictor.

One of the few things you can do to speed this up is that you can have a check to say, “If this looks like an X class, jump to the X class implementation. Otherwise, fall back to dynamic dispatch.” That’s something that you can implement very cheaply and very quickly. In fact, the JIT and JVM does this by using monomorphic and bimorphic dispatch for the common cases and some of the non-Oracle ones do more than just two.

That works in C++ as well. If you’ve got a dynamic dispatch call like if you’re implementing a VFS… I saw something a while ago that said a Linux VFS operation had been sped up by a factor of three or something simply because they said, “If you’re using the X2 filing system, then delegate this to the X2 implementation instead.” Some of the recent mitigations that have been put in the Linux kernel to avoid things like Spectre and Meltdown and so on have actually decreased performance of Linux over time. By putting your own monomorphic dispatch, you can then gain some of that speed up back again.

Here, I think Martin has highlighted this a few times in the past: Inlining is the master optimization because when you do inlining you suddenly know much more about where it is that you’re going but you also often lose an indirect call as well. Actually, losing those function calls is a good way of optimizing performance because you then don’t have to worry about the branch predictor or the target predictor at the CPU from being able to do things.

Backend

We’ve figured out where we’re going, we’ve got a whole bunch of μops, we’ve got this nice stream of them coming through. What happens next? Then it goes over to the backend, and so the backend’s job is to take all of these μops, do the calculations, and then spit out any side effects to memory. In this particular case, we’re loading from a location and memory, we’re incrementing that, and then we are writing that value back again.

Now, at this point, we don’t have to worry about things like eax, esi, rsi, and so on because we’ve all changed them to temporary registers. The x86 ISA has these registers inside. The core has a lot more of these things. It says, “For this particular instruction, we’re going to stick the temporary variable which was eax here into, say, R99.” You’re going to pick one of the ones that’s free inside this.

As processors evolve, as this register file gets larger, we can deal with more and more in-flight data. That’s important because inside the backend, once we’ve done the allocation, we are now off to the races. All of these μops are competing for availability on the internal processor at the backend to actually do their work. As soon as that data dependencies are there, they’ll then be scheduled, executed, and the results will go in.

Intel processors and most other server-side processors these days are out of order because, at this point, any of those micro-ops can happen at a time. Importantly, as far as the performance is concerned, this backend is capable of running multiple μops simultaneously. The frontend can dispatch to the backend four μops per cycle for Intel and Cascade Lake systems. For Ice Lake systems it can issue five μops per cycle. If they aren’t competing for slots that they need to run on, then you can have a number of those μops running in parallel as well.

In this particular case, our internal ‘processor’ has got a number of ports, and those ports are the execution units. They are all able to deal with arithmetic, they are all able to do logical operations. Some of them are going to be able to divide, some of them are going to be able to multiply. There are different implementations for the integer and floating-point sides.

All of these will be able to take up 256 bit values inside them. Port 5 will be able to handle 512 bits on its own and you can combine ports 0 and 1 to have a 512-bit operation. If you are dealing with 512 bits, you’ve essentially got two paths, either port 5 or port 0 and 1. You don’t get to decide what happens, this is the CPU doing it for you, but you can at least execute those two things in parallel. There’s a bunch of other ones that you need to use as well for address generation, loading, storing and so on.

In this particular case, they get allocated to their appropriate ports, we figure out what the value is that comes in from the load, we figure out that the register now has value 2A. The increment is now ready to execute because its dependency has been met and so that does the update, and then once that’s there, the write then flows out at the end.

At this point, when it goes out the door, we’re back to in-order again. Lots of stuff happened, we may have made lots of mistakes, we may have knackered our cache up, and we can see that externally through Spectre and Meltdown, but between the μops going into the backend and the μops coming out of it or the writes going out to it, we’ve been out of order through that whole time.

perf

How do we know what’s going on inside to be able to change things? I’m sure you’re all aware with perf. Perf is a general Linux performance tool that can interrogate counters that are kept inside the backend and the frontend of the core itself. There’s tools like record which will allow you to trace the execution of a program; annotate and report to tell you what the results actually mean from the binary file; and a stat for being able to interrogate performance counters themselves.

Perf record, when you run it, will take an application, will then generate a perf data file for all the results. You can then schlep that file off to another machine for analysis if you want to or you can do data processing on the host depending on what you’re doing.

When you do recordings in perf, it will capture the stack trace at each point, and the stack trace is then going to be where you are so you can amalgamate it and get the results back again like profilers that you’re used to at the high end will do.

Skid

However, at this level, the profiler is subject to what’s called skid. Skid is you say, “I want to start recording here,” but actually the process is still doing things and by the time it notices that you want to do a change, it’s actually got to back up and then say, “We’re here now, so I guess that’s what you meant.”

There are some precision flags, the ‘:p’s that you can add to it for more precise values and they will add more and more overhead but will get you more and more closer to the precise value of what’s happened. Generally speaking, if you’re trying to identify bottlenecks in your system, then you’ll be able to get a rough overview, first of all, by running perf without it, and then as you hone down and narrow down where the issues are, you may then want to start turning up the precision to get exact values of it.

Backtraces

When you have core branches, what will happen is the perf program will try and figure out what the backtrace is by walking back the stack. If you’ve got code that’s dealing with the branch pointer inside there, it will be able to follow that back from the stack. Quite a lot of programs are compiled without the branch pointer support because back in the 32-bit days, we didn’t have many registers so it was useful to be able to use the frame pointer for something else.

These days, probably less of a reason not to have it in there, but if you’re dealing with a server with debug symbols it can use the dwarf format to be able to figure out how to walk the stack back again. If you’ve ever seen things about incomplete stack traces it’s probably because you don’t have the buffer pointers in there and you don’t have the debug symbols to be able to track it back. However, for running code, the perf command supports something called the call-graph. What the call-graph option will do is say which flavor of backtracing do you want to use.

LBR is Intel’s Last Branch Record. What happens is as the Intel processor is jumping around, it records where it’s been. If you take a snapshot of that every few call stacks then you can build up a complete picture of where you’ve gone even if you don’t have the debugging symbols or the frame pointers inside there. That will give you accurate values. There’s also something on even newer Intel processors called Intel Processor Trace. Intel Processor Trace does the same thing but with a much lower overhead. There’s a couple of Linux weekly articles down at the bottom which you can find when you look at the slides later.

Processor Stats

As well as perf record, there’s also perf stat. Perf stat will give you an idea of how much work your program is actually doing, and it will read the counters from the processor to be able to tell you that. It’ll tell you how many instructions there are, how many branches you’ve taken. In this case, the branch misses are about 5% branch misses, so it means the branch predictor is working about 95% of the time which is nice. Those are the kinds of things which you could inspect.

One thing to look for is the instructions per cycle or IPC. The IPC says how efficiently you’re doing work. You can have a program that’s running at 100% CPU and has an IPC of 0.5 that will run dog slow and you can have the same program running at 100% CPU and get an IPC of 2.5 and be running five times faster. 100% times five is 100%. It’s a great speed up if you can find it.

Depending on which processor you’re running, the IPC is going to be in a region below one which means that we have potentially other issues that we need to investigate or something closer to four which is the maximum type number that you’ll see out of these systems. I think IceLake might start to reach into the five IPC range, but larger numbers are better in that particular case; and less than one means that you probably need to look into find out what’s going on. These things read from what are called performance counters.

Performance Counters

Performance counters are model-specific registers that each CPU has, and as it goes through and executes instructions it will increment this counter. You get some for free: the number of branches; branch misses; the number of instructions executed, and so on. There are programmatic ones as well. If you want to count how many TLB cache misses you’re doing, how many page walks you’re doing, how many executions you’re running on port five on the processor itself, you can actually configure these counters to be able to give you answers to those values.

They’ve got names – perf -list will tell you what they all are. If you don’t know what the name is but you know what the code is, you can put in something random and that will tell you something as well. Of course, I don’t know what that is. I picked it from somewhere but I can’t remember what it is. That’s because usually when you’re doing analysis you need to have a process to follow.

Top-down Microarchitecture Analysis Method

There’s something called a Top-down Microarchitecture Analysis Method or TMAM which was created by Ahmed Yasin. What it says is, “Let’s take the performance of our application and figure out where the bottlenecks are.” In the big diagram I had before with the frontend and the backend, that’s the separation that we have. Are we failing to serve the number of μops on the frontend, are we blocked from the backend of doing some processing? Are we retiring them? That doesn’t mean going off to live somewhere in the country, that means actually going to work. Or is it bad speculation because we’ve simply no idea where we’re going?

Each one of these has got various different perf counters that you can enable to be able to say where you are. Now, again, this is great if you’ve got it and it mentions it in the Intel software optimization manual. What it really boils down to is saying: have we ever allocated a μop? If we have allocated a μop, does it retire? If it retires then great, that’s useful work done.. If not, it falls down to the bad speculation route. If the μop isn’t allocated is that because we’ve stalled on the frontend or on the backend?

It all sounds very complicated but as long as you can spell top-down, great, because perf does it for you. If you run perf -topdown it will run system-wide and it will give you an overview in which buckets your system is running in. In this case, we’re not timing sleep, we’re just using that as a holding point for being able to specify what’s going on. In this particular case, I’ve got a six-core system running on a single socket. (It’s probably three cores and three hyper threads.) The retiring rate is somewhere between 15% and 35%.

In other words, we’ve got between three and four times that we could optimize this to go faster. If this retiring is like 100% job done, it’s beer o’clock. Most of the time you’ll find this is a smaller number and you can optimize this number by figuring out why your program’s running slower, and then taking steps to be able to optimize it. In this particular case, it’s highlighting in red that we may be frontend or backend bound across a whole bunch of processes.

To find out something more useful than that, (this will give you a snapshot of your system), you really want to find what your process is doing.

Toplev

Andi Kleen has written something called Toplev. The Toplev tool will go through the process and give you ideas of where you can start looking. When you first run it, it downloads some configuration files from Intel’s website, download.01.org for the process that you’re running out so that you know what performance counters it has available to it.

If you’re deploying this on an air gapped server, you can download these configuration files ahead of time and then deploy them with it. There’s documentation that shows you how to do that.

Essentially, this is a very fancy frontend to perf. That “cpu/event=0x3c,umask=0x0/” I don’t know what that is, but Toplev does and that will then be able to generate a perf command. There’s an option that you can use to even say what perf command would you run so that you can then copy and paste it and run it somewhere else. This will do the work for you.

The other nice thing Toplev will do is if you have a repeatable workload, you’re repeatedly running a particular process then you can use this in no-multiplex mode. We talked about multiplexing before: if you’ve got two symbols and you don’t have the counters for them, it will record one and then it will record the other, then it will record one and then it’ll record the other, and then essentially just double the count afterwards. In no-multiplex mode, it runs the entire program counting this counter, then it runs again counting the other counter. That will give you a much more accurate review of what’s happening in your system, assuming that it’s repeatable.

You can then run it with -l1. It will show you the same thing as perf top-down did but for that particular process only. Here, we’re creating a 16-megabyte random data file, and inside there we’re looking at a single-threaded no-multiplexing level one for base64 encoding this just because base64 is something pretty much everyone has and you can play around with this example at home afterwards.

When we run it, it will do some calculations – and one of the things it will tell us is that “By the way, did you know this program is backend bound?” In other words, we’re getting the stuff in the memory, we’re doing the μop allocations, the branch predictor’s not failing dismally, we’re passing it to the backend, and then the backend isn’t going through it as quickly as we would like.

If you run it through -l2 it will say that it’s core bound ,and if you run it again it will say it’s to do with ports utilization. What this means is that we’re generating a whole bunch of instructions, they are running on one of the ports inside that diagram that I showed you several slides ago, but because there aren’t any other ports available to do that, it’s all being bottlenecked on this port. For example, if we were doing a bunch of divisions, we would expect everything just to be running on port one because that’s the only one that’s doing integer division. Different processors will give you different performance primarily because Intel keep adding execution units.

Cascade Lake and Skylake are pretty much the same thing. (One of them is just a slightly optimized way of laying down the silicon.) Ice Lake is then adding new ports inside there and new functionality. You’d expect this program to run faster on Ice Lake simply because they’ve changed the internals of the CPU.

There are, of course, different options that you can come out from there. You have to drill further to find out where the process is happening. If it’s frontend bound then you need to look into the memory and be able to pull things out.

If it’s backend bound, maybe there’s something you can do with your algorithm. I’m not really proposing to jump in and fix base64 live on stage, but my guess is that it’s doing a bunch of multiplies and those multiplies are being executed on a particular port. If we found another way of doing it then perhaps it would be slightly faster if we did it that way. One of those ways is vectorization and I’ll mention that a little bit later.

Code Layout

One of the other things that can affect performance of your program is actually the code layout itself. If you’ve got a program and you’ve got some ‘if’ logic such as: if an error has occurred, if a NullPointerException has occurred or whatever, there’s two different ways of laying down that code in memory. You can either have an error test that then jumps to the good case afterwards, and otherwise effectively just follow through into the error case in which you jump something else. Or you can do it the other way around where the normal case follows through and the bad case jumps down.

As far as the branch predictor is concerned, as long as you don’t error out it’s always going to guess the right branch then it’s always going to do the right thing. However, if you’ve got cache lines involved in this and you’ve loaded the instructions in one particular cache line or several cache lines it’s going to be good if we’ve already loaded the code that we want to be able to run. If you could punt your error code so that it lives somewhere else instead of running inside, you will get better performance just because the memory subsystem is going to have less work to do.

One way of doing this is to use a __builtin_expect function, and this is used in the Linux kernel. They’ve got macros of likely and unlikely to be able to use __builtin_expect. If you say __builtin_expect(error,1), in other words, we expect that’s the default case of doing it, the code will be laid out like that. Back in the old days, this used to emit an instruction to tell the branch predictor which way to go. (That hasn’t been used since my hair wasn’t gray.) If you wanted to specify the good case you can then specify __builtin_expect no and it will come in and lay it out like that instead.

If you’re using profile-guided optimization, this is the thing the profile will learn for you. If you’ve got a CI pipeline that’s doing performance tests, gathering profile information, and then applying that, this is already job done – it is doing things in this particular case. If you aren’t, this could be something that will enable your good path code or hot path code to run slightly faster.

Loop Alignment

The other thing, and I mentioned it a little bit before, was something called loop stream decoder. In fact, about an hour ago, I saw a tweet going past of someone saying, “Why does this run faster on a particular alignment?” and there’s a thread on Reddit that they pointed to. Intel has got something called the loop stream detector, so when you get to the end of a loop, it’ll jump back to the top and then go around the next loop. That’s basically what all loops like in code. The only difference between these two loops is that it just so happens that they are in different places of memory and that’s all.

If you’ve got a loop and it starts on a 32-byte boundary, then the Intel loop stream decoder will serve the instructions from the μop cache that we talked about earlier. The μop cache can hold something like 1500 μops. As long as your loop isn’t too big, if you’re serving things straight from the μop cache, then you will be executing that loop faster without having to do any extra work yourself. If it’s slightly less aligned, then you’ll go through the ordinary process which is to read the instructions, convert them into μops, dispatch those μops and so on, and it will be a little bit slower. Over large amounts of data or large number of times processing it, it can make a difference.

There is a flag in LLVM that you can specify to say, “Ok, for the targets of where we’re jumping to, align all of them to 32-byte boundaries because, hey, you’re then going to not have this problem.” The align-all-function says, whenever you are creating a function, make the function start on a 32-byte boundary. The align-all-nofallthru-blocks is just a very complicated way of saying all of the branch directions to the start of the loop need to be a on 32-byte boundary. There was another option which was align-all-blocks and that says throw everything on a 32-byte boundary and that’s overkill – because it’s only the ones where you’re not worried about it. It’s only the ones where you need to jump to the start of the loop where they need to be 32-byte aligned.

This is probably not going to help you;iIt may be able to help you and it’s useful to know about, but it’s almost certainly not something that’s going to give you for free. I believe that the JIT will automatically align functions on a 32-byte boundary as well for this particular reason.

The important thing of mentioning this as a code layout perspective is because recompiling your code can give different performance profiles. There was a blog post by Denis Bakhvalov, where he was talking about having a function and just by adding another function in the code completely uncalled, but just the presence of that function affected the alignment and then saw a drop in performance. When you’re doing performance measurement, you need to be aware that just recompiling your code can shuffle the code layout which can give different performance profiles to deal with things like this. In that particular case, these finds are useful because then it will at least give you consistent performance as far as the loop stream decoder is concerned.

Facebook have written a tool called BOLT which I think stands for the Binary Optimization Layout Transformer. What it will do is it will load in your program code, parse it, essentially reconstitute the basic blocks that it started with, and then defrag them. (Anyone remember defragging disks? That’s maybe a last millennium thing.) Basically, what it will do is it will run your code, figure out where the hotspots are. The same thing that a profile-guided optimizer will tell you as well. Figure out where those hotspots are and then reorder all of the blocks so that all of the hotspots live in a smaller amount of memory. This doesn’t change the instructions themselves, they’re not running any faster, it’s just saying all of the hot code where before maybe it sat in, say, 10 pages of memory now fits into one or two pages of memory. Therefore, you get much better utilization of the cache and the pages that are bought in.

There’s an arXiv paper that they published and there’s a repository on the bottom there which you can follow from the slides.

Vectorisation

Another thing that I mentioned is the parser, SIMD parser for processing vectorization. This is something that Daniel Lemire’s written and a few other people as well, being able to do a vectorized processing of parsing JSON files.

Typically, when you parse JSON it ends up being in a switch loop. Is this character an open brace? If it is, then go down this particular path because we’re doing an object, if it’s a quote, we’re doing a string, and so on.

It turns out that instead of reading things a byte at a time, you read can 64 bytes at a time. Then, you can get much better throughput for being able to parse your JSON object and generating the object as a result. He and a few others worked on this approach for generating the JSON parser and compared it with the known existing best ones in the open-source. They saw a speedup of around two to three times depending on which benchmark it was that they were using.

There’s a couple of reasons why it’s faster. One is that instead of processing one character at a time, we’re processing four characters at a time. The second thing is that they’ve got a mechanism for avoiding branches. One of the things that vector operations will give you, is a way of doing masking operations then being able to combine things together because, as we talked about previously, the branch predictor’s job guessing whether you went down a branch or not is made significantly easier when there isn’t a branch to go down. This particular case was branch-free and, therefore, was a lot faster for doing processing.

Summary

We’ve talked about a lot of different ways that we can do analysis of programs and some of the techniques and some of the ideas that you can use to get programs faster. If you’re dealing with memory, you want to try and use cache-aligned memory or cache-aware data structures so that things are based on 64-byte chunks because that’s realistically what’s happening under the covers or page size chunks. Data structures like B-Trees which have historically been used a lot in databases and on disk formats are something which work relatively well for the way memory is laid out, and so you consider that too.

Compress data to make sure that you can compress and decompress data on the fly inside the processor because the cost of schlepping the memory around is usually where the performance problems lie.

If you can, avoid random memory access. Random memory access is the thing that will kill memory lookup performance at run time.

Have a look at whether huge pages can help. Certainly, you can get 10% or 15% speedup on some types of applications simply by enabling huge pages. Databases tend to be very sensitive to this and you should follow those instructions to say whether or not this is a good thing or a bad thing.

Configure how you source your memory by looking at the libnuma and the other controls that allow you to specify where your processes are run. The more local you can get your data to the program, the better it’s going to be.

As far as the CPU is concerned, each CPU is its own network data center. In the same way that you’ve been thinking about distributed processing systems, being able to move data, and summarize them across multiple network machines, think of that as what’s happening inside your server as well. If you can get it so that your data is closer, then it will be processed much faster than if the data has to come from somewhere else.

In particular, the branch speculation and memory cache misses are pretty costly. Earlier on there was someone who mentioned that when you’re thinking of complexity of programs, don’t think of it as the number of instructions that you run, but how long you can run without a cache miss and really count complexity of algorithms in the number of cache misses that you hit.

Look at branch-free and lock-free algorithms. (We didn’t really talk about lock-free because when you have locks, you generally don’t have something that’s CPU intensive because it’s locked waiting on something else.) If you can use lock-free and branch-free programs then you’re going to get better performance out of it because you’re not going to be contending and the branch predictor is not going to get in your way.

Then, use performance counters with the help of perf or Toplev to be able to indicate exactly where your program’s slowing down. Is it the pulling in memory, is it the cache misses that we’re seeing, is it the fact that we’re waiting for this particular dispatch port?

Use vectorization where you can. Most compilers will give you auto-vectorization for free with things like loop unrolling to be able to give you the best possible performance, but it may be the case that dropping out into your own vectorized code or vectorized assembly makes sense for particular types of operations. Things like the JVM will use vector code for doing object array copy, for example, when it knows it can just copy whole swathes of data with vector instructions instead of as a byte by byte loop.

I’ve compiled all of the references that I used into the presentation on this slide, so if you don’t want to download the slide and be able to use it then you can take a photo of this and you’ll be able to find it from there.

I’ve also created a section of links to other people’s blogs and other items that you might be able to use as well including some of the ones which I’ve mentioned within this presentation. Those are good things for following up.

My links to presentations are down here at the bottom. That’s my blog at the top and my Twitter feed where I shall post a link to the slide. If you’re impatient you can go to Speaker Deck now. I’ve already published it and I’ll send out the link as soon as I finish speaking. There’s some other links to things like my GitHub repo and other narrated videos that I have done in the past.

See more presentations with transcripts

Uncategorized

Babel 7.9 Reduces Bundle Sizes, Adds TypeScript 3.8 Support

MMS • Dylan Schiemann

Article originally posted on InfoQ. Visit InfoQ

The Babel 7.9 release decreases default bundle sizes when using the module/nomodule pattern and adds support for TypeScript 3.8 and its type-only imports and exports. Babel 7.9 also improves optimizations for JSX transforms and adds experimental parser support for the ES Record & Tuple proposal.

Before Babel 7.9, @babel/preset-env grouped ECMAScript syntax features into groups of related smaller features. Unfortunately, this approach led to increasing large bundles over time due to the gradual introduction of new edge cases and fixes. Including one browser with a known issue would add a significant amount of code to fix a group of related issues.

The new bugfixes: true option instead transpiles the broken syntax to the closest non-broken modern syntax. With the Babel 7.9 release, the bugfixes option works best with the esmodules: true target for browsers with native ES module support. The Babel team expects to improve this approach over their next few releases and enable it by default in Babel 8.

With the TypeScript 3.8 release, developers can now import and export modules for type definitions. Babel does not analyze type definitions, and before the 7.9 release, the @babel/plugin-transform-typescript handled imports not used as values as if they were type-only. With Babel 7.9, the new type modifier can get used without any configuration changes. As explained by Nicolò Ribaudo of the Babel team:

We recommend configuring @babel/preset-typescript or @babel/plugin-transform-typescript so that it only considers imports as type-only when there is the explicit type keyword, similarly to TypeScript’s --importsNotUsedAsValues preserve option.

React is currently working on simplifying element creation. The goal is to support new functions for instantiating JSX elements as an alternative to the legacy general-purpose React.createElement function, providing for better optimization. This new approach to element creation is available today for React users via the experimental version of React. This new approach gets leveraged through the following configuration with Babel 7.9, and will become the default with the Babel 8.0 release:

{
  "presets": [
    ["@babel/preset-react", {
      "runtime": "automatic"
    }]
  ]
}

Development on Babel 8.0 is underway and a release is expected in the next few months.

Babel is available under the MIT open source license. Contributions are welcome via the Babel GitHub organization and should follow Babel’s contribution guidelines and code of conduct. Donations to support the project may also get made via Open Collective.

Uncategorized

Improving Incident Management Through Role Assignments and Game Days

MMS • Matt Campbell

Article originally posted on InfoQ. Visit InfoQ

John Arundel, principal consultant at Bitfield Consulting, shared his thoughts on how to ensure incidents are handled smoothly and quickly. He suggests assigning specific roles to each team member responding to the incident. Red team versus blue team exercises can also be leveraged to ensure the team is prepared to respond accurately and quickly.

Within the incident response team, the incident commander has the most critical role. They are responsible for running the team’s designated incident response process. Arundel notes that:

The key thing is to have one person in charge. You need a decision maker. Often, this will be the team lead, but over time, you should make sure to give everybody a turn in that chair.

The next role that Arundel recommends is the communicator. The communicators job is to provide status updates both internally and externally. This includes updating management, project managers, and the impacted clients. Supporting the communicator is the records person whose responsibility is to document everything as it happens, including taking notes, capturing screenshots, and collecting log data and metrics for future analysis. The final role that Arundel recommends is the researcher. Their responsibility is to hunt down answers to questions as they come up in the incident response process.

This matches closely with how Netflix runs incidents as seen in the recent open sourcing of their incident management tool Dispatch. Dispatch can automatically assign an incident commander based on the type, priority, or description of the incident. Dispatch can also facilitate communications by allowing for notifications to happen on a cadence removing the need to have a human remember to send them out.

As the team becomes better at resolving incidents and mitigating the issues that lead to them, they may need other ways to ensure they are prepared. As Arundel states, “The more reliable your systems, the less frequently real incidents happen, so the more you need to practice them.” This is where he recommends using red team versus blue team exercises. This concept, which originates in military exercises and is heavily used in information security, has one internal team take on the role of “attacker”. Their job is to create an incident that the blue team needs to respond to. This is similar to the concept of game days in which a failure is simulated within the environment to allow for testing systems, processes, and team responses.

Adrian Cockcroft, VP cloud architecture strategy at AWS, shares this sentiment and believes that adopting a “learning organization, disaster recovery testing, game days, and chaos engineering tools are all important components of a continuously resilient system.”

Arundel shares some tips for teams looking to host their first game day: “Keep it short and simple the first time round. Put together a basic plan for what you’re going to do: this is the first draft of your incident handling procedure.” As the team becomes more practiced, he recommends starting to assign the various roles. For the first attempts at practice incidents, he advises keeping the exercise to around one hour in length. Finally, he feels that moving the debrief to the day after will provide a better experience as the team will have had time to reflect on their actions and learnings.

Eugene Wu, director of customer experience at Gremlin, shares a number of the same tips. He also adds the importance of clearly identifying up front the purpose of the game day and which scenarios are going to be tested. This allows for clearly identifying the correct individuals to be involved, both on the execution and the response sides. He also suggests scoping out the test cases to better define the perceived impact and extent of the potential blast radius. Finally, he recommends having a clear exit strategy in case the experiment needs to be aborted quickly.

Uncategorized

Article: Data-Driven Decision Making – Product Operations with Site Reliability Engineering

MMS • Vladyslav Ukis

Article originally posted on InfoQ. Visit InfoQ

The Data-Driven Decision Making Series provides an overview of how the three main activities in the software delivery – Product Management, Development and Operations – can be supported by data-driven decision making. In Operations, SRE’s SLIs and SLOs can be used to steer the reliability of services in production.

By Vladyslav Ukis

Uncategorized

Next.js 9.3 Released, Improves Static Site Generation

MMS • Bruno Couriol

Article originally posted on InfoQ. Visit InfoQ

The Next.js team recently released Next.js 9.3, featuring improved static website generation and preview, adding Sass support, while shipping a smaller runtime.

Next.js added support for static website generation two years ago in version 3. Next.js 9 introduced the concept of Automatic Static Optimization, allowing developers to have hybrid applications that contain both server-rendered (SSR) and statically generated pages (SSG). Automatic static optimization relies on whether a page had blocking data fetching requirements, as indicated by a getInitialProps method in the code of the page.

Following a RFC dedicated to static site generation, Next.js 9.3 introduces three new methods that aim at making a clear distinction between what will become SSG vs SSR.

The first method getStaticProps is an async function that computes props passed to the page component. The function runs in the Node.js context and receives a context parameter that contains the route parameters – necessary for pages using dynamic routes.

The second async method getStaticPaths is useful when handling dynamic routes. The function computes a list of paths that have to be rendered to HTML at build time, and specifies a fallback behavior when the user of the generated application is navigating to a route that is not included in the generated list of paths. If fallback is false, then any paths navigated to by the user that are not returned by getStaticPaths will result in a 404 error page. Otherwise, if the page path was not generated at build time (not in the computed list), it will be generated on-demand when a user requests the page. The release note provides the following example for a blog application featuring dynamic routes:


import fetch from 'node-fetch'

function Post({ post }) {
  
}


export async function getStaticPaths() {
  
  const res = await fetch('https://.../posts')
  const posts = await res.json()

  
  const paths = posts.map(post => `/posts/${post.id}`)

  
  
  return { paths, fallback: false }
}


export async function getStaticProps({ params }) {
  
  
  const res = await fetch(`https://.../posts/${params.id}`)
  const post = await res.json()

  
  return { props: { post } }
}

export default Post

With this code, Next.js will generate all the blog posts at build time (const paths = posts.map(post => `/posts/${post.id}`), and an error page when the user navigates to an non-existing blog post. Generating the page for one blog post will lead to fetching the content of the post, and passing that as a post prop to the Post React component.

The last async method getServerSideProps will lead to Next.js rendering the page anew on each request (SSR). getServerSideProps thus always runs server-side. The function receives a context parameter that contains the route parameters, the request and response objects, and the query string.

Next.js 9.3’s preview mode allows users to bypass the statically generated page to on-demand render (SSR) a draft page from for example a CMS. Next.js 9.3’s preview mode integrates directly with both getStaticProps and getServerSideProps and requires to create a preview API route. The preview mode allows developers to write a draft on a headless CMS and preview the draft immediately without regenerating the static page.

Next.js 9.3 additionally added both global and component-scoped Sass support (through CSS Modules). The syntax involves a simple import. The release note provides the following example for a global style sheet include:

import '../styles.scss'


export default function MyApp({ Component, pageProps }) {
  return <Component {...pageProps} />
}

CSS Modules for Sass files are only enabled for files with the .module.scss extension. The global Sass files must be in the root of the project and imported in the Custom <App>component(./pages/_app.js file).

Next.js 9.3 features differential bundling by default and will only load the necessary polyfills in legacy browsers. The new practice results in over 32 kB eliminated from the first load size for all the users with modern browsers (which do not require of polyfills).

Next.js’s new features have received positive reactions from numerous developers. A twitter user said:

This is wonderful. I followed some of the discussion in the SSG RFC, and it is great to see this turn into a real feature! Congrats on everyone

Next.js 9.3’s new features are non-breaking and fully backward compatible. To update to the new version, Next developers need to install it with npm:

npm i next@latest react@latest react-dom@latest

Next.js is available under the MIT open source license. Contributions and feedback are welcome and may be provided via the GitHub project.

Uncategorized

Chrome Phasing out Support for User Agent

MMS • Guy Nesher

Article originally posted on InfoQ. Visit InfoQ

Google announced its decision to drop support for the User-Agent string in its Chrome browser. Instead, Chrome will offer a new API called Client Hints that will give the user greater control over which information is shared with websites.

The User-Agent string can be traced back to Mosaic, a popular browser in the early ’90s where the browser simply sent a simple string containing the browser name and its version. The string looked something like Mosaic/0.9 and saw little use.

When Netscape came out a few years later, it adopted the User-Agent string and added additional details such as the operating system, language, etc. These details helped websites to deliver the right content for the user, though in reality, the primary use case for the User-Agent string became browser sniffing.

Since Mosaic and Netscape supported a different set of functionalities, websites had to use the User-Agent string to determine the browser type to avoid using capabilities that were not supported (such as frames, that were only supported by Netscape).

Browser sniffing continued to play a significant part in determining the browser capabilities for many years, which led to an unfortunate side effect where smaller browser vendors had to mimic popular User-Agents to display the correct website – as many companies only supported the major User-Agent types.

With JavaScript popularity rising, most developers have started using libraries such as Modernizer, which detects the specific capabilities of the browser, as this provides much more accurate results.

As a result, the most significant usage for the User-Agent remained within the advertising industry, where companies used it to ‘fingerprint’ users, a practice that many privacy advocates found to be problematic – mainly as most users had limited options to disable/mask those details.

To combat these two problems, the Chrome team will start phasing out the User-Agent beginning with Chrome 81.

While removing the User-Agent completely was deemed problematic, as many sites still rely on them, Chrome will no longer update the browser version and will only include a unified version of the OS data.

The move is scheduled to be complete by Chrome 85 and is expected to be released in mid-September 2020. Other browser vendors, including Mozilla Firefox, Microsoft Edge, and Apple Safari, have expressed their support of the move. However, it’s still unclear when they will phase out the User-Agent themselves.

You can read more about the Chrome proposed alternative to the User-Agent in an article titled ‘Client Hints’ on the official Github repository. As with every proposal, the exact implementation may change before its release, and developers are advised to keep track of the details within the repository as well as the release notes provided with new versions of Chrome.

3D Printed N95 Masks Not Viable

MMS • Erik Costlow

Subscribe for MMS Newsletter

Did you know...

Article: Spring Boot Tutorial: Building Microservices Deployed to Google Cloud

MMS • Sergio Felix

Subscribe for MMS Newsletter

Did you know...

Web Components at Scale at Salesforce: Challenges Encountered, Lessons Learnt

MMS • Bruno Couriol

Subscribe for MMS Newsletter

Did you know...

Presentation: Next Generation Client APIs in Envoy Mobile

MMS • Jose Nino

Transcript

API Shape

Using Generated APIs In Swift

Performance

Extensibility

API Transport

Envoy

Use Cases

Observability

Onwards

Subscribe for MMS Newsletter

Did you know...

Presentation: Understanding CPU Microarchitecture to Increase Performance

MMS • Alex Blewitt

Transcript

Inside the Socket

Cache Is King

Thanks for the Memory

Translation Lookaside Buffer

Page Tables

(Huge) Pages

Transparent Huge Pages

Cache Lines

False Sharing

Memory Performance Strategies

Pinning Memory and Threads

Inside the Core

Frontend

Branch Predictors

Backend

perf

Skid

Backtraces

Processor Stats

Performance Counters

Top-down Microarchitecture Analysis Method

Toplev

Code Layout

Loop Alignment

Vectorisation

Summary

Subscribe for MMS Newsletter

Did you know...

Babel 7.9 Reduces Bundle Sizes, Adds TypeScript 3.8 Support

MMS • Dylan Schiemann

Subscribe for MMS Newsletter

Did you know...

Improving Incident Management Through Role Assignments and Game Days

MMS • Matt Campbell

Subscribe for MMS Newsletter

Did you know...

Article: Data-Driven Decision Making – Product Operations with Site Reliability Engineering

MMS • Vladyslav Ukis

Subscribe for MMS Newsletter

Did you know...

Next.js 9.3 Released, Improves Static Site Generation

MMS • Bruno Couriol

Subscribe for MMS Newsletter

Did you know...

Chrome Phasing out Support for User Agent

MMS • Guy Nesher

Subscribe for MMS Newsletter

Did you know...