Mobile Monitoring Solutions

Search
Close this search box.

Presentation: GraphQL + Apollo Server + Azure Functions = Magic

MMS Founder
MMS Erick Wendel

Article originally posted on InfoQ. Visit InfoQ

Erick Wendel introduces Azure Functions and discusses building a serverless solution with GraphQL and Apollo Server.

By Erick Wendel

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Micro-front-ends: The Golden Circle

MMS Founder
MMS Bruno Vercelino da Hora

Article originally posted on InfoQ. Visit InfoQ

InfoQ Homepage Presentations Micro-front-ends: The Golden Circle

Bookmarks

Summary

Bruno Vercelino da Hora discusses micro-front-ends, what they are and what they are good for.

Bio

Bruno Vercelino da Hora works as a developer at Pipefy.

About the conference

THE CONF was founded in 2017 as an annual conference to fill the gap of instituting an International-level event in Brazil, where all speakers present in English. That way we can finally have a body of presentations that anyone in the world can consume and a venue where anyone in the world can attend. The goal is to showcase what interesting new tech, such as Data Sciences, Brazilians are working on, and integrate nearby countries and the rest of the world in the future.

Recorded at:

Dec 28, 2019

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Podcast: Joseph Jacks on Commercial Open Source Software, RISC-V, and Disrupting the Application Layer

MMS Founder
MMS Joseph Jacks

Article originally posted on InfoQ. Visit InfoQ

In this podcast, Daniel Bryant spoke to Joseph Jacks, Founder of OSS Capital and the Open Core Summit, and discussed topics including the open source and open core models, innovations within open source hardware and the RISC-V instruction set architecture, and current opportunities for disruption using commercial open source software.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Google Publishes Its BeyondProd Cloud-native Security Model

MMS Founder
MMS Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

Google BeyondProd white-paper provides a model for cloud-native security in a containerized world. Google’s model requires moving beyond the traditional perimeter-based security model and leverages code-provenance and service identity as security cornerstones. Google also provided a list of open-source software that can be used to implement its security model.

BeyondProd uses Google to ensure security of the billions of containers it deploys each week, writes Google PM in container security Maya Kaczorowski. Similarly to Google BeyondCorp security model for enterprise security, BeyondProd central idea is that organizations should not trust any entities, whether they are within or outside their perimeter, following the principle of “never trust, always verify”. In comparison with enterprise security, cloud-native security takes into account the use of containers, explains Kaczorowski:

The first big difference when using containers is due to scheduling. You can’t rely on IP addresses or host names for security. You need service identity

This idea has been gaining ever more traction in the last few years under the moniker of “Zero-Trust” networking. As independent cybersecurity consultant Michael Brunton-Spall says:

Just because you’re on the network doesn’t mean we trust you in the slightest.” In fact, in many cases it probably means we should trust you less, I would argue. Most of the networks I’ve seen across government have been compromised at some point in the past. Being on the network is not a good indicator of trust.

In zero-trust networking, protection of the network at its outer perimeter remains essential. However, going from there to full zero-trust networking requires a number of additional provisions. This is by no means easy, given the lack of standard ways to do it, adds Brunton-Spall:

You can understand [it] from people who’ve done this, custom-built it. If you want to custom build your own, you should follow the same things they do. Go to conferences, learn from people who do it.

Filling this gap, Google’s white-paper sets a number of fundamental principles which complement the basic idea of no trust between services. Those include running code of known provenance on trusted machines, creating “choke points” to enforce security policies across services, defining a standard way to roll out changes, and isolating workloads. Most importantly,

These controls mean that containers and the microservices running inside them can be deployed, communicate with one another, and run next to each other, securely, without burdening individual microservice developers with the security and implementation details of the underlying infra structure.

Applying these principles will require organizations to change their infrastructure and development process with the aim to build security into their products as early as possible while not burdening individual developers with security concerns, effectively transitioning from DevOps to a DevSecOps model.

This will not be straightforward or without costs for interested organizations, and Google has been creating internal tools and working on their processes for years. A good starting point is the list of open source software and other tools provided by Google, including Envoy, Traffic Director, Kubernetes admission controllers, and many more.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


How to Scale Pilots into a Global IT Organization

MMS Founder
MMS Ben Linders

Article originally posted on InfoQ. Visit InfoQ

Scaling pilots into a global IT organization is doable, and if done right, it really works and can help to transform entire companies, said Clemens Utschig. At DevOpsCon Munich 2019 he presented how they go from starting with an idea to scaling it up into the global organization.

BI X is the digital lab of Boehringer Ingelheim. In order to shape an idea into something that can be challenged from an early sketch, and decide where and how to prove it – BI X runs central ideation, with a solid deliverable called the Pitch Deck, as Utschig explained:

It will clearly elaborate on the challenge, the opportunity as well as the startup horizon around it. Ideation plays a crucial role in scouting for startups & partners, because we want to ensure we don’t try to re-invent the wheel.

In this stage and the following, the product owner plays a super important role, Utschig said. The product owner moves with his idea to BI X for ~4-6 months, and in return gets a fully staffed product team to do user research, build, and test the MVP and potentially bring it to pilot stage, with real (end) users.

Assuming that the owning business really has an appetite for the product, the stage of handover begins, which can last anywhere between 3-6 months, and involves finding the new team, helping the new folks breathe the vision of the product, and getting them into the BI X spirit.

Utschig mentioned that they faced many challenges along the way, and often there was not one right way to deal with them, since every handover is a little different. He mentioned that having good documentation, from user research, UX design, all the way down to stories, bugs, and lastly code lineage, is really helpful in retaining knowledge.

The same goes for proper onboarding of the new team regarding agile working methods and mindset, he said. Absolutely crucial is also to transfer the product vision to the new team so that they really own it, just like the lab team did. Strict hierarchies and long processes will hinder product development, Utschig said. He suggested focusing on getting the timing right- the more cadency you lose, the less value you generate.

InfoQ interviewed Clemens Utschig, CTO & head of IT technology strategy at Boehringer Ingelheim, and Gerard Castillo, backend engineer en BI X Digital Lab, about scaling pilots into a global IT organization.

InfoQ: Why did Boehringer Ingelheim start BI X?

Clemens Utschig: BI X is the digital lab of Boehringer Ingelheim, a top 20 global researching pharmaceutical company. It was founded in 2017 with the idea to lead Boehringer Ingelheim’s digital transformation, through new means of working, ideating and building cutting edge disruptive digital products.

While most incubators locate themselves in one of the well-known startup hubs such as Berlin, Boston or Tel Aviv, we purposely picked Ingelheim, the global headquarters of Boehringer Ingelheim, in order to stay connected to our mother company. Thus, we are able to leverage the vast amounts of knowledge inside the company and ease the handover of pilots later.

InfoQ: How does BI X use and co-develop core IT services?

Utschig: As mentioned, proximity plays an important role for us. And this proximity is quite literally. IT and BI X work on platform level together every day – be it on the security, the UX stack or the base platform, just to name a few examples. Remember – different products need different IT services (e.g. eCommerce, chatbots, etc. These need to be supported later beyond one specific product). This way we ensure that when handover time starts, technology is not unknown, and risk mitigations – where needed – are available.

Gerard Castillo: If you want to push/shape for digital transformation, you need to get involved with the different departments and existing teams, to get to know each other, to understand how mindsets and workflows are coexisting. One can view this as creating small communities based on different chapter topics such as PaaS, data science, UI, UX, etc, to go over the issues and align on next steps. Therefore, this is not only about digital transformation from a technical point of view, but also about aligning the mindsets to leverage such transformation.

It is crucial to meet regularly and assure everyone has a voice and gets involved in specific actions that are focused on a vision. Once that is happening and you feel the inertia, then you can say you are doing the transformation.

InfoQ: What challenges did you face when scaling up the results of your pilots?

Utschig: There were many issues, for example:

  • Searching for talent way too late
  • Needing to position/brand as an employer for digital job profiles – we had to rethink our recruiting approach and the way we hit the media
  • Outdated documentation – so knowledge got lost
  • Handing over technology and forgetting about product vision
  • Not bringing the new ways of working to the scale-up team, e.g. quick and direct feedback, no hierarchy etc.
  • Understanding that in the scale-up phase, sometimes the real complex work is only starting. Remember, the MVP/Pilot generates a lot of appetite and now it is time to deliver, which sometimes can be very tough.

InfoQ: What have you learned from scaling pilots into a global IT organization?

Utschig: There are learnings that we applied from our first waves of product developments:

  • Ensure we include our IT colleagues early on and with that, increase our mutual learning (new technologies versus operating large scale good practice quality guidelines and regulations)
  • Automation is key, especially in regards to Continuous Delivery and Regression Testing
  • Handling products for internal customers versus external customers is a vast difference, and we are still learning on this one
  • We are able to bring people closer together, reduce silo thinking and strengthen interfaces. We have proven this and it continues to change.

We still have a lot to learn and we know it, but we are up to this challenge on all levels as it is worth it – for our customers, and the speed and quality of all our products to market.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Google Cloud Team Releases AutoML Natural Language

MMS Founder
MMS Srini Penchikala

Article originally posted on InfoQ. Visit InfoQ

The Google Cloud team recently announced the generally available (GA) release of AutoML Natural Language framework. AutoML Natural Language supports features for data processing and discovering insights from text data. It also supports common machine learning tasks like classification, sentiment analysis, and entity extraction. It’s useful in the following type of applications:

  • Categorizing digital content (such as news, blogs, or tweets) in real-time to allow the users to see patterns and insights.
  • Identifying sentiment in customer feedback comments.
  • Turning dark, unstructured scanned data into classified and searchable content. Gartner defines dark data as the information assets organizations collect, process and store during business activities, but fail to use for other purposes like analytics.

AutoML Natural Language works with a wide range of content such as collections of articles, scanned PDFs, or previously archived records. There are three steps in how the tool works:

  1. Upload the documents: In the first step, you upload the documents using the AutoML Natural Language UI and label the text data based on your domain-specific keywords and phrases.
  2. Train the custom model: The tool then runs the machine learning tasks to classify, extract and detect sentiment.
  3. Evaluate the model: The last step involves getting insights that are relevant to the specific needs of the users.

AutoML Natural Language also supports analyzing PDF documents, including native PDFs and PDFs of scanned images. To help with challenging use cases such as understanding legal documents or classification of complex content taxonomies, AutoML Natural Language supports 5,000 classification labels, training up to one million documents, and document sizes up to 10 MB.

Chicory, a digital shopping and marketing solution provider for the grocery industry, uses the PDF scanning functionality. According to Asaf Klibansky, director of engineering at Chicory:

We are using AutoML to classify and translate recipe ingredient data across a network of 1,300 recipe websites into actual grocery products that consumers can purchase seamlessly through our partnerships with dozens of leading grocery retailers like Kroger, Amazon, and Instacart.

AutoML Natural Language also has some advanced features to help with understanding the documents better. AutoML Text & Document Entity Extraction incorporates the spatial structure and layout information of a document for model training and prediction. This leads to better understanding of the entire document, and is especially valuable in cases where both the text and its location on the page are important, such as invoices, receipts, resumes, and contracts.

The product is FedRAMP-authorized at the Moderate level, making it easier for federal agencies to benefit from Google AI technology.

To learn more about AutoML Natural Language and the Natural Language API, check out their website, Get Started documentation and a Quick Start tutorial.

Other products in Google Cloud AutoML include AutoML Vision, AutoML Video Intelligence (beta), AutoML Translation, and AutoML Tables.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Datawire Announces the Ambassador Edge Stack Early Access Program

MMS Founder
MMS Wesley Reisz

Article originally posted on InfoQ. Visit InfoQ

Datawire last week announced the release of the Ambassador Edge Stack 1.0 available as part of an early access program, the Ambassador Edge Stack is an integrated edge solution that empowers developer teams to rapidly configure the edge services required to build, deliver, and scale their applications running in Kubernetes. Datawire is the company behind Ambassador, the open source Kubernetes-native API gateway built on top of Envoy, and Telepresence, a CNCF-hosted tool enabling programmers to develop locally while connecting a service to a remote Kubernetes cluster.

The “edge”, a heavily overloaded term in software today, here describes the edge of a Kubernetes cluster, or the implicit boundary over which microservices are exposed to users outside the organization. Promising a comprehensive developer experience, the platform helps developers directly manage policies, implement a variety of security needs, facilitate troubleshooting through observability features, and configure traffic management (to include TCP, HTTP, gRPC, & WebSocket). The Ambassador Edge Stack replaces the need for a separate layer 7 load balancer, API gateway, Kubernetes ingress controller, and developer portal.

The set of policy options for externally exposed microservices has extended well beyond API protocol and security choices to include options like automatic retries, timeouts, rate limits, canary testing, and the number of instances, said Richard Li, CEO of Datawire. As a result, developers are likely to make one or more edge changes with each release. The Ambassador Edge Stack provides developers with an easy-to-use interface to change any edge policy so they can develop, test, and execute with confidence, while freeing the platform team to play a more strategic role.

In a press release announcing the Ambassador Edge Stack, the company described the modern cloud native application release impediment as one of the big reasons for the creation of the stack. Organizations are moving en masse to adopt Kubernetes as a way to boost developer productivity and increase the release velocity of their software projects. Leveraging microservices, teams are iterating quickly and releasing services at varying velocities. In these releases, teams define security, resilience, and traffic management policies for their services. However, the actual implementation of these definitions typically relies on an operations team. As teams expand, features increase, and velocity continues to climb, the operation’s team backlog also balloons, hurting the overall velocity of the system. The goal of Ambassador Edge Stack is to address this release impediment and eliminate the operation’s team backlog by giving developers the power to implement (not just specify) the policy changes for their feature. Some of the core capabilities of the Ambassador Edge Stack includes:

  • Edge Policy Console. Configure, manage, and visualize edge policies via a graphical interface.
  • Easy-to-use security. Secure microservices with automatic HTTPS configuration via integrated ACME support, OAuth/OpenID Connect integration, rate limiting, and fine-grained access control.
  • Availability. Ensure microservice availability by configuring resilience strategies such as automatic retries, timeouts, circuit breakers, and rate limiting.
  • Developer-onboarding. Accelerate developer onboarding with an auto-updated API catalog, a customizable developer portal, and API documentation generated from Swagger/OpenAPI.
  • Observability. Facilitate troubleshooting via native Envoy-based support for distributing tracing, metrics collection, and logging.
  • Modern traffic management. Configure traffic routing across a wide variety of protocols including TCP, HTTP/1.x, HTTP/2, gRPC, gRPC-Web, and WebSockets. Utilize traffic management controls, including traffic shadowing, canary routing, cross-origin and resource sharing.

The Ambassador Edge Stack is freely available today as part of their early access program. Going forward, the stack will be available in both a free community edition and an enterprise edition. To learn more about the Ambassador Edge Stack, visit the early access page.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Build Your Own WebAssembly Compiler

MMS Founder
MMS Colin Eberhardt

Article originally posted on InfoQ. Visit InfoQ

Transcript

Eberhardt: My name’s Colin Eberhardt. I’m the technology director of a company called Scott Logic, which I doubt any of you have heard about. We’re a fairly small UK-based software consultancy that specializes in writing some large-scale applications for financial services. I spend a lot of time writing fairly large-scale JavaScript applications.

I’m also a bit of a fan of WebAssembly as you might guess from the talk title. I run the WebAssembly Weekly newsletter. I’ve heard it’s the most popular WebAssembly newsletter. It’s actually the only one, but it’s still the most popular. I published the 100th issue just yesterday, so I’m pretty chuffed with that. One of the reasons I’m quite interested in WebAssembly is, there’s a lot going on in that community, there’s a lot of innovation going on. In fact, just yesterday there was an announcement about a new group called the Bytecode Alliance, which is a new industry partnership between a number of different companies that are looking to make a secure by-default WebAssembly ecosystem. This is actually intended to tackle some of the problems that were talked about in the previous talk around vulnerabilities in untrusted third party NPM modules.

Why Do We Need WebAssembly

Why do we need WebAssemby? To me, this slide sums it up really quite nicely. To give you a bit of a context, JavaScript was invented about 25 years ago, and it was originally conceived as a relatively simple scripting language to add a little bit of interactivity into the web. These days we’re not just writing a few hundred lines of code of JavaScript. Typically, we’re writing thousands or tens of thousands of lines of code of JavaScript. The tooling we have is really quite advanced compared to that which we had 20 or so years ago. We’ve got things like TypeScript and Babel. We’ve got UI frameworks like React that are doing all these really quite clever things and our tooling is doing all kinds of complex compilation, transpilation, and transforms. Yet the thing that it emits is still JavaScript, this kind of obfuscated hard to read JavaScript. JavaScript is a compilation target. Some have called it the assembly language of the web. You might think, “Yes, that’s fine.” Well, it’s not fine.

To understand why it’s not fine, you have to look at how JavaScript is executed within the browser. The first thing that happens is your browser engine receives the characters over HTTP and then it has to parse them into what’s known as an abstract syntax tree, and we’ll get onto that in a little bit. From there, it’s able to generate a bytecode, and at that point, it’s able to initially run your application using an interpreter. The thing is, interpreters are pretty slow. JavaScript of 10, 15 years ago was very slow because it was just interpreted. These days the runtime will monitor the execution of your application, and it will make some assumptions. It will make some assumptions about, for example, the types being used, and from there it’s able to, just in time, emit a compiled version of your code which runs faster.

Modern browsers have multiple levels of optimization, multiple tiers. Unfortunately, if some of the things they did to optimize your code prove to be invalid, if some of the assumptions are invalid, it has to bail out and go to a slightly lower performance tier. These days JavaScript runs really quite fast, but unfortunately, because of these steps you see in front of you, it takes a long time to get there. What this means is from a end-user perspective, there’s an impact. There’s an impact on how long it takes for your JavaScript application to get up and running. The first thing as I mentioned previously is your code is parsed, it’s compiled, optimized, re-optimized, and eventually executed at a pretty decent speed, and then it’s eventually garbage-collected. The way that JavaScript is delivered to the browser today has an impact on the end-users.

What is WebAssembly? On the WebAssembly website, they have this really nice one line of description. “WebAssembly is a new portable size and load time-efficient format suitable for compilation to the web.” I’m going to just pick this part a little bit. It’s portable; as you’d expect, it’s a web technology. You expect it to work in Safari, in Chrome, in Edge, for example. It’s size and load time-efficient format. It’s not an ASCII format, it’s a binary format. It’s load time-efficient. It’s designed to load and execute quickly. Finally, it’s suitable for compilations to the web. JavaScript is not suitable for compilation to the web. It was never ever designed to do that. Whereas WebAssembly was designed with compilation in mind from day one. Also, it was designed to be a compilation target for a wide range of languages, so not just JavaScript – in fact, JavaScript’s tricky to compile to WebAssembly – but it was designed for C++, C#, Rust, Java to bring all of those languages to the web as well.

Finally, to bring that home, the sort of timeline of execution of JavaScript is shown at the top. Contrast that with WebAssembly. All the browser has to do is decode, compile, optimize and execute. This happens much more rapidly when compared to JavaScript. That’s why we need WebAssembly.

Why Create a WebAssembly Compiler?

Why create a WebAssembly compiler? Why am I standing here telling you about how to create your own WebAssembly compiler? There’s a couple of reasons. First, is something I read at the beginning of the year. Stack Overflow, each year, publish a fascinating survey. I think they have around about 20,000 respondents where they ask people about the languages and the frameworks they use. They also ask them for their sentiments. What languages, tools, frameworks do they love, what do they dread – maybe not hate – what do they enjoy using. Interestingly, WebAssembly was the fifth most loved programming language. I thought, “That’s nuts. How many people out there are actually programming in WebAssembly?”

WebAssembly was the only compilation target listed under the most loved languages. I thought, “Maybe people want to know a little bit more about WebAssembly, the language itself.” That was one of the ideas that made me think about writing this talk. The next one is I’ve always had a bit of a programming bucket list. There are things that I’ve always wanted to do as a programmer, things like create an open-source project and meet Brendan Eich -which I did, he’s a lovely guy. I came from a physics background, so I didn’t study computer science, so I never got the opportunity to learn about compilers and so on. I’ve always wanted to do that. Those two things, coupled together made me think, “Ok. I’m going to create my own language. I’m going to write a compiler that compiles it to WebAssembly, and then I’m going to do a talk on it.” It took me many weeks and months, and there’s quite a lot to go through.

In order to constrain my experimentation, I had all kinds of crazy ideas. I thought, “I’m going to write a programming language that has enough structure to it to achieve a fairly modest goal. I want to create a programming language that allows me to render a simple fractal, a Mandelbrot set. This is an example of the language. It’s not a very nice looking language, but you it’ll do.

Let’s take the first step. Let’s look at creating the simplest, the most trivial wasm module with code. I’m using TypeScript here, so I’m going to construct the simplest wasm module possible. It stands out the simplest wasm module is just eight bytes long. It starts with a module header, and if you’re any good with your ASCII codes, I’m sure you’ll know, 61, 73 6d is ASM. The next is the module version number, which at the moment is version one. Concatenate these together, and you get the simplest possible WebAssembly module. Typically, you wouldn’t do this. Typically, you wouldn’t construct this in memory. Typically, this would be downloaded over HTTP as a binary wasm module, but you can do either.

In order to run this, what you have to do is take your binary and instantiate it using the WebAssembly APIs. There’s a new set of APIs for JavaScripts that allow you to instantiate and interact with WebAssembly modules. This already illustrates some quite interesting for properties of WebAssembly. You don’t download it into the browser directly from a script tag. The only way to get a WebAssembly module instantiated and interact with it is through the JavaScript host. At the moment, this WebAssembly module does absolutely nothing, but it’s still a perfectly valid module. Let’s make it a little bit more complicated. Let’s try to create a module that does something vaguely useful – a simple add function.

Rather than looking at it from the binary format and looking at hex codes, it’s a little easier to look at it in the WebAssembly text format. If you’ve ever done any assembly language programming, typically, when you’re doing low-level programming, you’ll use assembly language, which is a slightly more human-readable version of the machine code that it represents. WebAssembly has the same kind of two different views. We have the text format and the binary format. This is a very simple valid WebAssembly module that provides an add function. It’s a function that has two parameters of type float, 32 bits. It returns a 32-bit result. The way that it works is it gets the two parameters using the get_local opcode, and then it adds the two together. Finally, this is exported to the host so that it can execute it.

This gives us a little bit more insight into what WebAssembly is actually like. It has a relatively simple instruction set. It has an assembly-like instruction set. If you’ve ever done any assembly language programming, it’s got that feel to it. It only has four numeric types. It has two integer types and two floating-point types, and that’s it. It’s a stack machine, so you’ll notice the add operation here, the two get_local set up the stack, they push two values onto the stack, the add instruction pops those two values, adds them together and pushes them to the stack. Finally, the function returns the remaining value on the stack.

One other interesting thing is the WebAssembly has absolutely no built-in I/O. It cannot access the DOM, it can’t access a file system, it can’t do a console log. This is quite important for WebAssembly. It means that the attack surface is nonexistent. The only way for WebAssembly to interact with the DOM or its host is through exported and imported functions. To encode that in binary format, the binaries are arranged into sections. You’ve seen the header and the version number. Following that, you have the type section and the import section and the function section. These are packed together in sequence. I’m not going to go into too much detail, you don’t need to know the ins and outs of all of these. The main reason it’s split into these various different sections is to minimize the size of the assembly module. If you have two functions with the same signature, it makes sense to have those all encoded in the type section and reference them later. That’s the main difference between the text format and the binary format.

Let’s construct an add function again in code, in this case, using TypeScript. The code is really quite simple. I have an enumeration of my available opcodes. I’m just taking the get_local opcode. Following that, I’m encoding the zero index. This uses an unsigned LEB encoding. All you need to know is that that’s a very standard encoding, which is a variable length encoding. It’s between one and four bytes long. Next, I’m encoding my get_local opcode and finally the add opcode, and that’s it, that’s my WebAssembly code. The next thing I need to do is package that up into a function body, and again, this is using some very simple encoding, all the encodeVector function does is sort of prefix my vector with the length and that’s it.

Finally, I’m constructing my code section by encoding my vector functions together. This is pretty much it. It doesn’t matter if you don’t understand every single line here. All you need to really understand is that it’s relatively simple to handcraft these WebAssembly modules. Finally, I’m able to instantiate the wasm module and invoke the exported function. If I look at the output in binary formats, again, I can see the actual code at the end there. If you recall, the get_local opcode had a hex code of 20, for example. All really simple.

Building a Compiler

Let’s start looking at how we can turn this simple example into a compiler. Before delving into the detail, I just want to get a little bit of terminology out of the way if you’re not quite familiar with it. My language is comprised of statements. At the top level, I just have a collection of statements. Here’s one example, a variable declaration statement. As you can imagine, this declares the variable b and assigns it the value zero. Here’s the variable assignment statement, it assigns an existing variable to a new value. Interestingly, statements can have other statements nested within them, as is the case with the while statements. Another important component of the language is a concept called expressions. Expressions return values, whereas statements are void, they do not return values. Finally, here, we see an expression tree. Expressions can be composed using brackets and operations. These are the basic building blocks of my language.

Then we’ll look at the basic building blocks of the compiler itself. You might have heard of these terms before. I’d certainly heard of these terms, but I haven’t had the chance to explore them personally. The first thing that happens is my code is processed by my tokeniser into an array of tokens. It’s then parsed into an abstract syntax tree and then finally it emits my wasm binary. We’re going to visit each one of these in turn.

For the first version, the 0.1 version of my language, which is called chasm, I asked people on Twitter for some good programming language names, 99% of them were terrible. Some of them weren’t even repeatable. This is version 0.1 of my programming language. My first iteration was a programming language, which did nothing more than then allow me to print numbers, and that was it.

Let’s look at the tokenizer for this language. Rather than looking at the code of the tokenizer, I thought it’s easier to actually look at what this code does. It’s only about 15 lines of code, and it’s comprised of a number of regular expressions which match the various building blocks of my language. The top regular expression we have here matches a one or more digits or decimal points, bonus points for nice thing that it’s not a terribly robust regular expression. I can have multiple decimal points, but we’ll gloss over that. My next regular expression matches a couple of my keywords, “print,” and “var,” and the final one matches white space.

The tokenizer advances through the input, which is my program written in chasm and matches these tokens one after another. The first location here, the whitespace pattern matches and that does nothing. As it advances to the next location, my keywords, regular expression matches, and this causes it to push a token to the output. It then advances to the next whitespace again, which is ignored. Finally, it matches the number token, which is pushed to the outputs. Again, it’s retaining the index at which it matches for future debug support, which I haven’t implemented yet.

The output here is just a couple of tokens. What we see here is it’s removed some of the whitespace. In my language, whitespace is not semantic. It has no meaning, so it can be disposed of. The tokenizer also provides some basic validation at the syntax. The ability to tokenize a texturing doesn’t necessarily mean it’s executable, but it will throw up, for example, if it comes across a keyword which isn’t valid in your language.

The next step is a little bit more complicated. This is our parser. The parser takes the tokens is the input. There’s a little bit more code going on here. I’m going to draw your ride certain parts of it, so don’t worry if you don’t understand all of it. Just like the tokenizer, this advances step-by-step. We have pointed to the current token, and we also have a function that eats the current token and advances to the next token in the inputs. We’ve got some code, which I’ll elaborate on shortly. This is the main body of my parser. As I mentioned previously, my language is comprised of a collection of statements, so my parser is set up just like that. It expects that for each token, it will be the start of the next statement in the language. If for whatever reason the tokens do not conform to that, an exception will occur, and it’s caught elsewhere.

Let’s look at the statement parser. At the moment, my language does nothing more than print numbers. The only token type I’m expecting here is a keyword and in this case, the keyword value is print. In future, there’ll be more of them. It eats the token, and the next thing it does is it advances to the expression. Each print statements is followed by an expression. Here is the expression parser. Again, the language does nothing more than print simple numeric values. The expression parser is very simple. It matches the type, which is always number, and it converts this number string into a real numeric type, and that’s it and eats the token. The output of my parser is the abstract syntax tree. Here you can see these two tokens are converted into an abstract syntax tree, which is a single print statement and a single number literal expression. That’s the transformation taking place here.

The final step is the emitter. Again, a little bit of code going on here. The emitter iterates over each of the statements and here it matches, or it switches on the statement type. Here, the only statement type is always print at the moment. The first thing it does is it emits the expression that relates to the print statements. This is because WebAssembly is a stack machine. The print operation expects to have the value already present on the stack, so we emit the expression first. Here, the only expression type again is a numeric literal at the moment. All we do is we take the node value, and we emit to the f32_const opcode. That’s a constant value opcode using the IEEE 754 encoding, which I’m sure you all know. Finally, the print statement itself is implemented as a call. I’ll get onto that in a little bit, we’ll skip that for the time being.

It’s time for a demo. If the demo gods are on my side, I should be able to show chasm working. Someone name a number.

Participant 1: Forty-two.

Eberhardt:. If I run my compiler, the output is 42. Just to recap, what’s happening here is my tokenizer, my parser, and emitter are all written in TypeScript and compiled to JavaScript – because we compile JavaScript. Within the browser, it’s translating that simple application into a WebAssembly module and executing it. I should be able to print something else. If I run that, I can print multiple statements. This is what it looks like. I’m a bit of a magician because I knew you were going say 42 and that was not a setup. I’ve done this a couple of times, and the first time someone said 42 and I thought, “I bet everyone says 42,” and they do.

As you can see, here’s the tokenized output, the abstract syntax tree, and the final WebAssembly module. You can see it’s all together. As I promised, I said I’d return to the print statement. As I mentioned, WebAssembly has no built-in IO. In order to perform my print statement, I want to do effectively a console log, so I have to work with the JavaScript host in order to achieve that. WebAssembly modules can import and export functions. By importing a function, it’s able to execute a JavaScript function, and by exporting it, it allows the JavaScript host to execute one of the WebAssembly functions. That’s how WebAssembly performs IO. For example, if he wants to do something a little bit more meaningful, like interact with the DOM, you have to do it through function imports and exports.

The next version of my chasm language, I wanted to implement more complex expressions. I wanted to create sort of expression trees to allow me to do some fairly simple maths. I’m not going to delve into each of the steps in quite so much data, and I’m going to accelerate a little bit here. My tokenizer, to support this, I only had to add another couple of regular expressions, and that’s about it. I had to add a regular expression to match brackets, and that only took me five minutes because I had to work out the escaping and all that lot. Then I have another regular expression which matches the various operators I support, and that’s it. My tokenizer is good to go.

Looking at the parser side of things, the only thing I had to update was my expression parser. This is a little bit more interesting. Here’s what happens if it encounters parentheses in the array of tokens. What it does is, in the array of tokens, you expect to see the left-hand operand, the right-hand operand and the operator in the middle, which basically the parser is expecting them in that order. They allow nesting, so the left or the left and the right-hand side use recursion to recursively call the expression parser once again. With a few additional lines of code, my expression parser is now able to construct an abstract syntax tree, which is truly tree-like. Here this print (42+10)/2) is encoded as that abstract syntax tree.

Moving on to the emitter. Again, there’s a few extra additional things going on here, which I’m going to point out. My expression emitter now uses a visitor pattern. I’m sure you’ll have probably heard of a visitor pattern before. It’s fairly classic software engineering pattern. In this case, I’m using a tree visitor. My abstract syntax tree is a tree, and my traverse function visits every node on that tree executing a function, that’s the visitor. This is a depth-first post-order traversal. What that means is that it visits the left-hand node, then the right-hand node, then the root. The reason it does that, again, is that WebAssembly is a stack machine that sets up the operations in the correct order. Then the binary expression when it encounters that, all it has to do is convert the operation into the right opcode and that’s pretty much it.

Demo time once again. If I’m lucky, if I run that, it does some basic maths for me. One thing that I found interesting here is it took me quite a while to set up my original compiler architecture, the parser, tokenizer, emitter, and that sort of thing. Once I started to add extra features to my language, it became really quite easy. Just a small concept of having expression trees which can be executed. That was two or three lines of extra code and the tokenizer, that was maybe 10 or so extra lines of code in the parser, and maybe another 10 in the emitter. All of a sudden, my language is a lot more powerful. I’m going to again accelerate a little further. I’m not going to go into all of the details. I’m just going to touch on a few different things.

The next version of chasm – I wanted to add variables and while loops. We’ll look at how variables map between my language and WebAssembly. WebAssembly is composed of multiple functions, and functions have parameters and a return value as with most languages. They also have the concept of locals, so each function has a zero, one or more local variables. On the left-hand side here, you’ll see my simple chasm application. It takes a variable, assigns a value 23 and then I’m printing the value of that variable. On the right-hand side, this is roughly speaking how you do the same with WebAssembly. We define a function that has a single local, which is my variable f. We set up a constant and store it within that local, so set_local 0. Then we retrieve it using get_local 0 and then call my print function. I know an optimizing compiler would trash a few of those are operations, but I hope you get the point.

Mapping variables from my language to WebAssembly, it’s really quite easy. All I have to do is maintain a symbol table which maps the variable name to the index within the function, and that’s it. It was really quite simple. While loops – again, surprisingly simple. An interesting thing about WebAssembly is even though it is an assembly-like language, it has some surprisingly high-level concepts intermixed with that. You’ve already seen that it has functions, which is quite surprising for something that claims to be an assembly language. It also has loop constructs, it has ifs and else. For example, when I wanted to implement wild loops, I was able to use blocks and loops within WebAssembly. The way this works is the loop condition is encoded, and then the next thing it does is it uses the eqz’s opcode that determines whether the current value on the stack is equal to zero. The next one is break if or branch if to a stack depth of one. What that means is if the stack value is equal to zero, it breaks to an execution stack depth of one. What this means is it breaks out of both the loop and the block. If that’s not the case, it will execute the nested statements and then break to a stack depth of zero which repeats the loop.

Let’s give that a quick demo. This is going to be a little bit more complicated. Let’s start with, “var = 0, while (f< 10), f = (f+1), print f, endwhile.” That all works, super chuffed. As with the previous upgrade to the chasm language, it wasn’t that hard to add these relatively high-level concepts.

Finally, chasm version 1.0, time for a major release – the setpixel function. Rendering a Mandelbrot is really quite simple. You don’t need that many different language constructs to do the basic maths. The final piece of the puzzle I needed was a setpixel. This is interesting because as I mentioned a few times, WebAssembly has no built-in I/O. How do you write the canvas in order to render to the screen with the WebAssembly? You could use function imports and exports. I could have a JavaScript function which is called setpixel and import that into my WebAssembly module, but that would be relatively inefficient. It would be quite chatty over the WebAssembly-JavaScript boundary. There’s actually a slightly smarter way of doing it.

Previously I mentioned the only way to do I/O with WebAssembly is through function imports and exports. There’s actually an additional way of performing I/O as well. WebAssembly modules can optionally have a block of linear memory, and they are able to read and write to it using store and load operations. Interestingly, this linear memory can be shared with the hosting environments. In the case of a JavaScript host, this is an array buffer. Both your WebAssembly application and your JavaScript application can read and write to the same block of memory. What I did was basically set the memory up as video RAM effectively. This is my kind of virtual canvas.

This is the final demo, and I’m not going to type that all out in front of you because I’d never get it right, but that renders the Mandelbrot set. My chasm language is complete. I must admit I worked on this quite a lot in my evenings.

Recap

Finally, to recap, WebAssembly is a relatively simple virtual machine. It has something like 60 opcodes. It’s got a really quite simple runtime model. For me, I find that quite fascinating. I’m used to using web technologies that I don’t understand, and by that I mean I don’t understand them under the hood. I’d like to think I understand how to use them, but things like React, for example, I haven’t got the foggiest how it works under the hood. Whereas with WebAssembly, it’s quite enjoyable to find a new concept on the web that you can literally understand everything about it. As a result, I find it quite a fun playground for doing some of the things that I used to do back in the kind of eight-bit computing era. I spend a fair bit of time writing WebAssembly by hand, not because I’m crazy, it’s fun.

As a bit of an aside, I don’t use TypeScript nearly as much as I should. This project was a really nice reminder for myself about how powerful TypeScript is. For example, the structure of my abstract syntax tree is defined as TypeScript interfaces. I get type checking support, which really ensures that my parser is quite robust.

I also found that creating a simple compiler isn’t as hard as I initially thought. It’s also a good way of exercising your programming skills. There are quite a few concepts in there – things like visitor patterns, tokenizers, so all kinds of interesting software engineering concepts that you come across through having a goal with writing a compiler.

Also, WebAssembly is a very new technology, and there’s a lot of creative energy being poured into WebAssembly. If you take the time to understand it, there are quite a number of really interesting open-source projects that you can potentially contribute to once you’ve got that knowledge.

Hopefully, you have been inspired by this talk to find out a little bit more. Returning to my bucket list, I’ve ticked the final one-off I guess – or maybe not. This is one of those fun projects that spiralled out of control. Once I’d got that far, I thought, “I could spend another few days doing strings, or arrays, or functions.” The interesting thing is when you get to things like strings and arrays, you get to the really hard stuff. You get to, for example, memory allocation. WebAssembly doesn’t have a garbage collector, so you need to manage memory yourself. You need to work out how to store these concepts within linear memory. It’s a lot of fun.

That’s how to build your own WebAssembly compiler. All the codes are on GitHub if you want to play around with it. Also, it’s arranged in such a way that each of the fictitious releases of chasm is a commit. You can roll back right to the beginning, which is that simple, few lines of code that makes the first eight bytes and go commit by commit through step-by-step if you’re interested in playing along at home.

Questions and Answers

Participant 2: There’s a lot of great languages that we want to run on the web through WebAssembly. Is there any reason that tokenizing or the parsing would have to be changed for WebAssembly or is it just writing a new emitter for an existing compiler?

Eberhardt: You’re talking about real languages now, aren’t you? There’s a few different ways of doing it. The first language to compile to WebAssembly was C and C++ using the M scripting compiler. Under the hood, that uses the LLVM toolchain, which is a modular infrastructure for building compilers. To your point about whether you have to reinvent the wheel is I guess what you’re talking about, compiler technology is already relatively modular. In order to create the first WebAssembly compiler, the C and C++ team were able to build on some preexisting LLVM concepts. It depends on the language, though. For example, a number of the early languages used the M scripts and then LLVM, some of them have used different compiler technologies. We’re seeing quite a lot of divergence now in the technologies used.

Participant 3: Besides just for fun, have you been able to apply this to anything, like in your work?

Eberhardt: Yes. Not much, admittedly. It’s all pretty new technology. In practice actually, there are a few people using WebAssembly in production, and the ones that are most well-known are AutoCAD, which have a huge C++ codebase. There’s PSPDFKit who’ve taken a half-million line of code PDF rendering engine and moved that to the web. In my own line of work, I do work within financial services, and we work with a company that had their own Bespoke Protocol for streaming pricing data. It’s relatively old, and it was all written in C++, and they keep getting annoying people say, “But I want to do it in Node,” and they’re, “But it’s C++.” We are able to help them through M script and compile their client library for decoding their Bespoke Protocol, rapid TypeScripts layer on top of it to make it play nice. Yes, we have used it in production, not in many cases, but the technology is only two years old.

Participant 4: I haven’t looked into WebAssembly enough to really figure this out, but one thing that I’ve always been curious about is, for those more complex run times, how do they interact with memory? I saw in the assembly that you had that it’s certainly putting things in the stack on those implicit registers, I suppose, but how does it work when it’s like “It’s going to be C++?”

Eberhardt: It’s entirely down to the design of the language itself. You’ve basically got linear memory; you’ve got a big block of memory. The moment Rust ships with its own very lightweight allocator whereas C# through Blazor, they actually ship a garbage collector, which is compiled to WebAssembly. It’s a bit of a blank canvas. It’s got that kind of low-level feel. It’s up to the language designers and the implementers how they use that memory to best suit their language. Participant 4: What I’m unclear about is, is there a particular opcode for saying, “I want this much memory and then [inaudible 00:36:30] space,” or something like that?

Eberhardt: You ask for a certain number of pages of memory, and you can also grow and shrink memory dynamically, but that’s how it works. You allocate a block of memory upfront.

Participant 5: When you are working with Rust, or C#, or C codes, can you do any I/O or do you need to have JavaScript in the two of them?

Eberhardt: WebAssembly has no built-in I/O at the moment. However, there’s a working group called WASI, which stands for WebAssembly System Interface, which are defining a core set of I/O operations. Although they’re not really designed for the browser, they’re designed for outer browser WebAssembly. The WebAssembly runtime is being used for things like serverless functions, for writing smart contracts on the blockchain. There is a real need for a standard set of system interfaces there. In the browser that’s making use of automatic generation of bindings. There’s a certain amount of glue code required on each side to kind of bridge the sort of boundary. At the moment you can generate lots of that. There’s a thing called wasm-bindgen for Rust, which does exactly that. There’s additional work, but most of that is hidden by tooling.

Participant 6: One of the dreams, of course, is that you have your backend language used on the front end, like Java or Python. Is that anywhere on the horizon or is that far way?

Eberhardt: That’s a good question. Taking the example of PSPDFKit, the company that took their PDF rendering engine, they originally had a web version of that product, and that was all running on the server. Through WebAssembly, they were able to shift the same code into the browser and offer that as a commercial product. One of the main reasons that JavaScript is so popular is not how good a language it is – it’s an awesome language – the reason it’s popular is because of the ubiquity of the web platform. It’s the biggest platform out there. I think WebAssembly is a great idea and that it allows other languages to be part of the most ubiquitous platform and runtime there is.

Participant 7: I believe there’s Doom 3 running in WebAssembly in the browser?

Eberhardt: The Unreal Engine was one of the early demos from asm.js, which was a precursor to WebAssembly.

Participant 8: One way to convince other people to pay attention to WebAssembly would be if we had some formal way to quantify the speed benefits we get if we report our application to WebAssembly. Do you have some good examples?

Eberhardt: I’ve got some bad examples. Bad examples are the ones that are the most revealing. People do ask time and time again, “What’s the performance of WebAssembly like?” My response to that is, “What’s the performance of JavaScript like?” It’s actually pretty damn good. If you look at most algorithmic benchmarks, Javascript is maybe 30% slower than native code. You ask, “What’s the performance of WebAssembly like?” It’s only got that 30% gap to span. It can’t actually get that much faster, which is why at the beginning of the talk, I was focusing on the time it takes for your JavaScript application to reach peak performance. That’s what WebAssembly is improving. It’s significantly reducing the amount of time to reach peak performance. It’s not adding much to peak performance because JavaScript is pretty fast as it is. It does provide better performance, but you’ve got to ask the right question.

Participant 9: Just a piggybacking on what you were just saying. Would you say that perhaps the biggest value proposition for WebAssembly than over JavaScript would be the ability to use other languages at near-native speed?

Eberhardt: Yes. I think that has to be one of the biggest propositions of WebAssembly. Yes, I’d say so. JavaScript is a highly capable language. There are times when it doesn’t give you the performance that you need, but there are very few people here who probably have a real performance issue with JavaScript. Typically, your performance issue is elsewhere. It’s in your use of the Dom APIs or something else. JavaScript is quite fast. Yes, the value proposition really is bringing other languages to the web, but also the value proposition is the WebAssembly runtime now being used on the edge, within clouds, and on the blockchain. It’s bringing that kind of universal runtime to a whole host of other areas as well.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Beyond Microservices: Streams, State and Scalability

MMS Founder
MMS Gwen Shapira

Article originally posted on InfoQ. Visit InfoQ

Transcript

Shapira: It’s late in the day. After listening to all the other talks, my brain is in overflow mode. Luckily, this is not a very deep talk; it’s going to be an overview. Starting five years back, we started doing microservices. We tried some things, we had some issues, we tried other things. We’ll go through the problems and different solutions at how microservices architectures evolved and where this is all taking us, but I’m not going to go super deep into any of those topics. Some of those topics went to in depth in this track already. If you didn’t listen to the other talks, you can catch up on video. Some of them, it will just be a pointer. If you want to know more about, say, service measures, there will be a list of references at the end of my slide deck. Once you download it from the QCon site, you can go and explore more. These are highlights, so don’t write in your speaker notes at the end, “Gwen [Shapira] did not go into depth into anything.” I’m telling you upfront, I’m not going into depth into anything.

In the Beginning

Let’s go all the way back to the beginning. I think my recollection, the hype cycle for microservices really started maybe five, six years ago. I know some people have been doing it for longer. I’ve seen some references going back to 2011, but let’s go all the way back when we first decided to use microservices. We had those big monolithic applications and every time we had to make a change, it was forces negotiation between five different teams and so reorganization, which back then may not have existed under the name as a reorganization as they think of it, and application was the hostage. You had to have everyone agree in order to make a step forward.

This was slow, we were frustrated and we said, “What if we make these things up into those isolated contexts where a team can own an entire context of execution, start to finish, write it, test it, take it to the production, deploy it in their own timelines, make changes in their own timelines? Surely, we can move a lot faster that way.” To a large extent, it worked quite well, and it has definitely become the architecture of choice in most fast-moving companies today, but there was a catch. We took what used to be one application and we broke it down into parts. It shouldn’t be that surprising that they still have to work together, they still have to communicate with each other.

What we started out doing is say, “Every language in the world, everything that I do has HTTP. I know how to do REST. I know how to send JSON. Everything has JSON support and if it doesn’t, I can write it down in a weekend. I mean, three months. I mean, maybe two years in the library. It will surely be done.” It sounds fairly straightforward and easy and that’s what pretty much universally we all did. As we discovered, this type of pattern, in a lot of contexts, basically caused us to fall back into tight coupling. This ended up getting the fun nickname of a distributed monolith, where you have all of the problems of microservices, yet somehow none of the benefits, so debugging is pure hell, but you still have to negotiate every change.

I’m going to walk us through some of the ways that this happened to us, because then I’ll talk about the patterns that evolved to solve the problem. The first thing that happened, if you weren’t careful, is that clients started knowing way too much about the internal organizations that you had. Suddenly, it wasn’t just talking to one application and sending everything to this one application plus or minus a few DNS names. They had to know that in order to do returns, I’m talking over here, in order to do inventory, I’m talking over there. This is a serious problem because clients are not updated on your schedule. They’re updated on someone else’s schedule, which means that if you leak your internal architecture to a client, you’re basically stuck with it for five-plus years. You definitely don’t want to end up there. That was one problem.

The other problem is that during those lines of communication, you had weird shifts of responsibilities. Imagine that one of the services is an Order Service – I’m going to talk about Order Services a lot more – and one of them is a validation. The Order Service has to know that once an order is created, it has to call validation and validation has to know what to call next. Suddenly, a lot of the business logic is really encapsulated into those microservices. One may wonder whether your Order Service has to know the validation rules and who to talk to about validation. One may wonder if you create a new insurance policy has to know about the call to check where the person lives in order to figure out the correct insurance rate service. Basically, your business logic is now leaked all over the place if you’re not careful.

To make things worse, a lot of the concerns are actually common to a lot of those services. If you have a validation service, maybe you have a bunch of services that depend on it, which means that all of them have to know what to do with the validation services down – do I queue events? Do I retry? Do I fail some calls? There is a bunch of logic that is spread out all over the place, leading to redundancy and shift in responsibility. Maybe your Order Service shouldn’t have to worry at all about whether validation service is up or not. Maybe it’s none of their problem.

The other thing is, as this architecture becomes more complex, making changes become riskier and you have to test more and more of those services together – again, if you’re not careful. There are ways to go to around that. That’s exactly what the rest of my talk will be about. If you’re very naive, you can get into a point where, “Oh, no, I cannot make this change without talking to five other services that may depend on me. Then, I’ll break service number six that they didn’t even know that talks to me because we don’t always know who is talking to each other.” We need to solve that.

The other thing, and I think that’s one of the biggest problems that really slows down innovation in a lot of companies, is that in order to add anything to this graph of microservices, I need to convince someone else to call me. Imagine that they have this amazing innovative service that really handles shipping with errors way better than anything else that we had before. I cannot really add any value to the company before I convince someone else to call my service instead of maybe the old service. These other people are busy people. They have their own priorities, they have their own deliverables, they have stuff to do. If I’m not proving to them that I’m adding value to their life, why would they even call my service? I’m stuck in this horrible chicken and egg cycle where I cannot prove that my service is valuable until you call me, but you have no reason to call me until I’ve proven that my service is valuable. The motivation to innovate just goes way down because you know it’s not going to be very easy.

Then, the last thing. It’s minor, but it’s also annoying. JSON has to be serialized and deserialized on every hop. HTTP has to open and close connections on pretty much anything. This adds latency. You cannot really reduce latency by adding more microservices. It somehow happens that in microservices architectures, the only way we know how to deal with problems is add more microservices on top. I don’t know exactly how it happened. I’ve also seen it happen in my company. We had a bunch of microservices, making a change required actually changing three, four, or five different ones. It was really painful, we complained, we had a bunch of architects, we had a discussion. Somehow, at the end, we moved from five to something like 11. That’s how it goes, so you have to be mindful of latency. It’s not exactly a big concern, but it’s also not entirely dismissible one. Clearly, with all those problems, if you do things naively, it looks like we can do things a bit better.

We Can Do Better

I want to talk about how we’ll do things better. The reason I’m here talking to you about that is that I’ve been solving these kinds of problems for a long time now, sometimes as a consultant, sometimes as an engineer. Recently, I became an engineering manager, which means that when I talk about developer productivity now and how to make engineering teams move faster, I now have a lot more skin in the game. It used to be very abstract, “Let’s talk about architectures that are theoretically faster.” Now, I’m with a stopwatch, “Is the feature done? Why are you wasting time?” Yes, I have vast interest in all of us building things that make developers more efficient. Also, I tweet a lot, which clearly makes me an expert.

API Gateway

The first pattern I’d like to share is that often API Gateway should be fairly easy. How many of you have API Gateway in your architecture? Yes, I expected every hand to go up. That’s well-known and I don’t think anyone lives without it. I’m actually at the point where, when I interview engineers, we have those fields design questions that we always ask. Usually, they draw the usual boxes, here’s the upserver, here’s the caches, here’s the database, and then they throw an API gateway on top. I ask, “What is it for?” They say, “We’re going to need it. Trust me.” Yes. That’s how it works.

API Gateway was originally introduced to solve the problem of the client knowing too much with the ideas that we can put an API Gateway to hide all those complexities from the clients, but they turned out to be even more useful than that. They can also handle the fact that, if you think about it, once you expose a lot of microservices to the clients, all of them have to authenticate. Whether every service should be responsible for authentication, whether the services were unlikely key enough to be closer to clients should be responsible for authentication, whether this is something that you really want to implement multiple times, is very questionable. API Gateways also solve that, and they ended up solving a lot of different things.

The main pattern is that we’ll throw an API Gateway to front all the requests. Clients will always talk to the API gateway and say, “I want to talk to the return endpoint of an API Gateway.” The API gateway will route it correctly. Because now all the routing happens through the API Gateway, it can also do things like, “If it’s V1 returns, we’ll send you over here, but if it’s V2 returns, we actually have a new V2 service that we are A/B testing, or that works faster if you have a newer client or whatever it is that happens.” Actually, routing can get quite smart.

The main benefit we have for API gateways is that they take responsibility for things that a lot of services need. They do authentication, they do the routing. They also do rate limiting, which is a huge deal. If you’re open to the entire internet, you don’t want every single request to hit you straight in the database, you want to have layers of control around that. I’ve heard that there are even services like AWS Lambda or Azure Functions architectures that can basically scale into the thousands at the drop of a hat. Someone may try to use that to speed order iPads before they run out, so rate limiting can be incredibly useful. Then, because it’s the first place where clients hit your back end, it’s also a really good place if you do things like tracing and spanning and trying to do observability. It’s a very good place to log the first access point and start timing your back end from there. A lot of logging and analytics goes there, too.

As you’ve noticed, the API gateway is super useful. Sometimes, I want to talk to another service and know that I will be routed to the correct one, and be rate limited and get all this logging. It just sounds like a really good thing. What sometimes happens is that internal services start talking to each other via the API Gateway because it makes life a lot easier. It’s like this one point that you know you’ll get routed and everything will happen correctly. It definitely started happening to us when we have an API gateway. The guys who own it set rules and they’re willing to protect them with their lives, like, “No. If you exist in one of our smaller regions, you are not allowed to talk back to the gateway.” That’s the worst anti-pattern.”

Service Mesh

We learned, but we still want this goodness. Still, good things come with API gateways, which brings us to service mesh, which is in one word, like API gateways, but internally. The way we normally talk about API Gateways is we talk about North-South Traffic. This is from the outside world into the depths of our data center. The data inside our data center is normally called East-West Traffic. If you have very good eyes, you can see that I colored things in slightly different colors. You can see that East-West Traffic is the traffic between the microservices, not the one coming from the client.

The way we implemented the API Gateway for the East-West services – one of the things to notice is that we have a lot more connections going to East-West than we do North-South. Pretty much by definition, the internal web is much denser, which means that having one API gateway is this monolith that everything goes through as a single point of failure, as a single bottleneck is not going to work and scale at all. The pattern we implemented is to take all these things that we want an API Gateway to do, and scale it out. The pattern that was used to scale it out is what is known as a sidecar. The idea of the sidecar is that you take short functionality that every one of your applications is going to need and create its own very small application that can be deployed next to in the same pod in the same container on the same host as your actual applications.

Why an application and not a library? Because these days, a lot of times companies have multiple programming languages. Even us, we’re a small company, we have maybe 100-ish engineers, and we ended up with Python, Go, Java, Scala. Some people are trying to introduce Rust – we are not going to let them. Having a sidecar means, because it’s independent functionality and it talks via network calls, mostly HTTP network calls, it means that you don’t have to implement the library in multiple languages on trying to do Interop from Java to C to Go. That would be really fast if we try to do it clearly, so sidecars it is.

Basically, that’s the way it looks. The colored circles are my microservices and I put a small proxy next to each one of them. Why a proxy? Because my application thinks that it’s talking to basically some kind of a port on localhost, but our sidecar knows what it’s actually trying to talk to and routes the data correctly. On the way, it does a lot of the goodness that API gateways would also do, and sometimes actually a lot more. One of the things it can do is that especially if you run on Kubernetes, one of the things that happens a lot is that IPS change a lot and you have a bunch of load balancers and things that require to know about it. One of the nice things that the proxy can do is, you keep talking to the same port on local host, and the proxy is the one that is aware that things moved around and routes data correctly. It’s the same thing if you want to upgrade from one version of the pink application to the other, you basically route it to another version. If you want to do A/B testing, pink application versus green application, you can do the routing, and the purple application does not have to know anything about that. It still thinks it’s talking to the exact same thing. Routing happens magically in the background.

The other really nice thing that can happen internally is rate limiting. It basically means that a lot of intelligence about error recovery – I know there was an entire talk about error recovery – a lot of intelligence doesn’t have to be built into every single application, which is very powerful. As engineers, actually, a lot of us don’t want to think about error handling at all, if we could, and a lot of times we don’t. Then, we write bad applications that do bad things when errors happen. One of the really bad patterns that sometimes happens is that you send something to the server, the server sends you back some an error, or it has some a delay. Instead of handling the error like a grown up, you throw a tantrum and you just keep retrying until the server gives up and dies, mostly. It happens a lot on databases.

One of the first errors I ever troubleshooted in my career was in MySQL that got viciously attacked by a set of PHP applications. Funny enough, this keeps on happening almost 20 years later. Some things never go away. The idea is that the client can be as bug as you want and keep retrying. The proxy knows about rate limiting and it will drop the majority of the retries. The server will actually be protected, it can recover in peace from whatever error just happened, but maybe more importantly, because we’re dropping just the retries, the important traffic that will actually maybe get stuck in a very long queue behind all those retries, now has a chance of actually getting processed by the server. This architecture is significantly more resilient than a lot of alternatives.

Event Driven

We had sever mesh, which solved a lot of our internal communication problems, but maybe not quite all of them. I’m going to switch to talking to what is probably my favorite design pattern and the one that I’ve really been talking about for five years or so. The thing that we really want to solve is the fact that because of those point-to-point request response communications, changes are riskier and any two applications are more aware of each other than they have to be, and adding new things is much harder than we would like it to be. The problem, or at least something that is overused and could be used only in special cases and a lot less than it currently does is what we call request-driven pattern. In the request response pattern, I, as a service, talk to other services, I initiate the communication and I initiate it with maybe two ideas in mind. The first one is that sometimes, I want to tell you what to do, “I’m the Order Service, I need you to validate my order. Please validate my order. Thank you.” The other scenario is that sometimes I have a question to ask. Sometimes I am, let’s say, the Shipping Service and I need to ship something. I don’t actually know where this customer ID lives. I talk to another service and say, “Do you know where customer 526 lives?” You get the address and now you can continue processing. This is fine, but this creates those couple patterns that I didn’t really like.

Event driven switches this pattern on its head and makes every service more autonomous, which is exactly what we’re after in the first place. When we first came up with microservices – let’s take a step five years back – we really wanted to make services independent, so teams have more autonomy. In this case, every service is responsible for broadcasting events. Events is anything that changes the state of the world: an order was created, a shipment was sent, a shipment has arrived, something was validated, the person moved to a new address. You keep broadcasting those changes to the state of the world.

Then, you have other services. Every service is also responsible to listening in to events on how the world change. When you get an event that tells you the world changed, you can do a bunch of stuff. You can discard it – this is not, “Yes, something happened, but I don’t care that it happened.” It can be, “An order was created. I have to handle it. I know what to do. It’s my responsibility to pick up orders and handle them. Nobody’s really telling me what to do, I noticed a change in the world.” It can also create a local cache out of those events, and store basically a small copy of the state of the world for its own later use. Because this is a lot, I’m going to work through an example.

Events Are Both Facts and Triggers

The thing I want you to keep in mind to clarify the whole thing is that you have those events about, “The world changed.” The fact that the world changed is a fact, you can store it in a database, and it’s a trigger. It can cause you to do something.

Let’s talk about buying iPads. I do maybe slightly too much of that. We have an Order Service, and in order to buy an iPad, the Order Service has to call the Shipping Service and tell it, “There’s a new order. Please ship something.” The Shipping Service is, “Ok, a new order. I have to ship something to customer 526, but I don’t know where it lives.” It’s calling the customer service, getting the customer address and shipping it. This is reasonable, we’ve all done that. Of course, things can go wrong. Shipping Service can go down and now Orders have to figure out what to do with all those extra orders that we cannot temporarily ship. Customer Service can go down, we can suddenly basically stop shipping stuff accidentally. That will be a problem. We want to improve this.

The first order of improvement is to start using events for notification. When an order is created, Order Service doesn’t call the Shipping Service. It updates this stream of events and say, “An order was created and here’s another one.” Note that those orders get logged to a persistent stream no matter whether the Shipping Service is up listening or not. This is fantastic because when the Shipping Service has time, has energy, is available, it can go start picking orders and shipping them. Then, what if it doesn’t? No worry. We get a log of all the orders, maybe a log of orders and cancellations. Really, even if the Shipping Service is up, but overloaded, or it’s up, but the customer service database is down, all those things don’t matter, orders will keep getting created, customers will get acknowledged. We’ll ship it to them maybe five hours later when we deal with our outage, but nobody ever noticed a five-hour delay in shipping, unless it’s like those two-hour stuff that Whole Foods does. That will be a problem. Other than that, nobody really cares. This is much more resilient architecture than we had before.

To improve it a bit more, we can start using the events as the facts, as data that is shared between multiple services. Whenever the Customer Service sees that a customer changed the address, changed the phone number, changed the gender, this gets written to the database, but an event is published because the state of the world has changed, and everyone needs to know that the world has changed. Shipping Service maybe doesn’t care whether or not I’ve changed my gender, but it definitely cares that I changed my address. The Shipping Service listens to those events and creates a small copy of the database – really small, just customer ID, shipping address. That’s usually enough.

This small copy of database allows it that whenever it gets a request to ship something – which happens by the way, far more than people change their home address – it can basically answer the question from its own local database, which is very fast and very powerful, and it makes it completely independent from whether the customer service is up or not. Now we’ve created a much more fully decoupled architecture where people really can deploy things independently and don’t have to say, “Customer Service needs to be doing maintenance. Is it ok if we stopped shipping stuff a few hours?” This is much better.

The one objection that always comes up is whether it’s really safe to have a DB for each microservice. Just walking the hallway, I heard a bunch of people say, “These days we’re just duplicating data everywhere and I’m not very comfortable duplicating data everywhere.” I get that, but I want to point out a few things. The first thing I want to point out is that it’s actually safer than you think, because all these databases are created from a shared stream of events about the state of the world and the stream of events is persistent. If you know about event sourcing, you know exactly what I’m talking about. It means that all those different databases will have a common source of truth. They’re not going to diverge into La La Land.

The other thing to keep in mind is that if I suspect that the database went wrong and something doesn’t look right, I can completely delete the database and recreate it from the history of events. It’s going to take some time, this is not actually something you can do in seconds. You can try to paralyze and speed it up, you have the ability to do that. It’s much safer than you imagine.

The other thing is that you get those custom projections. Every service will have the data it needs – just the data it needs – in the exact format that it needs it because it creates its own small database without ever bothering a DBA. No more going into the DBA and, “I want to add this field, but this may take too much space and I have to approve it with the other department as well.” It’s your own database, you store data the way you want. If any of you attended the data mesh presentation – very similar idea. You own your destiny and your data in your context of execution. Then, the obvious benefit is that you get the reduced dependencies and you get low latency. Everyone loves lower latency.

Event Driven Microservices Are Stateful

The thing to note here is that when I talk about event-driven microservices, by and large, they are stateful because they have their own databases they maintain. This thing is important because you can see how much faster, more independent, and more powerful stateful microservices are. This is important because in about 10 slides, we’re going to lose the state again to new advancement, and we’re going to miss it when it’s gone.

The last thing I want to point out about event-driven microservices is that they enable the innovation that I really want to see. I had the Order Services and I had the Inventory Service, suppose that I want to add a Pricing Service to the mix. We only had something that it’s pricing, but it was very fixed, the Inventory Service just had a fixed price for everything, but I think it can do better.

I watched airlines very carefully and I know that you can subtly shift prices in response to supply and demand. Maybe I used to work at Uber and I actually know how to do it, but I don’t know how to convince other people in the company that it’s also a good idea to do it for my iPad Shipping Service and not just for Uber. I can go and create my own Pricing Service based on everything that I learned at my previous job, and plug it into the shared universal source of events for my company. Now I’m going to know about every change to the inventory, every order that happened, every order that maybe cannot be fulfilled because we are out of stock, and I can fine-tune my pricing algorithm.

Then, I can start publishing pricing to the stream of events. Anyone who wants to check how would our revenue look if you used my pricing versus the existing pricing, can basically look at all those events and compare them. I can actually go to different departments, like the iPad Services, and say, “Do you want to use my pricing versus the old pricing? Because look how much better it’s going to be.” I don’t have to ask them to trust me. I actually have proof or at least some semblance of proof – shadow proof – of how much better life would be. It’s a much healthier process. You can make more data-driven decisions that way, which is something that I’m a big fan of.

Schema

We still have all those services talking to each other, whether it’s request response, whether some of them are now talking in an event-driven way. They’re still talking to each other and this still has some issues involved. Yes, it can be either way, it doesn’t matter. Either you send those commands, or you write events, it doesn’t matter how you do it. Until now, we’ve only talked about the medium, how you send events. We haven’t talked about the events themselves and what’s in them. The medium is the message of this presentation, but it’s not the messages that are sent in your architecture.

This is what your messages will probably look like. It’s a JSON, it has a bunch of fields. It has Social ID, Property ID, a bunch of stuff in there, lots and lots of metadata about everything that’s going on in the world. This is pretty good, it has some problems. One of the big ones is that, as we said in the beginning, HTTP and JSON can be incredibly slow. I can’t say it’s easily solvable, but a very popular solution these days is to use HTTP/2 and gRPC. We’ve started using gRPC. If you use GO, it’s fairly easy, it’s fairly built in. It doesn’t solve all of the problems, but it does make things significantly faster, so definitely worth exploring.

The other thing is, it doesn’t matter if you use gRPC, or JSON. Some types of changes are still quite risky. I want to talk a tiny bit about those. I’m going to use event-driven as an example because I lived in the ward for a long time, but you can have the exact same problems in a slightly lighter weight fashion if you are using Request Response. The problem is that when you communicate with messages, the messages have schema. It has fields and the fields have data types. If you really did any validation or testing, you have a bunch of things that depend on the exact data types and the exact field names and you maybe not even know about them. You make what you think is a fairly innocent change and it’s very likely that it will immediately break everything. Now, you’ll think that it will be an incredibly silly change. How would anyone not notice that you changed something from a long to a string? Clearly, it was not a compatible change. It will break everything. It turns out that basically, it’s been done over and over.

There’s this one customer, I’m, “These guys are new. Things happened. They didn’t know very well.” Then, you go to a talk by Uber and I discovered that it also happened to Uber. I was, “They are maybe chaotic companies. They move fast and break things. Surely, it will never happen to me.” Guess what? March, earlier in the year. This is incredibly embarrassing, I’ll talk about it a bit more later. The key is that if you don’t have a good way to test that your schema changes are compatible, you are likely to cause huge amount of damage. It’s worse if you have this event stream because remember, we want to store it forever. If you wrote things that are incompatible, at any point in time, a service of any version can read a data point in any version. A new service can go all the way back to the past, or an old service can just stick around and keep on reading new messages. You have basically no control over who is reading what, which is incredibly scary.

The way to really look at it is, whether you’re using gRPC, or REST, or you’re writing events, you have those contracts of what the communication looks like, what is in the message. Those contracts are APIs and they have to be tested and they have to be validated. The way we do it in the event-driven world, if you use Kafka as your big message queue – we normally use schema registries. Confluent has one, a lot of other companies have one. The idea is that when you produce events, you register the schema of the event in a registry and it automatically gets validated with all the existing schemas in the registry. If it turns out to break a compatibility rule, the producer will get incompatible data synchronization error. You physically cannot produce an incompatible event, which is fantastic because you just prevented a massive breakdown for everything downstream.

Obviously, waiting until something hits production in order to reject your messages is a crappy developer experience. What we really want is to catch it in the development phase. If you use Schema Registry, there’s a Maven plugin. You run Maven plugin, validate schema, you give it your schema definition, it goes up to a Schema Registry of your choice and just checks compatibility for you. Then, you can do it before you merge, you can do it on your test system, etc.

I don’t exactly know how to do it with gRPC, but I know how not to do it, because we ended up with a system that I’m not a huge fan of. In gRPC, you create all those tracts. Obviously, those tracts are used to communicate between microservices, so they have to be shared. We created this repository with all those structs that everyone has dependency on so they can share it. Imagine that you go to that repository – which you remember does not really have code, it only has structs, which means that it doesn’t really have tests – and you make a change. You changed it and you need it in your service, you go to your service, you bump up the version. Now, you depend on the new version of that request. Everything works fantastic. Now, I want to make a change. I go in, I make a change in my own struct, but when I go in and bump the version, my service also gets your changes, so now I have a new version of two different structs. My tests could still fail with the new version, even though I haven’t touched it. It works for you, but we both have dependency on that struct and only your tests ran after you made the change, not my tests.

It looks like if we have this repository of structs, and someone makes a change, we actually have to bump up the version across the entire universe and run tests across the entire universe, which makes me think that I’m all the way back in my distributed monolith again, which is not where I wanted to be. We are still trying to make sense of it. I am a bigger fan of event-driven and those easy to validate. You have a Schema Repository and can validate changes within the Schema Repository itself. I haven’t found any thing that allows me to automate compatibility validations of gRPC changes, which is what I really want in life.

Serverless

We solved some problems, we created some problems, but then we discovered that the really big problem is that running services is not very easy. Deploying services, monitoring them, making sure that you have enough capacity, making sure you scale, all this takes a lot of time and a lot of effort, and we started thinking that maybe we can do better with that as well, so we ended up with serverless. If you lived under a rock for the last year or two, serverless is incredibly popular function as a service AWS Lambda. There was a talk earlier about similar things for Microsoft.

The idea is that you just write your function. The function has one main method, which is handle event, so the only thing is that you get an event and you spit out the response – everything else is up to you – and you give it to your cloud provider. If someone sends an event to an endpoint, the cloud provider is responsible to spin up a VM, start a function and run it for you. That’s already quite useful. It gets better. If nobody sends an event, this is not running and note this, you are not paying for it versus in the old way, you had the microservice and whether or not it handled events, I still had to pay for the fact that it’s running on a machine. For this one, if there is no events, you don’t pay, which is a really big improvement and everyone really likes that.

The other big thing is that sometimes I have a lot of events and I don’t have to know about it in advance and I don’t have to plan, and I don’t have to do capacity planning, which is really cool. I just have to open my wallet. That’s the only condition. A cloud provider makes money when you handle events. He does not make money if you are rate limited and cannot handle events. They have every interest in making sure you can scale immediately far and wide to handle every single event that comes your way, and they mostly do a fantastic job of that. I’ll give them that. This is a very nice and easy way to scale simple microservices. It also matches some of the very simple event-driven patterns where you get an event, you do something to it, and you speed back a response.

The validation service that I mentioned earlier is that you get orders and validate them – it’s like that. You get an order, you do some checks, and you produce an output. This is really cool.

Up Next: Stateful Serverless

There is only one thing that’s missing in my picture, which is that we lost the state on the way. I mentioned earlier that having state in my microservice is incredibly powerful and we’ll miss it when it’s gone. It’s gone and I miss it. Why do I miss state so much? I miss having states because sometimes my rules are dynamic. I cannot hard code them, I have to look them up somewhere. Sometimes, my events contain some of the data that I need, an ID, but not the rest of the data that they need, so I have to look it up somewhere. Sometimes, I have to join multiple events. Netflix had a good talk about it earlier in the day. Sometimes, I just want to aggregate number of events per second, number of orders per second, dollars per hour. All these things are important, so I need state.

The way I currently do state is something like that. Once my function is running, I call the database, I do a select, I call the database, I do an insert. This database is likely S3, DynamoDB, all of them are quite popular. Note that where I’m paying, I’m paying for running my function. I’m paying for every minute that my function is up, for the memory it’s using, for the CPU it’s using, and also, I’m paying for memory CPU and IOPS on the database side. This is significantly expensive when you try doing it at scale. Can we do better? A bit, we can do a bit better.

We know our logic and we can say, “It’s very likely that if someone checks the status of the order, there will be other people who want to check the status of the order.” I know that when the function is running, I have about five minute periods where if I get more events, this function instance will be reused, so I can actually create a small cache of all the hot orders, do one database call, get all of them, cache them, and maybe save on some data calls in the future. This is what some people will call an ugly hack and what some finance departments will think is genius, depending on exactly where you work, but you definitely don’t want to do too much of that. Ideally, this will all be hidden from you completely.

The things that I really want back is really these sorts of events that I could always create my own personal data store from. I really don’t have it and I miss it. If I had it, I could do things like create order just writes events to the stream of events, validate orders, pick them up, populate the local database. Maybe it also reads rules in inventory from another database into local database, validate stuff locally, posts the status. The Status Service also has its own small database that it can handle locally. You can actually do really cool things if you could have local state in the cloud that is continuously updated by new events, and maybe even continuously synced with a longer-term database. This would be amazing.

Rumors are that some cloud services are doing it. Maybe some of you attended a talk about durable function earlier in the day is that does some of what we’re really looking for. It doesn’t have great integration with databases, though, which brings us to the things that I would really like to see in a serverless world. One of them is the whole idea of durable functions. Microsoft is doing it apparently in Azure. People told me that AWS has something, but to be honest, I searched, and I haven’t managed to find anything. As long as it’s not the exact same thing in every cloud, it will be really hard, I think, to get a lot of traction. People seem to be afraid to implement something that is so deeply coupled with architectures that are only runnable in Azure, even if you don’t have any immediate plan to run multi cloud, which by the way, after two years of experience, if you don’t have to, don’t. You still don’t want to be that tied into something that relatively few people are doing, and you don’t know if it caught on or not.

I really want to see this idea of functions with state catching on. I want to see them much better integrated with databases, because as we said, data is a type of event and event is a type of data. You can see flows going back and forth. I think that Amazon actually has something that gets events from DynamoDB as a function events. You could trigger functions based on things that happened in Dynamo, which is pretty cool. Another thing that we really want is a unified view of current state. What is happening in the local data store of every one of my functions and every one of my microservices? Being able to pull together unified reports where if you have a shared source of events, it becomes much more tractable.

I hope I gave you a good overview and some ideas of maybe what one of you could implement in the future.

Questions and Answers

Participant 1: I have a question about the life cycle of schemas. Do you ever deprecate schemas once it’s registered?

Shapira: I can only deprecate schemas if I know for sure that, A, nobody is using them, and B, they are not stored anywhere. If all of my streams have very short retention policy, say, 7 days or 30 days, and if I know which one of my microservices is using which schema, which I could know because they’re all validating it against the registry, I can try to track who is talking to the registry and which questions is it asking, then I could deprecate schemas. Currently, I’m not doing it. All schemas are cheap in my mind. Again, consumers that want to stop supporting them can stop supporting them. Producers that want to stop producing all the schemas can stop. If you maintain compatibility, all schemas are quite cheap. I’m not too worried, but if I really have to, there are ways that I can do it.

Participant 2: You mentioned that we should separate our databases for microservices. In my workplace, the problem was that if you separate out the databases you will have to pay more, and they were a little bit hesitant to do that. We sort of compromised with having separate schemas instead of separate databases. What do you think about that?

Shapira: This works to an extent, I think. Even if you separate them to separate schemas, your database is still a monolith. Everyone is talking to the same database. Everyone is limited by how much your database can scale. If you chose a relational database, then everyone has to be relational. If you chose Cassandra, everyone has to do Cassandra. I think to an extent it’s possible, but I think we’re going towards a world where engineering teams have good reasons for choosing the databases and the data stores that they choose and the cost of running additional databases is usually lower than the cost of forcing every team to be tied to one big database. Also, scaling one big database is not as free as one may want to think.

Participant 3: In the event-driven architecture how do you solve the duplicate message? Because of every retry, you will get the same message created, for a single order two shipping event message is created. How do we solve that?

Shapira: There are two ways to solve it. You either have someone detect duplicates and cleans them up for you, which is expensive, or you make sure you design your entire event architecture, so events are idempotent. Idempotent events means that if you get the same event twice, nothing bad happens. This means that the event cannot be, “Ship an iPad.” It can be, “Ship a single iPad on this date and no more to this person.” It cannot be, “Add $50 to an account.” It can be, “The account balance in this account is now $350.” Then, if you get two notifications like that, it’s fine. They basically act exactly like they are one, so either you figure out a good way to do it, and a lot of places figure out good ways to do it because it’s so much more efficient than having a service that tries to figure out what to duplicate and cleans it up.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Mistakes and Discoveries While Cultivating Ownership

MMS Founder
MMS Aaron Blohowiak

Article originally posted on InfoQ. Visit InfoQ

Transcript

Aaron Blohowiak: I’m Aaron Blohowiak. I am @aaronblohowiak on Twitter. There is actually another Aaron Blohowiak on Twitter. I do not tweet about sports stuff, so if you’re looking, it’s mostly hot takes on distributive systems things. It’s also aaronb@netflix.com, I share my email so you all will use it, hopefully, if you have anything you want to discuss later.

On your first day at Netflix, you can push code into prod, you can spend up to $25 million without manager approval, you can sign contracts, and all we ask is that you use good judgment. On your first day, are you ready to do any of that? No. So how do we get from here to there? How do we get from your first day where you don’t really know what’s going on to the point where you can flex the freedom that we grant you upfront? We do that through our culture. We avoid rules. Rules are something that most organizations tend to accrete over time. As they have bad things happen, they put in a rule to never let that happen again. As a result, they end up artificially limiting the space of options for people. People feel constrained, they feel like they’re in a tight jacket. They can’t have the freedom to do what they know is the right thing. Instead we say we don’t need rules. What we need are smart people who know what’s going on in the world, who have the right context. They can use really good judgment. To that end, we generally prefer people over process.

The world is ever-changing. Your process is a lagging indicator, it reflects the way that the world was five years ago when the process was put in place, not in the way that the world will be in five years. To that end, we try to have context, not control. Again, you can’t make really good decisions if you don’t understand the larger world, if you don’t understand how your decisions here impact somebody else. We have this thing we talk about a lot called freedom and responsibility. Part of the key needs for having good context is to avoid the case where my freedom becomes your responsibility. Gathering context, understanding the world, is super important. If you try to have control as a leader, be that a manager, senior software engineer, team lead director, you will not have nearly as much information as the people on the ground actually doing the work. So a much better trade off to do is to give the right information to the people who can actually be there to make the decisions day-to-day.

Finally, I mentioned briefly freedom and responsibility. This isn’t just a slogan or a tag line. This runs deep, it’s how we structure everything. Freedom means we default to opening up and retaining optionality, having more potential things that you could do, do things that you’d consider doing. Responsibility means, if you’re going to have all of this freedom, then we’re going to hold you to account for the results of your actions, for the quality of your decision-making. It is important to separate out the quality of the decision-making from the result. We all are operating in an uncertain world. With that responsibility over time is really what we’re looking for.

From the CEO down the hierarchy, that responsibility is delegated. Our Chairman/CEO is ultimately responsible for everything that every Netflix employee does, and that responsibility stems from the shareholders who effectively own the company. That responsibility gets divvied up and delegated down the hierarchy from the C-suite to the VPs, to the directors, to the managers, to the senior software engineers, teams leads, and senior ICs and teams. That’s really how you can scale yourselves. If our CEO tried to make all of the decisions, that certainly wouldn’t work. I spent a better part of a decade working at startups before working at Netflix, and I can tell you I’ve seen CEOs who try to have all the decision-making fall on their head, and none of those startups are currently in business.

Similarly, that vision is refined. As an ongoing enterprise, we have a pretty simple business to describe. We offer a service for sale. People give us money, and we provide the service. From that large vision of streaming movies and videos online, there’s lots of details to figure out. The potential space of things that we could do as a business is vast. The primary job that leadership hierarchy has is to narrow the scope of options of the problems that we’re solving and to narrow the scope of the solutions that we’re willing to consider. And this happens at each layer. Again, as you get closer and closer to the work itself, the vision should be more refined, but along the way, we still want to retain maximum flexibility so that the people who are closest to the work can make the best decisions they have.

We really pride ourselves on how few decision management is involved in. I’ve heard a rumor from a secondary or tertiary source that our CEO has bragged about having only made a couple decisions in the last year, and that all goes down to setting the right context and having the right people.

The Expectation

This is the expectation that we have for every employee at Netflix, responsibility and follow-through. If you say you are going to do something, I am not going to check up on you ever. We hire fully-formed adults. We expect that when you say it, it’s going to happen, and if it’s not going to happen, you’re going to proactively reach out and manage that graceful disappointment which is necessary. You shouldn’t be afraid to disappoint people, you should just do it very well.

We also want you to not just be responsible, but proactive. You need to be thinking two steps ahead. Netflix employees aren’t only the ones who will stop by to pick up the trash. They might also consider moving the trash can closer to where it needs to be. That proactivity anticipation is really what lets you, as a leader, scale yourself, having more heads truly brought to bear the problem.

Finally, in addition to just thinking ahead in the short term, we want all of your decisions to be based on the long-term. We hope to be a very long-lasting business, a very long-going concern, and so we want to set ourselves up for future success. This doesn’t mean giving into Yagni-type issues as, if you were in Justin’s [Ryan] talk previously, he mentioned we plan for two to four x scales. Some places plan for 10 x scales. We do want to plan in the future. We will be revising our decisions, but overall, we do want to take that long-term view.

Finally, we expect every employee to be defining how things should be. I worked at one place a long time ago where there was an employee who was responsible for creating a report together, and we were sitting in the cafeteria, and she was complaining to me, “I spend about a third of my time putting this report together, and I know that the doctors I give it to barely look at it.” Why are you doing the report? She didn’t feel empowered to say, “Is this really the best use of my time? Would a two-page summary be good enough?” When we have that freedom and responsibility, it creates that sense of “Yes, I should be helping shape how things should work, not just responsible for executing my particular role.” We expect everyone to have that kind of owner mentality, that kind of architect mentality.

This expectation really translates that every employee at Netflix will have the highest level of ownership. This is pretty Netflix-y – we tend to hire very senior people, and we do give them that incredible amount of freedom, so we do require that very high level of ownership, that visions level of ownership, of understanding how things should be. Not every org expects that of every employee, and that’s totally fine, but through this talk, I hope to give you a framework for understanding the different levels of ownership that will help you navigate understanding what the right level of ownership is for the right person at the right time.

What Is Ownership

What is ownership? It’s really a collection of beliefs, of attitudes, and behaviors. Do people believe that they own the destiny of their particular responsibility? Do they have this mindset that they can have the freedom to do what’s right? Finally, are they proactive in understanding what needs to happen next, not only immediately but in the long run?

It’s important that ownership is not this yes or no thing. it’s not a single bit that can fit, although Shane did talk about fractional bits. That’s really confusing. No, it’s not binary, it’s a spectrum. What are the different levels of ownership as we’ve defined them? Zero of ownership. I’m not a Lua programmer, so this is Demonstration, which is ultimately no ownership. I’m going to do something or the senior person or the lead in this scenario is going to do the thing, and the person who is having the responsibility ultimately delegated to them will be responsible for asking really great questions. That’s Demonstration.

The next level is Oversight. You’re going to do it, but I’m going to have myself or somebody else preapprove it, look at it first, and you should expect a lot of revisions. The next level is Observation. Go ahead, you do it, and I’ll look at it after it’s already done. I may give you some high-level guidance.

For these three levels there’s a fundamental aspect of approval that’s occurring from the person that’s delegating the responsibility, and we try to spend as little time as possible at these levels at Netflix, as we want people to ramp up quickly. On your first day, you probably will be at demonstration just learning how are things done here. While we don’t have rules, we do have norms that help things go smoothly. Moving beyond that approval realm of ownership, we start to get into independent execution. We can establish this is the place you want to go to, and I know you’ll do a great job. There will be some random check-ins just so I can keep abreast of what’s happening and understand how you’re doing your job, but ultimately, you’re kind of heads down or out in the world, and we’ll check in in our one-on-ones.

Finally, there’s that last level, that Vision level where you’re shaping the future, and you’re understanding your area of responsibility and all of the stakeholder domains, how your area of responsibility interacts with them and the larger business context. That’s that vision level of ownership that we expect every individual contributor at Netflix to embody. If your organization puts that expectation at a different level, that’s fine. This is just a framework for talking about it.

In that Demonstration level, the belief here is that you are in a student role. You do not yet know what is going on, and it’s important to know when you don’t know what’s going on. Your behaviors here should really be asking great questions, not just the whats but the whys. Why is it done that way? How did you come to this standard? What’s the purpose of using master as our integration branch? All of these sorts of things, you want to learn as fast as possible. With our ramp-up framework that we have on our team, you have an onboarding buddy that is basically there to answer all of your questions, and the quicker you can extract the information from their brain, the better you’re doing as you onboard.

That next level is Oversight. Again, we try to spend very little time in here. The one thing that we do have an ongoing expectation for oversight, is things like changes to the code. Pull requests are fantastic for sharing knowledge throughout the team, and in some ways, that is an expression of ownership. Now your role of being overseer or overseen will change depending on whether you authored the PR or you’re viewing. This also goes to show that these dynamics on ownership are changing over time within relationships and within particular products. The attitude here, if you’re experiencing the oversight, is to be grateful for the additional context that is being provided to you, and if you’re providing the oversight, the attitude needs to be one of empathy and understanding that, if I’m going to put 50 comments in your pull request, that’s tough, so you have to figure out how to message that in an empathetic way.

The Observation level: Here, we’re really in that reactive mode now where you’re operating mostly independently, but someone who’s responsible for delegating to you is going to observe and give you high-level realignment, long-term adjustments, course corrections, rather than really detailed feedback at all times. Many places, this is what they started to consider a mid-level engineer to be low impact on the immediate manager or leader of whatever form that may take. That can be really tempting to get stuck in that level as a leader, because you still know what’s going on, and you can feel like people are able to pursue mastery, autonomy, and purpose, but the fact that you’re ever-present and watching all of the work that they’re doing takes a whole bunch of your time and ultimately deprives them of understanding what it means to be truly independent.

Moving onto Execution, this is where the leader sets the direction and checks in infrequently to see how things are going and this is where a lot of people top out when they start to describe what it means to be a senior engineer on a lot of teams, very little management oversight and at this point of execution you should be able to train somebody else to do the same tasks. Fundamentally though, that vision level is the highest level you want to get to, where you’re looking further out. I think I might have belabored that point, so I will just proceed.

We have here the hierarchy of the levels of ownerships, and there is an analogy to something you might be familiar with: Maslow’s hierarchy of needs. Starting about mid-way through, at the Observation level, you have that community interaction where someone is looking at the work that you’re doing. You have an understanding that you belong, you have that approval that can feel good. Ultimately, you want to go further than that and get into this notion of esteem that Maslow had where you get the confidence and respect of the people around you. It’s very similar to Execution. If someone says, “Go there,” and I know you can go there without worrying about it or checking in, that means you really have my confidence. Maslow initially taught that the peak of the hierarchy of needs is self-actualization: determining who I am going to be and how I fit into the world rather than taking in other people’s role and expectation. That’s very similar to this Vision level of ownership where you are imagining for yourself the way that things should be.

Fun historical fact: the pyramid representation was never presented during Maslow’s lifetime and in his writing, he talked about these different levels never needing to get rid of the first level. You always need to eat, you always need to have your basic needs met. It’s just that as you develop personally, you spend an additional portion of your time working in the higher levels of personal development. This is very similar in ownership. Even when you’re very tenured at the company, there will be some things where you’ll still want to have somebody else preapprove before you send out. It may be that you’re reaching out to peers to take a second look at something. When we publish things at the tech blog, of course, there are people that have expertise that we’re going to rely on. We never totally abandon the lower levels of ownership throughout our careers no matter how advanced we are.

Does this sound good? Hopefully you’ll want to build a team of high ownership individuals, and that’s hopefully where we want to get to. This talk is about how to do that, how to cultivate ownership. But really, let’s talk about how not to cultivate ownership, and I will be leading here by counter example. Many of the names that follow, their stories have been changed to protect the innocent, and they really are the innocent. As my brother used to say, the response you get is the message you sent.

As a leader, if I’m not seeing the correct level of ownership from the people on my team, it’s my failure to set the correct context. I didn’t give them the right information they needed to make the best decisions possible.

Archie’s Communications

For the first story, we’re going to talk about Archie’s communications. Archie was very experienced when he joined Netflix, like many people who are when they join. Unlike many people who join Netflix, he expected that the way that Netflix did things was going to be very similar to the way that he had done things previously. One of the things that our team does is we perform regional evacuations. Quick two minute spiel: we run in three different AWS regions. We can run in any two of the three for high availability reasons, and so we regularly evacuate regions. As a result of doing that, we then send a communication out to tell people, “How did this go?” When Archie was onboarding, I said, “Go ahead. Here’s the folder in Google Drive. Look at our best communications, use the template, please match the tone, format, and style, and run it by me before you send it out. Archie, being very experienced before joining, wanted to show his proactive spirit, and he went ahead and put an email together and sent it out without asking for anyone to look it over first. That email had spelling, grammatical, and factual errors, and unbeknownst to him, when he copied the email address from the template, that distribution list included all of engineering.

I had tried to set us at that Oversight level, and he wanted to jump straight to that Execution level. We had a chat and I told him, “There is a reason why I asked to be at Oversight level here,” and I realized that I had failed to provide that context so he didn’t have all the information he needed. If it was just an email going out to our team, maybe up our management chain, it might not have been a big deal.

To visualize this, I thought that we were somewhere between Demonstration and Oversight, Archie thought we were at Execution. That mismatch is where the pain crept in. The mistake here is that we had different ideas about which levels we should be at, and the discovery is that you really need to be explicit about what level we are at and what level we should be at and then explain why we are at the level we’re at.

Lily’s Communication

Now we’ll talk about Lily. Lily is amazing. She joins after Archie, and same spiel: “Here’s the folder, please match the tone, format, and style. Let me look it over before you send it out.” Lily’s great, her email was perfect, no errors at all, and she said, “Does this look good?” I said, “Looks good to me.” Now the problem was, the next time she sent the email, she also asked me to look it over, and I was, “Hmm.” I wanted to encourage her to start taking that additional ownership, so I started replying with just slack emojis. I thought, this is going to encourage her to go ahead, “You got this,” but instead of encouraging her, it actually reaffirmed I was giving my rubber stamp and by giving that stamp, I was double underscoring that we were at the Oversight level. We had a little chat, and I said, “I do not need or want to look over your work. You’re great. Go for it.” She said, “Cool,” and never again has she asked me to look over her emails.

I thought that we’d moved over to Observation. I’m still on that engineering email list, but she thought that we were at Oversight, because that’s what I had told her. The mistake here was that we had different ideas about which level we were at, and the discovery was that we want to be really explicit when that level changes. Lily has a very ingrained sense of propriety, and since I had made an explicit request to look over this email, she was going to go ahead and follow through with that from now until the end of her employment.

The next thing is that if someone is stuck in that approval-seeking behavior, they might just be waiting for permission. She didn’t love getting my approval. She didn’t need it or really care about it. She was just trying to follow what I had set up for her as a context. That final thing is that if I had established that goal that everyone should ultimately get to Vision which each of the tasks that they’re at, then she would’ve had upfront permission to escalate that level of ownership on her own at her terms when she was ready.

Lily’s Region Squeeze

Lily then took over a relatively recent activity that our team does, which is we load test entire regions. With our traffic steering mechanism, we can send everyone to the same region and see how the system does as traffic scales up. What’s interesting about this is, it was taking over for a relatively new thing that somebody else had created. When Lily was taking this over, the first time she ran it, she appropriately was going around asking lots of questions. She created memos with all of her plans and ran them by her peers who said, “Yes, looks good to me.” That first squeeze – total success. Afterwards I said, “Lily, you did this, we’re at that Oversight Observation level, now this is totally yours. Go nuts with it. I expect you to be at the Vision level.” Since then, for the next squeeze, she created stakeholder listening groups where she listened in for people’s input. She created a rubric by which we will evaluate all the different feature requests and set up a road map for the entire project for the next year, all without me saying anything other than, “This is yours now.” Much success.

Here I thought that we were at Vision, and she thought that we were at Vision. Sometimes you get it right. Again, that’s really the importance of being explicit when that level changes.

Fred’s Performance Tooling

Getting a bit more serious, we have Fred’s performance tooling. Fred is amazing. Fred was a little bit less experienced when joining our team but ramped up incredibly quickly. He got on call faster than anyone had before. He really had that depth of curiosity and drive that really inspired us and made me confident that I had made a great hire.

There was this new Greenfield Project to detect performance regressions in systems in prod by looking at the CPU to RPS ratio. The details don’t matter, but they’re interesting. Due to the track record of success that he had and incredible trajectory, I was like, “All right, you got this. Let’s just go ahead.” We had agreed that we’d operate at the Execution level. He independently knew that he had to develop this performance regression tooling, and he knew what the overall timeline was for the larger project at play here, so that seemed fine.

After a couple of check-ins, there were some yellow flags around some decisions he made, and some judgment calls about how he was prioritizing his work. As someone who prides himself on not being a micromanager, I waited to see if this was a pattern, if this was a one-off thing, because I don’t want to nitpick someone’s every decision. I just want to understand the long-term trends of their judgment making. After about a month, I started to get worried that there were no big milestones. I decided it was time to engage, dig a bit deeper. What I had learned at that time was not only was he not close to having a prototype ready, but he was still in the deep weeds of investigating different algorithms for comparing the data of previous to current performance work. That was wildly out of step with our expectations that you’re going to get something done from end-to-end with kind of a not-so-good algorithm that can detect huge performance variations, then we can refine things over time. That was the sort of implicit value that our team had around getting something done end-to-end and then refining it and refining it, as opposed to understanding truly the best algorithm and then building the UI to support that algorithm.

There are a few different ways we can look at this. We very well could say, “Bad performance, we expect something to be done, it’s not done, that’s on you.” We could also look at it as poor judgment. “Ok, so what you were doing through this algorithm evaluation was very good work, but the judgment of this being the thing you need to do versus getting something working end-to-end, that’s poor judgment. You should know as a mature member of our staff – get something done, make it work, then make it right.” We could take that approach when evaluating what happened. We could also look at it as bad onboarding. Maybe that onboarding buddy of his was responsible for setting that context. “This is how we do things. Greenfield Project – here you go.” Many places that we could point the finger of blame. My dad has this thing where when you point your finger, there are three that point right back at you. We could also look at this as that incorrect level of ownership. This is the one that’ll fall in the middle that I believe. I don’t believe in blaming the victim.

I thought that we should be at Execution. He thought that we should be at Execution, but we should have been at Observation. If there was a bit closer monitoring of what was going on and I could’ve intervened much earlier on, we wouldn’t have to claw back. He wouldn’t feel like he was a failure for the past month working on the wrong thing, and we would’ve quickly escalated up to Execution once we knew he was headed in the right direction.

This mistake is that we agreed at what level we were at, but we were both wrong. Finally, the discovery is that even amazing people can’t just jump into that deep end. If they don’t understand the current context, the implicit values, and not only do they not understand the current context, but their onboarding buddy or the leadership present doesn’t even know what they’ve forgotten to tell them. The curse of knowledge: “I don’t know how much I know about Netflix, because I’ve been in it for so long.” It’s like the fish in water kind of thing.

Akwesi’s Meetings

Now I’ll talk about Akwesi. Akwesi is amazing, senior contributor to Netflix, been with us for a little while, had been executing fully, independently, had been operating at the Vision level, helping us understand what our team is and what it should be. He proposed a giant project that would affect how hundreds of teams operate in production. He created the documents. He led it through our decision-making process. Awesome, beautiful work. The giant project, through that process, it got approved. Everything was going great. As part of the first steps of the project’s execution, Akwesi was going to meet with a bunch of the different teams to understand more deeply our stakeholder’s concerns, brainstorming some UI decisions, those kinds of things, and I was super-excited. I was, “You know what, Akwesi? Can you add me as optional to those meetings?” This is what he said: Why? Don’t you trust me?

For those of you that may not be well versed in management theory, there is no better signal that you’re a crap manager than having someone ask you this question. I wish I could say that I understood the ramifications of my actions. I really wish I could say I was my best Aaron that day. And I wasn’t. I was so excited for what was going on I wanted to be part of the party, and from Akwesi’s perspective, he was executing at the Vision level, kicking butt, and here I was saying, “I want to be in the room while you do this, and I’m going to watch you as you work.” How could he not take that step down from Vision to Observation or in his mind, Oversight, as anything but a signal that I didn’t trust him. I was taking away something from him. I was ratcheting down my expectations.

On my side, Akwesi had been kicking butt, performing at a very high level, not only influencing our team, but beyond our team in all of production engineering. I wanted to see what he was doing so I could be justified and well versed when I was making the case that he gets compensated at a whole new level, and this is the question he asks me. It was super tough. I tried to explain to him, “This is why,” but if you’ve ever tried to put toothpaste back in the bottle, you know emotions are hard. It was a tough lesson to learn.

I thought from my reasons as a manager, as a leader, that we needed to go to the Observation level, so I could justify the compensation increase commensurate with the new role that I had every expectation he was going to fulfill. Fom his level, he thought he should be at Vision, because that’s where he had been executing at, that’s where he had been kicking butt at. Those mismatches, that’s where the pain comes in.

The mistake here was that we had different ideas about which level we should be at, and the discovery was that ownership really evolves over time, not just over the course of relationships, like it would have with the other people I had talked about, but even the course of projects itself and not only in the upwards direction but up and down as things come and go for reasons sometimes related to the projects, for reasons sometimes related to being a manager, for reasons sometimes to my boss’s boss somehow is trying to micromanage this, and I need to provide you cover. If you don’t make that clear to people, they don’t have the right context. All they see is you’re ratcheting up the desire for approval. You really have to have empathy for those emotional implications of changes to the levels of ownership.

Bringing It All Together

Starting to bring it all together now, we had these four levels of ownership: watch me, I want to preapprove, I’ll watch you, go ahead and do it, tell me what we need to do. This is not a strict hierarchy, but we have all of these different things happening at different levels over time. If I had made this clearer, perhaps with Akwesi, the fact that we would have a little bit of green sneaking in would’ve been ok.

There are some broad classes of mistakes that we can categorize mostly around those level mismatches. We can have different ideas about which level we are at. We can have different ideas about which level we should be at. We can agree on the level that we’re at, but we can both be wrong, because big discoveries are all about understanding that most failures of ownership are failures to set that right context. Given the right information, people will make the right decisions. Along with those lines, sometimes a piece of that context is what your expected level is and that the goal is going to be to get to vision and that the level will change over time.

Don’t jump in the deep end, even for amazing people on your team. Sometimes we’ve heard that phrase, “They weren’t set up for success.” That can mean that we didn’t provide them adequate scaffolding to get from where they needed to go from a zone of incompetence into a zone of mastery. Helping provide that ladder, that escalation of ownership, sets people up for success over time. Finally, as leaders we all have to have empathy. I’ve found it to be particularly hard to have empathy two times: when I’m really excited and I just think, “Everyone’s going to think this great,” and I forget that the way that they’re going to take this might be the way I intended. The second time is if I ever find myself feeling righteous, but that’s a different talk for another day.

Just to drive this point home, given the right context and that freedom to do what you think is right, people will make great decisions. One of the things that our team is responsible for is managing all of the reservations and capacity for all of Netflix on Amazon, which is quite a big responsibility. The way that we do this is, we let every engineer do what they think is best and spinning up instances and auto scaling and choosing instance migrations. And instead of trying to prevent errors in those spaces, we pursue rapid recovery. If something goes a little hinky, if we see a dramatic increase to your scale with a low utilization, we’ll come have a chat and say, “What’s up?” When we talk to some of our peers, and they ask, “How do you achieve sublinear cloud costs to your business growth without having cost controls?” we say, “People want to do the right thing, so we let them. Sometimes they don’t even know when they may have let something go on too long and run too cold.” Giving them that context, they will prioritize it. If they got a big push, then maybe they’ll delay, and that’s good for the business. Maybe they’ll say, “Whoops, I’ll actually turn on a TTL for my logs that are stored in S3 and save a bunch of money every month.”

Having that right context lets people make great decisions, and if you instead lean on control, people think, “Not my job.” They won’t even be that proactive. Actually, having that freedom creates in people that sense of responsibility, so if you take nothing else from this talk, this should be the main takeaway.

New Hire Onboarding

As a way to embody this on that team to have a culture of ownership, not just a one-off in each of the different relationships, we bring this in to the very beginning of that relationship with the person on the team. The first thing that I do in our first one-on-one is, I say, “Thank you. How’s it going? The second thing I do is talk about the levels of ownership, and I say that “After the first three months, we have certain expectations for you.” You will primarily be in that Demonstration, Oversight, and Observation mode. We expect you at the end of the three months to ask a heck of a lot of questions to learn the reasons why we do the things we do and to be part of our on-call rotation. That’s the goal for three months. In six months, we expect you to be handling one of our quarterly goals that the team outside of yourself has previously established as the right thing to do but executing on that independently. So we expect you to be at that Execution level before the end of your sixth month on the team. Finally, at the end of your first year, you need to be responsible for helping us set what the team should be, helping shape the vision of the team.” That Vision level is our expectation of every IC at Netflix in almost every role that I’m aware of, and that may be unique to us, but still setting a timeline for your expectations of what kind of ownership you expect to seek in general sets people up for success.

Flash Back

A quick flashback – remember Archie, the one who sent the email to all of engineering? That was totally on me. I hadn’t set the context that this distribution list included all of engineering. Maybe if I had, maybe Archie would’ve slowed down, but those factual errors that were in there were actually a sign of things to come. Archie was so experienced in his career in his career and so used to operating at a very high level that he just wanted to get back to doing that kind of work. But that’s not quite the way that things work for us. We expect you to come in and work your way from the brilliant basics all the way on up to ratcheting up the level of ownership over time. He kept on trying to jump forward and would rush through the work that he viewed as more pedestrian so his work product, while there was a lot of it, none of it was really usable. We had to ratchet down that level of ownership and really set it clear, “Look, we want to have oversight over all of the work that you do.” Either he didn’t hear it, he didn’t want it, but the work that he produced was not commensurate with the work he had produced in the interview process. Ultimately, he failed to ratchet up those levels of ownership, and I had to make the call to fire him.

Firing someone is never an easy thing to do, but if you have a value, that value’s mostly going to be expressed in your hiring, promotion, and firing decisions. While it wasn’t great to look someone in the eye and say, “Today’s your last day,” I knew what I had to do. I knew what the right thing was for us and for the team. Even though we try and set all failures of ownership are failures of context, if someone has repeated explicit feedback over a length of time and they still fail to get to the level that’s expected – and it might not be the last level depending on your organization and team – but if they fail to make that transition successfully or if they go, they operate fine, and then they sink back down, and you see this happen a couple of times, then you know what you need to do. They’re not in the right role.

Bonus Track

A quick bonus track. Back to our good friend, Maslow, this is what most people are familiar with with Maslow’s Hierarchy of Needs. Later in his career, he came to revisit this, and he said “Self-actualization, this idea that the highest expression of oneself is figuring out what I want to be, that’s a bit of egoic, isn’t it?” What he came to understand is that there’s a higher level. You find your fullest realization by giving oneself to something beyond oneself. When you live not just for yourself to understand “what I should be,” but when you live to understand “what my community should be,” what the larger world should be, the larger environment, that’s really the pinnacle of self-development. There’s a corollary here at our levels of ownership, and I would say that it’s cultivating other people’s ownership.

Astute viewers would notice I have a little wiggle room in this slide. We do expect all of the ICs at Netflix to achieve the Vision level of ownership. That new meta level of Cultivation, that’s an expectation that we have for everybody who’s in the leadership position, and that could be senior engineers on the team, could be senior technical contributors, it could be effective team leads. It’s also managers, directors, all of our VPs on up. We expect that you are cultivating other people’s sense of ownership. Then at the higher level, our expectation for directors is that they are making sure that the managers that report to them are doing the cultivation of the people and transitively or cursively or wheels within wheels all the way up.

Questions and Answers

Participant 1: A little bit further on this cultivation, how do we recognize when people are ready for the next level of ownership and that we’re not preventing them from achieving that by providing too much training wheels?

Blohowiak: The question was basically around, how do we help ratchet people up through the different levels and know when they’re ready. There I take a lot of inspiration from a guy, his name is Vygotsky, who talked about the zone of proximal development. There’s some things that you have total mastery over and some things you have total incompetency about, and within that, you have this level that you can do things when you have different levels of assistance. You can start by providing a lot of assistance and you slowly take away that assistance. For instance, you think that they’re operating successfully at the Observation level and they might be ready for Execution level. There’s not really a chance in kind but more a change in degree with the frequency of check-ins, the frequency of feedback that you’re giving them, and slowly you can back away. Similarly, from Execution to Vision, starts to ask those questions of, “This is how things are, should it be this way?” Then once you get to encourage those kinds of conversations, you start to really enjoy what you’re hearing from them, then you should start setting the context of, I shouldn’t be asking you, “Should it be this way?” You should be telling me what the alternatives are, what you’re considering, how those things are going on. So, there are progressive steps between the different levels.

Participant 2: You mention that Netflix is one of the major companies you have experienced resources being higher, and you prefer the people over the processes. For a few of the companies that may not stand because you need to have people without experience or less experience, if you find they’re not the right fit, you’ve got to have some process in place to scale them up and put them in the right place. Without process, how can you achieve that? If you just give them freedom, how do you make them successful?

Blohowiak: To summarize it, if you aren’t Netflix, if you don’t just hire senior individuals, how can you actually have people over process if you have junior folks who are going to unknowingly do bad things? There I would say that you actually don’t need to lean into process. What you need to lean into is people having a higher degree of self-awareness. You need to understand that every relationship that you’re in, you are either in the role of teacher or student. Very rarely are you true peers. Understanding what role that you’re in, in which context, that takes a fair degree of self-awareness, and many people who are fresh out of school, me certainly included at that time, have an over-inflated sense of self, and the key context you can set there is, “You know, you’re not that good. I believe in you. You’re going to get there, and here’s how you’re going to do it.” Then you can provide that level of understanding. In the beginning, you’re going to be learning, learning, learning, and you’re going to have lots of oversight, lots of feedback, and the way that you’ll work in terms of progressing through the levels, that’s determined by you. But, right now, for you to be successful at your level, you need to be great at being demonstrated to. I’m going to judge you on the quality of your questions. For oversight, you’re going to be judged based on how well you loop people in to ask them to oversee your work, and once you demonstrate true competency at this level, then you’ll be ready for the next. Ultimately, I suspect but have not practiced using people over process, even with more junior folks, if you set that appropriate context of expectation of what level they’re at.

Participant 3: This was great, and I absolutely adore it. A lot of the examples were within team. I’m curious about what levels of success you’ve had with managing this across teams when you have individuals thinking that they’re executing on Vision versus Execution.

Blohowiak: Managing within team versus outside of the team and cross teams with having individuals who think that they’re operating at vision versus execution – I think that can be really challenging. One of the things in John Maxwell’s Five Stages of Leadership that he talks about is that you reset in each relationship. Each time that you’re interacting with somebody new, you go back to almost zero trust, and you have to build that up over time. In large organizations, we have proxy signals for how trustworthy someone is, reporting structure, title sometimes, that sort of thing, but ultimately it is a relationship-based business. Having that be part of the conversations is something that we’ve found being very important to achieve big project success. To get a bit more concrete, one of things that we have made a first level citizen in our decision-making process is who the stakeholders are, and so by calling that out and saying these are the different levels of involvement, you’re helping write the proposal, you’re helping approve the proposal, or signal your approval of the proposal, because we don’t achieve consensus, we don’t seek that.

If you’re watching and monitoring from the outside which level you’re at, that will help us understand how to relate to you. It is that sort of opt in, people over process, but with a bit of a comfortable norm that everyone should be comfortable saying, “I need to be much more involved in this than I currently am. That’s a totally fine thing to do, which also means they have to be ok with saying, “I totally forgot you.” That also has to be fine. Then through those relationships, we can ultimately navigate those interactions. It does get sticky, especially if you have someone who has a security concern and someone who has a product deadline, and they can be in tension, and then who’s really the one to set that vision? That would be a great talk I would attend.

Participant 4: You mentioned that you try to, in your organization, teach new people, “This is why we do what we do.” How do you balance the tension, inevitably, that you’re going to get with new people coming in with other visions, other experience, against this notion that “This is why we do what we do, and we’ve always done it this way, and we’re never going to change”?

Blohowiak: Balancing longstanding cultural norms and bringing people in with different ideas about things. I have intentionally hired people with different ideas about what we should do be doing. Hiring people with different proclivities as engineers, some people prefer immediacy, some people prefer thoroughness, some prefer risk mitigation, some prefer risk avoidance. Bringing all those voices to the table, whether or not they’re new or have been at the company a while, is important for creating the healthy tension that allows the team to respond to a changing world. How do we balance “this is the way we do things” with having everyone contribute a new vision of how we should do things? The reason why we have to say why we do the things, not just what we do, is that you have a basis for a sound argument. If we just tell you this is the way we do things, then you have no idea how to even start the conversation of challenging our assumptions.

There are a couple things that engineers do that are really bad. One, they optimize systems that shouldn’t even exist. Two, they accept constraints that are external to them. In many ways, the question that you’re asking is around people accepting these constraints about how we do things, and part of the way that we talk about why we do things is how things have evolved over time. Our team has now existed for somewhere in the order of six to eight years, depending on how you look at it, and we’ve had many revolutions and changes through how we’ve done it, so part of our story of why is that story of evolution of our process, and then we even try to include, and here are the things we are starting to think about for the future.

As part of our annual planning process, we talk about, “Should the team still exist? What team still needs to exist?” because we can recharter ourselves whenever we want, and what are the problems that we should be solving? That’s really when we open up the conversation to starting from first principals, like a zero-based budgeting kind of approach, but toward technical strategy and approach to the problem, and then every voice is really incorporated into the room. People who are overeager to share their view of things before they’ve learned the history or the why, it’s kind of like misjudging where you are in that teacher or student relationship and dynamic, and that can be inappropriate and distracting to the larger team context. Setting the context when you try and suppress that behavior a little bit, “It’s not that I don’t care about your opinion, it’s that you don’t have the good information to have formed a solid opinion yet,” that’s super critical. Then setting the timeline for, “In a few months, that’s when we really expect you to be fully ramped up, and you’re doing great as you’re ramping up towards that level,” so not shut it out, even if you are trying to lower it in the beginning.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.