May 2020 - Page 4 of 13 - Mobile Monitoring Solutions

Uncategorized

Microsoft Introduces App Service Static Web Apps in Preview at Build 2020

MMS • Steef-Jan Wiggers

Article originally posted on InfoQ. Visit InfoQ

During this year’s digital Build event, Microsoft announced it had expanded Azure App Service with a new hosting offer explicitly tailored for static web apps. The hosting offering is called Azure Static Web Apps and is currently in preview.

With Azure Static Web Apps, developers can build modern, full-stack JavaScript web apps with static front-ends and optional dynamic back-ends powered by serverless APIs. More specificly, Daria Grigoriu, a program manager at Azure Functions, said in a Microsoft Build session explaining and demonstrating the new service:

Static Web App provides a unified workflow, which takes you from your source code to global availability. Everything is managed entirely for you. And all of the different configuration aspects are available for you at a simple click distance.

Furthermore, according to the announcement blog post by Girgoriu, Azure Static Web Apps provides developers with the following advantages:

Use frameworks such as Angular, React, Svelte, and Vue or static site generators like Gatsby when looking for a simple interface to deploy the cloud resources.
Move dynamic logic to serverless APIs unlocking dynamic scale that can adjust to demand in real-time.
Pre-rendering of static content (including HTML, CSS, JavaScript, and image files) and leverage global content distribution to serve this content – which removes the need for traditional web servers generating the content with every request.

The way Azure Static Web Apps works is that a developer can have an application build in, for instance, JavaScript and push or create a pull request to a repository in GitHub. Subsequently, the push or pull request triggers a GitHub action, which initiates a workflow – building the assets (content and APIs) with NPM run build and deploy them in Azure as a static web app. Furthermore, the deployment of the assets is in at least five servers and regions around the globe.

Source: https://docs.microsoft.com/en-us/learn/modules/publish-app-service-static-web-app-api/1-introduction

Beside easy deployment and integration with GitHub, the service also features authentication and authorization capabilities, routes, preview in pre-production environments and custom domains. Combined Azure Static Web App provides developers with one package that works for static web apps – which Azure manages for them.

Rafael Rivera, Microsoft MVP, said in a tweet:

Don’t skip over Azure Static Web Apps. It’s like GitHub Pages on steroids. You push your content to GitHub, and it’ll handle SSL, wiring up web APIs (az functions!), serve up static content, and even handle auth (AAD, FB, G, TWIT, etc.) for you. Crazy!

And as John Papa, principal developer advocate lead at Microsoft, wrote in his blog post about Azure Static Web Apps:

Oh, there is so much more you can do! You can add a custom domain with an SSL certificate, authentication, and authorization. You can make a change in a new branch, make a pull request, and then have the GitHub Action build and deploy your changes to a staging/preview URL!

Before, developers were able to build and host static web app using several Azure components, such as Azure Storage for static content. However, developers had to do a lot themselves, and managing is hard – as Mitch Webster, senior engineering manager for Apps Service, said in another Microsoft Build session on Azure Static Web Apps:

In the past, when we’ve used a variety of different services across Azure, it’s kind of hard to manage, and it can be hard to get for the first time. A big goal of this project is to streamline that and make things as easy as possible.

Currently, App Service Static Web Apps are available in the Central US, East US 2, West US 2, East Asia, and West Europe Azure regions and as preview starts with a free plan. Furthermore, developers can visit the quickstart to try out and explore Static Web Apps – and a Visual Code extension for Azure Static Web Apps is available in the marketplace.

Uncategorized

Presentation: The Evolution of Distributed Systems on Kubernetes

MMS • Bilgin Ibryam

Article originally posted on InfoQ. Visit InfoQ

Transcript

Ibryam: I’m Bilgin. Today, I’m going to share with you how I see distributed systems evolving on Kubernetes. Why me? Because I work for Red Hat. I have been a consultant and architect there, working with distributed systems using Apache Camel. Apache Camel is a very popular framework in the Java ecosystem for doing integrations. I’m a committer. I have a book about Apache Camel. In the latest years I’ve used Kubernetes. I also have a book about it. I’ve been at the intersection of distributed systems and Kubernetes.

First, I want to start with a question, what comes after microservices? I’m sure you all have an answer to that and I have mine too. You’ll find out at the end what I think that will be. To get there, I suggest we look at what are the needs of the distributed systems and how those needs have been evolving over the years starting with monolithic applications to Kubernetes, and with more recent projects such as Dapr, and Istio, Knative, how they are changing the way we do distributed systems. We will try to do some predictions about the future.

Modern Distributed Applications

To set a little more context, on this talk, when I say distributed systems, what I have in mind is systems composed of multiple components, hundreds of those. These components can be stateful, stateless. They can be serverless. Components created in different languages running on different environments. Components created using open-source technologies, open standards, interoperable. I’m sure you can create such systems using closed source software. You can create them on AWS and other places. For this talk specifically, I’m looking for the Kubernetes ecosystem and how you can create those on Kubernetes ecosystem.

Let’s start with the needs of distributed systems. What I have in mind is we want to create an application or service. We want to write some business logic. What else do we need from the platform from our runtime to create distributed systems? At the foundation, at the beginning is we want some lifecycle capabilities. What I have in mind is, when you write your application in any language, then we want to have the ability to package, to deploy that application reliably, to do rollbacks, health checks. Be able to place the application on different nodes, and be able to do resource isolation, scaling, configuration management, all of these things. These are the very first things you would need to create a distributed application.

The second pillar is around networking. Once we have an application, we want it to reliably connect to other services, whether these are within the cluster or in the outside world. We want to have abilities such as service discovery, load balancing. We want to be able to do traffic shifting, whether that’s for different release strategies or for some other reasons. Then we want to have an ability to do resilient communication with other systems, whether that is through retries, timeouts, circuit breakers, of course. Have security in place, and get good monitoring, tracing, observability, and all that.

Once we have networking, the next thing is we want to have ability to talk to different APIs and endpoints, so resource binding. Be able to talk to different protocols, different data formats. Maybe even be able to do transform from one data format to another one. I would also include here things such as light filtering. When we subscribe to a topic, maybe we are interested only from certain events.

What do you think is the last category? It is state. When I say state and stateful abstractions, I’m not talking about the actual management of state such as what a database does or a file system. I’m talking more about developer abstractions that behind the scenes rely on state. Probably, you need to have the ability to do workflow management. Maybe you want to manage long running processes. You want to do temporal scheduling, so some cron job to run your service periodically. Maybe you want to also do distributed caching. You want to have idempotence, be able to do rollbacks. All of these are developer-level primitives, but behind the scenes they rely on having some state. You want to have these abstractions in your disposal to create good distributed systems. We will use these frameworks to evaluate how these needs have been changing on Kubernetes and with other projects.

Monolithic Architectures – Traditional Middleware Capabilities

If we start with the monolithic architectures and how we get those capabilities, the very first thing is when I say monolith, I have in mind, in the context of distributed application that is the ESB. ESBs are quite powerful. When we check our list of needs, we would say that ESBs had very good support for all stateful abstractions. You could do the orchestration of long running processes. You can do distributed transactions, rollbacks, idempotence. They also have very good resource binding capabilities. An ESB will have hundreds of connectors. They can do all transformation, orchestration, and even networking capabilities. An ESB can do service discovery. It can do load balancing. It has all things around resiliency of the networking connection, so it can do retries. Probably, because by nature, an ESB is not very distributed, it doesn’t need very advanced networking and release capabilities. Where ESB lacks is primarily around lifecycle management. Because it’s a single runtime, the first thing is you are limited to using a single language and that’s typically the language that the actual runtime is created in, whether that’s Java, or .NET, or something else. Then because it’s a single runtime that means we cannot easily do declarative deployments, we cannot do automatic placement. The deployments are quite big, quite heavy, so it usually involves human interaction. Another difficulty with such a monolithic architecture is around scaling. We cannot scale individual bits. Last but not least, around isolation, whether that’s resource isolation or fault isolation. None of these can be done with the monolithic architectures. From our needs’ framework point of view, the ESBs monolithic architectures don’t qualify.

Cloud-native Architectures – Microservices and Kubernetes

Next, I suggest we look at cloud-native architectures and how those needs have been changing. If we look at a very high level, how those architectures have been changing, cloud native probably started with the microservices movement. Microservices allow us to split a monolithic application by business domain. It turned out that containers and Kubernetes are actually a good platform for managing those microservices. Let’s see some of the concrete features and capabilities that Kubernetes becomes particularly attractive for microservices.

At the very beginning, that is the ability to do health probes. That’s something that Kubernetes made popular. In practice, it means when you deploy your container in a pod, Kubernetes will check the health of your process. Typically, that process model is not good enough. You still may have a process that’s up and running, but it’s not healthy. That’s why there is also the option of using readiness and liveness checks. Kubernetes will do a readiness check to decide when your application is ready to accept traffic during startup. It will do a liveness check to continuously check the health of your service. Before Kubernetes, this wasn’t very popular, but today almost all languages, all frameworks, all runtimes have health checking capabilities where you can easily start an endpoint.

The next thing that Kubernetes introduced is around managed lifecycle of your application. What I mean here is you are no longer in control when your service will start up and when it will shut down. You trust the platform to do that. Kubernetes can start up your application, it can shut it down, move it around on the different nodes. For that to work, you have to properly implement the events that the platform is telling you during startup and shutdown.

Another thing that Kubernetes made popular is around deployments and having those declaratively. That means you don’t have to start the service anymore, check the logs whether it has started. You don’t have to manually upgrade instances. Kubernetes with declarative deployments can do that for you. Depending on the strategy you chose, it can stop old instances, start new ones. If something goes wrong, it can roll back.

Another thing is around declaring your resource demands. When you create a service, you containerize it. It is a good practice to tell the platform how much CPU and memory that service will require. Kubernetes uses that knowledge to find the best node for your workloads. Before Kubernetes, we had to manually place an instance to a node based on our criteria. Now we can guide Kubernetes with our preferences, and it will make the best decision for us.

Nowadays, on Kubernetes, you can do polyglot configuration management. You don’t need in your application runtime anything to do configuration lookup. Kubernetes will make sure that the configurations end up on the same node where your workload is. The configurations are mapped as a volume or environment variable ready for your application to use.

It turns out those specific capabilities I just spoke about, they are also related. For example, if you want to do automatic placement, you have to tell Kubernetes what are the resource requirements of your service. Then you have to tell it what deployment strategy to use. In order for the strategy to work properly, your application has to implement the events coming from the environment. It has to implement health checks. Once you put all of these best practices, once you use all of these capabilities, your application becomes a good cloud-native citizen and it’s ready for automation on Kubernetes. This represents the foundational patterns for running workloads on Kubernetes. Then there are other patterns around structuring the containers in a pod, doing configuration management, and behavioral.

The next topic I want to briefly cover is around workloads. From the lifecycle point of view, we want to be able to run different workloads. We can do that on Kubernetes, too. Running Twelve-Factor Apps and stateless microservices is pretty easy. Kubernetes can do that. That’s not the only workload you will have. Probably you will also have stateful workloads, and you can do that on Kubernetes using a stateful set. Another workhorse you may have is a singleton. Maybe you want an instance of an app, to be only one instance of your app throughout the whole cluster. You want it to be a reliable singleton. When it fails, it should be started up. You can choose between stateful set and replica sets depending on your needs, whether you want the singleton to have at least one or at most one semantic. Another workload you may have is around jobs and cron jobs. With Kubernetes, you can do those as well.

If we map all of these Kubernetes features to our needs, Kubernetes satisfies really well the lifecycle needs. In fact, the list of needs I created are primarily driven by what Kubernetes provides us today. These are expected capabilities from any platform. Kubernetes can do for you deployment, placement, configuration management, resource isolation, failure isolation. It supports different workloads except serverless on its own.

Then, if that’s all Kubernetes gives for developers, how do we extend Kubernetes? How can we make it give us more features? I want to briefly talk about two main ways that are used today.

Out-of-process Extension Mechanism

The first thing is the concept of a pod. Pod is an abstraction used to deploy containers on nodes. The pod gives us two sets of guarantees. The first set is deployment guarantee. All containers in a pod always end up on the same node. That means they can communicate with each other over a local host or asynchronously using the file system, or through some other IPC mechanism. The other set of guarantees a pod gives us is around lifecycle. Not all containers within a pod are equal. Depending if you’re using init containers or application containers, you get different guarantees. For example, init containers are run at the beginning. When a pod starts they are run sequentially, one after another. They run only if the previous container has completed successfully. They are good for implementing some workflow logic driven by containers. Application containers, they run in parallel. They run throughout the lifecycle of the pod.

This is where the sidecar pattern comes in. Sidecar is the ability to run multiple containers that cooperate with each other and jointly provides value to the user. That’s one of the primary mechanisms we see nowadays for extending Kubernetes with additional capabilities.

In order to explain the next capability, I have to briefly tell you how Kubernetes works internally. It is based on the reconciliation loop. The idea of the reconciliation loop is to drive the desired state to the actual state. Within Kubernetes, there are many bits that rely on that. For example, when you say I want two instances of a pod, this is the desired state of your system. There is a control loop that constantly runs and checks if there are two instances of your pod. If two instances are not there, if there is one or more than two, it will calculate the difference. It will make sure that there are two instances. There are many examples of this. Some are replica sets, stateful set. The resource definition maps to what the controller is. There is a controller for each resource definition. This controller makes sure that the real world matches the desired one. You can write your own custom controller. You have an application that’s running in a pod and it cannot load configuration file changes at runtimes. You can write the custom controller that detects every time a config map changes, it restarts your pod so that your application is restarted. It can pick up configuration changes at startup. That would be an example of a custom controller.

It turns out that even though Kubernetes have a good collection of resources, that they are not enough for all the different needs you may have. Kubernetes introduced the concept of custom resource definitions. That means you can go and model your requirements and define an API that lives within Kubernetes. It lives next to other Kubernetes native resources. You can write your own controller in any language that understands your model. You can design a ConfigWatcher implemented in Java that describes what we described earlier. That is what operator pattern is. Operator pattern is a controller that works with the custom resource definitions. Today, we see lots of operators coming up and that’s the second way for extending Kubernetes with additional capabilities.

Next, I want to briefly go over a few platforms that are built on top of Kubernetes, and they are heavily using sidecars and operators to give additional capabilities to developers around distributed systems.

What is Service Mesh?

Why don’t we start with the service mesh? What is a service mesh? We have two services, service A that wants to call service B. It can be in any language. That’s basically our application workload. What service mesh does is, using sidecar controllers, service mesh injects a proxy next to our service. You end up with two containers in the pod. The proxy is a transparent one. Your application is completely unaware that there is a proxy that is intercepting all incoming and outgoing traffic. The proxy also acts as a data firewall. The collection of these service proxies represents your data plane. These proxies are small. They’re stateless. In order to get all the state and configuration, they rely on the control plane. The control plane is the stateful part that keeps all the configurations. It gathers metrics, takes decisions, and interacts with the data plane. They are a good choice for different control planes and data planes. It turns out, we need one more component. We need an API gateway in order to get data into our cluster. Some service meshes have their own API gateway and some use third party. All of these components, if you look into those, they provide the capabilities we need.

An API gateway, it is primarily focused to abstract the implementation of our services. It hides the details and provides borderline capabilities. Service mesh does the opposite. In a way, it enhances the visibility and reliability within the services. Jointly, we can say that API gateway and service mesh provides all the networking needs. In order to get networking capabilities on top of Kubernetes using just the services is not enough, you need some service mesh.

What is Knative?

The next project I want to mention is Knative. It’s a project started by Google a few years ago. It’s getting very close to GA. It is basically a layer on top of Kubernetes that gives you serverless capabilities. It has two main modules: serving and eventing. Serving is focused around request-reply interactions. Eventing is more for event-driven interactions.

Just to give you a feel what serving is? In serving, you define a service but that is different than a Kubernetes service. This is a Knative service. Once you define a workload with a service, you basically get a deployment but with the serverless characteristics. You don’t need to have an instance up and running. It can be started from zero when a request arrives. You get serverless capabilities. It can scale up rapidly. It can scale down to zero.

Eventing gives us a fully declarative event management system. Let’s assume we have some external systems we want to integrate with, some external event producers. At the bottom, we have our application in a container that has an HTTP endpoint. With Knative eventing, we can start a broker. We can start a broker that’s mapped by Kafka, or it can be in-memory, or some cloud service. We can start importers that connect to the external system and import events into our broker. Those importers can be, for example, based on Apache Camel, which has hundreds of connectors. Once we have our events going to the broker, then declaratively with YAML file, we can subscribe our container to those events. In our container, we don’t need any messaging client. We don’t need a Kafka client, for example. Our container would get events through HTTP POST using cloud events. This is a fully platform managed messaging infrastructure. You as a developer, all you have to do is write your business code in a container and don’t deal with any messaging logic.

From our needs’ point of view, Knative satisfies few of those. From lifecycle point of view, it gives our workloads serverless capabilities, so ability to scale to zero, and activate from zero and go up. From a networking point of view, if there is some overlap with the service mesh, Knative can also do traffic shifting. From a binding point of view, it has a pretty good support for binding using Knative importers. It can give us Pub/Sub, or point-to-point interaction, or even some sequencing. It satisfies the needs in a few categories.

What is Dapr?

The next project that is using sidecars and operators is Dapr. Dapr was started by Microsoft only a few months ago, but rapidly getting popular. It is basically a distributed systems toolkit as a sidecar. Everything in Dapr is provided as a sidecar. It has a set of what they call building blocks, or set of capabilities.

What are those capabilities? The first set of capabilities is around networking. Dapr can do service discovery and point-to-point integration between services. Similarly, to service mesh, it can also do tracing. It can do reliable communications. It can do retries, recovery. The second set of capabilities is around resource binding. It has lots of connectors to cloud APIs, different systems. It can also do messaging, basically, publish/subscribe and other logic. Interestingly, Dapr also introduces the notion of state management. In addition to what Knative and service mesh gives you, Dapr also has abstraction on top of the state store. You can have key-value based interaction with Dapr that is backed by the actual storage mechanism.

At a high level, the architecture is you have your application at the very top, which can be in any language. You can use the client libraries provided by Dapr, but you don’t have to. You can use the language features to do HTTP and gRPC called the sidecar. The difference to service mesh is that here this Dapr sidecar is not a transparent proxy. It is an explicit proxy that you have to call from your application and interact with over HTTP or gRPC. Depending on what capabilities you need, Dapr can talk to other systems, such as cloud services.

On Kubernetes, Dapr deploys as a sidecar. Dapr also works outside of Kubernetes. It’s not only Kubernetes. It also has an operator. Sidecars and operators are the primary extension mechanism. There are a few other components to manage certificates, to deal with actor based modeling, and for injecting the sidecars. Your workload interacts with the sidecar. The sidecar does all the magic to talk to other services, which can give you some interoperability with different cloud providers. It gives you additional distributed system capabilities.

If I were to sum up what these projects are giving you, we can say that ESB is the early incarnation of distributed systems where we had the centralized control plane. We had the centralized data plane so it didn’t scale really well. With cloud native, there is still centralized control plane, but the data plane is decentralized. It’s highly scalable with good isolation. We always would need Kubernetes to do good lifecycle management. Then on top of that, you would need probably one or more add-ons. You may need Istio to do advanced networking. You may use Knative to do serverless workloads, or Dapr to do integration. Those frameworks play nicely with Istio and Envoy. From Dapr and Knative point of view, probably you have to pick one. Jointly, they are providing what we used to have on an ESB in a cloud-native way.

Future Cloud Native Trends – Lifecycle Trends

For the next part, I have basically done an opinionated list of a few projects where I think interesting developments are happening in these areas. I want to start with lifecycle. With Kubernetes, we can do good lifecycle of your application. That might not be enough for more complex lifecycle management. For example, you may have scenarios where the deployment primitive in Kubernetes is not enough for your application if you have a more complex stateful application. In these scenarios, you can use the operator pattern. You can use an operator that does a deployment and upgrade where it also backs up maybe the storage of your service to S3. Another thing is that you may find out that the actual health checking mechanism in Kubernetes is not good enough. If liveness check and readiness check is not good enough, you can use an operator to do more intelligent liveness and readiness check of your application, and based on that to do recovery.

A third area would be auto-scaling and tuning. You can have an operator to better understand your application and do auto-tuning on the platform. Today, there are primarily two frameworks for writing operators, the Kubebuilder from Kubernetes special interest group, and the Operator SDK, which is part of operator framework. Operator framework is created by Red Hat. It has a few things. There is the Operator SDK that lets you write operator, that is, Operator Lifecycle Manager, which is about managing the lifecycle of the operator itself, and OperatorHub where you can publish your operator. If you go there today, you will see there are over 100 operators that manage databases, message queues, monitoring tools. From lifecycle space, probably operators are the area where most active development is happening right now on the Kubernetes ecosystem.

Networking Trends – Envoy

The next project I picked is Envoy. On the networking side, we’ve seen this morning what’s happening with service mesh and Istio. There is the introduction of service mesh interfaces specification that will make it easier for you to switch different service mesh implementations. There has been some consolidation on Istio architecture in the deployment. You don’t have to deploy 7 pods for the control plane, now you can just deploy once. More interestingly, is what’s happening at the data plane in the Envoy project. We see that more and more Layer 7 protocols are added to Envoy. Service mesh is adding support for more protocols such as MongoDB, ZooKeeper, MySQL, Redis, and the most recent one is Kafka. I see that the Kafka community is now further improving their protocol to make it friendlier for service meshes. We can expect that there will be even more tight integration, more capabilities. Most likely there will be some bridging capability. In your service, you can do an HTTP call locally from your application and the proxy will, behind the scene, use Kafka. You can do transformation, encryption outside of your application in a sidecar for the Kafka protocol.

Another interesting development has been the introduction of HTTP caching. Now Envoy can do HTTP caching. You don’t have to use caching clients within your applications. All of that is done transparently in a sidecar. There are tap filters, so you can tap the traffic and get the copy of the traffic. Most recently, the introduction of WebAssembly, that means, if you want to write some custom filter for Envoy, you don’t have to write it in C++ and compile the whole Envoy runtime. You can write your filter in WebAssembly, and deploy that at runtime. Most of these are still in progress. They are not there. This gave me an indication that the data plane and service mesh have no intention at stopping just supporting HTTP and gRPC. They are interested in supporting more application-layer protocols to offer you more, to enable more use cases. Especially, with the introduction of WebAssembly, you can now write your custom logic in the sidecar. That’s fine as long as you’re not putting there some business logic.

Binding Trends – Apache Camel

Next project I want to talk about is Apache Camel. That’s a project I love. It is a project for doing integrations. It has lots of connectors, hundreds of connectors to different systems. It is using enterprise integration patterns. The way it’s related to this talk is in the latest version, Camel 3 is getting deeply integrated into Kubernetes. It’s using the same primitives we spoke so far, such as operators. In Camel, you can write your integration logic in languages such as Java, JavaScript. Here, the example is with using YAML. In the latest version, they have introduced a Camel operator. That’s something that runs in Kubernetes that understands your integration. When you write your Camel application, deploy it to custom resource, the operator then knows how to build the container or how to find dependencies. Depending on the capabilities of the platform, whether that’s Kubernetes only, whether that’s Kubernetes with Knative, it can decide what services to use and how to materialize your integration. There is quite a lot of intelligence that is going outside of your runtime but into the operator. All that happens pretty fast. Why would I say it’s a binding trend? Mainly, because of the capabilities of Apache Camel with all the connectors it provides. The interesting point here is how it integrates deeply with Kubernetes.

State Trends – Cloudstate

The next project I picked is Cloudstate. It’s around state related trends. Cloudstate is a project by Lightbend, and it is primarily focused on serverless and function-driven development. With their latest releases, they are integrating deeply with Kubernetes using sidecars and operators. I think they also have integration with Dapr, Knative, and all of this. The idea is, when you write your function, all you have to do in your function is use gRPC to get state, to interact with state. The whole state management happens in a sidecar that is clustered with other sidecars. It enables you to do event sourcing, CQRS, key-value lookups, messaging. From your application point of view, you are not aware of all these complexities. All you do is a call to a local sidecar and the sidecar handles the complexity. It can use, behind the scenes, two different data sources. It has all the stateful abstractions you would need as a developer. A project definitely to follow and see what’s happening.

We have seen what the current state of the art is in the cloud-native ecosystem, and some of the recent developments that are still in progress. How do we make sense of all that?

Multi-runtime Microservices Are Here

If you look at how microservice look like on Kubernetes, you will need to use some functionality from the platform. You will need to use Kubernetes features for the lifecycle management primarily. Then, most likely, transparently, your service will use some service mesh, something like an Envoy to get enhanced networking capabilities, whether that’s traffic routing, resilience, enhanced security, or even if it is for a monitoring purpose. On top of that, depending on your use case, you may need Dapr or Knative, depending on your workloads. All of these represents your out-of-process, additional capabilities. What’s left to you is to write your business logic, not on top, but as a separate runtime. Most likely, future microservices will be this multi-runtime composed of multiple containers. Some of those are transparent. Some of those are very explicit that you use.

Smart Sidecars and Dumb Pipes

If I look a little bit deeper, how that might look like, you write your business logic in some high-level language. It doesn’t matter what it is. It doesn’t have to be Java only. You can use any other language. You develop your custom logic in-house. Then all the interactions of your business logic with the external world happens through sidecar. That sidecar integrates with the platform, does the lifecycle management. It does the networking abstractions for external systems. Gives you advanced binding capabilities and state abstraction. The sidecar is something you don’t develop. You get it off the shelf. You configure it with a little bit of YAML or JSON, and you use it. That means you can update sidecars easily because it’s not embedded anymore into your runtime. It makes patching, updating easier. It enables polyglot runtime for our business logic.

What Comes After Microservices?

That brings me to the original question, what comes after microservices? If we see how the architectures have been evolving, application architectures, at a very high level. This is a simplification. Hopefully, you get the idea. We started with monolithic applications. Microservices gives us the guiding principles on how to split a monolithic application into separate business domains. After that came serverless, and function as a service where we said, we can split those further by operation. This gives us the ability for extreme scaling, because we can scale each operation individually. I would argue that maybe FaaS is not the best model. Functions are not the best model for implementing reasonably complex services where you want multiple operations to reside together when they have to interact with the same dataset. Probably, multi-runtime, I call it MEC architecture, where you have your business logic in one container, and you have all the infrastructure related concerns as a separate container. They jointly represent a multi-runtime microservice. Maybe that’s a more suitable model, because it has better properties. You get all the benefits of microservice. You still have all your domain, all the bounded contexts in one place. You have all the infrastructure and distributed application needs in a separate container, and you combine them at runtime. Probably, the closest thing that’s getting to that right now is Dapr. They are following that model. If you’re only interested from a networking aspect, probably using Envoy is also getting close to this model.

See more presentations with transcripts

Uncategorized

Q&A on the Compose Specification Community

MMS • Christian Melendez

Article originally posted on InfoQ. Visit InfoQ

Docker released a community for developing the Compose specification to help developers build cloud-native applications using compose. There’ve been different implementations of Docker compose to make it work in platforms like Kubernetes or AWS ECS. But Docker wants to work with the community to provide better support and define the future of Compose. The Compose spec is to let people use compose in different platforms and make any new features as a first-class feature in Compose.

InfoQ recently talked to Justin Cormack, Security Lead at Docker, to learn more about the Compose specification community.

InfoQ: What is the Compose specification community? What type of problems is the spec looking to solve?

Justin Cormack: Originally, there was only one implementation, the Docker compose in Python. But there’s been more implementations where they’ve looked at what Docker compose does and tried to copy as much of it as possible. But sometimes they’re doing different things. Even within Docker, we have three different implementations like Docker compose for Kubernetes or Swarm. So, all these implementations had to copy what we did and then try to make it work, but they didn’t get any real influence on how to implement Docker compose on different platforms.

The Compose spec is to let people use compose in different platforms and make any new features as a first-class feature in Compose, allowing the community to influence how the spec progress. We knew there were problems with running compose on Kubernetes because it does networks and volumes differently. And other platforms like AWS ECS has its implementation to interact with other AWS services. So we thought the best thing to do is to make an open spec to help people use compose within other platforms different than Docker Swarm or the desktop.

InfoQ: What’s the progress you’ve made so far? What’s the current state of the spec?

Cormack: It’s not a complete spec yet, it only has what set of parameters exist and what other sections you can include in the YAML file, it doesn’t say exactly what it should do. It includes which bits in compose are optional because existing implementations don’t implement them all, and there’s also a reference for implementing tests.

We’ve also found many issues that we think need to be fixed, features we don’t like, or issues that the community has brought up. We have regular meetings with the community where we discuss the current issues and what to do about them. For instance, there isn’t a set of conformance tests to define a valid implementation yet, but we’d like to have that in the future.

Also, we’re looking in the short term for ways that people can do custom extensions if they have very platform-specific features, like in the case of Kubernetes or AWS ECS. Then, we’ll work out which of these features makes sense as general cross-platform extensions later.

InfoQ: Docker compose has more a developer focus. Do you think Docker compose could evolve for production usage as well?

Cormack: There are two ways people use compose at the moment. Some people use compose for development, and it’s not what goes to production at all. They use compose for testing, and it might not spin up the same services, but it’s similar. We also know people who have a workflow where the developers write compose files, and then another team writes another YAML file that corresponds, but it’s not an automatic version and might have different things.

We’re not trying to force people how and where to use compose. One of the things that we looked at when we were defining the spec was how many people are using compose. Only in GitHub, there are like 700,000 Docker compose files, and I’d say that most of them are not currently using it for production things. Usually, these files only define an example of how to spin up the system, so it’s easy to get started, but it’s not necessarily how you want to run the system in production.

So my feeling is that we should keep compose as things that developers care about. And then things that operations care about in regards to how to run systems in production. So your compose file wouldn’t be the whole description of production; it would just be the developer-facing pieces. It’s a sort of separation of concerns piece.

InfoQ: What’s in the roadmap for the specification? What would you like to see?

Cormack: We have some ideas, but we don’t have a road map as such. Part of the aim is to let the community drive where they want to go and fix people’s problems that they’ve got early. There are a few things that I would like to do, like not having to specify a specific version number in the compose file. Instead, I’d guide to a feature-driven approach where we look at the features you’re using and say: “Does that work on this platform?” And if not, return a helpful message saying: “This feature is not supported on this platform.”

Uncategorized

Apple Releases iOS 13.5 With Exposure Notification Beta and Best Practices Sample App

MMS • Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

The latest release of iOS, iOS 13.5, includes beta support for the Exposure Notification API Apple defined jointly with Google to enable contact tracing apps. Apple also published a sample app to showcase best practices in a contact-tracing apps.

Besides providing the actual framework implementation, iOS 13.5 includes the basic user-facing authorization mechanism that lets users opt in to or out from contact logging. No apps using the Exposure Notification API are available, yet, which makes the Exposure Logging setting disabled. When you install a third-party app developed by some public health authority, users will be given the option to opt in or out.

To make things easier for third-parties wanting to build a contact tracing app, Apple has published a sample app providing a reference design. In addition to showing best practices when using the framework, the sample also includes code to simulate central server responses to implement diagnosis key sharing and exposure criteria management.

Apple sample app checks on launch whether the user enabled exposure logging and asks them to enable it in case they did not. Contrary to what usually happens with other privileges managed by iOS, the Exposure Notification API provides a mechanism for the app to explicitly trigger the authorization mechanism through the ENManager singleton:

static func enableExposureNotifications(from viewController: UIViewController) {
    ExposureManager.shared.manager.setExposureNotificationEnabled(true) { error in
        NotificationCenter.default.post(name: ExposureManager.authorizationStatusChangeNotification, object: nil)
        if let error = error as? ENError, error.code == .notAuthorized {
            viewController.show(RecommendExposureNotificationsSettingsViewController.make(), sender: nil)
        } else if let error = error {
//...
        }
    }
}

Apple’s sample goes as far as asking the user to enable the service twice, if they deny their permission on the first request. This is obviously not a requirement but hints at Apple considering this an accepted practice, which could mean such an aggressive strategy to gather user permission will not be considered cause for rejection in the eventual App Store review process.

The app stores locally all data it logs, including a flag telling whether the user on-boarded or not, data about any test they took, whether they shared it with the server, etc. Not requiring to store all user data on a central server is a key feature of the Exposure Notification protocol, since storing it locally preserves user pricacy.

Only when a user is positively diagnosed COVID-19, they may decide to share it with the central server. This requires the app to retrieve a list of diagnosis keys, which in turns requires the user to provide an explicit authorization each time, and send it to the server:

func getAndPostDiagnosisKeys(testResult: TestResult, completion: @escaping (Error?) -> Void) {
    manager.getDiagnosisKeys { temporaryExposureKeys, error in
        if let error = error {
            completion(error)
        } else {
            // In this sample app, transmissionRiskLevel isn't set for any of the diagnosis keys. However, it is at this point that an app could
            // use information accumulated in testResult to determine a transmissionRiskLevel for each diagnosis key.
            Server.shared.postDiagnosisKeys(temporaryExposureKeys!) { error in
                completion(error)
            }
        }
    }
}

The sample app also uses a background task to periodically check exposure for users having no COVID-19 diagnosis. The background task identifier shall end in .exposure-notification, which ensures it automatically receives more background time to complete its operation. Additionally, apps that own such tasks are launched more frequently when they are not running. The background task calls the app’s detectExposures method to check if the user got exposed and re-schedules itself:

BGTaskScheduler.shared.register(forTaskWithIdentifier: AppDelegate.backgroundTaskIdentifier, using: .main) { task in
    
    // Notify the user if bluetooth is off
    ExposureManager.shared.showBluetoothOffUserNotificationIfNeeded()
    
    // Perform the exposure detection
    let progress = ExposureManager.shared.detectExposures { success in
        task.setTaskCompleted(success: success)
    }
    
    // Handle running out of time
    task.expirationHandler = {
        progress.cancel()
        LocalStore.shared.exposureDetectionErrorLocalizedDescription = NSLocalizedString("BACKGROUND_TIMEOUT", comment: "Error")
    }
    
    // Schedule the next background task
    self.scheduleBackgroundTaskIfNeeded()
}

As a final remark, the Exposure Notification framework provides a way to estimate risk each time a contact is detected. This takes into account when the interaction took place and how long it lasted based on the detected device proximity. The app can alter how risk is estimated by providing an ENExposureConfiguration object, which will be usually sent by the server, and eventually call finish to update the local store and complete the search. The ENExposureConfiguration object supports parameters such as a minimum risk, transmission risk, contact duration, days since last exposure and a few more.

The ENExposureConfiguration object is passed to the ENManager singleton detectExposures(configuration:diagnosisKeyURLs:completionHandler:) method. For each detected exposure, the app can get additional information using getExposureInfo(summary:userExplanation:completionHandler:):

Server.shared.getExposureConfiguration { result in
    switch result {
    case let .success(configuration):
        ExposureManager.shared.manager.detectExposures(configuration: configuration, diagnosisKeyURLs: localURLs) { summary, error in
            if let error = error {
                finish(.failure(error))
                return
            }
            let userExplanation = NSLocalizedString("USER_NOTIFICATION_EXPLANATION", comment: "User notification")
            ExposureManager.shared.manager.getExposureInfo(summary: summary!, userExplanation: userExplanation) { exposures, error in
                    if let error = error {
                        finish(.failure(error))
                        return
                    }
                    let newExposures = exposures!.map { exposure in
                        Exposure(date: exposure.date,
                                 duration: exposure.duration,
                                 totalRiskScore: exposure.totalRiskScore,
                                 transmissionRiskLevel: exposure.transmissionRiskLevel)
                    }
                    finish(.success((newExposures, nextDiagnosisKeyFileIndex + localURLs.count)))
            }
        }
        
    case let .failure(error):
        finish(.failure(error))
    }
}

The Exposure Notification API requires iOS 13.5 and Xcode 11.5.

Uncategorized

Tech Giants Shift to More Remote Working for the Long Term

MMS • Shane Hastie

Article originally posted on InfoQ. Visit InfoQ

As the impacts of COVID-19 continue to be felt around the globe, and many tech industry employees get used to working from home, large tech companies are making long-term decisions about allowing and encouraging their people to work remotely.

Facebook announced a 10 year plan to move most of its workforce to working remotely. Twitter is encouraging it’s employees to work from home forever. Shopify is going all in on remote work with no plans to bring people back to working in offices.

These changes will have a lasting impact on the companies, their employees and the communities they currently have offices in. Facebook used to pay a $15000 bonus for people to move to within 10 miles of their headquarters. In an interview with The Verge Facebook CEO Mark Zuckerberg said they are immediately opening up remote hiring for all new roles, and then they will allow existing people to request a shift to working remotely at some point.

A USA today examination of the topic says that

Such a shift might also amount to a repudiation of the notion that creative work demands corporate campuses reminiscent of college, with free food, ping pong tables and open office plans designed to encourage unplanned interactions.

The result could re-imagine not just Silicon Valley but other cities as the companies expand hiring in places like Atlanta, Dallas and Denver, where Facebook plans to open new hubs for its new, mostly remote hires.

Microsoft has extended their current work from home directive until at least October 2020, and is exploring options for the longer term. In an interview with the New York Times, CEO Satya Nadella expressed his concerns about the potential loss of social interaction and community elements of being in person:

What does burnout look like? What does mental health look like? What does that connectivity and the community building look like? One of the things I feel is, hey, maybe we are burning some of the social capital we built up in this phase where we are all working remote. What’s the measure for that?

Uncategorized

Article: WebAssembly at Sentry – Q&A with Armin Ronacher

MMS • Armin Ronacher Bruno Couriol

Article originally posted on InfoQ. Visit InfoQ

Sentry sees great potential in WebAssembly and uses it internally in the context of its ingestion system. However, further usage is hampered by the limited capabilities of WebAssembly when debugging in production.

While proposals exist to make the DWARF standard debugging format work with Wasm, more work and better tooling are necessary.

InfoQ interviews Sentry’s Armin Ronacher.

By Armin Ronacher, Bruno Couriol

Uncategorized

Presentation: Should We Really Run It if We Build It?

MMS • Paul Hammant

Article originally posted on InfoQ. Visit InfoQ

Transcript

Hammant: Nowadays, I like to think that I’m the guy that speaks most and loudest on trunk-based development. If anyone has a need to get me into their company to drag into that, as well as CI, as well as CD, and all of that stuff, I think it sits on TBD and monorepos, then they can call me. In terms of InfoQ, Floyd contacted us for works, and I was volunteered to maybe write an article. We did one based on a case study that I and Ian Cartwright had done in fulsome in the U.S. I just think it’s interesting that I’ve come back to the InfoQ family to present for the first time, and I was there at the beginning, writing an article, which you could still find.

Premise

The idea is we’re in startups. There’s possibly a good chunk of the people present here today who are or have recently been in a startup. Then you moved through a scale-up. Then you’re a legacy company, presumably, with thousands or millions of clients. You’re now watching your rearview mirror as somebody’s going to nimbly overtake you because they’re better. Your focus, for now, is on the start phase. Your CEO has told you that he’s heard this war cry that if you build it you can run it. You think, “Maybe we can.” We have an imagination what that’s going to be like, “I’m going to go home and then there’s going to be an alert,” and you hope for the best, maybe.

We should think about the rationale. When the CEO says, “Why are we going to do it,” it’s because we’re proud of what we make. It’s also maybe a focus area for us as developers that we want to maybe eliminate all the defects, and the best way to be focused on that is to be exposed to the defects as they happen rather than think that’s an over-the-fence thing. The fence can quite often be dev to QA going downstream, but the fence is upstream from us too, BAs, classically for feature roadmaps and backlogs. We also have a fence coming from the people that are using the software in production and how they get something for each of us to fix it. If we’re exposed to things as they happen, then maybe we can make it good. The rationale seems solid.

There’s some poison here, and this one, for me, I can’t tell you how much I hate Slack. Honestly, every company, every client I have is trying to drag me into Slack. There are maybe three, four, five Slacks I have to ping between, and then I’ve stupidly done it on different logins for different companies that give me the things. The whole pinging around is hard, and it’s fraying my nerves, the alerts that come through Slack and all those channels. It’s like somebody’s hailing me here, and it’s like, “No, I’m confused. I went into the wrong channel. That was from a week ago, and I missed it a week ago.” There’s no doubt about it, you can make Slack ops work for your company, especially supporting the production system. Slack certainly works for impromptu development communication, especially when you’re not all collocated, which would otherwise be the extreme programming ideal.

Slack isn’t even the only way. A startup I was involved with was using WhatsApp in exactly the same structure. They had a Slack, but they’re only using that for planned work. Think of a stream going downhill and planned works being dropped into the top, and it’s going to flow through as fast as we can, which is why the Lean-Agile community talks about flow. The stream is not water, it’s made of molasses. If somebody’s trucking defects in there, rocks, and they’re being thrown into the stream midway, they’re slowing down the flow. We have a separation where we’re using Slack for planned work and then WhatsApp for unplanned work. That’s two different systems that can ping my phone 24/7, and my nerves are going to be even more frayed, personally, and I don’t think I’m speaking just for myself. The poison can be just the interrupts. Then you have family, and then there’s a work-life balance problem, and pressure to not attend to your phone.

Burnout

We risk burnout. Of course, I think the clue was in the setup for this. We fare through a “build it, support it” ethos that we might be in this place where we’re pushed to just collapse because there’s too many incidents too often, and then there was five working days but there’s actually seven days in a week. Depending on the nature of our business, we’re supporting the stack, the solution, the users seven days a week, or not. We thought, in agile, for extreme programming at least, that maybe six hours a day pairing, and then maybe another two hours a day attending meetings and doing email and stuff like that, and then you go home, exactly at the eight-hour mark. We’re in the U.K., luckily, we have seven-hour working days, so we just adjust that down a little bit.

Startups, especially when you feel there’s a buzz going on for the thing you’re building and the race to beat some known or unknown competitor to market, you might find yourself working up to 10 hours a day. In some cases, I think, back in 2001, just before the downturn, I was doing 16 hours a day. You can’t sustain that. We might observe that 10 and 12 are regular, and then if we’re doing support too, then it’s going north of that. It’s definitely in a place that me, at my age, I can’t do that anymore, and I’m not even sure I could do as a 20-year-old. At least, if I was, I was kidding myself, not everyone.

Quitting

Quitting is your outcome. That’s your bad outcome. You could plan to quit. I’m going to work a month or two, sharpening up my CV, going to interviews on the down low, and then hand my notice in. I’m going to work my notice in the U.K., not so much in the U.S. You’d be black-bagged the same day. You’re going to find a new employer, and you’re going to have some relief in that. The question could be, in a moment, the quitting could be because you’ve reached a breaking point. The quitting could be because it seemed like the only rational course of action to do at that particular moment in time was this shouting match was happening over an incident or immediately after the incident and you hadn’t been home, or whatever it is. It can get that bad in startups. You disappear for a regular job, and you’re here at a new place. It is a reset. Change is as good as a break. If you’re in a startup and one of the reasons you joined a startup was because you really super love their value proposition and even the technologies being used, and that somehow, in the induction cycle, you were persuaded to take stock and not just salary. I call some of those two things compensation.

There’s eyes open ways of going into that and running spreadsheet to see, “I’m going to move my salary down from here to here, and I’m going to take that amount as options. Those will come in one year or four years later, and they’ll come in in tranche shares, but I must remember to get a checkbook out to actually buy them at that stage at the strike price.” If you’re going to quit, it could easily be ahead of that schedule. You might actually be disappearing and losing some of your delayed compensation. If we think that maybe 9 out of 10 startups fail anyway, what if that was one of the ones who was going to make it and you’re not leaving a sinking ship? It’s a ship that was going to make it, and you were the one that bailed early. We have all left, hopefully, companies for good reasons, but hopefully not too many of us have left companies that went on to become unicorns and left us struggling to make the rent because we bailed too early. Those are bad lessons that would sit with you for the rest of your life maybe that, if you just stuck yet another month or two, you’d have been fine.

With all people that exit bad situations, whether it’s a bad support situation or a bad development situation because your team leader or your architect is yelling at you or telling you not to test really bad things, then quite often, the people that do leave are the most talented. They do so quietly and they do it the planned way. They leave the organization in a place where it’s the people that are willing to put up with more shits and more misery and actually have been, maybe throughout their life, acclimatized to being yelled at. Those are the people that stick around, and they should also leave if it’s a really poisonous outfit. The quitters can quite often be the most talented.

Remedy

There’s a remedy, and the remedy is known to established organizations already. The people that have gone past scaleup, and then now the incumbent legacy company, they have a proper three-line support. People argue about what is line 1, what is level 2, what is level 3, but let’s just say there can be some variation. We’ll say that, classically, the development team that would support a stack in the goings of supporting it is actually level 3. The main life is we are developers in a developed team, we have multifunctions within the team, we have a nice sprint-centric way of working, and all the candy upside of living in a dev team, but we also do support.

Support is unplanned work that comes in. If I’m throwing into a stream, the rock’s dropped in the stream. If you’re lucky, they come into your stream during a working day. Then you can work on them as it happens, and maybe you are dropping something to work on that rock that’s just fallen in the stream. Maybe that’s all good and you still go home on time. We’re really worrying about the ones that come out of hours. In a high functioning organization that’s structured for long term delivery success, there will be a bunch of support professionals. I think that’s maybe the historical industry rollup title for people. There’s plenty of modern ways of phrasing these people, but they might have a career in that. They may never have done any of the other roles within a large IT organization, they may have focused on that one.

Level 1, depending on who you are, that’s bots or account representatives and things like that. That’s the first alerting system that something could be wrong that hasn’t involved human. Triage would go through level 2. Then, if it’s midnight and we need to call somebody else, it’d be somebody from level 2 that does it, hopefully, in this equitous setup for how we do support rather than the Slack ops or the WhatsApp-centric way of doing it.

If we dwell into level 2 a little bit more, we could say level 2 staff report to the users, and we don’t always know who the users are. In the system, they can be internal or external. The users could also be companies rather than humans. They’re awake and alert, hopefully, at any time the incident is happening, wherein the developer, who would be level 3, might be happily asleep. They do draw in a body of knowledge, and they can resolve issues themselves. The best-case scenario is a team that’s been skilled up with, if need be, SQL server access on some temporary password basis and some SQL skills to run some known remedies for some frequently or infrequently appearing incidents before they’ve been fixed properly by developers. They have their own systems, their own software, software that we don’t use. If we live in Git and IntelliJ or another JetBrains product and Jira, then they don’t live in those tools. They might subsequently slot something into Jira, but they don’t live in those tools for their operation. One of the human aspects that maybe we have all institutionalized now in whichever part of our world. Part of IT is we have respect for the people that feed us stuff from the support team, and they hopefully have respect for us as they feed us things. It’s not always the case, but we should have it if it’s working well.

Contrasting level 2 and level 3. Level 2 is 24/7. Maybe if you’re a business and you’re selling to businesses, then maybe it’s five days a week. If Singapore is one of your marketplaces and you’re sitting in Hawaii, then maybe it’s six days a week. Because Saturday or Sunday is debatable, depending on which side of the International Date Line you are, and that maybe your hours are not just regular New York hours, maybe it’s a long day. Depending on the nature of your company, even the age a bit, you might actually decide to not do 24/7 when you should do, and you’re actually a startup and there’s only 8 developers. You just can’t support it, so you’re going to take a risk that the smaller amounts of users who are awake at midnight using your booking application that might encounter something, that they’re going to be ok, during your startup phase, that nothing was done with that until morning, including downtime, because you only have seven developers. If you’re hockey stick, then you shouldn’t be in that situation. Everything about growth of a startup through scaleup is acquire the personnel and their practices on a just-in-time basis for when they’re needed. Maybe part of what I’m trying to tell you today is sometimes the support might have to be a little bit earlier if we want to do it right.

Level 2 has run books. That’s a very historical term for a body of knowledge that is curated over time and growing and trimmed, tweaked, fixed, eliminated. It could be a Word document, and it talks about what we do when this particular incident happens. Part of the onboarding for level staff would be to train through that. They have investigative powers. They take tracker issues, and then they might slot some of those into a backlog tool. They might have a historical tool, which will be a trouble ticket system, and devs might prefer to work out of Jira or Trello. In some agile teams, there’ll be transcription. Someone in project manager role, scrum master role, will be copy-and-pasting stuff from the tracker tool into Jira, and I personally don’t think that’s right. I think, as a dev team, we should be sophisticated enough to look at two queues and work out what we should do in a week. It seems, to me, the transcription from one system record to another is just a total waste.

The level 2 team can actually have toggles created for them. If you’re an airline and you were selling cars, renting cars at the same time as a secondary funding stream, you can have partnerships with rent-a-car, you could have partnerships with Hertz, Avis, and others, but one of those is going down. Their marked service has gone dog for a little period of time, and it’s affecting the page that would render the aggregation of all of those hire-a-car results. You want to empower your ops team to actually flip a toggle without asking or calling anyone in executive management or anyone in level 3 support. They can flip a toggle and maybe just send an email saying, “We turned Hertz off.” Then maybe, before everyone gets out of bed and comes in to work, Hertz is turned on again because their own ops team has fixed their issue. That would be a classic case for a toggle that you’ve configured to work at run-time. As developers, maybe we understand that toggles have a lot of uses, but that was a toggle that we put into the application, not in the end-user application, but in the admin console for the running stack to toggle something off. In the run book, we told the operations team when they could make a decision without asking anyone in leadership.

Level 3, hopefully, does working hours stuff only. That would mean that your first prerogative if you’re first in on any morning would be to look at the queue of things that need immediate attention that we’ve taken a gamble on that we can wait four hours to have them resolved. That’ll be a nice situation rather than call through the night. We can help the ops team with their body of work, their run books. We can participate in those activities, sometimes called path to production, long before we’re actually in production. We have interviews with the ops team that would support new applications that are going to slot into an array of applications about what it would be to support that, what would they need, and what can we do for them now as development in order to make their operability better. We can, as is mentioned earlier, take work directly from the tracker if we need to work on it. If it’s been tested as a mechanism of giving stuff to developers, then we should stick to it and hold on to it, because if we stop attending the tracker, then some project manager starts copy-and-pasting into Jira, and now we’re in that wasteful place. With double references, because you got to point the Jira at the tracker ticket number, and you got to point the ticket number to Jira. Those two tools don’t necessarily have hyperlinks to each other and weren’t single sign-on and all sorts of corporate problems like that. We end up with a more supportable system if we’ve demarked correctly between level 2 and level 3.

B2B

That is these types of companies. Cassie passed a note saying, “Make sure that you differentiate between the types of business.” B2B, you maybe don’t have end users that you super focus on most. You mostly focus on their account manager or their partner who were to call you and operate as maybe level 1 in some regard. You have your account manager who deals with their account manager, and either one of them could be in a persuasive moment with you that you should drop everything to work on an issue, especially if it’s a whale customer. The mechanisms of funding for how a company acquires customers who are other businesses is curious.

There’s a division between op-ex, which is how we pay for all the regular developers making the base platform, and then there’s an onboarding effort, sometimes called pro services, and then the funding streams for that are quite often chargebacks when it counts. If the company is big enough, they’re a whale, you’ll just comp all of the customizations they want for your stack and just say, “We’re happy to have you as a customer.” If they’re small, you’ll make them pay-to-play. All of that, the nature of their onboarding, the size of their account within your larger corporate deliverables, all of that drives your attention to them in a support cycle. Maybe the minnows who pay to play are going to be attended at 9 a.m. when the first developers come in to work and have a look at something, presuming there’s nothing on our priority. If it was a whale, you might be up at 3 a.m., and you might not get to bed again until 7 a.m., and you might not come in that day. The criticality is sometimes driven from the financial importance of the clients to the company.

B2C

B2C is not that, right? It’s people with phones. In this age, hockey sticks can happen whilst you’re asleep. You went to bed with 10,000 customers, you woke up with 100,000. It can be driven by news cycles or just word of mouth if your application is good enough. There’s renames that happen within the support world for what the support engineers are called, even how engineering they are versus how supported they are. I think one of the latest renames is to call this CX, customer experience. It’s confusing to us, developers, because we keep thinking of UX for user experience, and then we’re all confused because our little reptile minds can’t hold too much at the same time. Either way, CX seems to be sticking and it’s a sudden field of science. You have a whole bunch of training around that, a whole bunch of expertise that developers can’t match but they can learn.

Platforms change too when maybe you’re dealing with millions of customers. You might, as a startup, be speaking to the likes of Salesforce to deliver some of your customer support operation, and you might be dealing with having to customize Salesforce to your needs as you integrate it. There’s 20 or 30 different companies you could bring into your startup, and they’re going to be quite cheap for you to consider at the outset. As a truly sophisticated company, I think Gartner wrote this, if you’re at the top of your vertical, you’re probably going to write all your own software. That may have been true 10 years ago, but now, in the era of Mongo and a whole bunch of super-advanced databases, maybe you’re not writing your own software anymore. You’d have to be Google to do that. Either way, support changes when there’s multiple channels, phones that you have has a little rectangle or a button or a hamburger fly-out thing that integrates some rectangles that came from that vendor, from Salesforce, or for other, or it could be a classic, “I’m calling a call center for a support line,” or for us, devs, we’re also going to look for an email way of hitting support agency for the piece we’re complaining about. We want to do asynchronously and get on to something else. It does change whether you’re B2B or B2C.

Systems

Systems, other than Salesforce and the like, would classically be PagerDuty, and there’s about 12 variants of people that compete on this level. They can strike you as quite expensive, but they’re not when you’re a startup. They would worry you if you had 10,000 people that you’d engineered into PagerDuty and was paying a monthly fee with only modest breaks for volume, and maybe you’d think about your own or some other solution or changing vendor if you’d gotten to bed with the wrong one and it was going to be costly later. You don’t care as a startup. You never try to save the CEO’s money by writing something yourself. You’re always trying to get to the goal line quicker and spend as much of the CEO’s money as you can and have them tell you no rather than assume the no was going to come.

I foolishly, when I was a director of engineering, went and wrote my own extension to Confluence, and Confluence doesn’t have extensions. There I am hacking this joyfully after hours, my wife complaining that I should have been paying more attention. I made this JavaScript thing over several pages, and it would load an adjacent page and then read data from the HTML and counting that as a table. I was dropping query functions in it, and it would go and render something on a just-in-time basis like this. It was out of screenshots, so I should have log entry. I went and rolled it out, and nobody used it at all. It’s like, “Yes, that was just a blog, wasn’t it?” The clue for me was that I was actually banned from developing on the first day of being hired as director. They haven’t told me this in the interview, and I might not take on the job, but it was probably the right decision. I would do the same now if I was in the same position to somebody else.

Do Not Disturb

One of your problems in any support cycle is, it doesn’t matter whether it’s Slack ops, WhatsApp, PagerDuty, Salesforce plugins, you have an essential problem that a developer is going to have a phone, Android or iPhone, and they all have the same feature which is “do not disturb,” and it allows you to sleep. It doesn’t seem unreasonable to me that we allow developers to have that, even on the night they’re on-call if there’s a rotation so that the previous screen had an implicit rotation. It doesn’t seem unreasonable to me. Whatever tool we’re using, and even if that’s human in your first month of operation that you’ve got a person running a spreadsheet and going on an incident who’s on-call. If they’re a pool, “Ok, what’s their pool’s number? Who has their pool’s number?” Then you call their pool. Be aware that they don’t want to pick up. A fine criticality driven here, that’s call again within 30 seconds. At some moment, one, two calls, or three calls, you’ll bust through the “do not disturb” feature of the iPhone and the Android. Then, hopefully, [inaudible 00:24:14] and he can pick it up and say, “What’s up,” a stifling yawn, “Where’s my coffee?” Then you’re on it. Then you’re back into Slack where the incident has been running for five minutes now or in email or there’s a bridge line and you’re now going to be speaking to a bunch of people in your company, and you’re the sleepy developer. It seems fair that we allow people to utilize their “do not disturb” to ease their work-life balance, even when they know they’re on-call.

Preparedness

We need a rotation. You need to do this early, perhaps. If you have hockey sticks overnight, and suddenly, the incidents have gone up from one every three days to four a night, you’re in the position where you’re pushing people immediately to the burnout place. Before your hockey stick, and there’s not many things you need to do in dev before your hockey stick. This is one of them. I think if you’re in a PHP stack and you’re about to hockey stick, you probably should have solved your PHP problem before you have hockey sticks. Although, Facebook didn’t, they just made the PHP solution work. There’s another idea of a rotation from dev into the support team or the nascent support team. You site somebody for a month, you want to go and sit with them and be part of their team, “No, you won’t have to be awake through the night shift. You can do it through the day shift, because it’s three shifts. You’re going to listen to every incident or you’re going to help work out where the systems are deficient and the bits that dev could do for them,” or they really are just going to learn the way they do stuff and be able to do it as one of them. In any rotation basis, and XP does this already with QA automators and devs, when you’re comparing one or the other, the distinction is somewhat blurred, but we’re trying to apply the same idea between dev and support, especially if they’re local. Can we slot into their team for a month to see how they work? Can we improve the way the system performs for them? There are times that I’ve seen this done, and it’s not me who’s done it. It’s me observing really successful companies operating this. The times I’ve seen this done, it really works. It’s called a SWOT team when I saw it, and it was super impressive, almost like the place where somebody could say, “I don’t want to rotate back to being a developer. I want to stay in this team.” In our quiet moments, we’re doing fire drills, all the stuff that Netflix was doing with yanking cables and sexy open-source product names for that same thing. The chaos engineering this company was doing too and just calling it fire drill. There’s plenty of fun that we have. Somebody that previously thought the development was their career track, and they could take a year off and be in the support world and still get very fine.

Measuring

We all know about blameless postmortems, hence, the picture on the right. It’s the best I could find. We want to do things blamelessly, because if we don’t, habits change. Just in any part of your life, if somebody in a position of authority yells at you, even once, you modify your behavior for every subsequent meeting with that person to avoid being yelled at. All of that defensive behavior is just like a padding. In the history of agile, we’d say, “How long is that going to take?” You’d say, “Eight weeks.” You go, “I think it’s only a five-minute job,” because everyone you asked pad it by 50%, and then it got all the way to the CTO and suddenly it’s a month’s work. We pad stuff in order to be defensive so that we don’t get yelled at again. If we don’t make it safe for people to fail, and that includes during postmortems and blameless autopsies, if we don’t keep it safe for people to fail or for systems to be broken in for humans to subsequently resolve it, we change the behaviors of the team to be subsequently malfunctioning or persistently malfunctioning. Some stupid folklore about monkeys in a ladder and some bananas and a firehose, which is a great story, but it turns out to be not true.

We should maybe think about the run books as living documents rather than Word files that are emailed around. That’s the worst. Everything that’s email as an attachment automatically has 100 versions that nobody is sure which is the most current or most valuable or the definition of value changes based on who you ask. Can we have a system record that actually allows for updates on a bit of an order trail? That could be Confluence. It seems to be one of the Atlassian products that I get what it does. We want to maybe audit the operations too, not just the curation of the run books. If the operations team has temporarily gained access to production to go and do a data fix with an SQL update statement, how do we audit that? Is that in a place that we can turn to auditors later if we have them and say, “That’s what happened here with the permissions around that”?

We might actually, as a software delivery team, be writing systems that are custom in-house, not from Salesforce land, but a custom in-house that do parts of our auditing cycle to allow us to stand in front of an auditor in the future, that could be Ernst & Young, or allow some member of our team to stand in front of a judge and jury in the case of stolen assets or something like that and say, “We did our best, and we had a provable trail.” Then we measure everything about it, because it could inform how we tweak everything to be better the next time. If we stick with the positivity, avoid the negativity around incidents, keep it calm, rational, have somebody to run an incident, and then close it off. Pat was mentioning in the keynote, there’s plenty of things that can just be forgotten out of the retrospectives. The same is true of blameless postmortems. We have to make sure that we can work forwards with the stuff that we are taking out of the back of that and action it. Maybe that’s a feature in the planned backlog rather than something that’s slotted into or remains in the unplanned backlog, especially if it feels like work that saves future incidents rather than work to solve a specific incident.

Staffing

You’re a startup or a scaleup, you’re wondering, “How are we going to do this with seven developers who are covering front end and back end, where one of them is doubling as scrum master and there’s no QAs?” Not mention any names, in particular. The answer is you could borrow this as an elastic resource from a company that offers support on this basis using your tools while supporting other companies, at least. They can have a follow-the-sun metaphor, meaning whichever eight-hour segment of the day that you’re particularly worried about is daylight for them. When they pick up the phone, they answer as your company’s name, even though the next phone call, they might say, “We’re Cisco,” and then, 10 minutes later, they’re Oracle. That could get you through a tricky period where you don’t want to take people out of the highly trained and aligned dev force.

You don’t want to diminish that, but you need somebody to pick this up so that we’re not all called through the night and there are people who can solve incidents. You may not have a multiyear relationship with that company, maybe they hope you will, but it’s ok to have some company fulfilling this function for six months until you are hockey sticking and your funding levels are increasing because people are coming in through rounds of funding through venture capitalists or others at other moments. Then, suddenly, your buying power for additional staff has increased, and you could maybe think about hiring the first people to lead that group or replace that group and count them as in-house. If you’re sophisticated enough, you will run it in-house in time, according to that Gartner report. Then you don’t have to do 24/7 straightaway. You could do 18-hour days, 5 days a week, and count that as just peachy even if your usage cycles are longer than that. It’s a gamble. A lot of stuff we do is a gamble in development if you’re chasing a delivery date with scant resource.

Success

Success is none of your staff quitting, really. That’s a multiple access. Not quitting because they felt pressure too much, but there’s another not quitting, which is the people who are there at the beginning that sailed with you all the way through to your acquisition or your flotation. That team photo is interesting, the one that’s formed on the first day, with the founding team and then the people who made it through to flotation, doesn’t always have the same people in it. Maybe only one or two, and it’s the CEO. Most of the tech staff have been rotated, hopefully, kicking their options. It’s interesting, if you make an energetic enough company that the software is so compelling, we’ve heard speakers already in this track that talk about how great their companies are, and it’s very believable, that if your company is so compelling that every aspects of it, development, testing, which is downstream from development, and upstream, the support aspect, none of those are odious or cancerous. Then you’re more likely to stay the course. People actually look for reasons to leave a company. If there’s none who presents it, then the easiest thing for you to do is just remain in that company, provided they are meeting all your compensation asks, your, “Hey, I’ve grown with you. I’m no longer the post boy. Can I now be director of engineering?” Maybe the answer can be yes in many companies.

Success is multifaceted. You don’t want your staff quitting is one. Success is also maybe secondarily measured, and I’ve seen this in companies, again, not something I personally engineered, but I’ve witnessed as a major triumph, which was a support engineer sitting amongst the developers and then being treated as equals. The more separation you have, the further they are away, the easier it is for you to have some animosity for them. I started in the ’80s coding professionally, and you would hear conversations about somebody slamming down the phone and then, Brits, at least, expletive driven accounts of what happened and calling somebody parentage somewhat questionable. That’s normal. No, it wasn’t. That was nasty. If that person knew you’re speaking about them like that, they probably wouldn’t pick up the phone to you again. Guess what, when they slam the phone down on the ops side, they’re probably yelling at you about your … Early ’90s, we were moaning about only having 4 megs of memory in a 286 PC or something, and we wanted 8 megs of memory. Anyway, things change. Those problems were solved.

If we’ve moved the situation, we’re between the people doing development and the people doing support, there’s friendships. You link in to these people, you look forward to working with them again. There’s rotation between both of those groups. These social functions aren’t just staged. They’re genuine like, “Come along,” or “We’ll wait for you to fix Velma’s RAM.” We can be in a position that we’re friends with these people, and that’s what we should engineer within startups and scaleups, even if it’s a vendor. If it’s a software supplier vendor, we should hope to have the same relationship with them as we would have with our own staff. It shouldn’t change. We want it to be fair for them too. If we ask all sorts of support and we’re just keeping dev in-house, we don’t want them, as level 3, hurting like our staff would hurt in the same situation. Success is multifaceted, and I think it’s certainly attainable. I’ve seen it done a few times, and I think it’s very possible for anyone to engineer that for the startup they’re at least influencing during its initial period of time.

That was my last slide and I’m not sure, Cassie wasn’t giving me move forward things so I won’t be ahead of schedule. I think the answer is yes, we should support it if we build it, as long as we do that fairly. Any questions?

Questions and Answers

Shum: Actually, one of my questions is around the blameless postmortems. What if you’re part of an organization that doesn’t have that trust? How do you introduce that?

Hammant: Yes. That’s changing culture, isn’t it?

Shum: Yes.

Hammant: I mean, it’s very difficult. If you can get enough pats, and we have plenty of friends who could do this, if we get somebody who can just watch the postmortem go through and then pass a few comments very constructively, then maybe you can move them a little bit to a place. I mean, if they’re already doing nasty postmortems, the likelihood is that there’s a problem in the culture. If you had a bad message to give to somebody in a position of power, they’re not going to take it. They’re going to reject you. One of your problems is that organizations that have gone bad are very difficult to drag back to good. If you start good, let’s say, I obsess about build times, the startup I was involved with, the first build time was 40 seconds, including tests. I made an announcement, “This is ours to lose now. We stick with this and the CEO never yells at us. If we get worse, then we get in a perilous place.” There are some things you can do from the larger agile people-friendly way to doing things that are easy if you’re one of the first people in the company, and the business is a good sponsor for this, because actually, their goodwill is required for you to make any changes to the way we work. There are times when it’s going to happen quite easily, just from the setup moment. If the corporate culture is already bad, it’s going to be very difficult for you. It’s going to be a collective effort or it might require a change of guard at the top, like a new CTO comes in and says, “Hey, I’m from Netflix. Let’s do this the Netflix way.” Then you’ll go, “Phew.” There’s a chance, at least within the 100-day rule, that that CTO might bring in some changes. Any other questions?

Participant 1: It’s a challenging topic and some of your ideas I can relate to. One of the things that I wonder is, by doing a level 2 or level 3, aren’t you transferring the problem to the level 2?

Hammant: That’s really good, and I should have worked it into the talk. There’s an unworded problem within our industry around risk and responsibility. There’s the business who has a need for you to do something very cheaply and quickly and perfectly. Then there’s you as a deliverer of that. Sometimes the business asks you to take responsibility to do something and maybe change the way of doing it, but they don’t necessarily always take back an equal measure of risk. Quite often, this is an exchange of risk and responsibility, but the business is asking you to take both. It takes some maturity to realize that you go, “Hey.” Maybe the premise was there, “We build it, we run it. I’m taking risk and responsibility.” I think you have to make sure that in a change within an organization to start to consider level 2 support and the need for it, it was a part of a larger exchange of risk and responsibility. You can do many exchanges of those, but you shouldn’t have one side of that arrangement take both. You should be able to call in safely and say, “Hey, I seem to have risk and responsibility here, which means I’m going to be the hero for a few minutes and then I’m going to mess up one day and be yelled at.” You don’t want that.

Participant 2: A question around support and getting escalated issues when you are not out of hours but during development. You’re in a development team, you have maybe a sprint goal that you’re trying to hit, you have a standup thing, you said, “I’m going to get this done by the next standup,” and then something comes up, or you are watching the dashboard of “Here are the production issues,” at the same time as you’re trying to do development. You’re expected to do both at the same time.

Hammant: No, that’s risk and responsibility, again. If you’re in working hours, you have a scrum master, scrum mistress, project manager, coach present whose job is to watch the queues. Your job is to keep your headphones on if you’re [inaudible 00:41:49] or your wide bench for you and your [inaudible 00:41:53] if you’ve got two screens, two mice, one CPU, and you just carry on with your duty until somebody comes up and says, “Hey, when you finish that, can you look at Jira 1, 2, 3? That was the backlog tool. Could you look at this trouble ticket?” You don’t, as a developer in working hours, keep half an eye on the incident channel.

Participant 2: That would be great, but sometimes you have to rotate that, don’t you? The teams will rotate.

Hammant: Your team might have designed that the project manager said, “I can’t do this because I’ve got meetings.” You might take a rotating duty, yes. One of you, out of eight of you, is looking at those queues. Somebody else can answer the question here.

Participant 3: Something I can just help with this. Something we do is that we have something called an interrupt role in the team. That’s the person that can be interrupted. If there is an incident during hours, then that person, their usual work is not feature work. Let’s say they’re looking at tech debt or incoming bugs and so on or, say, build failures and so on. If there’s an incident, that’s their priority. They’re not part of the capacity planning for that sprint, and it’s a rotating role.

Hammant: XP had the same, didn’t it? It used to have a bug pair.

Participant 4: My question is about the level 3. I’m not completely clear about how you convince the team to do level 3 support, because that also involves being standby, right?

Hammant: On-call, yes.

Participant 4: That also affects your private life.

Hammant: Yes. If it was fair, you were told in the job interview that there would be elements of support. If it wasn’t fair, you were surprised six months in that we’ve just invented this role called level 3 support. It should be something that’s eyes open and that in your partnership with your wife/husbands and your consideration of the role is being offered, you should factor it in. If you’re an industry cycle, you could ask to speak to somebody who currently does support, to ask them what that’s like. In every interview, you do. In every round of every interview, you do. You should turn the table halfway through some portion and at least get your own questions answered, what does support look like, especially if you’ve never done it before. I mean, it is troublesome to be pinged. I was director of engineering and had to be pinged on everyone else’s support calls at one stage. The bigger my dev team gets, the more calls I get if I’m overseeing every incident. Ok, I was a younger man then.

Participant 4: In your view, that’s also a rotation.

Hammant: If it’s fair and it’s a rotation, then we work out what we need to do there. If we’re accumulating tons of defects and the support incidents around one particular piece of your application is going up and up and up, we have to have a conversation with the business about slowing down the rate of functional deliverables to attend to something that clearly is tech debt in order to make the software more robust in that regard. If the business isn’t doing that, they’re in that same place of giving you risk and responsibility without that being a fair exchange. If they’re not allowing us to, the best teams are going to attend to that tech debt as they accumulate it. Meaning, in any one finish of any one week so we don’t have any tech debt. We’ve adjusted every story estimate if we’re still estimating to include the remediation of the tech debt as it accumulates, even if it’s a surprise. One of the things we do around estimating for points or gummy bears or t-shirts, whatever your estimating model is, is you try to actually think of the average time it takes to do a thing. Sometimes you come in quicker and sometimes you come in slower because there are some tech debt. If we’re ok with the business and they’re not in that shanty place, then they don’t mind if some of them are longer than the others and that the amount of story points done in a week or whatever its rational length is. If it roughly matches after a few weeks the expectation, they’re ok with that. We should be attending tech debt, and that does include making the software more robust in production so we don’t get caught. Speak to your boss about fairness on tech debt, assuming you’re not the boss. You’re the boss, ok. Bad news. You’re going to have to see less functional developers.

Participant 5: What advice would you give to people picking a level 2 support partner? What would you do to make sure they are set up for success?

Hammant: There’s hundreds, honestly. Then you sit in there and you just want to choose one because you got other things to go on with. You go, “Ok, which mainly eliminated themselves? Ok, who else is in the mix?” You find Salesforce is gone and these other five here. Depending on the market, it changes too because not everyone does U.K. or South America. His problems must have been super crazy. You find a partner, they whittle themselves down to three, you have some selection criteria, you have an objective interview. You’ll ask them things like, “What is your technology?” You assess them as if they’re a long-term partner, and you’ll make critical judgments about their service too. They could have people that are available, but their software crashed, and you go, “Great, so they’re downtime is affecting our downtime.” Provided they pass the beauty contest, the only thing that remains…that process being a beauty contest. I have a blog entry on that one, not for support, for something else. Look it up, it’s called “Like a Used Sofa”.

I think the one remaining thing that goes wrong here is how we integrate them. They used to have standup software, and we have a standup software. If we’re honest, we have a dev environment, and then, maybe not yet, but soon, we might have a QA environment, which is deployed to you less frequently than the dev environment, I don’t mean my personal dev environment, I mean the shared dev environment, after I’ve committed and after CI said it was good. If I’m sufficiently mature, I might have a UAT environment, and I might have a [inaudible 00:48:21], but assume I don’t. I just have dev and QA and production. I want a standup something with them that we will both agree as dev. Separately, I want a standup something that’s another supportable environment. For me, I’ll call it QA, and maybe it has hardcoded users in it [inaudible 00:48:38], but I want their provision to be separate so the dev provision that my devs are actively coding against. I don’t want the data to be mixed on their side, because it’s not mixed on my side.

When I go live, I want them to stand up one more, totally separate to the other two, nothing mixed, no configuration, no shared users, and that one I want to be live and untangled with any of the other issues that I might have been testing. As I bring up a support capacity that involves my own devs, I want environment separations. Environment’s a canonical name. Most of the time, these partners are just appalling. They don’t know what CI is, let alone Jenkins. They only have one environment, they’ll call it Sandbox, so your QA and your dev have the same two feet in one shoe and is mingling there. Then somebody jumps in configuration with the dev, which reconfigures the QA. You don’t know if we’re still playing with that. What have you done? You’ve poured that down. They’re all appalling.

As it turns out, I’m in charge of a new technology called service virtualization, Servirtium, which is of the type of technology called service virtualization. One of the things we’re using it for in a startup is to record the interactions with the service stack and then play them back in a CI loop so that it is always green and always passes and isolates this from the suppliers from a sandbox environment, which is up and down like a ping-pong ball and nowhere near the capacity of production, nowhere near the response times of production. You maybe have to employ some dev tricks to make the supply staff more reliable.

See more presentations with transcripts

Uncategorized

Podcast: Sam Newman: Monolith to Microservices

MMS • Sam Newman

Article originally posted on InfoQ. Visit InfoQ

Subscribe on:

Correctly or not, microservices has become the de facto way to build apps today. However, the idea of microservices is really a tough thing to put a clear beginning to. A commonly pointed to birth of the architectural pattern starts around 2011 with James Lewis. James, consultant at ThoughtWorks, at the time was interested in something he called micro apps. It was a pattern that was becoming increasingly used at places like Netflix. You can actually find a talk from Adrian Cockcroft at QCon SF in 2010, where he talked about cloud services at Netflix on InfoQ. In 2012, several of these players gathered at a conference and debated this architecture. At that conference, the term microservices became an architectural pattern with a name.

Today on the podcast, we’re talking to one of the thought leaders who was at that conference, and through his consulting, his books, his talks, and his passion has helped to shape what we think of microservice architectures today. Today, we’re talking to Sam Newman. Sam is a consultant in the cloud CI/CD space, perhaps most well known for his thought leadership around microservices. The big focus on this podcast is really to chat about his new book, “Monolith to Microservices: Evolutionary Patterns to Transform Your Monolith.” That story from James Lewis in part is taken from that book.

Outline

Today on the podcast, we’re going to be talking about understanding the problem that you’re really trying to solve when it comes to microservices before you make that choice. We’re going to talk about technical and organizational challenges around microservices, and Sam’s thoughts on how to address them. We’ll spend some time discussing decomposing the database and some of the patterns to think about when you’re dealing with that monolithic database. Then once you’re on that path, we’ll talk about some of Sam’s thoughts on how to deal with the growing pains that inevitably come from microservices. Today on today’s podcast, expect to hear stories, advice, patterns and anti-patterns, and some best practices on microservices from Sam Newman. As always, thank you so much for joining us on “The InfoQ Podcast.”

Sam, welcome back to the podcast.

Newman: Thanks so much for having me back again.

Reisz: Late last year, November time-frame, you published a new book, “Monolith to Microservices: Evolutionary Patterns to Transform Your Monolith.” What have you been doing with all the copious amounts of free time since then?

What I’ve Been Doing since Writing My Last Book

Newman: Writing a second book.

Reisz: A third book.

Newman: Yes. A third book, really, yes. The reason I wrote that book was I started doing a second edition of my first book, “Building Microservices.” The chapter in building microservices, I think it’s Chapter 5, was where I looked at how you break systems apart. I started rewriting the second edition of my book by looking at that chapter. That chapter went from being 5,000 words to being 40,000 words in about a month. I thought, “This needs to be its own book.” I split it off really as its own books. You can read it by itself, but it also does work as a companion. It’s a deeper dive on those topics. Since I wrote that book, I’ve been dealing with the current situation, being locked down in my home. That’s, as an independent consultant, getting your customers happy with doing things online. It’s actually been easier than I thought. People are much more open to doing online consulting now than they were in the past. In the past, they wanted to see you in front of them. Literally, was just about half an hour ago, I was working on some stuff around the second edition of building microservices, and actually doing a lot of exploration of computer science papers written in the early 1970s. I was descending down that particular rabbit hole this morning.

Reisz: It’s definitely an interesting time now. I can’t say it’s exactly working remotely, the culture we’re in now. It’s an interesting time, to say the least. One of the questions that I always have when I’m talking to someone who writes books and consults is, what does your writing process look like? I’m curious.

What the Writing Process Looks Like

Newman: I’ve spoken to lots of authors, and they’re all different, as in how it works. It’s all quite personal. I’m what I consider to be a bit of a momentum writer. When I’m writing an initial draft of a chapter, I have to have at least a rough structure already mapped out. That’s the thing that will bounce around in my head for months beforehand. I’ll sketch that out. Then literally, I’d like to sit down and over a period of four or five days just brain dump. For that to be most effective, I really need to write blocks at a time. I need to have three or four days in a row to really get up to speed on that. That then helps me get out that first initial draft. Then I can review a few times.

Writing Days

For me, I can get my stream of consciousness down as prose fairly effectively. I’m not somebody that could write that initial bit of work an hour here, half an hour there. That means, really, from a work life point of view, or work-book balance, I have days where I’m writing, and I put nothing else in that day apart from writing. If I’ve got calls, if I’m doing some online training, I’ll mix those days with other administration type things I’ve got going on, or things just getting around the house. I try and keep writing days as writing days. I can’t write for more than five or six hours a day: rare exceptions. Today was a writing day. I finished at 3:30. After that, anything else I get done is gravy. Once I’ve written that, and I’m getting review feedback from people. At that stage, I can go through and process that review feedback in little chunks here and there around other bits of work.

Reisz: For me when I’m writing, I find I have to write code and then shape the text for the code. Do you find code first, or write first, and then shape the code to what you wrote?

Order of Writing Code

Newman: It’s interesting, because I think most of mine actually start as a story. I think most of the things I write start off life as a presentation, or a workshop. On those things, I’m taking somebody on a journey. Then I’m writing that journey down. The books I write, I don’t write code-centric books, because there are other people that do that really well. I want to make my book more broadly applicable to people. I don’t want to alienate the .NET’ers because I’m talking about Java, or whatever. For me, it’s more that if I want to share these ideas, how would I do that if I was chatting to somebody in front of me? I’m actually quite fortunate I do training actually as part of how I make my living. I actually get to go and take those messages out and take those stories out, and almost road test them by actually delivering training. Then after iterating on that a while and getting the flow, the beginning, the middle, and the end of each of the topics right in that forum, I then almost write that down. That’s the process that works most for me. I think I do still try and think in terms of that narrative arc. That tends to be the way I work more than just throwing in lots of topics I want to fit in. It’s more that I’ll do that narrative arc. Then I’ll come back and say, “I’ve missed some stuff. How do I shoehorn it back in again?”

Reisz: Then you have a second and third book?

Focus on One Book

Newman: Yes. I’ve got lots of ideas about books that I want to write after this one is done, but you got to focus on what’s in front of you.

Reisz: Speaking of that, let’s dive in and start talking about, “Monolith to Microservices.” Tell me about this book. You already mentioned briefly the origin but tell me more. Tell me about why this book and why now?

Why Monolith to Microservices Book

Newman: The vast majority of the people I speak to that come to my workshops, I ask this question often, who here has got systems too big? When I do a conference talk, and everyone puts their hands up. The vast majority of people don’t start with a blank sheet of paper. They start with an existing system. They think, “Microservices are attractive.” How do you make those two things work? Do you ditch the entire system and rebuild something from scratch? I think that’s extremely problematic in most situations. The reality for the vast majority of people is that if they’re interested in a microservice architecture, then they’re going to have to find some way to take what they’ve already got, and migrate it towards a microservice architecture.

The People Process Side

Even if you start microservices first, you get to situations where you find further decomposition is needed. I really wanted to go a bit deeper into that area to give people some concrete tips and advice about how they could make that journey happen for themselves. As part of that, you’re looking at the people process side of things, then looking at the how the code, so how you pull the code apart, patterns around that, and then spending a lot of time around the data. The idea being that I can share some concrete patterns. I’ve got more case studies in the book that I had in “Building Microservices” to show what is possible. I actually think it’s healthier for people, in fact, to take that migratory approach in an iterative fashion. I think it’s a much more sensible approach than trying to build a microservice architecture from scratch. Even if you’re rebuilding an entire system from scratch that you already know. I still actually think migrating the existing monolithic architecture, or whatever you want to call your current system, is a much healthier approach.

Reisz: Back in March at QCon London, I was doing the architecture you’ve always wondered about, track. Ian Thomas from, The Stars Group, was in that track. He started the talk off by referring back to your talk from the previous day. Ian was part of a project that took the sports book for PokerStars and the sports book for Sky Bet, and merged those things together. He had a Greenfield app that they were building from these two things. In your talk, you talked about, look, microservices isn’t the point you start. It might be the architecture you get to, but really, it’s starting at that MVP. You’re starting at the simplest thing that works. He put up this picture, I remember, and everyone cracked up, because he showed this picture of you looking straight at the camera. He said, I felt like when you were saying that don’t start with microservices, you were looking directly at him. One of the things that people that first come to one of your talks, I think that they’re initially surprised of, you’re not ringing the bell that microservices is the jumping off point. Can you talk a bit more about that?

Microservices Is Not the Jumping Off Point

Newman: I don’t think that anybody that’s spent a lot of time looking seriously at the challenges associated building a microservice architecture thinks that they should be writing on situations. Fundamentally, a microservice architecture is a distributed system, and distributed systems are complicated. They have baggage associated with them. There’s a lot of complexity that comes from these systems. The idea that everybody is aware of those issues and knows how to deal with those problems just isn’t true. We may have learned about some of those ideas at university perhaps, but they’re not things that most people have to deal with. For me, it’s like, “Microservices has all these problems. Yes, that’s cool. It’s also with these problems as well.” For me, when I’m working with a client, my mindset typically is not why shouldn’t we use microservices? It’s normally, why should we? I think I got asked, what’s the one thing you’d say when you would use microservices? The answer normally is when we’ve tried everything else. Because for an awful lot of things that people want to do, there are often easier ways to achieve the same goal. A lot of people want to have improved scale of replication. Have you tried having multiple copies of your monolith? Have you tried getting a bigger machine? They’re often the quicker things that you can explore and experiment with beforehand.

The Extent of Microservices

Even then, a lot of this comes down to if you do want to do microservices is the extent to which you do them. The analogy I’ve always tried to use with adopting microservices is that it’s not like flicking a switch. It’s not off or an on state. It’s much more like a dial. If you’re interested in trying microservices out, then just create one service. Create one service. Integrate it with your existing monolithic system. That’s turning the dial a little bit. Just try one and see how one works. Because the vast amount of the problems associated with a microservice architecture are not really evident until you’re in production. It’s super important that even if you think microservices are right for you, that you don’t jump on it. You dip your toe in the water first and see how that goes before you turn that dial up.

Reisz: One of the things that I liked in the book, and I’ve fallen in this trap before thinking of monolith is this one nebulous, just one classification of an app. You break it down and you talk about different types of monolith. That even these different types of monolith can move you towards this journey towards microservices. Can you talk a bit about that?

Different Types of Monolith towards Microservices

Newman: I talk primarily about a monolith as being a unit of deployment. I say that could be all of my code in one process. It could also be a distributed monolith where I’ve got a distributed system that I’m all deploying together. One of those monolithic patterns that can often work well, as a migratory step, would be what we now call the modular monolith. Where all of your code is packaged together in a single process. That code is actually compartmentalized into separate modules. This is a cutting edge idea from the late 1960s that we’ve just realized existed. If you look at all the work behind structured programming, information hiding, it’s all, how do you organize code into these modules that can be worked on independently? Actually, for a lot of people, if you have all of your code running on a single process, you sidestep a lot of the challenges of distributed systems. If that code is organized around modules, and if you’ve got your module boundaries right, then that also is a potential migratory path. You could say, “My first microservice, I’m going to take one of those module boundaries, and potentially use that now as a service.” There are different ways you can make that journey happen.

Reisz: Without necessarily having to deal with the network partition in the whole process.

Newman: Absolutely.

Taking People on a Microservices Journey

Reisz: Another part of the book that I really enjoyed was about organizational challenges. There was a phrase you used about taking people on a journey. When you say take people on a journey when it comes to microservices, what did you mean?

Newman: It can happen at a lot of different levels. One of the common questions I get is, “How do I convince my boss that we should do microservices?” That’s one of the ones I get a lot. This often comes from developers. I always say, why should they care? What’s in it for them? Get that side of it, which is some of what you’re going to do. Implementing a microservice architecture or moving to a microservice architecture won’t be cheap, and it won’t be quick. If you’re going to take your boss on a journey towards that microservice outcome, you’ve really got to explain why you’re doing it. I talk in the book about separating activity from outcome. You’re implementing microservices as an activity. The outcome shouldn’t be microservices. You don’t win by doing microservices. The only person that wins if you do microservices is me because I sell books about microservices. You’re implementing microservices to achieve something. What is it you’re actually trying to achieve? It’s amazing how many people I chat to who can’t give me a clear articulation of why they’re doing microservices. It starts about having a real clear vision for what it is you’re trying to achieve. Then you can say, is microservices the right approach for this? Are we using microservices in the right way?

Problems That Require a Microservices Approach

Reisz: Decompose that a bit. What are the smells? You’re running a monolith and you start to have a set of problems? What are some of those smells? What are some of those problems that indicate maybe microservices is something you want to look at?

Newman: I would say the most common one that I see for larger enterprise organizations is that they want to have lots of developers working on a given problem. Those developers are getting in each other’s way. What they want is they want those developers to be able to work more independently from each other, reduce what I call as delivery contention. How do I have different autonomous teams working more in isolation, but still have the things that they create come together to create an overall system? That’s a big part. How do I get more people working more effectively, efficiently? Scaling comes up sometimes, but much less frequent than you think. There are some aspects around scaling that could be pretty beneficial with microservice architectures. Data partitioning is something I’m seeing a lot more. If you isolate, for example, where your personally identifiable information is, you can sidestep GDPR concerns. You could say, “This part of my system has to have PCI sign-off or GDPR signup. This part of my system doesn’t.” Those are three quick examples of types of problems which can often lend themselves quite well to a microservice architecture.

Things to Have Right before Decomposing Your Monolith into Microservices

Reisz: You start to get these smells of team velocity, and you start to see some areas where privacy concerns or scaling different parts of your system may be different. There was famously a blog post years ago Martin Thompson had, “You must be this tall for microservices.” Before you can really jump off that cliff and even incrementally decompose your monolith into microservices. What are some of the things that you really need to make sure you have right?

Newman: Because I see that process of dipping your toe in the water as being something which is so gradual and it is a very gentle first step, I don’t have a big shopping list of things that I say that people have to do. Some people say, “We’ve got to do X, Y, and Z.” Actually, you probably should have automated deployment of your system. If you don’t, and you’re only adding one more thing to deploy, it’s not the end of the world. There’s one big prerequisite I always say people should do first. That’s actually to implement some form of log aggregation. By log aggregation, I mean some means by which a process can log files locally, and those files can automatically be aggregated in a central location where they can be queried and stuff. Traditionally, you think of things like the ELK Stack. I really like Humio for this. That’s the one prerequisite I really say and I’m quite firm on. The reasons for that are actually quite straight forward. The first is, it’s really useful and it’s the thing that is going to help you a lot early on.

A Good Test of an Organization

The second thing is it’s often a good test of an organization, if you as an organization and the different parts of your organization can’t work out how to choose a tool and implement that tool effectively. A log aggregation stack is a very simple thing in the grand scheme of things. If that’s something that you can’t do as an organization, the chances are that all the other problems that you’re going to have to face in microservice architectures, they’re going to be much tougher for you, and probably not some things you’re ready for. I always use it as a test of the organization. Because rolling out log aggregation means you’ve got to find a tool, pick the tool. Get the operations team on board. Get it bought. Get it installed. Get it configured with your existing applications. It’s not massive amounts of work, theoretically, but if it is for you, you might be thinking maybe we should sidestep this whole microservices thing for a little bit.

How the Organization Changes to Support Microservices Architecture

Reisz: I want to definitely come back and hit on logs, particularly with microservices as you start to scale them out. Those logs and metrics start to become more voluminous. Before we get there, though, I want to talk about organizational culture and the impact that has on your architecture. Conway’s Law talks about how your app models the communication structures. You’ve got this monolithic application that probably presumes you have an organization that models that. I guess as you start moving towards microservices, how does the organization need to change its shape to be able to support that architecture?

Newman: It’s interesting because I find that these things are often talked about in isolation from each other. Ideally, you get the most benefit where your organization and your architecture are aligned. I talk about ownership models. I got a talk called, Rip It Up, and Start Again, where I look at the different types of ownership models. The model which I find most effective for a microservice architecture is a service is owned by exactly one team, one team might own more than one service. That keeps your ownership lines nice and clear. This really is what you would think of as being a strong ownership model. That’s the model I’ve seen this works most effectively. Virtually, every large scale microservices organization I’ve looked at, they adopt that model. My starting point is often what is the existing organization? Often, the organization is already aligned around business concepts. That’s a lot more modern shift that’s been happening away from siloed IT teams to organizational structures which are more aligned around the nature of the product and around maybe product verticals. If you’re in that world already, your life becomes much easier. Then you’d be looking to create services that sit clean within these organizational lines.

Three-tiered Architecture Approach

I think it becomes much more problematic, where you’ve got things like the classic three-tiered architecture types approaches. Where you still have the database team, the services team, the front-end team, who just deal with tickets and they roll over on every single different part of the system. In that world and in that environment, it becomes very difficult to bring microservices in because microservices are fundamentally a vertical slice, and hopefully, a vertical slice through presentation, business, and data tiers. That becomes more difficult. In those environments, it’s much more like saying, “This is what a microservice is. We need to get a team that is working in a more poly-skilled fashion to own this stuff.” You don’t do the whole organization. You convince people on the model and then you pick one team and see how that model works.

The UI Tier

I think what I’ve been quite unhappy with since I wrote “Building Microservices” is the fact that although I always felt that this was evident from what I was saying, so many people stop that end-to-end slicer functionality at the UI tier. They don’t include the UI. I go to organization after organization, they still talk about front-end and back-end developers. I have no problem with you being a front-end specialist or a back-end specialist. I think calling yourself a full stack developer is daft because Charity Majors says, “You’re not a full stack developer, unless you design the chips.” I think having poly-skill teams makes sense to me. Many organizations have kept their UI as a silo. That’s been really disheartening. I think part of this is maybe because we didn’t talk about it enough early on. I didn’t highlight these challenges enough. I think it’s also partly been around the prevalence of single page app technology which does cause issues around UID conversation.

Micro Frontend

Reisz: What are your thoughts on the term micro frontend? Is that just another name for microservices? Or is it a complementary technology?

Newman: Micro frontend is very specifically the concept whereby you take a single-page app and make it not a single page app anymore. It’s talked about in the context of right now a single-page app. The micro frontend being, I don’t have one single page app representing my user interface, I now potentially have multiple single-page apps. We could then debate the word single at this point. I can coexist them inside the same browser pane. This is really about how do you make single page app frameworks and work with them in such a way that you can now break that work apart into different applications, which can be worked on and deployed independently? You’re now dealing with issues like, how would you sandbox your whole NPM chain and all this stuff? It’s absolutely, I think, would be a supporting technology for me. The use of micro frontends would be how I would decompose my user interface if I was trying to do that with single-page apps. If I’ve got a website, I could use pages as my unit of decomposition. I don’t even need to think about micro frontend. I’m just serving up different pages from different services. It’s definitely a supporting piece of technology and also a supporting concept.

Release Train as a Point in Time towards CI/CD and Not a Destination

Reisz: One of the things that you talked about in the book was a release train being a point in time in the journey towards CI/CD, but not necessarily the destination itself. Can you talk a bit about that?

Newman: I used to do a lot of CI/CD work, and I still do. A release train is an idea that you set a cadence. You say every four weeks, the release train leaves, and all functionalities ready, go out on the release train. That’s the idea. It’s a really effective technique to help people get used to delivering software on a regular cadence. The idea then is that you increase how frequently the release train leaves, and then eventually, you get rid of the release train and move towards delivering on-demand. I always described a release train as a useful set of training wheels on your bike, but you want to learn how to cycle properly, and so eventually, you get rid of them. I talk about in the context of the book, because I see some organizations adopt a release train for a services-based system. That can codify the idea that lots of services get released together. I talk about, in those situations, if you’ve got 20 different services, each individual service should have its own release train. You don’t have a release train for the whole system.

Sarah Wells has talked about this as well, I think, when she’s seen this happen, where org teams which have a, we release every four weeks. You end up with a distributed monolith as a result. I said, “If you’ve got a release train, that’s fine, but understand why you’ve got it.” You want to move beyond it. If you’ve got one for, all of your teams, allow each team to have their own release train. That’s a good first step. Then start to increase the cadence of those releases and eventually move to release on-demand. That’s just what’s in the “Continuous Delivery” book, is cutting edge information, really. I didn’t feel I had to have this conversation, but with something like SAFe, for example, which effectively codifies the release train as being the way to release software. It does come up more now. I think it just needs to be talked about a bit more to explain, this isn’t an aspirational technique. This is something which for many people is a stage you move through towards proper release on-demand, or release when ready type continuous delivery flow.

Patterns in the Book Equivalent to the Strangler Pattern

Reisz: It made a lot of sense when you made the relationship that if you can’t move beyond it, that you may be building distributed monoliths, and reinforcing the patterns of a distributed monolith made a lot of sense to me. One of the big aspects of this book, as you’re reading through there, was a lot of patterns. You’ll see familiar patterns in there like the strangler pattern, to be able to break off units of a monolith, and be able to strangle it down and move it out to a microservice. There are other patterns in there. There’s a branch by abstraction, parallel runs, decorating collaborators. What are some of the other patterns in the book that might give an architect the same mileage that they got out of the strangler pattern?

Newman: I think branch by abstraction is a big one. I think it’s overlooked because I think the branch by abstraction pattern is typically only talked about in the context of trunk-based development. Some people find trunk-based development controversial. We won’t have that argument again, because people who think trunk-based development is right are also correct. Because of that, that pattern is only looked at in the context of trunk-based development. People don’t even look at it unless they’re doing trunk-based development. I should probably explain it, shouldn’t I?

Branch by Abstraction

Reisz: I was just about to ask that. I’ll set it up for you. What is branch by abstraction?

Newman: You could distill it down to its smallest form and say that this is the true Liskov Substitution Principle at one level. The idea behind branch by abstraction is that you want to reimplement a piece of functionality, potentially to change that functionality, or in the context of a microservice architecture, to migrate that functionality to be a new service. What you’re doing is basically you’ve got the existing code, and you don’t want to break the existing code, but you need to come up with a new version of that code. I could do that in source code branching, or I could do that in the same code base. How do I do that? With a branch by abstraction, so effectively, I create an abstraction point that allows me to, say, toggle between one or the other implementation. It’s a really straightforward code level where I might do that.

I might have an order processor class that wants to send updates when the order is being dispatched. In an [inaudible 00:26:11] system, I would inject a notifications interface. That notifications interface could have more than one implementation. If I use that I could have my notifications implementation, which actually does all the functionality inside my monolith. I could have a microservice notification implementation that actually calls out to a separate notification service. At runtime, I could then change which one of those implementations I’m using, could be done on the feature flag or something else.

Refactoring

At one level, this is just abstractions and toggleable abstractions. When we do it in the context of a microservice migration, what we’re typically looking for is to be a refactoring. Refactoring is something that changes the code without a structure. When you’re migrating functionality from an existing application to a microservice application, we typically want to keep the functionality exactly the same so that we can compare and make sure we’ve done it well. This is effectively creating two different implementations of the same abstraction with exactly the same behavior, which is really the Liskov Substitution Principle. Although, often when you do branch by abstraction, you’re varying the behavior. We typically don’t want to do that in a microservice migration. Because if we can run both those implementations, we can make sure they’re working in the same way, we can compare results, and so on.

If You Do Branch By Abstraction, Do You Release It as a Canary?

Reisz: Let’s keep going to that natural extension of this. If you follow along, and you do branch by abstraction, and you have the ability to inject, I may do feature flag it. Do you release that as a canary and then instrument it to compare the results. What’s the follow-on once you’ve done that to know whether this is really the right approach?

Newman: A lot of it does depend a bit on a feature by feature basis. It’s down to the risk of this change, making everything else. Probably, an ultra-cautious end to this would be to do a parallel run. A parallel run would be where you would run all calls through both implementations of that piece of functionality, compare the results. Then say, “It’s working.” Typically, in that situation, you’d have the old implementation, which is the one you trust, your new microservice is one which you don’t yet trust. Whenever a call comes in to use that abstraction, you pass that call on to both implementations of the abstraction. Then you’re getting a complete like-by-like comparison. That comparison can be done live. That comparison could be done offline. Then when you get to a place where you’re confident, you can say, “I can now go on with the new version that’s working appropriately. I can remove the old version.”

Comparing the Results from a Parallel Run

The thing with a parallel run is, you’re getting to compare the results. If the microservice implementation misbehaves, that functionality has never actually been made visible to the customer. Because the results of the implementation you surface to the customer are from the old implementation. If I’m calling the monolith calculation, and I’m calling the microservice calculation, I want the answer to be the same. I’m never going to surface any answer to the customer other than the one that comes from the monolith implementation until I’m at a point where I trust it. That means that any problems with your microservice execution are completely hidden from the customer.

A Canary

A canary is a different track on that. Of course, it would have a set portion of the traffic go to the microservice implementation. That means if that had a problem, then those people who are in that canary group would see that issue. If your canary group is actually your internal customers, or your internal beta test team, that might be acceptable. A lot of it would depend on the business context. I talk about parallel runs. Branch by abstraction, as a pattern, makes it quite easy to implement parallel runs. It also makes it quite easy to implement internal canaries. I talk about those things as being under the umbrella of what we call progressive delivery. The different techniques around how you roll out functionality to end-users.

Dealing with COTS

Reisz: It totally makes sense. What happens when you have a monolith and there is COTS functionality that is embedded within the monolith? A CRM, or a CMS, or something along those lines, how do you deal with that COTS? Do you wrap it? What are some of the strategies there?

Newman: Wrapping it can work. Wrapping is a good technique. I can’t remember if it’s in this book on “Building Microservices,” I talk about our experiences of helping move away from a Salesforce-based system. A Salesforce-based application dealt with accounts, and revenue, and project information, and everything else. What we started doing was wrapping it with multiple services. I stopped people going straight to the Salesforce for the account information, and I said you’re going to the account service. Behind the scenes, the account service was Redis shelling out to talk to Salesforce. Initially, Redis was providing almost an adapter layer on top. That then got you from rather than going direct to the COTS system, we could then look at the migration behind the scenes. That approach can work fairly effectively. At that point, you can start doing things like strangler figs inside those adapters to bypass functionality around.

Another option is just good old fashioned get hold of the database. Some of the COTS products don’t have a nice API. Salesforce, relatively speaking, is a pretty easy to use API. We’ve got a lot of things we can do at that layer. Some software doesn’t have that. Sometimes all you’ve got is the database where data is stored in. In those situations, this becomes more problematic. There you might have to do some weird Change Data Capture things to extract the data out. What I tend to see with COTS-based migrations of functionality because you don’t have the ability to change the monolith themselves in that case, because the monolith is the COTS products. You can’t use patterns like branch by abstraction, because they require an internal change of the structure of the code. Instead, you’re much more limited in the options. That can often result in you having to take maybe bigger slices of functionality out of that system. A COTS product with a decent API gives you a lot more options about how to do that. If you’ve got maybe a graphical base, very much black-box type of COTS products, sometimes your best bet is to go into the database or to take bigger slices of functionality out of that system.

Decomposing the Database

Reisz: Speaking of the database, there’s quite a bit in this book that talks about decomposing the database. Why so much attention in the book for decomposing the database?

Newman: If you want to have services that can be worked on and deployed independently, then you need to avoid the quite nasty, often pathological coupling that comes from sharing the same database or reaching into somebody else’s database. With microservice architectures, typically, if a microservices needs to store, manage, or take state, that it does so in a database that it owns, that it controls, and it hides from the outside world. If we want to move from a monolithic system, where our data is probably one big database, to microservices, we’ve got to pull that database apart. This tends to be where a lot of these migrations falter. People can find ways to pull the code apart. They don’t bother to pull the database apart. They leave themselves without really getting the benefits from that microservice migration.

If we’re to break the database apart, that’s difficult. They’re often relational databases. A lot of their power comes from being in one place. They are amazing things, databases. We’re going to have to do some horrible things to a relational database in order to get that data out. Things like breaking foreign key relationships and messing up join queries, and potentially having to give up on transactional integrity in some areas. I really wanted to give people a whole lot of patterns and different techniques in that space to show what is possible, but again, also to show that you shouldn’t have to make this change daunting. You’re not saying, “We’re going to do it all today.” You’re going to say, “We’re going to make one step today. That step will get us further down the line of where we want to be.” Trying to show patterns that like the application things are refactorings. They’re small changes you can make to a database that allow you to edge yourself into the right direction.

Leveraging Change Data Capture in Decomposing a Monolithic Database

Reisz: I love that phrase, pathological coupling. That’s a great phrase. One of the techniques you talk about in there was Change Data Capture. Just one of the ones I think is fantastic for being incrementally moving towards more of an event type of driven system. Can you talk a bit about Change Data Capture and how you might leverage that technique in decomposing a monolithic database?

Newman: A lot of the patterns in the book are patterns that preexist with the microservices world. I’ve just tried to take those patterns and show how they can be used in the context of microservices. Strangler pattern is one example of that. The Change Data Capture, really straightforward idea. It’s basically, once a change is made in one data source, you can capture that change and pass that change on somewhere else. A lot of CDC systems basically work by looking at transaction log of databases, large numbers of people now. This is one of the biggest use cases for Kafka. You use Debezium. It looks at the transaction log of your database when a bit of data gets inserted. Event gets pumped out over Kafka. Things are going to subscribe to those events. This can be used for data replication. It’s a classic part of normal ETL processes. Change Data Capture could be really useful if you need to replicate state between two different systems potentially because you’re looking at something like a transitionary mode where you’ve got maybe two sources of truth for a period of time. I have a couple of patterns right now I share about things like the example from Square and the tracer write type situation, although they didn’t do CDC there. It also can just be very useful in terms of attaching behavior on changes in a monolithic system.

Awarding Points When an Order is Placed

I think I gave the example, what happens if I want to award points when an order is placed. The only way I know an order is being placed is when an order arrives in the database. When an order arrives in the database, I could have it fire an event via a Change Data Capture pipeline, maybe over Kafka. I could then receive the event that says the order has been placed. My little loyalty microservice could then start adding and awarding points to you, and exchange your order. It can be a really nice system actually. I think the one challenge with any Change Data Capture system is it requires that your CDC pipeline knows about the schema. At this point, you’re normally talking about a monolithic database, which is so difficult to change that no one is changing it. It often is quite stable. There are some good modern tool chains in that space. I don’t think it’s just narrowly looked at in the world of ETL anymore. There was a really good case study recently from Airbnb talking about how they did CDC as part of their microservice transition. I think they’ve built their own internal system for that.

Are Frameworks a Foregone Conclusion With Scale?

Reisz: Debezium with Kafka that you mentioned, those are two that I’m familiar with. They’re pretty powerful in that space. We’re nearing the end here. I wanted to touch on just a few growing pains. We’ve talked about a handful of services, but I want to talk about, as you continue to scale, some of the challenges that people face. I want to talk a bit about frameworks, things like service meshes, and maybe just touch briefly on logging before we wrap up. There’s a quote that you have in the book, more services, more pain. As you continue to scale out your services, are frameworks just a foregone conclusion, things like service meshes?

Newman: Not necessarily a foregone conclusion. I think service mesh is a great example. Service mesh is a conceptually simple idea but the quality of the implementations in that space is still variable. I have been giving people the same advice about service mesh for the last three-and-a-half years, which is, if you can wait six months to work out which service mesh you pick, wait six months. I’m still giving that advice. I think, conceptually, they are good. The whole point of a service mesh is that you get to push a bunch of common behavior into the platform. The difference between maybe that and a normal framework like Spring Boot, is that there you’re relying on the framework running inside the service itself. Maintaining consistent behavior across a whole bunch of services is difficult. You can’t say, “Everybody, upgrade to Spring Boot version 15,” or whatever because you don’t want to be doing those deployments. With service mesh, you can have some degree of common behavior running in the platform.

Service Mesh for Synchronous Communication

The other thing that’s good with service mesh is they really are only primarily for synchronous communication. They don’t really give you many solutions of anything, any help at all really, in the world of async communication. Just look at the variety of different options you’ve got around right now. Istio just realized the other day, we need to completely rearchitect it and rebuild the whole architecture of how Istio was built, and run, and managed. I love the idea. I think something like that might make sense as part of your future. I think what’s more likely to be what we’re looking at, is not, you’ve got to run on Kubernetes and on a service mesh. It’s more like it’s going to be something like most development in the next 5 to 10 years will be done on some Fast Type platform. I think that’s much more likely going to be the future for most developers. Even if under the hood, it’s a giant, hellish Jenga stack of Kubernetes, and Istio, and Knative, or something else.

Dealing With Volume of Logs and Metrics

Reisz: What advice do you give on the observability front, talking about logs, metrics, and tracing? As you deploy more services and get more scale, those logs continue to get bigger? How do you deal with the volume of logs and metrics?

Newman: I think volume-wise, there are some balancing forces here, because on the one hand, you don’t know what you need until you need to ask the question of the data. One level you want to lean towards logging as much as you can. To be honest with you, logging and log volumes, if you’re logging a lot of data, you need to quantify what a lot of data is and look at the capabilities of the log aggregation platform you’re looking at. Something like ELK Stack, for example, the Elasticsearch, Logstash, Kibana, the thing that’s most concerning there in terms of large volumes is Elasticsearch. I spoke to a big SaaS based company. They had a dedicated team just running Elasticsearch just for the ELK Stack cluster. They’re dealing with quite large volumes. A lot of people aren’t actually in that space. If you are, at the massive volume space where you’re generating so many logs, then there are dedicated log aggregation tools that are great at doing that, to pay some money. Get Humio and you’ll be happy.

Logs versus Other Data Types and Metrics

I think it’s a bit different with logs versus other types of data, and metrics, and things like that. Because often when it comes to large volumes of data if you can collect that information in a semi-asynchronous or batch-oriented fashion, which is how most log aggregation is done, that allows you to sidestep a bunch of the problems. There are other situations where you need to collect data in such a way that you need the data to come through now. Distributed tracing is a great example of that. You need a single wall clock to do effective timing, to look at how long something takes. You look at systems like Jaeger, Honeycomb, or LightStep. Those are quite different bits of data you’re gathering there, typically, because the logs, they actually need to get to me within the next couple of minutes. When I’m sending traces as part of a distributed trace, they have to come through in asynchronous fashion. Because of the large volumes of data that you often deal with there, you don’t need to affect a running system, which is why you do sampling.

Getting Data Off Of Machines

To be honest with you, it’s very hard to say this is what you’re going to need. This comes back to this idea of the dial. I always come back to the key point. You need to get the data off of those machines and stored somewhere centrally, where you can actually go and ask questions of that data. That’s the first thing. You need to get your logs out. You need to get your metrics out and you stick them somewhere where you can query them. Ideally, these are queries that are structured queries. They’re repeatable structured, and repeatable queries. The tools I tend to look at for metrics and certainly for tracing, if I’ve got no money, I’m looking at maybe using a mix of Jaeger and Prometheus. If I’ve got money, I’m looking at LightStep and Honeycomb just because they are tools that have been built with these systems in mind. I think the biggest challenge is they look quite alien to some people.

Going from Microservices Back To a Monolith

Reisz: Recently, we’ve seen some projects, some companies, some different architectures that have been reverting, going from microservices back to a monolith, at least a process monolith. Istiod comes to mind. Why do you think that is?

Newman: It’s really straightforward. They didn’t read my books. One of the things I’ve talked about a number of times is the challenges of microservice related architectures being operated by the end customer. When you create a piece of software that you’re giving to somebody to have them own and manage it, that’s an issue. With a microservice architecture, you push a lot of complexity into the operational space. All of that complexity is visible to the customer. You’ve now got to run effectively a little mini microservice stack, which was the original Istio architecture in order to run your microservices stack. Actually, by moving it all back into being a single monolithic process, they make the operation of that much simpler.

I think if people can’t see the inherent irony in this I don’t know what to do. That’s why they’ve done it. Actually, I think it’s good and it’s healthy. The fact it has taken so long for them to do this is a bit of a worry, especially given that Istio is effectively the underpinning of what is now going to be Knative. Although, now it’s in Google’s only hands, who knows whatever is going to happen there? It was totally understandable. I’m glad they did it. I think it was the right thing to do. I think it’s a brave decision. I don’t read anything into it about Istio not being fit for purpose. This is a fairly fundamental change that has come two years since they stabilized what Istio was supposed to be. I think this is why I say the space of service meshes is still not stabilized.

Reisz: I think it was pretty brave, though. You got to eat your own dog food, so you want to build on top of the patterns that you’re espousing. They also understood that customers were fighting with the complexity of what was there. I think it made a lot of sense.

Newman: That was very brave.

Reisz: What’s next?

What’s Next?

Newman: Writing and lots of remote consulting and remote training for my clients and hoping my internet holds up. That’s basically my next life for the next six months.

Reisz: I hear you, Sam. Sam, thank you so much for joining us on “The InfoQ Podcast.” As always, it’s great to chat with you.

Newman: Thanks so much.

QCon is a practitioner-driven conference designed for technical team leads, architects, and project managers who influence software innovation in their teams. QCon takes place 8 times per year in London, New York, Munich, San Francisco, Sao Paolo, Beijing, Guangzhou & Shanghai. QCon New York is at its 9th Edition and will take place Jun 15-19, 2020. 140+ expert practitioner speakers, 1000+ attendees and 18 tracks will cover topics driving the evolution of software development today. Visit qconnewyork.com to get more details.

. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Uncategorized

Java at 25

MMS • Ben Evans

Article originally posted on InfoQ. Visit InfoQ

May 23rd 2020 marked 25 years since the first public alpha release of the Java programming language and platform.

The world has changed a great deal since that initial release – which occasioned Network World to opine “Some analysts believe that the Java programming language has the potential to transform the web” (22nd May 1995). At that time, Microsoft were gearing up for the August release of Windows 95. That operating system would famously launch without any form of web browser. The Internet was not yet really a part of the mainstream of public consciousness.

Java would become a key player in the years following its initial release – as the Internet transitioned to become a mainstream phenomenon. This would show itself in some unexpected ways, such as the renaming of a mostly-unrelated scripting language to “Javascript”. There was never anything other than the flimsiest of technical reasons for this decision – it was merely to cash in on the expanding public profile of the Java ecosystem.

Hindsight is 20-20, but it is also true that long term bets in software are always difficult. Java has certainly benefited from some design decisions that could be seen either as prescient or very lucky.

In particular, Java has been a particularly fortunate beneficiary of Moore’s Law. Several of Java’s most important features are only really feasible because of the incredible growth in the computation ability of processors in the last 25 years. The early years of Java were plagued with poor performance – which led to a folk memory among programmers that “Java is slow” that occasionally persists even today, despite it not having been true for over 15 years.

Java may also have benefited from a coherent design philosophy – as it has always had a number of design goals that favour the developer:

Backwards Compatibility
Language stability
Code should be easy to read
Features should be implemented as libraries if possible
Provide an extensive standard library out of the box

These principles may have combined with a certain amount of luck, to produce a language and platform that was “in the right place at the right time”. Very few programming languages become successful, and of those that do, most fall from favour after a few years.

In the modern world, only Javascript, Python and C / C++ have had the same level of sustained, truly mainstream success that Java has enjoyed. Java takes its role as the stable underpinnings of production software very seriously, and it shows in the health and longevity of the platform.

So, as Java turns 25 and looks to the future, here’s a quick (but by no means exhaustive) roundup of events that are being held to commemorate the occasion:

It’s impossible to say whether Java will still be around in a recognizable form for its 30th (or 40th) birthday. However, on the current evidence and health of the community, it seems entirely possible that it might.

Uncategorized

Secure Multiparty Computation May Enable Privacy-protecting Contact Tracing Solutions

MMS • Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

The current COVID-19 pandemic has fueled several efforts to implement contact tracing apps, based on a number of different cryptographic approaches.

Smartphones are being used in many countries to track COVID-19 transmission, with adopted solutions varying largely in terms of how well they preserve citizens’ privacy. Apple and Google have recently introduced a Bluetooth-based exposure notification protocol for their mobile operating systems that guarantee a certain level of data privacy. Such protocol is meant to be only a component of a larger system operated by health authorities. Those systems should also implement the required cryptographic solutions to ensure data privacy protection.

InfoQ has spoken with HashiCorp principal product manager for cryptography and security Andy Manoske to learn more about Secure Multiparty Computation and how it can enable privacy-protecting analysis on private data from different sources.

InfoQ: Due to COVID-19, there has been a lot of public discussion and growing interest for Contact Tracing apps, which are considered key for a safe way back to normal life. What are the privacy implications of contact tracing and why is cryptography key?

Manoske: Most contract tracing suites will likely require sensitive information such as an individual’s name, contact information, and their history of travel and social interactions. While this information is necessary for tracking the spread of a disease through a population, it can also be used by a malicious adversary to commit identity fraud.

Cryptography’s role is to ensure that only identified and permitted applications can access, i.e. decrypt, this sensitive information. Without verification of valid identity (and ideally intent) an adversary is forced to break the encryption protecting this sensitive information.

**InfoQ: Could you explain what Secure Multiparty Computation can bring to this kind of solution and how it could be implemented?

Manoske: SMPC ensures that sensitive data is constantly encrypted, and that valid analysis of that data does not require the data to be unencrypted to be processed/analyzed.

A common technique for voiding encryption is to avoid it altogether with a “side channel attack.” Rather than using math to attempt to guess the key, side channel attacks allow an attacker to exploit vulnerabilities in a system or use malware such as keyloggers or Remote Access Trojans (RATs) to steal encryption keys or valid credentials to decrypt data.

With SMPC, valid analysis can be done on constantly encrypted data. This minimizes the possibility for an adversary to launch a side channel attack to decrypt protected data.

**InfoQ: Could you provide additional examples of systems or applications where SMPC is essential to protect privacy?

Manoske: SMPC allows for groups of valid users – who may not completely trust each other – to operate collaboratively on protected data. When combined with strong rules on how those valid users operate on protected data, SMPC can greatly enhance the security of use cases such as:

Financial Information Disclosure: Allow companies to disclose sensitive information such as earning reports in a way such that nobody gains early access to that data (and can commit insider trading to profit off of non-public information). SMPC can protect data rooms for upcoming IPO/M&A transactions and other disclosure systems akin to the SEC’s EDGAR filing system, thereby minimizing the possibility of a side channel attack to commit market manipulation as seen in the 2016 EDGAR cyberattacks.

Electronic Medical Records (EMR): Allow organizations such as hospitals, pharmacies, and insurance manufacturers to share EMR about patients widely across systems without exposing sensitive HIPAA-protected personally identifiable information (PII). Using SMPC, valid organizations could be granted temporary access to PII in such a way that an adversary couldn’t steal their credentials or decrypted copies of the PII data.

-Defense and National Security Information: Side channel attacks are not just used by criminal hackers. Intelligence organizations frequently use side channel attacks to compromise well-protected data. Major cyberattacks such as the campaign of attacks on American aerospace manufacturers during the early 2010s/late 2000’s have employed side channel attacks to steal encryption keys protecting classified information. In the case of these attacks, spies were able to infiltrate classified environments and exfiltrate unencrypted system data using legitimate credentials. SMPC allows that data to remain constantly encrypted and minimize a “walk away” side channel attack where a temporarily validated user can steal decrypted data.

**InfoQ: What are the inherent complexities to using SMPC?

Manoske: SMPC requires software to employ one of the following: comprehensive, and often complicated, key management by trusted external systems, or new cryptography ciphers that allow for some analysis in-line within encryption.

Building systems hardened to protect encryption keys while not complicating applications’ ability to validate their identity (and not complicate making it harder to develop software) is a major challenge. This challenge is made even harder by having to manage keys over large, distributed infrastructures with parties who do not necessarily trust each other.

There are new cryptographic ciphers that use homomorphic encryption to allow applications to perform analysis on encrypted data. Some of these algorithms are being vetted by the cryptographic community. For example, the US National Institute of Standards and Technology (NIST) is reviewing homomorphic encryption ciphers in the Post-Quantum Cryptographic Standardization project, as they are resilient to attacks from quantum computers.

It will still take some time for the community to confirm homomorphic encryption is safe for use, and projects like NIST’s PQCSP won’t be completed until the mid-2020s.

**InfoQ: How does HashiCorp Vault fit into this picture? What features does it provide and how can it enable to create SMPC-based applications and systems?

Manoske: HashiCorp Vault manages keys for how all applications and systems access secrets. By ensuring that no systems have access to the keys used to encrypt Vault-protected secrets at rest, adversaries cannot steal keys to extract secrets from Vault.

HashiCorp Vault also abstracts identity away from a single credential, allowing for administrators to set policies to access data based on a logical identity rather than a single credential.

For example, a person may have a number of ways of verifying their identity in their wallet: a credit card, a driver’s license, business cards, etc. Vault treats applications as people by allowing them to have their logical identity attributed by any of those elements in their “wallet.” This allows for identity-based security, ensuring administrators can write policies for the logical identity of a workflow while letting Vault handle the complexity of identifying and attributing users or applications from their various credentials.

When combined with secret engines such as Vault’s Transform Secret Engine, Vault can be used to implement SPMC by ensuring all data residing in a shared system is encrypted. That data can be encrypted using data type protection to ensure that sensitive data preserves its type and formatting, and only approved parties can temporarily access sensitive data in a way that doesn’t allow them access to encryption keys. This allows parties, who may not trust each other, to collaboratively operate on encrypted data without accessing the keys used to protect that data.

InfoQ will keep reporting on efforts to build privacy-protecting systems to track COVID-19.

Microsoft Introduces App Service Static Web Apps in Preview at Build 2020

MMS • Steef-Jan Wiggers

Subscribe for MMS Newsletter

Did you know...

Presentation: The Evolution of Distributed Systems on Kubernetes

MMS • Bilgin Ibryam

Transcript

Modern Distributed Applications

Monolithic Architectures – Traditional Middleware Capabilities

Cloud-native Architectures – Microservices and Kubernetes

Out-of-process Extension Mechanism

What is Service Mesh?

What is Knative?

What is Dapr?

Future Cloud Native Trends – Lifecycle Trends

Networking Trends – Envoy

Binding Trends – Apache Camel

State Trends – Cloudstate

Multi-runtime Microservices Are Here

Smart Sidecars and Dumb Pipes

What Comes After Microservices?

Subscribe for MMS Newsletter

Did you know...

Q&A on the Compose Specification Community

MMS • Christian Melendez

Subscribe for MMS Newsletter

Did you know...

Apple Releases iOS 13.5 With Exposure Notification Beta and Best Practices Sample App

MMS • Sergio De Simone

Subscribe for MMS Newsletter

Did you know...

Tech Giants Shift to More Remote Working for the Long Term

MMS • Shane Hastie

Subscribe for MMS Newsletter

Did you know...

Article: WebAssembly at Sentry – Q&A with Armin Ronacher

MMS • Armin Ronacher Bruno Couriol

Subscribe for MMS Newsletter

Did you know...

Presentation: Should We Really Run It if We Build It?

MMS • Paul Hammant

Transcript

Premise

Burnout

Quitting

Remedy

B2B

B2C

Systems

Do Not Disturb

Preparedness

Measuring

Staffing

Success

Questions and Answers

Subscribe for MMS Newsletter

Did you know...

Podcast: Sam Newman: Monolith to Microservices

MMS • Sam Newman

Subscribe on:

Outline

What the Writing Process Looks Like

Writing Days

Order of Writing Code

Focus on One Book

Why Monolith to Microservices Book

The People Process Side

Microservices Is Not the Jumping Off Point

The Extent of Microservices

Different Types of Monolith towards Microservices

Taking People on a Microservices Journey

Problems That Require a Microservices Approach

Things to Have Right before Decomposing Your Monolith into Microservices

A Good Test of an Organization

How the Organization Changes to Support Microservices Architecture

Three-tiered Architecture Approach

The UI Tier

Micro Frontend

Release Train as a Point in Time towards CI/CD and Not a Destination

Patterns in the Book Equivalent to the Strangler Pattern