Article: Key Takeaway Points and Lessons Learned from QCon London & Plus 2022

Uncategorized

Article: Key Takeaway Points and Lessons Learned from QCon London & Plus 2022

MMS • Abel Avram

Article originally posted on InfoQ. Visit InfoQ

QCon international software development conference returned to London this past April and online with QCon Plus in May.

The conference explored the major trends in modern software development across 15 individually curated tracks. The 2022 QCon London and QCon Plus tracks featured in-depth technical talks from senior software practitioners covering developer enablement, resilient architectures, modern Java, Machine Learning, WebAssembley, modern data pipelines, the emerging Staff-Plus engineer path, and more.

InfoQ Editors attended both conferences. You can read their coverage online.

Architectures You’ve Always Wondered About

Shopify’s Architecture to Handle the World’s Biggest Flash Sales by Bart de Water, Manager @Shopify Payments Team

Bart de Water talked about Shopify’s architecture to handle what he called the “world’s biggest flash sales.” His presentation was very technical, covering details about the backend built on Ruby on Rails with MySQL, Redis, and Memcached, with Go and Lua in several places for performance reasons. The frontend is done in React and GraphQL, with the mobile application written in React Native.

He focused mostly on the storefront and checkout because these have the most traffic, the two having different characteristics and requirements, with the storefront mostly about read traffic. The checkout covers most of the writing and interacts with external systems. Shopify supports various payment systems.

According to de Water, Shopify’s main challenge is dealing with flash sales and some weekend sales. To show what scalability issues they are dealing with, de Water said, “Here are last year’s numbers. $6.3 billion total sales with requests per minute peeking up to 32 million requests per minute.” To check that everything works, the team performs load testing in production using Genghis, a custom load generator that creates worker VMs that execute Lua scripts describing the behavior that they want and a resilience matrix simulating an outage to see what happens when one would take place, and what needs to be done to prevent one in production. They also use a circuit breaker to protect datastores and APIs, activating it when necessary.

At the end of the talk, de Water answered specific questions on handling big clients, moving to microservices, pods, scaling the storefront and checkout, using Packwerk to enforce boundaries between teams, and others.

Debug, Analyze, and Optimize… in Production!

An Observable Service with No Logs by Glen Mailer, Senior Software Engineer @Geckoboard

Mailer talked about a microservices project written in Go that they did at CircleCI. They needed a new service for a build agent, and decided to build it in Go (rather than Clojure, as many other services had been). They also wanted a new approach to logging, one that was not based on text logs but rather used what he called an “event-first approach,”—a data-first telemetry instead of a text-first telemetry as it is for regular logs. What takes place in a system is described through events with each having a unique name throughout the codebase identifying what takes place in the system. The properties of the respective event are structured data. The resulting log lines are made of structured data.

He then showed us some Go code on how this event-first observable service is written, mentioning the patterns that emerged while developing it and how they implemented them. Also, he showed the interface built on top of it, using it to visualize logged data and interpret it.

The project started as an experiment, a partial success according to Mailer. He learned a lot about what was important to log. Having a new approach to logging was, in his opinion, much better than logging some text lines when some code gets executed.

What Mailer didn’t like about the result was that it was like “reinventing a bunch of wheels.” He still considered the process valuable for their team because they discovered together what to include and exclude from the library they had built up.

Concluding, Mailer “tentatively recommends this concept of a unified telemetry abstraction, having this one place to produce logs, metrics, and events.” Still, he had some concerns because others did not find the new service as intuitive as hoped. He considered the new logging service good, but not good enough to rip the present system apart and start from scratch. He recommended looking at OpenTelemetry, an open-source project somewhat similar to theirs.

Impressions expressed on Twitter included:

@jessitron: This morning at #QConLondon, @glenathan tells a story of an experiment at CircleCI: for this greenfield project, let’s have NO LOGGING.

@jessitron: Instead of logs, events! that make traces! “Events are good logs.” They’re data-first instead of human-first; they wrap units of work with durations; they can be explored multiple ways @glenathan #QConLondon

@jessitron: They found best practices like: wrap each unit of work; report errors in a standard field; use span names that are specific enough to tell you what’s happening but general enough for useful grouping. ~ @glenathan #QConLondon

@jessitron: Then they tuned the span creations so that there wasn’t a bunch of duplicate-looking ones. Instead of repetitive spans, add more attributes. “Add every identifier you have in your system” to help find problems. @glenathan #QConLondon

@jessitron: If you look at the “three pillars” of observability from the top, it’s all the same events. @glenathan #QConLondon

Profiles, the Missing Pillar: Continuous Profiling in Practice by Michael Hausenblas, Solution Engineering Lead @AWS

Hausenblas discussed the need for observability in large, sometimes distributed systems. He defined observability as “the capability to continuously generate and discover actionable insights based on certain signals from the system under observation.” He mentioned several types of signals: text logs, usually consumed by humans, then metrics, labeled as numerical signals, and distributed traces used for “propagating an execution context along a request path in a distributed system.”

He then added another signal type, namely profiles which he saw in the context of the source code and the running process associated, exploring them to find out more about the system monitored. Unlike metrics, which have their benefits, profiles allow going to a specific line of code that is responsible for a certain metric that is out of normal values.

He also mentioned 3 open source continuous profiling tools that developers can install and use right away:

Parca—inspired by Prometheus.
Pyroscope—a rich tool with broad platform support that supports several language profiles and eBPF.
CNCF Pixie from the Cloud Native Computing Foundation donated by New Relic, currently a sandbox project. It uses eBPF for data collection.

Hausenblas predicted that eBPF will become the “unified collection method for all kinds of profiles.” All profiling tools recommended by Hausenblas use eBPF.

Hausenblas concluded his session by answering questions on Profiling and the Development Life-Cycle, differences between eBPF and language-specific tools, and the missing piece in Continuous Profiling. Highlighting the benefits of using eBPF, he said that eBPF does not care if the program was written in a compiled or interpreted language. Also, it captures both function calls and syscalls, all in one place.

Impressions expressed on Twitter included:

@PierreVincent: “Observability is the capability to continuously generate and discover actionable insights based on signals from the system under observation with the goal to influence that system” and that’s for both people (e.g. debugging) and automation (e.g. autoscaling) @mhausenblas #QConLondon

@PierreVincent: Observability can go beyond usual metrics, logs, and traces: @mhausenblas introducing profiles and eBPF #QConLondon

@PierreVincent: Profiles with sampling are low overhead, so great to apply in production. @mhausenblas #QConLondon

@PierreVincent: High-level architecture of Continuous Profiling, from agent and ingestion to frontend for visualization. @mhausenblas #QConLondon

@PierreVincent: Continuous Profiling with Parca, open-source project, with concepts and query language very close to Prometheus. @mhausenblas #QConLondon

@dberkholz: Three tools for continuous profiling in production: Parca, Pyroscope, and Pixie. All of them support eBPF, which is becoming the unified collection method. -@mhausenblas #qconlondon

@PierreVincent: Looking ahead, eBPF is becoming the unified collection method for profiles. Profiling standards seem to be converging towards pprof with its increasing support @mhausenblas #QConLondon

@estesp: Lots of eBPF today at #QConLondon! My colleague @mhausenblas making the case for continuous profiling as a key observability pillar for your applications.

@PierreVincent: Correlation remains a challenge with profiles, to link together all the different signal types, rather than dealing with multiple independent tools. @mhausenblas #QConLondon

@danielbryantuk: “It is vital to be able to correlate signals and metrics to provide insight into an application” @mhausenblas at #QConLondon

@PierreVincent: Is 2022 the year of Continuous Profiling? @mhausenblas #QConLondon

Effective Microservices: What It Takes to Get the Most Out of This Approach

Airbnb at Scale by Selina Liu, Senior Software Engineer @Airbnb

Liu shared lessons learned at Airbnb during their migration from a monolith to SOA. The first lesson was to invest early in the needed infrastructure because that will “help to turbocharge your migration from monolith to SOA.”

They approached the migration by decomposing the presentation layer into services that “render data in a user-friendly format for frontend clients to consume.” Below those services, they have mid-tier services with various business rules which use data services connected to databases to access data uniformly. After migration, their application was made up of 3 layers: presentation, business, and data running as services. The higher layer may call a lower layer but not the other way around.

Liu then explained using a concrete case how they migrated Host Reservation Details, a certain feature they used, to SOA, as an example to show how they tackled the migration process.

To help with the migration process, they used an API Framework used by all Airbnb services to talk to each other. Then they had a multi-threaded RPC client handling features like error propagation, circuit breaker requests, and retries. That helped engineers focus on the business logic to create new functionality rather than inter-service communication.

After that, she talked about Powergrid, a Java library used to run code in parallel. Powergrid “helps us organize our code as a directed acyclic graph, or a DAG, where each node is a function or a task.” This library enables them to perform low-latency network operations concurrently.

Then she mentioned OneTouch, a “framework and a toolkit built on top of Kubernetes that allows us to manage our services transparently and to deploy to different environments efficiently.” One last piece of useful infrastructure she mentioned was Spinnaker, defined as an “open source continuous delivery platform that we use to deploy our services. It provides safe and repeatable workflows for deploying changes to production.” These pieces of infrastructure helped them migrate their monolith to 400+ services with over 1000 deploys daily over 2-3 years.

She also mentioned that “as we continue to evolve our SOA, we also decided to unify our client-facing API. Our solution to the problem is UI Platform, a unified, opinionated, server-driven UI system.”

As another conclusion, she added that having the UI generated with the backend services allows them to have a “clear schema contract between frontend and backend, and maintain a repository of robust, reusable UI components that make it easier for us to prototype a feature.”

Liu’s session ended by answering questions, one of them regarding the use of SOA instead of microservices. She said “I think, tech leaders at that time, they really just used the word SOA to push the concept wider to every corner of the company. I think partially it might be because, at the start, we were wary of splitting our logic into tiny services. We wanted to avoid using microservices, just because the term might be a little bit misleading or biased in just how it’s understood by most people when they first hear it. It’s probably just like the process of how we communicate it, more than anything else.”

Liu also talked a bit about the technologies used and the benefits resulting from migration.

Impressions expressed on Twitter included:

@danielbryantuk: “building and deploying services at @AirbnbEng is easy due to all code and config being located in one place, and our use of tooling to provide key abstractions and simple command-line actions” @selinahliu #QConLondon

@danielbryantuk: Four pieces of advice from @selinahliu at #QConLondon in relation to the @AirbnbEng migration from a monolith to microservices – invest in common infra – simplify service dependencies – platformize data hydration – unify client-facing API

Pump It Up! Actually Gaining Benefit from Cloud Native Microservices by Sam Newman

In this session, Sam Newman discussed cloud and microservices, what to do about them to succeed, and shared some tips on how they should be done and certain pitfalls to avoid. He started by wondering why cloud-native and microservices have not delivered as expected.

Part of the unsuccessful pairing of cloud and microservices is, in his opinion, a certain level of confusion on the role of the cloud in an organization. He tells the story of visiting a company in Northern England where developers said they wanted access to the cloud. When he told their desire to the CIO, he discovered they already had access to the cloud. It turned out that if a developer wanted access, he had to create a JIRA ticket and someone from the system administration would intervene and provide the needed access. It was like using a resource on premises in the old way but now in the cloud. This made the process inconvenient for developers, slowing the entire development process.

Talking about the revolution introduced by AWS with cloud computing, Newman said “AWS’s killer feature wasn’t per-hour rented managed virtual machines, although they’re pretty good. It was this concept of self-service. A democratized access via an API. It wasn’t about rental, it was about empowerment.”

He continued inviting leaders to trust their people, adding “this is difficult because, for many of us, we’ve come from a more traditional IT structure, where you have very siloed organizations who have very small defined roles.” To address this issue he suggested leaders think in terms of long-lived product teams who own major parts of the project—think “holistically about the end-to-end delivery of functionality to the users of your software.”

That’s why Newman considered microservices to be of particular value because teams can own their microservice.

Besides this ownership issue, Newman pinpointed the lack of proper training in using the tools they have as a pitfall to a project’s success. He considered it not enough to throw some tools at them but follow up with the necessary training on using those tools.

Going further, he advised creating a good developer experience by treating the microservices platform as a product, talking to developers, testers, and security personnel to find out what they need to succeed, to create an “experience for your delivery teams that makes doing the right thing as easy as possible,” making developers, testers, sysadmins your own customers.

Impressions expressed on Twitter included:

@danielbryantuk: “There is a danger with platform abstractions that we artificially create constraints that stop people getting their work done” @samnewman discussing there may be no one-size-fits-all platforms at #QConLondon

@danielbryantuk: “The job of the platform team is not about building the platform, it’s about enabling developers. This includes outreach, training, and more. If you do build a platform, treat it as a product” @samnewman #QConLondon

@danielbryantuk: Beware of your platform becoming a requirement rather than an enabler… @samnewman #QConLondon

@danielbryantuk: “if you focus on durable, product-focussed teams, you can reduce the governance and centralized control required to make progress” @samnewman #QConLondon

@danielbryantuk: “Make your platform easy to use. This is essential if your platform is optional. A platform with no users should generate a lot of reflection” @samnewman #QConLondon

About the Author

Abel Avram

Show moreShow less

Mobile Monitoring Solutions

Uncategorized