Learning to Bend but Not Break at Netflix: Haley Tucker Discusses Chaos Engineering at QCon NY
MMS • RSS
At QCon New York, Haley Tucker presented “UNBREAKABLE: Learning to Bend but Not Break at Netflix” and discussed her experience with chaos engineering while working across a number of roles at Netflix. Key takeaways included: use functional sharding for fault isolation; continually tune RPC calls; run chaos experiments with small iterations; watch out for the environmental factors that may differ between test and production; utilise canary deployment; invest in observability; and apply the “principles of chaos” when implementing supporting tooling.
Tucker began the talk by discussing that a key performance indicator (KPI) for the Netflix engineering team is playback Starts per Second (SPS). Netflix has 185 million customers globally, and the ability to start streaming content at any given moment is vital to the success of the company. The Netflix system is famously implemented as a microservice architecture. Individual services are classified as critical or non-critical, depending on whether or not they are essential for the basic operation of enabling customers to stream content. High availability is implemented within the Netflix system by functional sharding, Remote Procedure Call (RPC) tuning, and bulkheads and fallbacks. The design and implementation of this is verified through the use of chaos engineering, also referred to as resilience engineering.
Tucker has worked across a number of engineering roles during her five years at Netflix, and accordingly divided the remainder of her presentation into a collection of lessons learned from three key functions: non-critical service owner, critical service owner, and chaos engineer. The primary challenge as a non-critical service owner is how to fail well; as a critical service owner it is how to stay up in spite of change and turmoil; and as a chaos engineering it is how to help every team build more resilient systems.
The first question to ask when owning a non-critical service is “how do we know the service is non-critical?”. An example was presented using the “movie badging” service, a seemingly non essential service that provides data to allow movie audio/visual metadata, such as “HDR” and “Dolby 5.1”, to be displayed as a badge on the UI. Although this non-critical service was called from a downstream service via the Hystrix fault tolerance library, and included a unit- and regression-tested fallback, an untested data set caused an exception to be thrown for a subset of user account, which bubbled up through the call stack and ultimately caused the critical API service to fail. This resulted in a series of playback start failures for the subset of customers affected.
Key takeaways for being a non-critical service owner include: environmental factors may differ between test and production e.g. configuration, data, etc; systems behave differently under load than they do in a single unit or integration test; and users react differently than you often expect. It was stressed that fallbacks must be verified to behave as expected (under real-world scenarios), and chaos engineering can be used to “close the gaps in traditional testing methods”.
The next section of the talk focused on critical service owners. Here the essential question to ask is “how do we decrease the blast radius of failures?”. Tucker explained that the Netflix engineering team experimented with sharding services into critical and non-critical cluster shards, and then failed (in a controlled manner) all of the non-critical shards. Although all of the critical services remained operational, a 25% increase in traffic was seen on the critical service shards. Upon investigation it was determined that this additional traffic was the result of the critical services having to perform additional tasks that were previously initiated by the non-critical services. For example, non-critical services typically pre-fetch or pre-cache certain content that is predicted to be required in the near future.
An additional question for critical service owners to ask is “how do I confirm that our [RPC] system is configured properly?”. For example, how do engineers know that they have optimally configured retries, timeouts, load balancing strategies, and concurrency limits. The answer is to run chaos experiments and inject latency and failures into service calls and observe what happens. Tucker cautioned that this is especially challenging when working within a fast-paced organisation like Netflix, which has a constantly changing code base and service call graph, and therefore experiments must be run continually.
The key takeaways for critical service owner included: use functional sharding for fault isolation; continually tune RPC calls; use chaos testing, but make few changes between experiments in order to make it easier to isolate any regressions; and fine-grained chaos experiments help to scope the investigation, as opposed to outages where there are potentially lots of “red herrings”.
For the final section of the talk Tucker discussed her role as a chaos engineer. The primary question she has been asking here is “how do we help teams build more resilient systems?”. The answer is to do “more of the heavy lifting” and provide tooling and guidance to service teams. After introducing The Principles of Chaos, the discussion moved on to the importance of running chaos experiments in both test and (critically) in production. Netflix engineers always attempt to minimise the blast radius during experiments, and have implemented a “stop all chaos” button that, when triggered, immediately terminates all chaos experiments globally.
Extensive use is made of canary deployments and testing, with a small percentage of production traffic split off to both a control and experiment version of a service. A maximum of 5% of total traffic is allowed to be exposed to chaos at any one time, and experiments are limited around holiday times (when customer demand is at its peak). It is essential to address any failures identified during an experiment, and the Netflix Chaos team’s web UI will not allow experiments that resulted in issues to be re-run unless verified and acknowledged manually. For safety Tucker recommended that chaos experiments should “fail open”, and this should be triggered when the control service errors are high, errors are detected in the service that are unrelated to the experiment, or platform components have crashed.
Chaos experiments should always start with a hypothesis, and an essential prerequisite of this is that the steady-state of the system must be known. The best way to achieve this is through comprehensive observability, such as effective monitoring, analysis and insights. The open source Netflix continuous deployment tool Spinnaker was discussed, alongside the “Kayenta” Automated Canary Analysis (ACA) integration. The Netflix Chaos Automation Platform (ChAP) was also briefly discussed, as were the verifications the platform undertakes when running chaos experiments, including: validating the fault injection occurred correctly; validating that KPIs are within expected values; checking for service failures, even if this did not impact KPIs; and checking the health of the service under test.
Real-world events can be validated in an automated fashion by carefully designing and prioritising experiments. The system under test must be well understood, and any potential ramifications of testing know. For example, the ChAP web UI highlights services that have no fallbacks to warn engineers accordingly. When prioritising which experiments to run, factors such as the percentage of traffic a service receives, the number of retries on a service call, and the experiment type (failure or latency) should be considered, as should the length of the period since a previous experiment was run. However, safety (preventing a customer impacting outage) is always the top priority.
This section of the talk was concluded with a brief case study on running ChAP experiments in production, where fourteen vulnerabilities were found (identifying key tooling gaps) and zero outages were observed. Key takeaways from being a chaos engineer included applying the principles of chaos to tooling, and providing self-serve tooling in order to prevent service teams from doing the “heavy lifting” of running chaos experiments themselves.
In conclusion, Tucker discussed that any organisation can potentially get value from starting a chaos practice. The company does not have to operate at the scale of Netflix in order to create hypotheses and design and run basic resilience experiments in test and production. Quoting one of her favourite Netflix shows, “Unbreakable Kimmy Schmidt“, she closed the presentation by stating that when dealing with the inevitable failure scenarios “You can either curl up in a ball and die… or you can stand up and say ‘We are different. We are the strong ones, and you cannot break us!'”
Additional information about Haley Tucker’s QCon New York talk can be found on the conference website, and the video of the talk will be released on InfoQ over the coming months.