MMS • RSS
Keystone has been operational since December 2015 and has grown significantly over the years as Netflix subscribers have grown from 65 million subscribers in Q2 2015 to 130 million users as of the time of writing this post. Keystone started off as an Apache Chukwa pipeline and evolved into a Kafka fronted pipeline over time. Back in 2016, Netflix was using 36 kafka clusters processing more than 700 billion messages per day as explained in their blog post.
Netflix’s architecture consists of two distinct real-time stream processing platforms. Keystone focuses on data analytics and Mantis focuses on operations. Keystone provides Data pipeline functionality and “Stream Processing as a Service”. Data pipeline generates, processes and analyzes data from all the different microservices that Netflix operates in near real time. Stream Processing as a Service allows internal users to focus on business application logic while developing and operating custom stream processing applications.
The main challenges that Netflix had to resolve while building and extending the platform are similar to the ones that engineers face when building distributed systems at scale. The routing service supports tuneable at-least-once delivery semantics with a tradeoff between latency and message delivery.
Keystone leverages Apache Flink, and can support stateless and stateful jobs, bursty or constant traffic, window sizes ranging from seconds to hours, strict ordering if required and configurable guarantees on message delivery. Resource contention can also become an issue for system design, as different jobs may be contentious on CPU, memory, I/O or network bandwidth. Users of the system range from software engineers to business analysts. All these challenges, combined with the goal of implementing a multi-tenant cloud based system that has to be easy enough for its users to declare and execute jobs without relying on operational colleagues for most of the jobs, make for an interesting set of design requirements.
Keystone’s platform mindset can be summed up as enabling users to get the job done. Tuneable tradeoffs, separation of concerns and sub-system failure as something that can and will happen, described as ‘failure as a first-class citizen’ are the essential cornerstones.
Netflix engineering team approach to Keystone design uses a declarative reconciliation protocol. Every user declared goal state gets stored in AWS RDS and acts as the single source of truth. If, for example, a Kafka cluster disappears then it can be reconstructed solely based on AWS RDS data.
Deployment orchestration is implemented via the continuous delivery tool Spinnaker, with an independent Flink cluster for each job. The only shared components of each component is ZooKeeper for consensus coordination and S3 for storing checkpoint states. Self-service tooling helps users declare jobs via a User Interface for routing jobs and a CLI interface for Stream Processing as a Service.
A set of internally developed managed connectors for Kafka, ElasticSearch, Hive etc helps developers that intend to use Keystone develop faster, without worrying about internals of the platform and message parsing. Custom Domain Specific Language (DSL) libraries abstract away filtering, projecting and other data transformation commonly used tasks. The platform provides self-healing through the AWS RDS reconciliation mechanism, and in the case of a failure can backfill or rewind the job with the data that is needed, through a User Interface. Finally, the platform comes with monitoring and alerts built-in.
Future development for the Keystone platform includes among others a service layer, streaming SQL support and Machine Learning capabilities, all to be detailed in future Netflix Engineering blog posts.