Article: Scaling and Growing Developer Experience at Netflix

MMS Founder
MMS Phillipa Avery

Article originally posted on InfoQ. Visit InfoQ

Key Takeaways

  • The balance between stability and velocity will change over time, with newer companies emphasizing velocity over stability, and valuing stability more as the product matures.
  • A unified centrally supported technical stack can help provide agile development with stable support processes, allowing a company to scale quickly without trading velocity or stability.
  • The decision to support a technology centrally should be an ongoing process, taking into account the cost effectiveness and business impact of the support needs.
  • Large business decisions like monolith vs multi-repo, or build vs buy, will be key choices with long lasting repercussions. It’s important to analyze the needs of the company and make the choice based on what the company values regarding agility, stability and ongoing development cost.
  • Being able to easily trial new centrally supported technology and fail fast if it doesn’t succeed is integral to growth.

An optimal Developer Experience will depend a lot on the company the developer is working for – taking into account the company’s values, culture and business drivers. As the company grows, what it values will shift to adapt to the growth, and in turn how a developer needs to create, deploy and maintain their code will change too. This article discusses why and when changes to developer needs will occur, how to get ahead of them, and how to adapt when these changes are necessary. I talk through some of the experiences myself and peers have had at Netflix, identifying some key learnings and examples we have gained over the years.

The impact of growth

The balance between velocity and stability changes over time as a company grows. When a company first starts its journey, they are iterating quickly and experimenting with different approaches. They often need to produce something quickly to test, and either scrap or extend the approach. This type of “prototyping” phase often doesn’t need an investment into stability, as failures will have a limited scope of impact.

As the user base of the product grows, however, there is a higher expectation of stability and cost effectiveness. This often results in a more cautious (and slower) approach to changes to the product, hence stability will affect the ability to maintain a higher velocity. This can be stemmed by putting into place a well founded and stable CI/CD system, which can create a level of trust that enables higher velocity. When developers trust that their changes will not break production systems, they are able to focus more on the velocity of innovation, and not spend excessive time on manual validation of changes made.

As the business grows, this level of trust in the release process is critical, and well-founded testing practices (such as the Testing Pyramid) will set the process up for success. Even with this in place though, a well-rounded CI/CD process will incur a hit to velocity through the validation process, and this will grow as the complexity of the overall product system grows.

To illustrate this point, over the years Netflix has introduced a number of validation steps to our standard CI/CD process, including library version dependency locking (to isolate failures for a particular library), automated integration and functional tests, and canary testing. Each of these stages can take a variable amount of time depending on the complexity of the service. If a service has many dependencies it will have more likelihood of one of those dependencies failing the build process and needing debugging. A service that is more functionally complex will have a larger testing footprint which will take longer to run, especially if more integration and functional tests are needed. When running a canary on a service that executes multiple types of traffic patterns (e.g. different devices, request types, data needs), a longer canary run is needed to eliminate noise and ensure coverage of all these patterns.

To remain flexible with the above needs, we lean into our microservice approach to create services with a decomposed functional footprint, allowing smaller dependency graphs, shorter testing times, and less noisy canaries. Additionally, we avoid blocking the release process without an easy override process. If a dependency version fails a build, it’s easy to roll back or lock to the previous version for the failing service. Test failures can be analyzed and fixed forward, ignored (and hopefully re-evaluated after), or modified depending on the changes made. Canary failures can be individually analyzed as to the cause and the developer can choose to move forward with the release (by-pass) as needed. The balance of velocity vs stability by CI/CD is ultimately decided by the service maintainers depending on their own comfort levels and the business impact.

Centralized vs local tools

At some point, a company may need to make a decision as to whether they will let developers individually choose and maintain the technology for their business needs, or provide a recommended (or mandated) technology which is then centrally supported by the company. The way I think about the choice between centralized vs local, is that centralized offerings enable a company to provide consistency across the entire product ecosystem. This can be consistent provisions of integrations (security, insights, resiliency, CI/CD etc), or best practices (architectures, patterns, dependency management etc). This centralized consistency can be very powerful from an entire business perspective, but might actually be detrimental for a particular use case. If you define a single solution that provides a consistent approach, there will almost always be a use case that needs a different approach to be successful with their business driver.

As an example, we specify Java with SpringBoot as our supported service tech stack. However, there are a number of use cases where data engineering will need to use python or Scala for their business needs. To further build on this example, we use Gradle extensively as our build tool, which works really well for our chosen tech stack, but for the developers using Scala, SBT might be a better fit. We then need to evaluate whether we want to enhance our Gradle offering for their use case, or allow (and support) the use of SBT for the Scala community.

Getting the right balance of understanding the decision weight of the centralized benefits vs the local business needs, and being able to evaluate the trade offs, is an ongoing process that is continually evolving. Understanding at what point a use case should be considered for centralized support should be evaluated by looking at the data – how many users are on the tech stack in question, what is the business impact (bottom dollar) of the workflow on the stack, how many people would it take to support the stack centrally? All these factors should be considered, and if there is sufficient prioritization and room for growth, then the tech stack should be moved to centralized support.

With Netflix’s culture promoting freedom and responsibility, we will often see developers make the decision to choose their own solutions for their use cases, and maintain the responsibility for this choice. This can be a great option for small use cases with low business impact. If there is a likelihood that the scale of impact will grow, however (more people start using the technology, or the impact of the use case is higher on the business), then this choice can be detrimental to the business long term – it can create a bottleneck in the ability to move quickly or scale if there is only one person supporting the technology, or if that person moves to a different project it can create tech debt with no one able to support the technology.

Given we can’t support all the use cases that would benefit from centralized support, we try to take a layered approach, where we provide decoupled components that can be used for different tech stacks with highly critical centralized needs – for example, security. These can be used (and managed) independently in a localized approach, but become more integrated and centrally managed the more you “buy in” to the entire supported ecosystem – what we refer to as the Paved Path. For the average developer it is much easier for them to use the paved path offerings and have all the centralized needs managed for them, while the more unique business cases have the option to self manage and choose their own path – with a clear expectation of responsibilities that come with that decision, such as extra developer time needed when something unsupported goes wrong, what the cost to migrate to the paved path in the future might be (if it becomes supported), and how easy it is to remove the technology from the ecosystem if it proves too costly.

To illustrate this decision process, often being on the paved path will require a migration of the service to the new technology. In some cases the disruption and cost of migrating a legacy service to a new technology is deemed lower upfront value than spending developer hours on the service only once things go wrong. We saw this in practice when we had the recent Log4Shell vulnerability and we needed to (repeatedly) upgrade the entire fleet’s log4j versions. For the services that were on the paved path, this was done entirely hand’s free for developers, and was completed within hours. For services that were mostly on the paved path, there were minimal interactions needed and turnover happened within a day. For services that were not on the paved path, there were multiple days worth of developer crunch time needed, with intensive debugging and multiple push cycles to complete. In the grand scheme however, this was still more cost effective with less business impact than migrating them to the paved path upfront.

Monorepo or multi-repo strategy

Unfortunately there is not a clear answer on how a company can decide between a monorepo or multi-repo strategy, as both approaches will have significant deficiencies as the product scales. The big difference that I can talk to is release velocity for a percentage of the product base. With a monorepo it is more difficult to target a release for a subset of the product (by design). For example, if you want to release a code change or new version (e.g. a new JDK version), it can be difficult for application owners to opt in to the change before others. Additionally, the monorepo can be significantly slower to release a new change, as it must pass validation for all the product before it is able to be released.

The Netflix multi-repo approach on the other hand provides a highly versatile and fast approach to releases – where a new library version is published and then picked up by consuming applications via automated dependency update CI/CD processes. This allows individual application owners to define the version of the code change that they wish to consume (for good and bad), and it is available for consumption immediately upon publication. This approach has a few critical downsides: dependency version management is incredibly complex, and the onus for debugging issues in the version resolution is on the consuming application (if you want to have a deeper understanding of how Netflix solves for this complexity, this presentation on Dependency Management at Scale using Nebula is a great resource diving into the details). If a service releases a new library, while it is perfectly viable for 99% of the population, there is often a small percentage of applications that have some transitive dependency issues that must be identified and resolved.

Long-term, we are moving towards a hybrid approach where we enable a multi-repo but mono release approach – singular repository library owners can release a new version, but must go through a centralized testing pipeline that builds and runs the library against their consumers. If the pipeline fails, the onus is on the library producer to decide what steps to take next to resolve the issue.

Convergence in technical stacks

Whenever a conversation happens on how to move an entire company to a consistent technical stack, you will likely hear the adage of “the carrot or the stick” – do you provide appealing new features and functionality (the carrot) that makes adhering to the paved path appealing enough that people self opt-in, or do you enforce developers to use paved path offerings (the stick)? At Netflix we try to lean towards the carrot approach, and keep the stick for a small set of high leverage and/or business impact needs.

Ideally, the carrot would always be the approach taken. Sometimes, however, a centralized approach might have benefits for specific use cases, but from the overall business perspective, it has high leverage. These cases will often not have much of a carrot for the individual developers, and can even add extra hurdles and complexity to their existing development workflows. In cases like this, we emphasize the responsibility to act to the benefit of the company, and provide clear reasons for why it is important. We try to reduce any extra burden to the best of our ability and demonstrate the benefits of the consistent approach as much as possible. 

On rare occasions, we will provide a top-down approach to providing a consistent tech stack, where the priority of migrating to the new stack is dictated higher for an individual team than their other priorities. This usually happens for security reasons (such as the aforementioned Log4Shell case), or when the overall business benefits of a consistent tech stack outweigh the individual teams’ needs – for example the tail end of a migration where the cost of support for the remaining use cases is becoming too expensive to maintain.

Build vs buy

Let’s classify build vs buy as built entirely in-house, vs using an external offering. At Netflix, we like to lean towards Open Source (OS) when possible, and we both produce and consume a number of OS products.

When possible, we lean towards “buying”, with a preference for OS offerings. If we can find an OS project with high alignment to the requirements and a thriving community, that will become the most likely candidate. If however there are no OS offerings, or there are significant differences in functionality with existing projects, we will evaluate building in-house. In the case where there is a smaller functional need, we will usually build and maintain entirely in-house. In cases where the project is larger or has high leverage externally, we will consider releasing it as an OS project.

If you choose to go open source, regardless of whether you choose to publish your own OS project or use an external project – both options will have a developer cost. When publishing a project there is the cost of building the community around the product – code and feature reviews, meetups, alignment with internal usage. All these can add up fast, and often popular OS offerings need at least one developer working full time on the OS management. When using external offerings it is important to maintain a working relationship with the community – to contribute back to the product, and to influence the future directions and needs to align with internal usage. It can be a risk if the direction of the external product changes significantly from the company’s use, or the OS project is disbanded. 

Developer experience over time

As the engineering organization grows with the scale of the company, consistency begins to matter more. Often at the smaller growth stages it’s likely that each developer will be working across multiple stacks and domains – they manage the entire stack. As that stack grows, the need to focus efforts on a specific part becomes clear, and it’s now multiple people working on that stack. As more people become involved in that workflow, and become more specialized in their specific part of the stack, there are increased opportunities to optimize the things they don’t need to care about – more centralized infrastructure, abstractions and tooling. Taking that over from a centralized perspective can then free them up to focus on their specific business needs, and a smaller group of people supporting these centralized components can serve a large number of business specific developers. 

Additionally, as the company ages we need to accept that technology and requirements are constantly changing, and what might have failed in the past might now be the best viable solution. Part of this is setting up an attitude of failure being acceptable but contained – fail fast and try again. For example, we have long used a system of A/B testing to trial new features and requirements with the Netflix user base, and will often scrap features that are not deemed beneficial to the viewership. We will also come back later and retrial the feature if the product has evolved or the viewership needs have changed.

Another internal technical example of this is our Publisher Feedback feature, which was used to verify candidate library releases before release into our multi-repo ecosystem. For each candidate published, we would run all downstream consumers of a dependency with a configured acceptance test threshold and provide feedback to the library producer on failures caused by the candidate release, optionally automating gating of the release as part of the library build. Unfortunately, the difficulty of providing a build environment out-of-band from the regular CI workflow made it difficult to provide more than compilation level feedback on the “what-if”’ dependencies, and as we realized we weren’t going to pursue declarative CI using the same infrastructure as we’d originally planned, we had to reevaluate. We instead invested in pull-request based features via Rocket CI, which provides APIs, abstractions and features over our existing Jenkins infrastructure. This allowed us to invest in these new features while avoiding being coupled to the specifics of the Jenkins build environment.

My advice for engineering managers working in fast-growing companies is: don’t be afraid to try something new, even if it has failed before. Technology and requirements are constantly changing, and what might have failed in the past might now be the best viable solution.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.