MMS • Aditya Kulkarni
DoorDash recently leveraged Open Policy Agent to enhance the efficiency of their developers. The infrastructure team at DoorDash observed several advantages from this, including quicker reviews of changes to infrastructure policies, more comprehensive tagging of resources, and a notable decrease in the number of incidents resulting from policy violations.
A few years back, DoorDash encountered an incident that caused their order volume to drop. While the infrastructure team managed to resolve the incident in an hour, the root cause was the accidental removal of essential AWS resources. This unintended removal occurred within a Terraform code that also included around 90 other resources, seemingly innocuous in nature. This realization prompted the use of policy automation as a safeguard against such critical oversights in the future.
At DoorDash, the team has used Atlantis, an open-source orchestrator for Terraform plans. This orchestrator manages the Terraform plan lifecycle. When users create infrastructure pull requests on GitHub, a webhook event is triggered to an Atlantis worker. This worker retrieves Open Policy Agent (OPA) policies from a designated S3 bucket.
DoorDash crafts the policy rules using Rego queries to identify deviations from the expected system state. The conftest tool, employed by DoorDash, leverages these OPA policies to validate data against policy assertions.
Atlantis then runs conftest against the Terraform plan, aligning it with OPA-defined policies. The results, alongside Terraform plan details, are added as comments on the GitHub pull requests.
DoorDash further streamlines the process with Pull Approve, a GitHub integration handling code review, assignment, and policy. With the required approvals in place, Atlantis executes changes to AWS resources as per the Terraform plan.
Du further illustrated the policies that can be written using this automation. He categorized the policies into four types – Reliability, Velocity, Efficiency, and Security.
For reliability, consider a scenario where it is important to safeguard critical resources from deletion. Du illustrated this by presenting an example where a policy was set up to identify these crucial resources. Subsequently, a verification step was introduced, necessitating an administrative review before any modifications to these resources could take place. To optimize the velocity of review Du showcased an example where the policy checked the Terraform module in a given PR from the already-approved list of modules. If the team is using a module that is not listed, the policy encourages using an already-approved terraform module.
As an outcome of the policy automation, the DoorDash infrastructure team saved time spent reviewing the pull requests, thereby working towards product improvements as a whole. The team could also prevent incidents caused by policy violations, as they could identify policy issues in pull requests early. Finally, the team increased their resources tagging coverage and standardization from 20% to 97.9%, leading to cost and team member optimization.