MMS • RSS
With DevOps, we learned that the biggest factor in streamlining deployments is to make things automated and minimize communication. However, we didn’t apply that principle to other areas. While security gets some traction with DevSecOps and machine learning has MLOps, we lack a solution for database management. In this article, we propose a new approach that unlocks the power of DevOps engineers to own their databases and lets them self-serve deployments, maintenance and troubleshooting.
Why Database Deployment and Maintenance is Slow
In the last few years, we moved from deployment being a big and challenging event to deploying many times daily. We moved from manual deployments to automated CI/CD pipelines. We built tools for deploying infrastructure and configuration changes. This led to a tremendous increase in the complexity of our systems. We somewhat learned how to handle that in deployments, infrastructure, configuration, security or even machine learning.
However, database complexity increased significantly as well. We now use many database engines with each application. We use SQL and NoSQL databases as well as databases tuned for specific scenarios like machine learning or vector search. Since we deploy changes often and asynchronously, modifications to database schema or background tasks may happen at any time and affect the performance. When this happens, the business is directly impacted as the database issues quickly spread throughout the ecosystem. However, developers and DevOps engineers can’t fix the issues on their own and need help from other teams, specifically operations and database administrators. This makes the process slow and error-prone as things are not automated but require manual human intervention. Recommended reading: Why DevOps should own their databases and how to do it.
The problem with database maintenance is the lack of solutions that would present the full and actionable story. We have many tools that can alert us about spiking CPU or increased latency, but we don’t know what’s the reason behind these issues. To find out, developers need to ask database administrators to check the logs and metrics to identify slow queries and whether the CPU spikes are caused by changes in the environment (like updating the operating system), slow background tasks that can be postponed (like vacuuming or index rebuilding) or due to inefficient application code that doesn’t use an index and performs a full table scan.
Since we don’t know the reason, we need to involve multiple teams to investigate. We need to first identify the points of contact in each team, assign tickets or work items, sometimes page people at night, and coordinate the troubleshooting process. To make things faster, we need to fix two aspects: reduce the amount of manual work and reduce the number of people involved in the investigation. Let’s see how.
Reducing the Manual Work with Observability
Many things can go wrong when dealing with databases. Let’s see some of the areas that can break and are not automatically fixed today.
Observability for the Code Changes
Every time developers change their codebase, there is a risk of breaking the correctness and performance. We can verify the correctness with automated tests (unit tests, integration tests, end-to-end tests, etc.) and do it regularly in our CI/CD pipelines. However, we can’t verify the performance impact. Our tests focus on what we extract from the database and not how we do it. We may use N+1 queries, we may scan tables instead of using indexes, and we may have slow queries with CTEs or not use searchable arguments (a query that is not sargable). All these things may result in CPU spikes or latency degradation and we can’t automatically explain what happens. We need to have observability that gives the story like “the query execution within a given REST endpoint doesn’t use an index and leads to slow performance.”
Performance of Schema Changes
When we modify a schema, we risk that tables will be locked for rewriting. This may take minutes or hours and effectively make our systems unavailable. However, our CI/CD pipelines do not test the performance of schema migrations. Tests run against the latest schema and migrations are fast because databases in pre-production environments are small. We don’t check how long these migrations will take in production and if they’ll lock tables. We need to have tools that can warn us if a slow migration is being deployed and will indicate how to make it faster.
Understanding of Execution Plans
Databases grow over time. Queries change their execution plans because statistics get outdated or because indexes are modified. However, automated tests won’t capture these issues as they focus on the correctness of the data and they don’t run on the production database. We don’t see how things evolve and that the query degrades gradually over time. We need to have observability systems that provide an understanding of all the moving pieces around. They need to give us a story explaining that “the query we deployed six months ago stopped using an index because the statistics are out of date as we injected a big batch of data last night”.
We need a story explaining what happened, like SQL stopped using an index or SQL query is not sargable. Other reasons can be schema changes, mismatched query fields, index fragmentation or significant changes in data volume affecting the query optimizer’s decision-making.
How to Reduce the Work
Taking all we saw above, we need to have systems that can understand how our environment evolves and how these changes affect performance. They need to alert us when some slow changes are being deployed to production, understand how the deployments affect the performance and automatically explain the issues. Troubleshooting should be as simple as just getting the notification with links to the places that we need to fix (like configuration, a particular line of code in our application, or the cloud console) with an explanation of how to do it. Fixing should be merely approving a pull request or confirming the configuration change.
Reducing the Number of People Involved With Self-Servicing
Having database observability is the first step. The second step is the ownership. Developers and DevOps engineers need to own their databases the same way they own their code and deployments. They need to be able to self-serve the issues and do that without any help from operations or database administrators. This will reduce communication and make things much smoother. There are three things that we need to introduce to let developers and DevOps engineers build their ownership.
First, we need the tools. Just like we explained in the previous section, we need tools that can explain the story. We can’t just put more responsibilities on developers and DevOps engineers without making the work simple and automated.
Second, we need new processes. We need to have playbooks and operational procedures explaining how to use the tools, how to self-serve the maintenance, and when to escalate. Developers must be familiarized with the procedures and must know how to follow them even outside of working hours. This way we can replace some work (communication, coordination of teams, manual investigation) with some other work (self-servicing) that we can do much faster.
Finally, we need to change the mindset. Developers and DevOps engineers need to take on new responsibilities and become true owners of their databases. Even though this sounds like more work, we can make it much less time-consuming by building good observability tools and replacing slow human communication with automated and well-defined self-servicing. It’s not about shifting the work from one place to another. It’s about making the work much smaller and much more automated so the developers and DevOps teams can do it without involving other teams. Since we reduce the cross-team collaboration, the increase in the scope of work of developers and DevOps engineers will lead to doing the work faster.
There are three aspects that we need to work on now. First, tools like Metis must be created. Second, Platform Engineers must use these tools to define processes. Third, developers and DevOps engineers need to take ownership. While it may seem challenging at first, we already saw with the DevOps movement that minimizing communication and owning the full scope of the system is beneficial for the business and for the engineers. We need to restructure our teams and reassign responsibilities to unlock the potential and make organizations work faster.