MMS • Noemi Vanyi Simona Pencea
Article originally posted on InfoQ. Visit InfoQ
Transcript
Pencea: I bet you’re wondering what this picture is doing on a tech conference. These are two German academics. They started to build a dictionary, but they actually became famous, because along the way, they collected a lot of folk stories. The reason they are here is partly because they were my idol when I was a child. I thought there was nothing better to do than to listen to stories and collect them. My family still makes fun of me because I ended up in tech after that. The way I see it, it’s not such a big difference. Basically, we still collect folk stories in tech, but we don’t call them folk stories, we call them best practices. Or we go to conferences to learn about them, basically to learn how other people screwed up, so that we don’t do the same. After we collect all these stories, we put them all together. We call them developer experience, and we try to improve that. This brings us to the talk that we have, improving developer experience using automated data CI/CD pipelines. My name is Simona Pencea. I am a software engineer at Xata.
Ványi: I’m Noémi Ványi. I’m also a software engineer at the backend team of Xata. Together, we will be walking through the data developer experience improvements we’ve been focused on recently.
Pencea: We have two topics on the agenda. The first one, testing with separate data branches covers the idea that when you create a PR, you maybe want to test your PR using a separate data environment that contains potentially a separate database. The second one, zero downtime migrations, covers the idea that we want to improve the developer experience when merging changes that include schema changes, without having any downtime. Basically, zero downtime migrations. For that, we developed an open-source tool called pgroll. Going through the first one, I will be covering several topics. Basically, I will start by going through the code development flow that we focused on. The testing improvements that we had in mind. How we ensured we have data available in those new testing environments. How we ensured that data is safe to use.
Code Development Workflow
This is probably very familiar to all of you. It’s basically what I do every day. When I’m developing, I’m starting with the local dev. I have my local dev data. It’s fake data. I’m trying to create a good local dev dataset when I’m testing my stuff. I’m trying to think about corner cases, and cover them all. The moment I’m happy with what I have in my local dev, I’m using the dev environment. This is an environment that is shared between us developers. It’s located in the cloud, and it also has a dataset. This is the dev dataset. This is also fake data, but it’s crowdfunded from all the developers that use this environment. There is a chance that you find something that it’s not in the local dev. Once everything is tested, my PR is approved. I’m merging it. I reach staging.
In staging, there is another dataset which is closer to the real life, basically, because it’s from beta testing or from demos and so on. It’s not the real thing. The real thing is only in prod, and this is the production data. This is basically the final test. The moment my code reaches prod, it may fail, even though I did my best to try with everything else. In my mind, I would like to get my hands on the production dataset somehow without breaking anything, if possible, to test it before I reach production, so that I minimize the chance of bugs.
Data Testing Improvements – Using Production Data
This is what led to this. Can we use production data to do testing with it? We’ve all received those emails sometimes, that say, test email, and I wasn’t a test user. Production data would bring a lot of value when used for testing. If we go through the pros, the main thing is, it’s real data. It’s what real users created. It’s basically the most valuable data we have. It’s also large. It’s probably the largest dataset you have, if we don’t count load test generated data and so on. It’s fast in the way that you don’t have to write a script or populate a database. It’s already there, you can use it. There are cons to this. There are privacy issues. It’s production data: there’s private information, private health information. I probably don’t even have permission from my users to use the data for testing. Or, am I storing it in the right storage? Is this storage with the right settings so that I’m not breaking GDPR or some other privacy laws.
Privacy issues are a big con. The second thing, as you can see, large is also a con, because a large dataset does not mean a complete dataset. Normally, all the users will use your product in the most common way, and then you’ll have some outliers which give you the weird bugs and so on. Having a large dataset while testing may prevent you from seeing those corner cases, because they are better covered. Refreshing takes time because of the size. Basically, if somebody changes the data with another PR or something, you need to refresh everything, and then it takes longer than if you have a small subset. Also, because of another PR, you can get into data incompatibility. Basically, you can get into a state where your test breaks, but it’s not because of your PR. It’s because something broke, or something changed, and now it’s not compatible anymore.
If we look at the cons, it’s basically two categories that we can take from those. The first ones are related to data privacy. Then the second ones are related to the size of the dataset. That gives us our requirements. The first one would be, we would like to use production data but in a safe way, and, if possible, fast. Since we want to do a CI/CD pipeline, let’s make it automated. I don’t want to run a script by hand or something. Let’s have the full experience. Let’s start with the automated part. It’s very hard to cover all the ways software developers work. What we did first was to target a simplification, like considering GitHub as a standard workflow, because the majority of developers will use GitHub. One of the things GitHub gives to you is a notification when a PR gets created. Our idea was, we can use that notification, we can hook up to it. Then we can create what we call a database branch, which is basically a separate database, but with the same schema as the source branch, when a GitHub PR gets created. Then after creation, you can copy the data after it. Having this in place would give you the automation part of the workflow.
Let’s see how we could use the production data. We said we want to have a fast copy and also have it complete. I’ll say what that means. Copying takes time. There is no way around it. You copy data, it takes time. You can hack it. Basically, you can have a preemptive copy. You copy the data before anyone needs it, so when they need it, it’s already there. Preemptive copying means I will just have a lot of datasets around, just in case somebody uses it, and then, I have to keep everything in sync. That didn’t really fly with us. We can do Copy on Write, which basically means you copy at the last minute before data is actually used, so before that, all the pointers point to the old data. The problem with using Copy on Write for this specific case is that it did not leave us any way into which we could potentially change the data to make it safe. If I Copy on Write, it’s basically the same data. I will just change the one that I’m editing, but the rest of it is the same one.
For instance, if I want to anonymize an email or something, I will not be able to do it with Copy on Write. Then, you have the boring option, which is, basically, you don’t copy all the data, you just copy a part of the data. This is what we went for, even though it was boring. Let’s see about the second thing. We wanted to have a complete dataset. I’ll go back a bit, and consider the case of a relational database where you have links as a data type. Having a complete dataset means all the links will be resolved inside of this dataset. If I copy all the data, that’s obviously a complete dataset, but if I copy a subset, there is no guarantee it will be complete unless I make it so. The problem with having a complete dataset by following the links is it sounds like an NP-complete problem, and that’s because it is an NP-complete problem. If I want to copy a subset of a bigger data, and to find it of a certain size, I would actually need to find all the subsets that respect that rule, and then select the best one. That would mean a lot of time. In our case, we did not want the best dataset that has exactly the size that we have in mind. We were happy with having something around that size. In that case, we can just go with the first dataset that we can construct that follows this completeness with all the links being resolved in size.
Data Copy (Deep Dive)
The problem with constructing this complete subset is, where do we start? How do we advance? How do we know we got to the end, basically? The where do we start part is solvable, if we think about the relationships between the tables as a graph, and then we apply a topological sort on it. We list the tables based on their degrees of independence. In this case, this is an example. t7 is the most independent, then we have t1, t2, t3, and we can see that if we remove these two things, the degrees of independence for t2 and t3 are immediately increased because the links are gone. We have something like that. Then we go further up. Here we have the special case of a cycle, because you can’t point back with links to the same table that pointed to you. In this case, we can break the cycle, because going back we see the only way to reach this cycle is through t5.
Basically, we need to first reach t5 and then t6. This is what I call the anatomy of the schema. We can see this is the order in which we need to go through the tables when we collect records. In order to answer the other two questions, things get a bit more complicated, because the schema is not enough. The problem with the schema not being enough for these cases is because, first of all, it will tell you what’s possible, but it doesn’t have to be mandatory, unless you have a constraint. Usually, a link can also be empty. If you reach a point where you run into a link that points to nothing, that doesn’t mean you should stop. You need to go and exhaustively find the next potential record to add to the set. Basically, if you imagine it in 3D, you need to project this static analysis that we did on individual rows. The thing that you cannot see through the static analysis from the beginning is that you can have several records from one table pointing to the same one in another table. The first one will take everything with it, and the second one will bring nothing.
Then you might be tempted to stop searching, because you think, I didn’t make any progress, so then the set is complete, which is not true. You need to exhaustively look until the end of the set. These are just a few of the things that, when building this thing, need to be kept on the back of the mind, basically. We need to always allow full cycle before determining that no progress was made. When we select the next record, we should consider the fact that it might have been already brought into the set, and we shouldn’t necessarily stop at that point.
We talked about at the beginning how we want to have this production data, but have it safe to use. This is the last step that we are still working on. It is a bit more fluffy. The problem with masking the data is that, for some fields, you know exactly what they are. It’s an email, then, sure, it’s private data. What if it’s free text, then what? If it’s free text, you don’t know what’s inside. The assumption is it could be private data. The approach here was to provide several possibilities on how to redact data and allow the user to choose, because the user has the context and they should be able to select based on the use case. The idea of having, for instance, a full reduction or a partial reduction, is that, sure you can apply that, but it will break your aggregations.
For instance, if I have an aggregation by username, like my Gmail address, and I want to know how many items I have assigned to my email address, if I redact the username and it will be, **.gmail, then I get aggregations on any Gmail address that has items in my table. The most complete would be a full transformation. The problem with full transformation is that it takes up a lot of memory, because you need to keep the map with the initial item and the changed item. Depending on the use case, you might not need this because it’s more complex to maintain. Of course, if there is a field that has sensitive data and you don’t need it for your specific test case, you can just remove it. The problem with removing a field is that that would basically mean you’re changing the schema, so you’re doing a migration, and that normally causes issues. In our case, we have a solution for the migrations, so you can feel free to use it.
Zero Downtime Migrations
Ványi: In this section of the presentation, I would like to talk about, what do we mean by zero downtime. What challenges do we face when we are migrating the data layer? I will talk about the expand-contract pattern and how we implemented it in PostgreSQL. What do I mean when I say zero downtime? It sounds so nice. Obviously, downtime cannot be zero because of physics, but the user can perceive it as zero. They can usually tolerate around 20 milliseconds of latency. Here I talk about planned maintenance, not service outages. Unfortunately, we rarely have any control over service outages, but we can always plan for our application updates.
Challenges of Data Migrations
Let’s look at the challenges we might face during these data migrations. Some migrations require locking, unfortunately. These can be table, read, write locks, meaning no one can access the table. They cannot read. They cannot write. In case of high availability applications, that is unacceptable. There are other migrations that might rely on read, write locks. Those are a bit better, and we can live with that. Also, it’s something we want to avoid. Also, when there is a data change, obviously we have to update the application as well, and the new instance, it has to start and run basically at the same time as the old application is running. This means that the database that we are using, it has to be in two states at the same time. Because there are two application versions interacting with our database, we must make sure, for example, if we introduce a new constraint, that it is enforced in both the existing records and on the new data as well.
Based on these challenges, we can come up with a list of requirements. The database must serve both the old schema and the new schema to the application, because we are running the old application and the new application at the same time. Schema changes are not allowed to block database clients, meaning we cannot allow our applications to be blocked because someone is updating the schema. The data integrity must be preserved. For example, if we introduce a new data constraint, it must be enforced on the old records as well. When we have different schema versions live at the same time, they cannot interfere with each other. For example, when the old application is interacting with the database, we cannot yet enforce the new constraints, because it would break the old application. Finally, as we are interacting with two application versions at the same time, we must make sure that the data is still consistent.
Expand-Contract Pattern
The expand-contract pattern can help us with this. It can minimize downtime during these data migrations. It consists of three phases. The first phase is expand. This is the phase when we add new changes to our schema. We expand the schema. The next step is migrate. That is when we start our new application version. Maybe test it. Maybe we feel lucky, we don’t test it at all. At this point, we can also shut down the old application version. Finally, we contract. This is the third and last phase. We remove the unused and the old parts from the schema. This comes with several benefits.
In this case, the changes do not block the client applications, because we constantly add new things to the existing schema. The database has to be forward compatible, meaning it has to support the new application version, but at the same time, it has to support the old application version, so the database is both forward and backwards compatible with the application versions. Let’s look at a very simple example, renaming a column. It means that here we have to create the new column, basically with a new name, and copy the contents of the old column. Then we migrate our application and delete our column with the old name. It’s very straightforward. We can deploy this change using, for example, the blue-green deployments. Here, the old application is still live, interacting with our table through the old view. At the same time, we can deploy our new application version which interacts through another new view with the same table. Then we realize that everything’s passing. We can shut down the old application and remove the view, and everything just works out fine.
Implementation
Let’s see how we implemented in Postgres. First, I would like to explain why we chose PostgreSQL in the first place. Postgres is well known, open source. It’s been developed for 30 years now. The DDL statements are transactional, meaning, if one of these statements fail, it can be rolled back easily. Row level locking. They mostly rely on row level locking. Unfortunately, there are a few read, write locks, but we can usually work around those. For example, if you are adding a nonvolatile default value, the table is not rewritten. Instead, the value is added to the metadata of the table. The old records are updated when the whole record is updated. It doesn’t really work all the time. Let’s look at the building blocks that Postgres provides. We are going to use three building blocks, DDL statements, obviously, to alter the schema.
Views, to expose the different schema version to the different application versions. Triggers and functions to migrate the old data, and on failure, to roll back the migrations. Let’s look at a bit more complex example. We have an existing column, and we want to add the NOT NULL constraint to it. It seems simple, but it can be tricky because Postgres does a table scan, meaning it locks the table, and no one can update or read the table, because it goes through all of the records and checks if any of the existing records violate the NOT NULL constraint. If it finds a record that violates this constraint, then the statement returns an error, unfortunately. We can work around it. If we add NOT VALID to this constraint, the table scan escaped. Here we add the new column and set NOT NULL constraint and add NOT VALID to it, so we are not blocking the database clients.
We also create triggers that move the old values from the columns. It is possible that some of the old records don’t yet have values, and in this case, we need to add some default value or any backfill value we want, and then we migrate our app. We need to complete the migration, obviously. We need to clean up the trigger, the view we added, so the applications can interact with the table and the old column. Also, we must remember to remove NOT VALID from the original constraint. We can do it because the migration migrated the old values, and we know that all of the new records, or new values are there, and every record satisfies the constraint.
It all seemed quite tedious to do this all the time, and that’s why we created pgroll. It’s an open-source command line tool, but it is written in Go, so you can also use it as a library. It is used to manage safe and reversible migrations using the expand-contract pattern. I would like to walk you through how to use it. Basically, pgroll is running a Postgres instance, so you need one running somewhere. After you installed and initialized it, you can start creating your migrations. You can define migrations using JSON files. I will show you an example. Once you have your migration, you run a start command. Then it creates a new view, and you can interact with it through your new application. You can test it. Then you can also shut down your old application. You run the complete command. pgroll removes all of these leftover views and triggers for you. This is the JSON example I was just talking about.
Let’s say that we have a user’s column that has an ID field, name field, and a description, and we want to make sure that the description is always there, so we put a NOT NULL constraint on it. In this case, you have to define a name. For the migration, it will be the name of the view, or the schema in Postgres. We define a list of operations. We are altering a column. The table is obviously named users. The description field, we no longer allow null values in the column. This is the interesting part. This is the up migration. It contains what to do when we are migrating the old values. In this case, it means that if the description is missing, we add the text description for and insert the name. Or if the data is there, we just move it to the new column. The down migration defines what to do when there is an error and we want to roll back. In this case, we keep the old value, meaning, if the value was missing, it’s a null, and if there was something, we keep it there.
Here is the start command. Let’s see in psql what just happened. We have a user’s table with these three columns, but you can see that pgroll added a new column. Remember, there is this migration ongoing right now. In the old description column, there are records that do not yet satisfy the constraint. In the new description the backfill value is already there for us to use. We can inspect what schemas are in the database. We can notice that there is this create_users_table, that’s the old schema version. The new one is the user_description_set_nullable, which is the name of the migration we just provided in our JSON. Let’s try to insert some values into this table. We are inserting two values. The first one is basically how the new application version is behaving. The description is not empty. In the second record, we are mimicking what the old application is doing. Here the description is NULL. Let’s say that we succeeded. We can try to query this table.
From the old app’s point of view, we can set the search path to the old schema version and perform the following query so we can inspect what happened after we inserted these values. This is what we get back. The description for Alice is, this is Alice, and for Bob it’s NULL because the old application doesn’t enforce the constraint. Let’s change the search path again to the new schema version and perform the same query. Here we can see that we have the description for Alice. Notice that Bob has a description. It is the default description or default migration we provided in the JSON file. Then we might complete the migration using the complete command, and we can see that the old schema is cleaned up. Also, the intermediary column is also removed, and the triggers, functions, everything is removed. Check out pgroll. It’s open source. It takes care of mostly everything. There is no need to manually create new views, functions, new columns, nothing. After you complete your migrations, it cleans up after itself. It is still under development, so there are a few missing features. For example, few missing migrations. We do not yet support adding comments, unfortunately, or batched migrations.
Takeaways
Pencea: Basically, what we presented so far were bits and pieces from this puzzle that we want to build the CI/CD data pipeline. What we imagined when we created this was, somebody creates a PR. Then, test environment with a test database with valid data that’s also safe to use, gets created for them. Then the tests are run. Everything is validated, PR is merged. Then it goes through the pipeline, and nobody has to take care or worry about migrations, because we can do the changes and everything.
Ványi: The migrations are done without downtime. If your pull request is merged, it goes through the testing pipeline, and if everything passes, that’s so nice. We can clean up after ourselves and remove the old schema. If maybe there is a test failure or something is not working correctly, we can roll back anytime, because the old schema is still kept around just in case. As we just demonstrated or told you about, there are still some work left for us, but we already have some building blocks that you can integrate into your CI/CD pipeline. You can create a test database on the fly using GitHub notifications, fill it with safe and relevant data to test. You can create schema changes and merge them back and forth without worrying about data migrations. You can deploy and roll back your application without any downtime.
Questions and Answers
Participant 1: Does pgroll take care of keeping the metadata of every migration done: is started, ongoing, finished?
Ványi: Yes, there is a migration site. Also, you can store, obviously, your migrations file in Git if you want to control it, but pgroll has its own bookkeeping for past migrations.
Participant 2: For the copying of the data from production, was that for local tests, local dev, or the dev? How did you control costs around constantly copying that data, standing up databases, and tearing them back down?
Pencea: It’s usually for something that sits in the cloud, so not for the local dev.
Participant 2: How did you control cost if you’re constantly standing up a near production size database?
Pencea: What we use internally is data branching. We don’t start a new instance every time. We have a separate schema inside a bigger database. Also, what we offer right now is copying 10k of data, it’s not much in terms of storage. We figured it should be enough for testing purposes.
Participant 3: I saw in your JSON file that you can do migrations that pgroll knows about like, is set nullable to false? Can you also do pure SQL migrations?
Ványi: Yes. We don’t yet support every migration. If there is anything missing, you can always work around it by using raw SQL migrations. In this case, you can shoot yourself in the foot, because, for example, in case of NOT NULL, we take care of the skipping of the table scan for you. When you are writing your own raw SQL migration, you have to be careful not to block your table and the database access.
Participant 4: It’s always amazed me that these databases don’t do safer actions for these very common use cases. Have you ever talked to the Postgres project on improving the actual experience of just adding a new column, or something? It should be pretty simple.
Ványi: We’ve been trying to have conversations about it, but it is a very mature project, and it is somewhat hard to change such a fundamental part of this database. Constraints are like the basic building block for Postgres, and it’s not as easy to just make it more safe. There is always some story behind it.
Pencea: I think developer experience was not necessarily something that people were concerned about, up until recently. I feel like sometimes it was actually the opposite, if it was harder, you looked cooler, or you looked like a hacker. It wasn’t exactly something that people would optimize for. I think it’s something that everybody should work towards, because now everybody has an ergonomic chair or something, and nobody questions that, but we should work towards the same thing about developer experience, because it’s ergonomics in the end.
Participant 5: In a company, assuming they are adopting pgroll, all these scripts can grow in number, so at some point you have to apply all of them, I suppose, in order. Is there any sequence number, any indication, like how to apply these. Because some of them might be serial, some of them can be parallelized. Is there any plan to give direction on the execution? I’ve seen there is a number in the script file name, are you following that as a sequence number, or when you’re then developing your batching feature, you can add a sequence number inside?
Ványi: Do we follow some sequence number when we are running migrations?
Yes and no. pgroll maintains its own table or bookkeeping, where it knows what was the last migration, what is coming next? The number in the file name is not only for pgroll, but also for us.
Participant 6: When you have very breaking migrations using pgroll, let’s say you need to rename a column, or even changing its type, which you basically replicate a new column and then copying over the data. How do you deal with very large tables, say, millions of rows? Because you could end up having even some performance issues with copying these large amounts of data.
Ványi: How do we deal with tables that are basically big? How do we make sure that it doesn’t impact the performance of the database?
For example, in case of moving the values to the new column, we are creating triggers that move the data in batches. It’s not like everything is copied in one go, and you cannot really use your Postgres database because it is busy copying the old data. We try to minimize and distribute the load on the database.
Participant 7: I know you were using the small batches to copy the records from the existing column to the new column. Once you copy all the records, only then you will remove the old column. There is a cost with that.
See more presentations with transcripts