Month: November 2024
DigitalOcean Introduces Scalable Storage for Managed MongoDB Offering | News | bakersfield.com
MMS • RSS
Posted on mongodb google news. Visit mongodb google news
NEW YORK–(BUSINESS WIRE)–Nov 14, 2024–
DigitalOcean Holdings, Inc. (NYSE: DOCN), the simple scalable cloud, today introduced scalable storage for DigitalOcean’s Managed MongoDB offering, giving users the ability to scale their cloud storage capacity independently of other compute requirements. This new functionality will provide customers with greater flexibility to adapt to growing data demands and fluctuating workloads, without the unnecessary expense of added compute capacity.
Article originally posted on mongodb google news. Visit mongodb google news
MMS • RSS
Posted on mongodb google news. Visit mongodb google news
Article originally posted on mongodb google news. Visit mongodb google news
MMS • RSS
Posted on nosqlgooglealerts. Visit nosqlgooglealerts
Amazon DynamoDB is a serverless, NoSQL, fully managed database with single-digit millisecond performance at any scale. Starting today, we have made Amazon DynamoDB even more cost-effective by reducing prices for on-demand throughput by 50% and global tables by up to 67%.
DynamoDB on-demand mode offers a truly serverless experience with pay-per-request pricing and automatic scaling without the need for capacity planning. Many customers prefer the simplicity of on-demand mode to build modern, serverless applications that can start small and scale to millions of requests per second. While on-demand was previously cost effective for spiky workloads, with this pricing change, most provisioned capacity workloads on DynamoDB will achieve a lower price with on-demand mode. This pricing change is transformative as it makes on-demand the default and recommended mode for most DynamoDB workloads.
Global tables provide a fully managed, multi-active, multi-Region data replication solution that delivers increased resiliency, improved business continuity, and 99.999% availability for globally distributed applications at any scale. DynamoDB has reduced pricing for multi-Region replicated writes to match the pricing of single-Region writes, simplifying cost modeling for multi-Region applications. For on-demand tables, this price change lowers replicated write pricing by 67%, and for tables using provisioned capacity, replicated write pricing has been reduced by 33%.
These pricing changes are already in effect, in all AWS Regions, starting November 1, 2024 and will be automatically reflected in your AWS bill. To learn more about the new price reductions, see the AWS Database Blog, or visit the Amazon DynamoDB Pricing page.
MMS • RSS
Posted on nosqlgooglealerts. Visit nosqlgooglealerts
Over 1 million customers choose Amazon DynamoDB as their go-to NoSQL database for building high-performance, low-latency applications at any scale. The DynamoDB serverless architecture eliminates the overhead of operating and scaling databases, reducing costs and simplifying management, allowing you to focus on innovation, not infrastructure. DynamoDB provides seamless scalability as workloads grow from hundreds of users to hundreds of millions of users, or single AWS Regions to spanning multiple Regions.
Our continued engineering investments on how efficiently we can operate DynamoDB allow us to identify and pass on cost savings to you. Effective November 1, 2024, DynamoDB has reduced prices for on-demand throughput by 50% and global tables by up to 67%, making it more cost-effective than ever to build, scale, and optimize applications.
In this post, we discuss the benefits of these price reductions, on-demand mode, and global tables.
Price reductions
You can now get the same powerful functionality of DynamoDB on-demand throughput and global tables at significantly lower prices. Let’s dive into what this price drop means for you and how DynamoDB can power your applications at a new level of cost-efficiency:
- On-demand throughput pricing has been reduced by 50% – DynamoDB on-demand mode is now even more attractive, offering you a fully managed, serverless database experience that automatically scales in response to application traffic with no capacity planning required. On-demand mode’s capabilities like pay-per-request pricing, scale-to-zero, and no up-front costs help you save time and money while simplifying operations and improving performance at any scale. On-demand is a game changer for modern, serverless applications because it instantly accommodates workload requirements as they ramp up or down, eliminating the operational complexity of capacity management and database scaling. With this pricing change, most provisioned capacity workloads on DynamoDB today will achieve a lower price with on-demand mode.
- Global tables pricing has been reduced by up to 67% – Building globally distributed applications is now significantly more affordable. DynamoDB has reduced pricing for multi-Region replicated writes to match the pricing of single-Region writes, simplifying cost modeling and choosing the best architecture for your applications. For on-demand tables, this price change lowers replicated write pricing by 67%, and for tables using provisioned capacity, replicated write pricing has been reduced by 33%.
Whether you’re launching a new application or optimizing an existing one, these savings make DynamoDB an excellent choice for workloads of all sizes. You can now enjoy the power and flexibility of serverless, fully managed databases with global reach at an even lower cost—allowing you to focus more resources on driving innovation and growth.
DynamoDB on-demand
When we launched DynamoDB in 2012, provisioned capacity was the only throughput option available. Provisioned capacity requires you to predict and plan your throughput requirements. For provisioned tables, you must specify how much read and write throughput per second you require for your application, and you’re charged based on the hourly read and write capacity you have provisioned, not how much your application has consumed. In 2017, we added provisioned auto scaling to help improve scaling and utilization. Although it was effective, we learned that customers wanted a serverless experience where they don’t have to think about provisioned capacity utilization and how quickly auto scaling can respond to changes in traffic patterns. In 2018, we launched on-demand mode to provide a truly serverless database experience with pay-per-request billing and automatic scaling that doesn’t require capacity management and scaling configurations.
Both provisioned and on-demand billing modes use the same underlying infrastructure to achieve high availability, scale, reliability, and performance. The key differences are that on-demand is always 100% utilized due to pay-per-request billing and on-demand scales transparently, without needing to specify a scaling policy. As a result, many customers prefer the simplicity of on-demand mode to build modern, serverless applications that can start small and scale to millions of requests per second. Continually working backward from our customer feedback, in early 2024, we launched configurable maximum throughput for on-demand tables, an optional table-level setting that provides an additional layer of cost predictability and fine-grained control by allowing you to specify predefined maximum read or write (or both) throughput for on-demand tables. Recently, we introduced warm throughput to provide greater visibility on the number of read and write operations an on-demand table can instantaneously support, and also made it more straightforward to pre-warm DynamoDB tables for upcoming peak events, like new product launches or database migration, when throughput requirements can increase by 10 times, 100 times, or more.
While on-demand was previously cost-effective for spiky workloads, with this pricing change, most provisioned capacity workloads on DynamoDB today will achieve a lower price with on-demand mode. This pricing change is transformative because it makes on-demand the default and recommended mode for most DynamoDB workloads. Whether you’re running a new application or a well-established one, on-demand mode simplifies the operational experience, while providing seamless scalability and responsiveness to handle changes to your traffic pattern, making it an ideal solution for startups, growing applications, and established businesses looking to streamline costs without sacrificing performance.
The following are the key benefits of on-demand mode:
- No capacity planning – On-demand mode eliminates the need to predict capacity usage and pre-provision resources. Capacity planning and monitoring can be time-consuming, especially as traffic patterns change over time. With on-demand, there is no need to monitor your utilization, adjust capacity, or worry about over-provisioning or under-provisioning resources. On-demand simplifies operations and allows you to focus on building features for your customers.
- Automatic scaling – One of the greatest advantages of on-demand mode is its ability to automatically scale to meet your application demand. On-demand mode can instantly accommodate up to double the previous peak traffic on your table. If your workload drives more than double your previous peak on the table, DynamoDB automatically scales, which reduces the risk of throttling, where requests can be delayed or rejected if the table is unable to keep up. Whether traffic is surging for a major launch or fluctuating due to low weekly or seasonal demand, on-demand can quickly adjust based on actual traffic patterns to serve your workload. On-demand mode can serve millions of requests per second without capacity management, and once scaled, you can instantly achieve the same throughput again in the future without throttling.
- Usage-based pricing – Unlike provisioned capacity mode, where you pay for a fixed amount of throughput regardless of usage, with on-demand mode’s simple, pay-per-request pricing model, you don’t have to worry about idle capacity because you only pay for the capacity you actually use. You are billed per read or write request, so your costs directly reflect your actual usage.
- Scale to zero throughput cost – With DynamoDB, the throughput a table is a capable of serving at any given moment is decoupled from what you are billed. For example, an on-demand table may be capable of serving 20,000 reads and 20,000 writes per second (we call this warm throughput) based on your previous traffic pattern, but your application may only be consuming 1,000 reads and 1,000 writes per second (consumed throughput). In this scenario, you are only charged for the 1,000 reads and 1,000 writes that you actually consume, even though at any time, your application could scale up to the warm throughput of 20,000 reads and 20,000 writes per second without any scaling actions needed by the DynamoDB table. On the other hand, if you are driving zero traffic to your table, then with on-demand, you are not charged for any throughput; however, your application can readily consume the warm throughput that the table can serve. Therefore, your table maintains warm throughput for when your application needs it but can scale to zero throughput cost when you aren’t issuing any requests against the table.
- Serverless compatibility – DynamoDB on-demand coupled with other AWS services, such as AWS Lambda, Amazon API Gateway, and Amazon CloudWatch, allows you to build a fully serverless application stack that can scale seamlessly and handle variable workloads efficiently without needing to manage infrastructure.
Global tables: Bringing data closer to your customers
Global tables provide a fully managed, multi-active, multi-Region data replication solution that delivers increased resiliency, improved business continuity, and 99.999% availability for globally distributed applications at any scale. Global tables automatically replicate your data across Regions, making it accessible to users around the world with low latency, high availability, and built-in resilience.
DynamoDB global tables are ideal for applications with globally dispersed users, including financial technology, ecommerce applications, social platforms, gaming, Internet of Things (IoT) solutions, and use cases where users expect the highest levels of availability and resilience.
The following are the key benefits of global tables:
- High availability – Global tables are designed for 99.999% availability, providing multi-active, multi-Region capability without the need to perform a database failover. In the event that your application processing becomes interrupted in one Region, you can redirect your application to a replica table in another Region, delivering higher business continuity.
- Flexibility – Global tables eliminate the undifferentiated heavy lifting of replicating data across Regions. With a few clicks on the DynamoDB console or an API call, you can convert any single-Region table to a global table. You also can add or delete replicas to your existing global tables at any time, providing you the flexibility to move or replicate your data as your business requires. Because global tables use the same APIs as single-Region DynamoDB tables, you don’t have to rewrite or make any application changes as you expand globally.
- Fully managed, multi-Region replication – For businesses with global customers, performance and availability matter more than ever. With global tables, your data is automatically replicated across your chosen Regions, providing low-latency local access and enhanced user experience.
- Global reach, local performance – Global tables enable you to read and write your data locally, providing single-digit millisecond latency for globally distributed applications at any scale. Updates made to any Region are replicated to all other replicas in the global table, locating your data closer to your users and improving performance for global applications.
Conclusion
We have made DynamoDB even more cost-effective by reducing prices for on-demand throughput by 50% and global tables by up to 67%. Whether you are developing a new application, expanding to a global audience, or optimizing your cloud costs, the new DynamoDB pricing offers increased flexibility and substantial savings.
These pricing changes are already in effect, in all Regions, starting November 1, 2024, and will be automatically reflected in your monthly AWS bill. We’re excited about what these changes mean for customers and the value you can realize from DynamoDB. For more details, see Pricing for Amazon DynamoDB.
About the authors
Mazen Ali is a Principal Product Manager at Amazon Web Services. Mazen has an extensive background in product management and technology roles, an MBA from Kellogg School of Management, and is passionate about engaging with customers, shaping product strategy, and collaborating cross-functionally to build exceptional experiences. In his free time, Mazen enjoys traveling, reading, skiing, and hiking.
Joseph Idziorek is currently a Director of Product Management at Amazon Web Services. Joseph has over a decade of experience working in both relational and nonrelational database services and holds a PhD in Computer Engineering from Iowa State University. At AWS, Joseph leads product management for nonrelational database services including Amazon DocumentDB (with MongoDB compatibility), Amazon DynamoDB, Amazon ElastiCache, Amazon Keyspaces (for Apache Cassandra), Amazon MemoryDB, Amazon Neptune, and Amazon Timestream.
MMS • RSS
Posted on mongodb google news. Visit mongodb google news
MongoDB MDB shares rallied 3.2% in the last trading session to close at $300.89. This move can be attributable to notable volume with a higher number of shares being traded than in a typical session. This compares to the stock’s 2.5% gain over the past four weeks.
MongoDB shares are benefiting from strengthening GenAI and predictive AI applications to build scalable architecture and strategic partnerships. MDB’s document model and learning badges are also tailwinds for its stocks.
This database platform is expected to post quarterly earnings of $0.69 per share in its upcoming report, which represents a year-over-year change of -28.1%. Revenues are expected to be $495.23 million, up 14.4% from the year-ago quarter.
While earnings and revenue growth expectations are important in evaluating the potential strength in a stock, empirical research shows a strong correlation between trends in earnings estimate revisions and near-term stock price movements.
For MongoDB, the consensus EPS estimate for the quarter has been revised marginally higher over the last 30 days to the current level. And a positive trend in earnings estimate revision usually translates into price appreciation. So, make sure to keep an eye on MDB going forward to see if this recent jump can turn into more strength down the road.
The stock currently carries a Zacks Rank #3 (Hold). You can see the complete list of today’s Zacks Rank #1 (Strong Buy) stocks here >>>>
MongoDB is part of the Zacks Internet – Software industry. Box BOX, another stock in the same industry, closed the last trading session 1.3% higher at $34.67. BOX has returned 4.6% in the past month.
For Box , the consensus EPS estimate for the upcoming report has remained unchanged over the past month at $0.42. This represents a change of +16.7% from what the company reported a year ago. Box currently has a Zacks Rank of #3 (Hold).
Want the latest recommendations from Zacks Investment Research? Today, you can download 7 Best Stocks for the Next 30 Days. Click to get this free report
MongoDB, Inc. (MDB) : Free Stock Analysis Report
Box, Inc. (BOX) : Free Stock Analysis Report
To read this article on Zacks.com click here.
Article originally posted on mongodb google news. Visit mongodb google news
MMS • RSS
Posted on mongodb google news. Visit mongodb google news
NEW YORK – DigitalOcean Holdings , Inc. (NYSE: DOCN) has announced a new feature for its Managed MongoDB (NASDAQ:MDB) service that allows customers to scale their database storage independently from other compute resources. This update aims to provide users with more flexibility and cost-efficiency when managing their data storage needs.
Previously, DigitalOcean customers looking to expand their Managed MongoDB storage had to upgrade their entire database plan, which included additional compute capacity. The new scalable storage feature enables users to adjust their storage capacity separately, potentially reducing costs by avoiding unnecessary upgrades to processing power and memory.
According to Darpan Dinker, VP of AI/ML and PaaS at DigitalOcean, the ability to independently scale storage aligns with the practical and financial needs of their customers. The company asserts that the process of upgrading storage is automatic and incurs minimal downtime, simplifying the scaling process.
The collaboration with MongoDB ensures that DigitalOcean’s Managed MongoDB service remains fully managed and certified by MongoDB. Annick Bangos, Senior Director of Partner OEM at MongoDB, highlighted the importance of adaptable data infrastructure, particularly as organizations engage with data-heavy applications like generative AI.
DigitalOcean’s scalable storage offering for Managed MongoDB is designed to provide users with the following benefits:
The company’s focus on simplifying cloud computing reflects its commitment to helping developers and growing businesses to rapidly deploy and scale their applications. This service update is a part of DigitalOcean’s broader suite of managed offerings aimed at reducing infrastructure management overhead for its customers.
This announcement is based on a press release statement from DigitalOcean Holdings, Inc. The company’s stock is publicly traded on the New York Stock Exchange under the ticker DOCN.
In other recent news, DigitalOcean Holdings Inc (NYSE:DOCN). reported a 12% year-over-year revenue increase in its third quarter of 2024, largely due to the success of its AI/ML platform, which saw a nearly 200% rise in annual recurring revenue (ARR). The company has also raised its full-year revenue guidance and announced the launch of 42 new features, as well as strategic partnerships aimed at enhancing its cloud services. Despite challenges in its managed hosting service, Cloudways, the firm remains optimistic about future growth, particularly in AI capabilities.
The company’s revenue guidance for Q4 2024 is set between $199 million to $201 million, with an expected full-year non-GAAP diluted earnings per share of $1.70 to $1.75. Looking ahead, management anticipates baseline growth in the low to mid-teens for 2025, backed by a commitment to operational leverage and product innovation, especially in AI capabilities. An Investor Day is also planned for late March or early Q2 2025 to discuss long-term strategies and financial outlook.
However, it’s worth noting that the managed hosting service, Cloudways, has faced challenges since a price increase in April, impacting the net dollar retention rate. Additionally, a decrease in annual recurring revenue (ARR) from 30 to 17 was reported, attributed to the previous quarter’s surge in AI capacity as an anomaly. Despite these challenges, DigitalOcean’s AI strategy, including the launch of NVIDIA (NASDAQ:NVDA) H100 Tensor Core GPU droplets and early access to a GenAI platform for select customers, has reportedly decreased troubleshooting time by 35% and is contributing significantly to revenue growth.
DigitalOcean’s recent feature update for its Managed MongoDB service aligns well with its financial performance and market position. According to InvestingPro data, the company’s revenue growth stands at 12.08% for the last twelve months as of Q3 2023, indicating steady expansion. This growth is further supported by a robust gross profit margin of 60.18%, suggesting efficient cost management in its service offerings.
An InvestingPro Tip highlights that DigitalOcean has a perfect Piotroski Score of 9, which is a positive indicator of the company’s financial strength and potential for future growth. This score aligns with the company’s efforts to enhance its services and potentially improve its market position.
Another relevant InvestingPro Tip notes that DigitalOcean’s liquid assets exceed short-term obligations, indicating a strong financial position to support ongoing innovations and service improvements like the new scalable storage feature.
It’s worth noting that DigitalOcean’s stock has shown a 42.71% price total return over the past year, suggesting investor confidence in the company’s strategy and growth potential. However, with a P/E ratio of 42.48, the stock is trading at a relatively high earnings multiple, which investors should consider in light of the company’s growth initiatives and market position.
For readers interested in a more comprehensive analysis, InvestingPro offers additional tips and insights on DigitalOcean. There are 11 more InvestingPro Tips available for DOCN, providing a deeper understanding of the company’s financial health and market prospects.
This article was generated with the support of AI and reviewed by an editor. For more information see our T&C.
Article originally posted on mongodb google news. Visit mongodb google news
MMS • Aditya Kulkarni
Article originally posted on InfoQ. Visit InfoQ
Recently, AWS CodeBuild introduced support for managed GitLab self-hosted runners, towards advancement in continuous integration and continuous delivery (CI/CD) capabilities. This new feature allows customers to configure their CodeBuild projects to receive and execute GitLab CI/CD job events directly on CodeBuild’s ephemeral hosts.
The integration offers several key benefits including Native AWS Integration, Compute Flexibility and Global Availability. GitLab jobs can now seamlessly integrate with AWS services, leveraging features such as IAM, AWS Secrets Manager, AWS CloudTrail, and Amazon VPC. This integration enhances security and convenience for the users.
Furthermore, customers gain access to all compute platforms offered by CodeBuild, including Lambda functions, GPU-enhanced instances, and Arm-based instances. This flexibility allows for optimized resource allocation based on specific job requirements.The integration is available in all regions where CodeBuild is offered.
To implement this feature, users need to set up webhooks in their CodeBuild projects and update their GitLab CI YAML files to utilize self-managed runners hosted on CodeBuild machines.
The setup process involves connecting CodeBuild to GitLab using OAuth, which requires additional permissions such as create_runner
and manage_runner
.
It’s important to note that CodeBuild will only process GitLab CI/CD pipeline job events if a webhook has filter groups containing the WORKFLOW_JOB_QUEUED
event filter. The buildspec in CodeBuild projects will be ignored unless buildspec-override:true
is added as a label, as CodeBuild overrides it to set up the self-managed runner.
When a GitLab CI/CD pipeline run occurs, CodeBuild receives the job events through the webhook and starts a build to run an ephemeral GitLab runner for each job in the pipeline. Once the job is completed, the runner and associated build process are immediately terminated.
As a side, GitLab has been in the news since earlier this year as they planned to introduce CI Steps, which are reusable and composable pieces of a job that can be referenced in pipeline configurations. These steps will be integrated into the CI/CD Catalog, allowing users to publish, unpublish, search, and consume steps similarly to how they use components.
Moreover, GitLab is working on providing users with better visibility into component usage across various project pipelines. This will help users identify outdated versions and take prompt corrective actions, promoting better version control and project alignment.
AWS CodeBuild has been in the news as well, as they added support for Mac Builds. Engineers can build artifacts on managed Apple M2 instances that run on macOS 14 Sonoma. Few weeks ago, AWS CodeBuild enabled customers to configure automatic retries for their builds, reducing manual intervention upon build failures. They have also added support for building Windows docker images in reserved fleets.
Such developments demonstrate the ongoing evolution of CI/CD tools and practices, with a focus on improving integration, flexibility, and ease of use for DevOps teams.
MMS • RSS
Posted on mongodb google news. Visit mongodb google news
Article originally posted on mongodb google news. Visit mongodb google news
MMS • Noemi Vanyi Simona Pencea
Article originally posted on InfoQ. Visit InfoQ
Transcript
Pencea: I bet you’re wondering what this picture is doing on a tech conference. These are two German academics. They started to build a dictionary, but they actually became famous, because along the way, they collected a lot of folk stories. The reason they are here is partly because they were my idol when I was a child. I thought there was nothing better to do than to listen to stories and collect them. My family still makes fun of me because I ended up in tech after that. The way I see it, it’s not such a big difference. Basically, we still collect folk stories in tech, but we don’t call them folk stories, we call them best practices. Or we go to conferences to learn about them, basically to learn how other people screwed up, so that we don’t do the same. After we collect all these stories, we put them all together. We call them developer experience, and we try to improve that. This brings us to the talk that we have, improving developer experience using automated data CI/CD pipelines. My name is Simona Pencea. I am a software engineer at Xata.
Ványi: I’m Noémi Ványi. I’m also a software engineer at the backend team of Xata. Together, we will be walking through the data developer experience improvements we’ve been focused on recently.
Pencea: We have two topics on the agenda. The first one, testing with separate data branches covers the idea that when you create a PR, you maybe want to test your PR using a separate data environment that contains potentially a separate database. The second one, zero downtime migrations, covers the idea that we want to improve the developer experience when merging changes that include schema changes, without having any downtime. Basically, zero downtime migrations. For that, we developed an open-source tool called pgroll. Going through the first one, I will be covering several topics. Basically, I will start by going through the code development flow that we focused on. The testing improvements that we had in mind. How we ensured we have data available in those new testing environments. How we ensured that data is safe to use.
Code Development Workflow
This is probably very familiar to all of you. It’s basically what I do every day. When I’m developing, I’m starting with the local dev. I have my local dev data. It’s fake data. I’m trying to create a good local dev dataset when I’m testing my stuff. I’m trying to think about corner cases, and cover them all. The moment I’m happy with what I have in my local dev, I’m using the dev environment. This is an environment that is shared between us developers. It’s located in the cloud, and it also has a dataset. This is the dev dataset. This is also fake data, but it’s crowdfunded from all the developers that use this environment. There is a chance that you find something that it’s not in the local dev. Once everything is tested, my PR is approved. I’m merging it. I reach staging.
In staging, there is another dataset which is closer to the real life, basically, because it’s from beta testing or from demos and so on. It’s not the real thing. The real thing is only in prod, and this is the production data. This is basically the final test. The moment my code reaches prod, it may fail, even though I did my best to try with everything else. In my mind, I would like to get my hands on the production dataset somehow without breaking anything, if possible, to test it before I reach production, so that I minimize the chance of bugs.
Data Testing Improvements – Using Production Data
This is what led to this. Can we use production data to do testing with it? We’ve all received those emails sometimes, that say, test email, and I wasn’t a test user. Production data would bring a lot of value when used for testing. If we go through the pros, the main thing is, it’s real data. It’s what real users created. It’s basically the most valuable data we have. It’s also large. It’s probably the largest dataset you have, if we don’t count load test generated data and so on. It’s fast in the way that you don’t have to write a script or populate a database. It’s already there, you can use it. There are cons to this. There are privacy issues. It’s production data: there’s private information, private health information. I probably don’t even have permission from my users to use the data for testing. Or, am I storing it in the right storage? Is this storage with the right settings so that I’m not breaking GDPR or some other privacy laws.
Privacy issues are a big con. The second thing, as you can see, large is also a con, because a large dataset does not mean a complete dataset. Normally, all the users will use your product in the most common way, and then you’ll have some outliers which give you the weird bugs and so on. Having a large dataset while testing may prevent you from seeing those corner cases, because they are better covered. Refreshing takes time because of the size. Basically, if somebody changes the data with another PR or something, you need to refresh everything, and then it takes longer than if you have a small subset. Also, because of another PR, you can get into data incompatibility. Basically, you can get into a state where your test breaks, but it’s not because of your PR. It’s because something broke, or something changed, and now it’s not compatible anymore.
If we look at the cons, it’s basically two categories that we can take from those. The first ones are related to data privacy. Then the second ones are related to the size of the dataset. That gives us our requirements. The first one would be, we would like to use production data but in a safe way, and, if possible, fast. Since we want to do a CI/CD pipeline, let’s make it automated. I don’t want to run a script by hand or something. Let’s have the full experience. Let’s start with the automated part. It’s very hard to cover all the ways software developers work. What we did first was to target a simplification, like considering GitHub as a standard workflow, because the majority of developers will use GitHub. One of the things GitHub gives to you is a notification when a PR gets created. Our idea was, we can use that notification, we can hook up to it. Then we can create what we call a database branch, which is basically a separate database, but with the same schema as the source branch, when a GitHub PR gets created. Then after creation, you can copy the data after it. Having this in place would give you the automation part of the workflow.
Let’s see how we could use the production data. We said we want to have a fast copy and also have it complete. I’ll say what that means. Copying takes time. There is no way around it. You copy data, it takes time. You can hack it. Basically, you can have a preemptive copy. You copy the data before anyone needs it, so when they need it, it’s already there. Preemptive copying means I will just have a lot of datasets around, just in case somebody uses it, and then, I have to keep everything in sync. That didn’t really fly with us. We can do Copy on Write, which basically means you copy at the last minute before data is actually used, so before that, all the pointers point to the old data. The problem with using Copy on Write for this specific case is that it did not leave us any way into which we could potentially change the data to make it safe. If I Copy on Write, it’s basically the same data. I will just change the one that I’m editing, but the rest of it is the same one.
For instance, if I want to anonymize an email or something, I will not be able to do it with Copy on Write. Then, you have the boring option, which is, basically, you don’t copy all the data, you just copy a part of the data. This is what we went for, even though it was boring. Let’s see about the second thing. We wanted to have a complete dataset. I’ll go back a bit, and consider the case of a relational database where you have links as a data type. Having a complete dataset means all the links will be resolved inside of this dataset. If I copy all the data, that’s obviously a complete dataset, but if I copy a subset, there is no guarantee it will be complete unless I make it so. The problem with having a complete dataset by following the links is it sounds like an NP-complete problem, and that’s because it is an NP-complete problem. If I want to copy a subset of a bigger data, and to find it of a certain size, I would actually need to find all the subsets that respect that rule, and then select the best one. That would mean a lot of time. In our case, we did not want the best dataset that has exactly the size that we have in mind. We were happy with having something around that size. In that case, we can just go with the first dataset that we can construct that follows this completeness with all the links being resolved in size.
Data Copy (Deep Dive)
The problem with constructing this complete subset is, where do we start? How do we advance? How do we know we got to the end, basically? The where do we start part is solvable, if we think about the relationships between the tables as a graph, and then we apply a topological sort on it. We list the tables based on their degrees of independence. In this case, this is an example. t7 is the most independent, then we have t1, t2, t3, and we can see that if we remove these two things, the degrees of independence for t2 and t3 are immediately increased because the links are gone. We have something like that. Then we go further up. Here we have the special case of a cycle, because you can’t point back with links to the same table that pointed to you. In this case, we can break the cycle, because going back we see the only way to reach this cycle is through t5.
Basically, we need to first reach t5 and then t6. This is what I call the anatomy of the schema. We can see this is the order in which we need to go through the tables when we collect records. In order to answer the other two questions, things get a bit more complicated, because the schema is not enough. The problem with the schema not being enough for these cases is because, first of all, it will tell you what’s possible, but it doesn’t have to be mandatory, unless you have a constraint. Usually, a link can also be empty. If you reach a point where you run into a link that points to nothing, that doesn’t mean you should stop. You need to go and exhaustively find the next potential record to add to the set. Basically, if you imagine it in 3D, you need to project this static analysis that we did on individual rows. The thing that you cannot see through the static analysis from the beginning is that you can have several records from one table pointing to the same one in another table. The first one will take everything with it, and the second one will bring nothing.
Then you might be tempted to stop searching, because you think, I didn’t make any progress, so then the set is complete, which is not true. You need to exhaustively look until the end of the set. These are just a few of the things that, when building this thing, need to be kept on the back of the mind, basically. We need to always allow full cycle before determining that no progress was made. When we select the next record, we should consider the fact that it might have been already brought into the set, and we shouldn’t necessarily stop at that point.
We talked about at the beginning how we want to have this production data, but have it safe to use. This is the last step that we are still working on. It is a bit more fluffy. The problem with masking the data is that, for some fields, you know exactly what they are. It’s an email, then, sure, it’s private data. What if it’s free text, then what? If it’s free text, you don’t know what’s inside. The assumption is it could be private data. The approach here was to provide several possibilities on how to redact data and allow the user to choose, because the user has the context and they should be able to select based on the use case. The idea of having, for instance, a full reduction or a partial reduction, is that, sure you can apply that, but it will break your aggregations.
For instance, if I have an aggregation by username, like my Gmail address, and I want to know how many items I have assigned to my email address, if I redact the username and it will be, **.gmail, then I get aggregations on any Gmail address that has items in my table. The most complete would be a full transformation. The problem with full transformation is that it takes up a lot of memory, because you need to keep the map with the initial item and the changed item. Depending on the use case, you might not need this because it’s more complex to maintain. Of course, if there is a field that has sensitive data and you don’t need it for your specific test case, you can just remove it. The problem with removing a field is that that would basically mean you’re changing the schema, so you’re doing a migration, and that normally causes issues. In our case, we have a solution for the migrations, so you can feel free to use it.
Zero Downtime Migrations
Ványi: In this section of the presentation, I would like to talk about, what do we mean by zero downtime. What challenges do we face when we are migrating the data layer? I will talk about the expand-contract pattern and how we implemented it in PostgreSQL. What do I mean when I say zero downtime? It sounds so nice. Obviously, downtime cannot be zero because of physics, but the user can perceive it as zero. They can usually tolerate around 20 milliseconds of latency. Here I talk about planned maintenance, not service outages. Unfortunately, we rarely have any control over service outages, but we can always plan for our application updates.
Challenges of Data Migrations
Let’s look at the challenges we might face during these data migrations. Some migrations require locking, unfortunately. These can be table, read, write locks, meaning no one can access the table. They cannot read. They cannot write. In case of high availability applications, that is unacceptable. There are other migrations that might rely on read, write locks. Those are a bit better, and we can live with that. Also, it’s something we want to avoid. Also, when there is a data change, obviously we have to update the application as well, and the new instance, it has to start and run basically at the same time as the old application is running. This means that the database that we are using, it has to be in two states at the same time. Because there are two application versions interacting with our database, we must make sure, for example, if we introduce a new constraint, that it is enforced in both the existing records and on the new data as well.
Based on these challenges, we can come up with a list of requirements. The database must serve both the old schema and the new schema to the application, because we are running the old application and the new application at the same time. Schema changes are not allowed to block database clients, meaning we cannot allow our applications to be blocked because someone is updating the schema. The data integrity must be preserved. For example, if we introduce a new data constraint, it must be enforced on the old records as well. When we have different schema versions live at the same time, they cannot interfere with each other. For example, when the old application is interacting with the database, we cannot yet enforce the new constraints, because it would break the old application. Finally, as we are interacting with two application versions at the same time, we must make sure that the data is still consistent.
Expand-Contract Pattern
The expand-contract pattern can help us with this. It can minimize downtime during these data migrations. It consists of three phases. The first phase is expand. This is the phase when we add new changes to our schema. We expand the schema. The next step is migrate. That is when we start our new application version. Maybe test it. Maybe we feel lucky, we don’t test it at all. At this point, we can also shut down the old application version. Finally, we contract. This is the third and last phase. We remove the unused and the old parts from the schema. This comes with several benefits.
In this case, the changes do not block the client applications, because we constantly add new things to the existing schema. The database has to be forward compatible, meaning it has to support the new application version, but at the same time, it has to support the old application version, so the database is both forward and backwards compatible with the application versions. Let’s look at a very simple example, renaming a column. It means that here we have to create the new column, basically with a new name, and copy the contents of the old column. Then we migrate our application and delete our column with the old name. It’s very straightforward. We can deploy this change using, for example, the blue-green deployments. Here, the old application is still live, interacting with our table through the old view. At the same time, we can deploy our new application version which interacts through another new view with the same table. Then we realize that everything’s passing. We can shut down the old application and remove the view, and everything just works out fine.
Implementation
Let’s see how we implemented in Postgres. First, I would like to explain why we chose PostgreSQL in the first place. Postgres is well known, open source. It’s been developed for 30 years now. The DDL statements are transactional, meaning, if one of these statements fail, it can be rolled back easily. Row level locking. They mostly rely on row level locking. Unfortunately, there are a few read, write locks, but we can usually work around those. For example, if you are adding a nonvolatile default value, the table is not rewritten. Instead, the value is added to the metadata of the table. The old records are updated when the whole record is updated. It doesn’t really work all the time. Let’s look at the building blocks that Postgres provides. We are going to use three building blocks, DDL statements, obviously, to alter the schema.
Views, to expose the different schema version to the different application versions. Triggers and functions to migrate the old data, and on failure, to roll back the migrations. Let’s look at a bit more complex example. We have an existing column, and we want to add the NOT NULL constraint to it. It seems simple, but it can be tricky because Postgres does a table scan, meaning it locks the table, and no one can update or read the table, because it goes through all of the records and checks if any of the existing records violate the NOT NULL constraint. If it finds a record that violates this constraint, then the statement returns an error, unfortunately. We can work around it. If we add NOT VALID to this constraint, the table scan escaped. Here we add the new column and set NOT NULL constraint and add NOT VALID to it, so we are not blocking the database clients.
We also create triggers that move the old values from the columns. It is possible that some of the old records don’t yet have values, and in this case, we need to add some default value or any backfill value we want, and then we migrate our app. We need to complete the migration, obviously. We need to clean up the trigger, the view we added, so the applications can interact with the table and the old column. Also, we must remember to remove NOT VALID from the original constraint. We can do it because the migration migrated the old values, and we know that all of the new records, or new values are there, and every record satisfies the constraint.
It all seemed quite tedious to do this all the time, and that’s why we created pgroll. It’s an open-source command line tool, but it is written in Go, so you can also use it as a library. It is used to manage safe and reversible migrations using the expand-contract pattern. I would like to walk you through how to use it. Basically, pgroll is running a Postgres instance, so you need one running somewhere. After you installed and initialized it, you can start creating your migrations. You can define migrations using JSON files. I will show you an example. Once you have your migration, you run a start command. Then it creates a new view, and you can interact with it through your new application. You can test it. Then you can also shut down your old application. You run the complete command. pgroll removes all of these leftover views and triggers for you. This is the JSON example I was just talking about.
Let’s say that we have a user’s column that has an ID field, name field, and a description, and we want to make sure that the description is always there, so we put a NOT NULL constraint on it. In this case, you have to define a name. For the migration, it will be the name of the view, or the schema in Postgres. We define a list of operations. We are altering a column. The table is obviously named users. The description field, we no longer allow null values in the column. This is the interesting part. This is the up migration. It contains what to do when we are migrating the old values. In this case, it means that if the description is missing, we add the text description for and insert the name. Or if the data is there, we just move it to the new column. The down migration defines what to do when there is an error and we want to roll back. In this case, we keep the old value, meaning, if the value was missing, it’s a null, and if there was something, we keep it there.
Here is the start command. Let’s see in psql what just happened. We have a user’s table with these three columns, but you can see that pgroll added a new column. Remember, there is this migration ongoing right now. In the old description column, there are records that do not yet satisfy the constraint. In the new description the backfill value is already there for us to use. We can inspect what schemas are in the database. We can notice that there is this create_users_table, that’s the old schema version. The new one is the user_description_set_nullable, which is the name of the migration we just provided in our JSON. Let’s try to insert some values into this table. We are inserting two values. The first one is basically how the new application version is behaving. The description is not empty. In the second record, we are mimicking what the old application is doing. Here the description is NULL. Let’s say that we succeeded. We can try to query this table.
From the old app’s point of view, we can set the search path to the old schema version and perform the following query so we can inspect what happened after we inserted these values. This is what we get back. The description for Alice is, this is Alice, and for Bob it’s NULL because the old application doesn’t enforce the constraint. Let’s change the search path again to the new schema version and perform the same query. Here we can see that we have the description for Alice. Notice that Bob has a description. It is the default description or default migration we provided in the JSON file. Then we might complete the migration using the complete command, and we can see that the old schema is cleaned up. Also, the intermediary column is also removed, and the triggers, functions, everything is removed. Check out pgroll. It’s open source. It takes care of mostly everything. There is no need to manually create new views, functions, new columns, nothing. After you complete your migrations, it cleans up after itself. It is still under development, so there are a few missing features. For example, few missing migrations. We do not yet support adding comments, unfortunately, or batched migrations.
Takeaways
Pencea: Basically, what we presented so far were bits and pieces from this puzzle that we want to build the CI/CD data pipeline. What we imagined when we created this was, somebody creates a PR. Then, test environment with a test database with valid data that’s also safe to use, gets created for them. Then the tests are run. Everything is validated, PR is merged. Then it goes through the pipeline, and nobody has to take care or worry about migrations, because we can do the changes and everything.
Ványi: The migrations are done without downtime. If your pull request is merged, it goes through the testing pipeline, and if everything passes, that’s so nice. We can clean up after ourselves and remove the old schema. If maybe there is a test failure or something is not working correctly, we can roll back anytime, because the old schema is still kept around just in case. As we just demonstrated or told you about, there are still some work left for us, but we already have some building blocks that you can integrate into your CI/CD pipeline. You can create a test database on the fly using GitHub notifications, fill it with safe and relevant data to test. You can create schema changes and merge them back and forth without worrying about data migrations. You can deploy and roll back your application without any downtime.
Questions and Answers
Participant 1: Does pgroll take care of keeping the metadata of every migration done: is started, ongoing, finished?
Ványi: Yes, there is a migration site. Also, you can store, obviously, your migrations file in Git if you want to control it, but pgroll has its own bookkeeping for past migrations.
Participant 2: For the copying of the data from production, was that for local tests, local dev, or the dev? How did you control costs around constantly copying that data, standing up databases, and tearing them back down?
Pencea: It’s usually for something that sits in the cloud, so not for the local dev.
Participant 2: How did you control cost if you’re constantly standing up a near production size database?
Pencea: What we use internally is data branching. We don’t start a new instance every time. We have a separate schema inside a bigger database. Also, what we offer right now is copying 10k of data, it’s not much in terms of storage. We figured it should be enough for testing purposes.
Participant 3: I saw in your JSON file that you can do migrations that pgroll knows about like, is set nullable to false? Can you also do pure SQL migrations?
Ványi: Yes. We don’t yet support every migration. If there is anything missing, you can always work around it by using raw SQL migrations. In this case, you can shoot yourself in the foot, because, for example, in case of NOT NULL, we take care of the skipping of the table scan for you. When you are writing your own raw SQL migration, you have to be careful not to block your table and the database access.
Participant 4: It’s always amazed me that these databases don’t do safer actions for these very common use cases. Have you ever talked to the Postgres project on improving the actual experience of just adding a new column, or something? It should be pretty simple.
Ványi: We’ve been trying to have conversations about it, but it is a very mature project, and it is somewhat hard to change such a fundamental part of this database. Constraints are like the basic building block for Postgres, and it’s not as easy to just make it more safe. There is always some story behind it.
Pencea: I think developer experience was not necessarily something that people were concerned about, up until recently. I feel like sometimes it was actually the opposite, if it was harder, you looked cooler, or you looked like a hacker. It wasn’t exactly something that people would optimize for. I think it’s something that everybody should work towards, because now everybody has an ergonomic chair or something, and nobody questions that, but we should work towards the same thing about developer experience, because it’s ergonomics in the end.
Participant 5: In a company, assuming they are adopting pgroll, all these scripts can grow in number, so at some point you have to apply all of them, I suppose, in order. Is there any sequence number, any indication, like how to apply these. Because some of them might be serial, some of them can be parallelized. Is there any plan to give direction on the execution? I’ve seen there is a number in the script file name, are you following that as a sequence number, or when you’re then developing your batching feature, you can add a sequence number inside?
Ványi: Do we follow some sequence number when we are running migrations?
Yes and no. pgroll maintains its own table or bookkeeping, where it knows what was the last migration, what is coming next? The number in the file name is not only for pgroll, but also for us.
Participant 6: When you have very breaking migrations using pgroll, let’s say you need to rename a column, or even changing its type, which you basically replicate a new column and then copying over the data. How do you deal with very large tables, say, millions of rows? Because you could end up having even some performance issues with copying these large amounts of data.
Ványi: How do we deal with tables that are basically big? How do we make sure that it doesn’t impact the performance of the database?
For example, in case of moving the values to the new column, we are creating triggers that move the data in batches. It’s not like everything is copied in one go, and you cannot really use your Postgres database because it is busy copying the old data. We try to minimize and distribute the load on the database.
Participant 7: I know you were using the small batches to copy the records from the existing column to the new column. Once you copy all the records, only then you will remove the old column. There is a cost with that.
See more presentations with transcripts
MMS • Ben Linders
Article originally posted on InfoQ. Visit InfoQ
DORA can help to drive sustainable change, depending on how it is used by teams and the way it is supported in a company. According to Carlo Beschi, getting good data for the DORA keys can be challenging. Teams can use DORA reports for continuous improvement by analysing the data and taking actions.
Carlo Beschi spoke about using DORA for sustainable improvement at Agile Cambridge.
Doing DORA surveys in your company can help you reflect on how you are doing software delivery and operation as Beschi explained in Experiences from Doing DORA Surveys Internally in Software Companies. The way you design and run the surveys, and how you analyze the results, largely impact the benefits that you can get out of them.
Treatwell’s first DORA implementation in 2020 focused on getting DORA metrics from the tools. They set up a team that sits between their Platform Engineering team and their “delivery teams” (aka product teams, aka stream aligned teams), called CDA – Continuous Delivery Acceleration team. Half of their time is invested in making other developers and teams life better, and the other half is about getting DORA metrics from the tools:
We get halfway there, as we manage to get deployment frequency and lead time for changes for almost all of our services running in production, and when the team starts to dig into “change failure rate”, Covid kicks in and the company is sold.
DORA can help to drive sustainable change, but it depends on the people who lead and contribute to it, and how they approach it, as Beschi learned. DORA is just a tool, a framework, that you can use to:
- Lightweight assess your teams and organisation
- Play back the results, inspire reflection and action
- Check again a few months / one year later, maybe with the same assessment, to see if / how much “the needle has moved”
Beschi mentioned that teams use the DORA reports as part of their continuous improvement. The debrief about the report is not too different from a team retrospective, one that brings in this perspective and information, and from which the team defines a set of actions, that are then listed, prioritised, and executed.
He has seen benefits from using DORA in terms of aligning people on “this is what building and running good software nowadays looks like”, and “this is the way the best in the industry work, and a standard we aim for”. Beschi suggested focusing the conversation on the capabilities, much more than on the DORA measures:
I’ve had some good conversations, in small groups and teams, starting from the DORA definition of a capability. The sense of “industry standard” helped move away from “I think this” and “you think that”.
Beschi mentioned the advice and recommendations from the DORA community on “let the teams decide, let the teams pick, let the teams define their own ambition and pace, in terms of improvement”. This helps in keeping the change sustainable, he stated.
When it comes to meeting the expectations of senior stakeholders, when your CTO is the sponsor of a DORA initiative then there might be “pushback” on teams making decisions, and expectations regarding their “return of investment” on doing the survey, aiming to have more things change, quicker, Beschi added.
A proper implementation of DORA is far from trivial, Beschi argued. The most effective ones rely on a combination of data gathered automatically from your system alongside qualitative data gathered by surveying (in a scientific way) your developers. Getting good data quickly from the systems is easier said than done.
When it comes to getting data from your systems for the four DORA keys, while there has been some good progress in the tooling available (both open and commercial) it still requires effort to integrate any of them in your own ecosystem. The quality of your data is critical.
Start ups and scale ups are not necessarily very disciplined when it comes to consistent usage of their incident management processes – and this impacts a lot the accuracy of your “failure change rate” and “response time” measures, Beschi mentioned.
Beschi mentioned several resources for companies that are interested in using DORA:
- The DORA website, where you can self-serve all DORA key assets and find the State of DevOps reports
- The DORA community has a mailing list and bi-weekly vídeo calls
- The Accelerate book
In the community you will find a group of passionate and experienced practitioners, very open, sharing their stories “from the trenches” and very willing to onboard others, Beschi concluded.