Category: Uncategorized
![MMS Founder](https://mobilemonitoringsolutions.com/wp-content/uploads/2019/04/by-RSS-Image@2x.png)
MMS • RSS
Posted on nosqlgooglealerts. Visit nosqlgooglealerts
Picture this: It’s Double 11 (Singles’ Day), China’s colossal online shopping frenzy and millions are hitting ‘buy’ simultaneously. Behind the scenes, a distributed database called OceanBase is handling the massive traffic spikeensuring every transaction proceeds without a hitch.
OceanBase is not just built to handle peak shopping traffic; it was specifically designed to overcome the limitations of traditional monolithic databases. With its distributed architecture, OceanBase ensures ultra-low-latency transactions, high throughput, and seamless scalability — essential for industries that need to handle large-scale data and real-time applications. Its ability to process millions of transactions per second and its robust fault tolerance and elastic scaling capabilities have made it a newly emerged innovative force for businesses that demand reliability even during peak loads.
From crisis to creation
OceanBase’s origin story is one of necessity and innovation. Rewind to 2013: Alipay, already a leader in the payments world, faced a scaling crisis as its monolithic Oracle database struggled to handle the rising volume of transactions.
However, the technical challenges extended beyond just the database. “The entire technical infrastructure was a massive fabric under extreme stress,” explains OceanBase.
Recognizing that throwing money at the database and hardware was not sustainable, Alipay developed a fully distributed database, OceanBase, from the ground up to resolve performance bottlenecks and scale effortlessly.
In 2019, OceanBase made history by becoming the first distributed database to top the TPC-C benchmark with a score of 60 million tpmC, outperforming Oracle on standard x86 servers. It then shattered its record in 2020 with 707 million tpmC. This marked a turning point, as it was the first time a distributed database claimed the top spot in the TPC-C benchmark, traditionally dominated by monolithic systems.
Open source, enterprise-ready
Today, OceanBase serves customers from a wide range of industries, from banks and insurers to telecom operators and retailers. Its open-source approach has been a driving force behind its growing popularity, making it accessible to developers worldwide and accelerating collaboration within the community.
“Open source is essential for expanding our global reach,” OceanBase says. Now, the open-source version of OceanBase supports technology exploration and collaboration and accelerates further innovations, while the enterprise version focuses on advanced security features, catering to the rigorous demands of financial institutions and customers from other sectors.
What sets OceanBase apart is its ability to deliver seamless, unified capabilities for mission-critical workloads, ensuring consistency and efficiency across transactional, analytical, and AI-driven operations.
Its hybrid TP and AP architecture ensures high performance for transactional and analytical workloads, eliminating the need for separate systems. With multi-model integration, OceanBase supports relational, JSON, Key Value and more, enabling businesses to handle diverse data types efficiently within a single database. Furthermore, its vector hybrid search capabilities empower AI-driven workloads, bringing generative AI and recommendation system applications into a unified database environment.
Beyond its unified design and performance, OceanBase is also secure, cost-efficient, and flexible. Security is paramount for financial institutions. OceanBase offers transparent data encryption, granular access controls, and comprehensive audit trails backed by SOC2 and PCI DSS certifications.
Additionally, OceanBase caters to a wide range of deployment scenarios, whether fully on-premises, cloud-native, or somewhere in between. This multi-cloud approach supports businesses to tailor their technical stack to meet specific needs while avoiding vendor lock-in.
Database for AI and hybrid workloads
OceanBase’s roadmap is designed to meet the growing complexity of modern data-driven workloads. Key upcoming features include enhanced Hybrid Transactional/Analytical Processing (HTAP) for real-time insights, a high-performance NoSQL KV store offering a reliable alternative to HBase, and vector hybrid search to support demanding generative AI and recommendation workloads efficiently.
As organizations increasingly rely on AI and hybrid workloads, the role of the database is more critical than ever. OceanBase is not just adapting to this shift but actively driving innovation, ensuring that businesses can seamlessly manage transactional, analytical, and AI-driven operations within a unified platform.
Image credit: iStockphoto/jullasart somdok
![MMS Founder](https://mobilemonitoringsolutions.com/wp-content/uploads/2019/04/by-RSS-Image@2x.png)
MMS • Aditya Kulkarni
Article originally posted on InfoQ. Visit InfoQ
![](https://mobilemonitoringsolutions.com/wp-content/uploads/2024/11/generatedHeaderImage-1731687384256.jpg)
Thoughtworks recently published their Technology Radar Volume 31, providing an opinionated guide to the current technology landscape.
As per the Technology Radar, Generative AI and Large Language Models (LLMs) dominate, with a focus on their responsible use in software development. AI-powered coding tools are evolving, necessitating a balance between AI assistance and human expertise.
Rust is gaining prominence in systems programming, with many new tools being written in it. WebAssembly (WASM) 1.0’s support by major browsers is opening new possibilities for cross-platform development. The report also notes rapid growth in the ecosystem of tools supporting language models, including guardrails, evaluation frameworks, and vector databases.
In the Techniques quadrant, notable items in the Adopt ring include 1% canary releases, component testing, continuous deployment, and retrieval-augmented generation (RAG). The Radar stresses the need to balance AI innovation with proven engineering practices, maintaining crucial software development techniques like unit testing and architectural fitness functions.
For Platforms, the Radar highlights tools like Databricks Unity Catalog, FastChat, and GCP Vertex AI Agent Builder in the Trial ring. It also assesses emerging platforms such as Azure AI Search, large vision model platforms such as V7, Nvidia Deepstream SDK and Roboflow, along with SpinKube. This quadrant highlights the rapid growth in tools supporting language models, including those for guardrails, evaluations, agent building, and vector databases, indicating a significant shift towards AI-centric platform development.
The Tools section underscores the importance of having a robust toolkit that combines AI capabilities with reliable software development utilities. The Radar recommends adopting Bruno, K9s, and visual regression testing tools like BackstopJS. It suggests trialing AWS Control Tower, ClickHouse, and pgvector, among others, reflecting a focus on cloud management, data processing, and AI-related database technologies.
For Languages and Frameworks, dbt and Testcontainers are recommended for adoption. The Trial ring includes CAP, CARLA, and LlamaIndex, reflecting the growing interest in AI and machine learning frameworks.
The Technology Radar also highlighted the growing interest in small language models (SLMs) as an alternative to large language models (LLMs) for certain applications, noting their potential for better performance in specific contexts and their ability to run on edge devices. This edition drew a parallel between the current rapid growth of AI technologies and the explosive expansion of the JavaScript ecosystem around 2015.
Overall, the Technology Radar Vol 31 reflects a technology landscape heavily influenced by AI and machine learning advancements, while also emphasizing the continued importance of solid software engineering practices. Created by Thoughtworks’ Technology Advisory Board, the technology Radar provides valuable insights twice-yearly for developers, architects, and technology leaders navigating the rapidly evolving tech ecosystem, offering guidance on which technologies to adopt, trial, assess, or approach with caution.
The Thoughtworks Technology Radar is available in two formats for readers: an interactive online version accessible through the website, and a downloadable PDF document.
![MMS Founder](https://mobilemonitoringsolutions.com/wp-content/uploads/2019/04/by-RSS-Image@2x.png)
MMS • Steef-Jan Wiggers
Article originally posted on InfoQ. Visit InfoQ
AWS recently announced a new integration between AWS Amplify Hosting and Amazon Simple Storage Service (S3), enabling users to deploy static websites from S3 quickly. This integration streamlines the hosting process, allowing developers to deploy static sites stored in S3 and deliver content over AWS’s global content delivery network (CDN) with just a few clicks according to the company.
AWS Amplify Hosting, a fully managed hosting solution for static sites, now offers users an efficient method to publish websites using S3. The integration leverages Amazon CloudFront as the underlying CDN to provide fast, reliable access to website content worldwide. Amplify Hosting handles custom domain setup, SSL configuration, URL redirects, and deployment through a globally available CDN, ensuring optimal performance and security for hosted sites.
Setting up a static website using this new integration begins with an S3 bucket. Users can configure their S3 bucket to store website content, then link it with Amplify Hosting through the S3 console. From there, a new “Create Amplify app” option in the Static Website Hosting section guides users directly to Amplify, where they can configure app details like the application name and branch name. Once saved, Amplify instantly deploys the site, making it accessible on the web in seconds. Subsequent updates to the site content in S3 can be quickly published by selecting the “Deploy updates” button in the Amplify console, keeping the process seamless and efficient.
(Source: AWS News blog post)
This integration benefits developers by simplifying deployments, enabling rapid updates, and eliminating the need for complex configuration. For developers looking for programmatic deployment, the AWS Command Line Interface (CLI) offers an alternative way to deploy updates by specifying parameters like APP_ID and BRANCH_NAME.
Alternatively, according to the respondent on a Reddit thread, users could opt for Cloudflare:
If your webpage is static, you might consider using Cloudflare – it would probably be cheaper than the AWS solution.
Or using S3 and GitLab CI, according to a tweet by DrInTech:
Hello everyone! I just completed a project to host a static portfolio website, leveraging a highly accessible and secure architecture. And the best part? It costs only about $0.014 per month!
Lastly, Amplify Hosting integration with Amazon S3 is available in AWS Regions where Amplify Hosting is available and pricing details for S3 and hosting on the respective pricing pages.
Podcast: Trends in Engineering Leadership: Observability, Agile Backlash, and Building Autonomous Teams
![MMS Founder](https://mobilemonitoringsolutions.com/wp-content/uploads/2019/04/by-RSS-Image@2x.png)
MMS • Chris Cooney
Article originally posted on InfoQ. Visit InfoQ
![](https://mobilemonitoringsolutions.com/wp-content/uploads/2024/11/engineering-culture-podcast-logo-1731494198545.jpeg)
Transcript
Shane Hastie: Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. Today I’m sitting down across many miles with Chris Cooney. Chris, welcome. Thanks for taking the time to talk to us today.
Introductions [01:03]
Chris Cooney: Thank you very much, Shane. I’m very excited to be here, in indeed many miles. I think it’s not quite the antipodes, right, but it’s very, very close to the antipodes. Ireland off New Zealand. It’s the antipodes of the UK, but we are about as far away as it gets. The wonders of the internet, I suppose.
Shane Hastie: Pretty much so, and I think the time offset is 13 hours today. My normal starting point is who is Chris?
Chris Cooney: That’s usually the question. So hello, I’m Chris. I am the Head of Developer Relations for a company called Coralogix. Coralogix is a full stack observability platform processing data without indexing in stream. We are based in several different countries. I am based in the UK, as you can probably tell from my accent. I have spent the past 11, almost 12 years now as a software engineer. I started out as a Java engineer straight out of university, and then quickly got into front-end engineering, didn’t like that very much and moved into SRE and DevOps, and that’s really where I started to enjoy myself. And over the past several years, I’ve moved into engineering leadership and got to see organizations grow and change and how certain decisions affect people and teams.
And now more recently, as the Head of Developer Relations for Coralogix, I get to really enjoy going out to conferences, meeting people, but also I get a lot of research time to find out about what happens to companies when they employ observability. And I get to also understand the trends in the market in a way that I never would’ve been able to see before as a software engineer, because I get to go meet hundreds and hundreds of people every month, and they all give me their views and insights. And so, I get to collect all those together and that’s what makes me very excited to talk on this podcast today about the various different topics that got on in the industry.
Shane Hastie: So let’s dig into what are some of those trends? What are some of the things that you are seeing in your conversation with engineering organizations?
The backlash against “Agile” [02:49]
Chris Cooney: Yes. When I started out, admittedly 11, 12 years ago is a while, but it’s not that long ago really. I remember when I started out in the first company I worked in, we had an Agile consultant come in. And they came in and they explained to me the principles of agility and so on and so forth, so gave me the rundown of how it all works and how it should work and how it shouldn’t work and so on. We were all very skeptical, and over the years I’ve got to see agility become this massive thing. And I sat in boardrooms with very senior executives in very large companies listening to Agile Manifesto ideas and things like that. And it’s been really interesting to see that gel in. And now we’re seeing this reverse trend of people almost emotionally pushing back against not necessarily the core tenets of Agile, but just the word. We’ve heard it so many times, there’s a certain amount of fatigue around it. That’s one trend.
The value of observability [03:40]
The other trend I’m seeing technically is this move around observability. Obviously, I spend most of my time talking about observability now. It used to be this thing that you have to have so when things have gone wrong or to stop things from going wrong. And there is this big trend now of organizations moving towards less to do with what’s going wrong. It’s a broader question, like, “Where are we as a company? How many dev hours did we put into this thing? How does that factor in the mean times to recovery reduction, that kind of thing?” They’re much broader questions now, blurring in business measures, technical measures, and lots more people measures.
I’ll give you a great example. Measuring rage clicks on an interface is like a thing now and measuring the emotionality with which somebody clicks a button. It’s a fascinating, I think it’s like a nice microcosm of what’s going on in the industry. Our measurements are getting much more abstract. And what that’s doing to people, what it’s doing to engineering teams, it’s fascinating. So there’s lots and lots and lots.
And then, obviously there’s the technical trends moving around AI and ML and things like that and what’s doing to people and the uncertainty around that and also the excitement. It’s a pretty interesting time.
Shane Hastie: So let’s dig into one of those areas in terms of the people measurements. So what can we measure about people through building observability into our software?
The evolution of what can be observed [04:59]
Chris Cooney: That’s a really interesting topic. So I think it’s better to contextualize, begin with okay, we started out, it was basically CPU, memory, disk, network, the big four. And then, we started to get a bit clever and looked at things at latency and response sizes, data exchanged over a server and so forth. And then, as we built up, we started to look at things like we’ve got some marketing metrics in there, so balance rates, how long somebody stays on a page and that kind of thing.
Now we’re looking at the next sort of tier, so the next level of abstraction up, which is more like, did the user have a good experience on the website, and what does that mean? So you see web vitals are starting to break into this area, things like when was the meaningful moment that a user saw the content they wanted to see? Not first ping, not first load, load template. The user went to this page, they wanted to see a product page. How long was it, not just how long did the page take to load before they saw all the meaningful information they needed? And that’s an amalgamation of lots and lots of different signals and metrics.
I’ve been talking recently about this distinction between a signal and an insight. So my taxonomy, the way I usually slice it, is a signal is a very specific technical measurement of something: latency, page load time, bytes exchange, that kind of thing. And in insight, there’s an amalgamation of lots of different signals to produce one useful thing, and my litmus test for an insight is that you can take it to your non-technical boss and they will understand it. They will understand what you’re talking about. When I say to my non-technical boss, “My insight is this user had a really bad experience loading the product page. It took five seconds for the product to appear, and they couldn’t buy the thing. They figured they that couldn’t work out where to do it”. That would be a combination of various different measures around where they clicked on the page, how long the HTML ping took, how long the actual network speed was to the machine, and so on.
So that’s what I’m talking about with the people experience metrics. It’s fascinating in that respect, and this new level now, which is directly answering business questions. It’s almost like we’ve built scaffolding up over the years, deeply technical. When someone would say, “Did that person have a good experience?” And we’d say, “Well, the page latency was this, and the HTTP response was 200, which is good, but then the page load time was really slow”. But now we just say yes or no because of X, Y and Z. And so, that’s where we’re going to I think. And this is all about that trend of observability moving into that business space, is taking much more broad encompassing measurements and a much higher level of abstraction. And that’s what I mean when I said to more people metrics as a general term.
Shane Hastie: So what happens when an organization embraces this? When not just the technical team, but the product teams, when the whole organization is looking at this and using this to perhaps make decisions about what they should be building?
Making sense of observations [07:47]
Chris Cooney: Yes. There are two things here in my opinion. One is there’s a technical barrier, which is making the information literally available in some way. So putting a query engine, and so putting, what’s an obvious one? Putting Kibana in front of open search is the most common example. It’s a way to query your data. Making a SQL query engine in front of your database is a good example. So just doing that is the technical boom. And that is not easy, by the way. That is a certain level of scale. Technically, that is really hard to make high performance queries for hundreds, potentially thousands of users hence taken concurrently. That’s not easy.
Let’s assume that’s out of the way and the organization’s work that out. The next challenge is, “Well, how do we make it so that users can get the questions they need answered, answered quickly without specialist knowledge?” And we’re not there yet. Obviously AI makes a lot of very big promises about natural language query. It’s something that we’ve built into the platform in Coralogix ourselves. It works great. It works really, really well. And I think what we have to do now is work out how do we make it as easy as possible to get access to that information?
Let’s assume all those barriers are out of the way, and an organization has achieved that. And I saw something similar to this when I was a Principal Engineer at Sainsbury’s when we started to surface, it’s an adjacent example, but still relevant, introduction of SLOs and SLIs into the teams. So where before if I went to one team and said, “How has your operational success been this month?” They would say, “Well, we’ve had a million requests and we serviced them all in under 200 milliseconds”. Okay. I don’t know what that means. Is 200 milliseconds good? Is that terrible? What does that mean? We’d go to another team and they’d say, “Well, our error rate is down to 0.5%” Well, brilliant. But last month it was 1%. The month before that it was 0.1% or something.
When we introduced SLOs and SLIs into teams, we could see across all of them, “Hey, you breached your error budget. You have not breached your error budget”. And suddenly, there was a universal language around operational performance. And the same thing happens when you surface the data. You create a universal language around cross-cutting insights across different people.
Now, what does that do to people? Well, one, it shines spotlights in places that some people may not want them shined there, but it does do that. That’s what the universal language does. It’s not enough just to have the data. You have to have effective access to it. You have to have effective ownership of it. And doing that surfaces conversations that would be initially quite painful. There are lots of people, especially in sufficiently large organizations, that have been kind of just getting by by flying under the radar, and it does make that quite challenging.
The other thing that it does, some people, it makes them feel very vulnerable because they feel like KPIs. They’re not. We’re not measuring that performance on if they miss their error budget. When I was the business engineer, no one would get fired. We’d sit down and go, “Hey, you missed your error budget. What can we do here? What’s wrong? What are the barriers?” But it actually made some people feel very nervous and very uncomfortable with it and they didn’t like it. Other people thrived and loved. It became a target. “How much can we beat our budget by this month? How low can we get it?”
Metrics create behaviors [10:53]
So the two things I would say, the big sweeping changes in behavior, it’s that famous phrase, “Build me a metric and I’ll show you a behavior”. So if you measure somebody, human behavior is what they call a type two chaotic system.
By measuring it, you change it. And it’s crazy in the first place. So as soon as you introduce those metrics, you have to be extremely cognizant of what happens to dynamics between teams and within teams. Teams become competitive. Teams begin to look at other teams and wonder, “How the hell are they doing that? How is their error budget so low? What’s going on?” Other teams maybe in an effort to improve their metrics artificially will start to lower their deployment frequency and scrutinize every single thing. So while their operational metrics look amazing, their delivery is actually getting worse, and all these various different things that go on. So that competitiveness driven by uncertainty and vulnerability is a big thing that happens across teams.
The other thing that I found is that the really great leaders, the really brilliant leaders love it. Oh, in fact, all leadership love it. All leadership love higher visibility. The great leaders see that higher visibility and go, “Amazing. Now I can help. Now I can actually get involved in some of these conversations that would’ve been challenging before”.
The slightly more, let’s say worrying leaders will see this as a rod with which to beat the engineers. And that is something that you have to be extremely careful of. Surfacing metrics and being very forthright about the truth and being kind of righteous about it all is great and it’s probably the best way to be. But the consequence is that a lot of people can be treated not very well if you have the wrong type of leadership in place, who see these measurements as a way of forcing different behaviors.
And so, it all has to be done in good faith. It all has to be done based on the premise that everybody is doing their best. And if you don’t start from that premise, it doesn’t matter how good your measurements are, you’re going to be in trouble. Those are the learnings that I took from when I rolled it out and some of the things that I saw across an organization. It was largely very positive though. It just took a bit of growing pains to get through.
Shane Hastie: So digging into the psychological safety that we’ve heard about and known about for a couple of decades now.
Chris Cooney: Yes. Yes.
Shane Hastie: We’re not getting it right.
Enabling psychological safety is still a challenge [12:59]
Chris Cooney: No, no. And I think that my experience when I first go into reading about, it’s like Google’s Project Aristotle and things like that may be. And my first attempt at educating an organization on psychological safety was they had this extremely long, extremely detailed incident management review, where if something goes wrong, then they have, we’re talking like a 200-person, several day, sometimes several day. I think on the low end it was like five, six hours, deep review of everything. Everyone bickers and argues at each other and points fingers at each other. And there’s enormous documents produced, it’s filed away, and nobody ever looks at it ever again because who wants to read those things? It’s just a historical text about bickering between teams.
And what I started to do is I said, “Well, why don’t we trial like a more of a blameless post-mortem method? Let’s just give that a go and we’ll see what happens”. So the first time I did it, the meeting went from, they said the last meeting before them was about six hours. We did it in about 45 minutes. I started the meeting by giving a five-minute briefing of why this post-mortem has to be blameless. The aviation industry and the learnings that came from that around if you hide mistakes, they only get worse. We have to create an environment where you’re okay to surface mistakes. Just that five-minute primer and then about a 40-ish-minute conversation. And we had a document that was more thorough, more detailed, more fact-based, and more honest than any incident review that I ever read before that.
So rolling that out across organizations was really, really fun. But then, I saw it go the other way, where they’d start saying, “Well, it’s psychologically safe”. And it’s turned inside this almost hippie loving, where nobody’s done anything wrong. There is no such thing as a mistake. And no, that’s not the point. The point is that we all make mistakes, not that they don’t exist. And we don’t point blame in a malicious way, but we can attribute a mistake to somebody. You just can’t do it by… And the language in some of these post-mortem documents that I was reading was so indirect. “The system post a software change began to fail, blah, blah, blah, blah, blah”. Because they’re desperately trying not to name anybody or name any teams or say that an action occurred. It was almost like the system was just running along and then the vibrations from the universe just knocked it out of whack.
And actually, when you got into it, one of the team pushed a code change. It’s like, “No. Team A pushed a code change. Five minutes later there was a memory leak issue that caused this outage”. And that’s not blaming anybody, that’s just stating the fact in a causal way.
So the thing I learned with that is whenever you are teaching about blameless post-mortem psychological safety, it’s crucial that you don’t lose the relationship between cause and effect. You have to show cause A, effect B, cause B, effect C, and so on. Everything has to be linked in that way in my opinion. Because that forces them to say, “Well, yes. We did push this code change, and yes, it looks like it did cause this”.
That will be the thing I think where most organizations get tripped up, is they really go all in on psychological safety. “Cool, we’re going to do everything psychologically safe. Everyone’s going to love it”. And they throw the baby out with the bath water as it were. And they missed the point, which is to get to the bottom of an issue quickly, not to not hurt anybody’s feelings, which is often sometimes a mistake that people make I think, especially in large organizations.
Shane Hastie: Circling back around to one of the comments you made earlier on. The agile backlash, what’s going on there?
Exploring the agile backlash [16:25]
Chris Cooney: I often try to talk about larger trends rather than my own experience, purely because anecdotal experience is only useful as an anecdote. So this is an anecdote, but I think it’s a good indication of what’s going on more broadly. When I was starting out, I was a mid-level Java engineer, and this was when agility was really starting to get a hold in some of these larger companies and they started to understand the value of it. And what happened was we were all on the Agile principles. We were regularly reading the Agile Manifesto.
We had a coach called Simon Burchill who was and is absolutely fantastic, completely, deeply, deeply understands the methodology and the point of agility without getting lost in the miasma of various different frameworks and planning poker cards and all the rest of it. And he was wonderful at it, and I was very, very fortunate to study under him in that respect because it gave me a really good, almost pure perspective of agile before all of the other stuff started to come in.
So what happened to me was that we were delivering work, and if we went even a week over budget or a week over due, the organization would say, “Well, isn’t agile supposed to speed things up?” And it’s like, “Well, no, not really. It’s more of just that we had a working product six weeks ago, eight weeks ago, and you chose not to go live with it”. Which is fine, but that’s what you get with the agile process. You get a much earlier working software that gives you the opportunity to go live if you get creative with how you can productionize it or turn into a product.
So that was the first thing, I think. One of the seeds of the backlash is a fundamental misunderstanding about what Agile is supposed to be doing for you. It’s not to get things done faster, it’s to just incrementally deliver working software so you have a feedback loop and a conversation that’s going on constantly. And an empirical learning cycle is occurring, so you’re constantly improving the software, not build everything, test everything, deploy it, and find out it’s wrong. That’s one.
The other thing I will say is what I see on Twitter a lot now, or X they call it these days, is the Agile Industrial Complex, which is a phrase that I’ve seen batted around a lot, which is essentially organizations just selling Scrum certifications or various different things that don’t hold that much value. That’s not to say all Scrum certifications are useless. I did one, it was years and years ago, I forget the name of the chap now. It was fantastic. He gave a really, really great insight into Scrum, for example, why it’s useful, why it’s great, times when it may be painful, times when some of its practices can be dropped, the freedom you’ve got within the Scrum guide.
One of the things that he said to me that always stuck with me, this is just an example of a good insight that came from an Agile certification was he said, “It’s a Scrum guide, not the Scrum Bible. But it’s a guide. The whole point is to give you an idea. You’re on a journey, and the guide is there to help you along that journey. It is not there to be read like a holy text”. And I loved that insight. It really stuck with me and it definitely informed how I went out and applied those principles later on. So there is a bit of a backlash against those kinds of Agile certifications because as is the case with almost any service, a lot of it’s good, a lot of it’s bad. And the bad ones are pretty bad.
And then, the third thing I will say is that an enormous amount of power was given to Agile coaches early on. They were almost like the high priests and they were sort of put into very, very senior positions in an organization. And like I said, there are some great Agile coaches. I’ve had the absolute privilege of working with some, and there were some really bad ones, as there are great software engineers and bad software engineers, great leaders and poor leaders and so on.
The problem is that those coaches were advising very powerful people in organizations. And if you’re giving bad advice to very powerful people, the impact of that advice is enormous. We know how to deal with a bad software engineering team. We know how to deal with somebody that doesn’t want to write tests. As a software function, we get that. We understand how to work around that and solve that problem. Sometimes it’s interpersonal, sometimes it’s technical, whatever it is, we know how to fix it.
We have not yet figured out this sort of grand vizier problem of there is somebody there giving advice to the king who doesn’t really understand what they’re talking about, and the king’s just taking them at their word. And that’s what happened with Agile. And that I think is one of the worst things that we could have done was start to take the word of people as if they are these experts in Agile and blah, blah, blah. It’s ultimately software delivery. That’s what we’re trying to do. We’re trying to deliver working software. And if you’re going to give advice, you’d really better deeply understand delivery of working software before you go and about interpersonal things and that kind of stuff.
So those are the three things I think that have driven the backlash. And now there’s just this fatigue around the word Agile. Like I say, I had the benefit of going to conferences and I’ve seen the word Agile. When I first started talking, it was everywhere. You couldn’t miss a conference where the word Agile wasn’t there, and now it is less and less prevalent and people start talking more about things like continuous delivery, just to avoid saying the word Agile. Because the fatigue is almost about around the word than it’s around the principles.
And the last thing I’ll say is there is no backlash against the principles. The principles are here to stay. It’s just software engineering now. We just call it what would’ve been Agile 10 years ago is just how to build working software now. It’s so deeply ingrained in how we think that we think we’re backlash against Agile. We’re not. We’re backlash against a few words. The core principles are parts of software engineering now, and they’re here to stay for a very long time, I suspect.
Shane Hastie: How do we get teams aligned around a common goal and give them the autonomy that we know is necessary for motivation?
Make it easy to make good decisions [21:53]
Chris Cooney: Yes. I have just submitted a talk to Cube put on this. And I won’t say anything just at risk of jeopardizing our submission, but the broad idea is this. Let’s say I was in a position, I had like 20 something teams, and the wider organization was hundreds of teams. And we had a big problem, which was every single team had been raised on this idea of, “You pick your tools, you run with it. You want to use AWS, you want to use GCP, you want to use Azure? Whatever you want to use”.
And then after a while, obviously the bills started to roll in and we started to see that actually this is a rather expensive way of running an organization. And we started to think, “Well, can we consolidate?” So we said, “Yes, we can consolidate”. And a working group went off, picked a tool, bought it, and then went to the teams and said, “Thou shalt use this, and nobody listened”. And then, we kind of went back to the drawing board and they said, “Well, how do we do this?” And I said, “This tool was never picked by them. They don’t understand it, they don’t get it. And they’re stacking up migrating to this tool against all of the deliverables they’re responsible for”. So how do you make it so that teams have the freedom and autonomy to make effective decisions, meaningful decisions about their software, but it’s done in a way that there is a golden path in place such that they’re all roughly moving in the same direction?
When we started to build out a project within Sainsbury’s was completely re-platforming the entire organization. It’s still going on now. It’s still happening now. But hundreds and hundreds of developers have been migrating onto this platform. It was a team in which I was part of. It’s started, I was from Manchester in the UK, we originally called it the Manchester PAS, Platform As a Service. I don’t know if you know this, but the bumblebee is one of the symbols of Manchester. It had a little bumblebee in the UI. It was great. We loved it. And we built it using Kubernetes. We built it using Jenkins for CI, CD, purely because Jenkins was big in the office at the time. It isn’t anymore. Now it’s GitHub Actions.
And what we said was, “Every team in Manchester, every single resource has to be tagged so we know who owns what. Every single time there’s a deployment, we need some way of seeing what it was and what went into it”. And sometimes some periods of the year are extremely busy and extremely serious, and you have to do additional change notifications in different systems. So every single team between the Christmas period for a grocer, Sainsbury’s sells an enormous amount of trade between let’s say November and January. So during that period, they have to raise additional change requests, but they’re doing 30, 40 commits a day, so they can’t be expected to fill up those forms every single time. So I wonder if we can automate that for them.
And what I realized was, “Okay, this platform is going to make the horrible stuff easy and it’s going to make it almost invisible; not completely invisible because they still have to know what’s going on, but it has to make it almost invisible”. And by making the horrible stuff easy, we incentivize them to use the platform in the way that it’s intended. So we did that and we onboarded everybody in a couple of weeks, and it took no push whatsoever.
We had product owners coming to us and saying one team just started, they’d started the very first sprint. The goal of their first sprint was to have a working API and a working UI. The team produced, just by using our platform, because we made a lot of this stuff easy. So we had dashboard generation, we had alert generation, we had metric generation because we were using Kubernetes and we were using Istio. We got a ton of HTTP service metrics off the bat. Tracing was built in there.
So in their sprint review at the end of the two weeks, they built this feature. Cool. “Oh, by the way, we’ve done all of this”. And it was an enormous amount of dashboards and things like that. “Oh, by the way, the infrastructure is completely scalable. It’s totally, it’s multi-AZ failover. There’s no productionizing. It’s already production ready”. The plan was to go live in months. They went live in weeks after that. It changed the conversation and that was when things really started to capitalize and have ended up in the new project now, which is across the entire organization.
The reason why I told that story is because you have to have a give and take. If you try and do it like an edict, a top-down edict, your best people will leave and your worst people will try and work through it. Because the best people want to be able to make decisions and have autonomy. They want to have kind of sense of ownership of what they’re building. Skin in the game is often the phrase that it’s banded around.
And so, how do you give engineers the autonomy? You build a platform, you make it highly configurable, highly self-serviced. You automate all the painful bits of the organization, for example, compliance of change request notifications and data retention policies and all that. You automate that to the hilt so that all they have to do is declare some config and repository and it just happens for them. And then, you make it so the golden path, the right path, is the easy path. And that’s it. That’s the end of the conversation. If you can do that, if you can deliver that, you are in a great space.
If you try to do it as a top-down edict, you will feel a lot of pain and your best people will probably leave you. If you do it as a collaborative effort so that everybody’s on the same golden path, every time they make a decision, the easy decision is the right one, it’s hard work to go against the right decision. Then you’ll incentivize the right behavior. And if you make some painful parts of their life easy, you’ve got the carrot, you’ve got the stick, you’re in a good place. That’s how I like to do it. I like to incentivize the behavior and let them choose.
Shane Hastie: Thank you so much. There’s some great stuff there, a lot of really insightful ideas. If people want to continue the conversation, where do they find you?
Chris Cooney: If you open up LinkedIn and type Chris Cooney, I’ve been reliably told that I am the second person in the list. I’m working hard for number one, but we’ll get there. If you look for Chris Cooney, if I don’t come up, Chris Cooney, Coralogix, Chris Cooney Observability, anything like that, and I will come up. And I’m more than happy to answer any questions. On LinkedIn is usually where I’m most active, especially for work-related topics.
Shane Hastie: Cool. Chris, thank you so much.
Chris Cooney: My pleasure. Thank you very much for having me.
Mentioned:
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.
![MMS Founder](https://mobilemonitoringsolutions.com/wp-content/uploads/2019/04/by-RSS-Image@2x.png)
MMS • Aditya Kulkarni
Article originally posted on InfoQ. Visit InfoQ
![](https://mobilemonitoringsolutions.com/wp-content/uploads/2024/11/generatedHeaderImage-1731563471483.jpg)
Recently, AWS CodeBuild introduced support for managed GitLab self-hosted runners, towards advancement in continuous integration and continuous delivery (CI/CD) capabilities. This new feature allows customers to configure their CodeBuild projects to receive and execute GitLab CI/CD job events directly on CodeBuild’s ephemeral hosts.
The integration offers several key benefits including Native AWS Integration, Compute Flexibility and Global Availability. GitLab jobs can now seamlessly integrate with AWS services, leveraging features such as IAM, AWS Secrets Manager, AWS CloudTrail, and Amazon VPC. This integration enhances security and convenience for the users.
Furthermore, customers gain access to all compute platforms offered by CodeBuild, including Lambda functions, GPU-enhanced instances, and Arm-based instances. This flexibility allows for optimized resource allocation based on specific job requirements.The integration is available in all regions where CodeBuild is offered.
To implement this feature, users need to set up webhooks in their CodeBuild projects and update their GitLab CI YAML files to utilize self-managed runners hosted on CodeBuild machines.
The setup process involves connecting CodeBuild to GitLab using OAuth, which requires additional permissions such as create_runner
and manage_runner
.
It’s important to note that CodeBuild will only process GitLab CI/CD pipeline job events if a webhook has filter groups containing the WORKFLOW_JOB_QUEUED
event filter. The buildspec in CodeBuild projects will be ignored unless buildspec-override:true
is added as a label, as CodeBuild overrides it to set up the self-managed runner.
When a GitLab CI/CD pipeline run occurs, CodeBuild receives the job events through the webhook and starts a build to run an ephemeral GitLab runner for each job in the pipeline. Once the job is completed, the runner and associated build process are immediately terminated.
As a side, GitLab has been in the news since earlier this year as they planned to introduce CI Steps, which are reusable and composable pieces of a job that can be referenced in pipeline configurations. These steps will be integrated into the CI/CD Catalog, allowing users to publish, unpublish, search, and consume steps similarly to how they use components.
Moreover, GitLab is working on providing users with better visibility into component usage across various project pipelines. This will help users identify outdated versions and take prompt corrective actions, promoting better version control and project alignment.
AWS CodeBuild has been in the news as well, as they added support for Mac Builds. Engineers can build artifacts on managed Apple M2 instances that run on macOS 14 Sonoma. Few weeks ago, AWS CodeBuild enabled customers to configure automatic retries for their builds, reducing manual intervention upon build failures. They have also added support for building Windows docker images in reserved fleets.
Such developments demonstrate the ongoing evolution of CI/CD tools and practices, with a focus on improving integration, flexibility, and ease of use for DevOps teams.
![MMS Founder](https://mobilemonitoringsolutions.com/wp-content/uploads/2019/04/by-RSS-Image@2x.png)
MMS • Noemi Vanyi Simona Pencea
Article originally posted on InfoQ. Visit InfoQ
![](https://mobilemonitoringsolutions.com/wp-content/uploads/2024/11/Nomi-Vnyi-Simona-Pencea-medium-1728648864871.jpg)
Transcript
Pencea: I bet you’re wondering what this picture is doing on a tech conference. These are two German academics. They started to build a dictionary, but they actually became famous, because along the way, they collected a lot of folk stories. The reason they are here is partly because they were my idol when I was a child. I thought there was nothing better to do than to listen to stories and collect them. My family still makes fun of me because I ended up in tech after that. The way I see it, it’s not such a big difference. Basically, we still collect folk stories in tech, but we don’t call them folk stories, we call them best practices. Or we go to conferences to learn about them, basically to learn how other people screwed up, so that we don’t do the same. After we collect all these stories, we put them all together. We call them developer experience, and we try to improve that. This brings us to the talk that we have, improving developer experience using automated data CI/CD pipelines. My name is Simona Pencea. I am a software engineer at Xata.
Ványi: I’m Noémi Ványi. I’m also a software engineer at the backend team of Xata. Together, we will be walking through the data developer experience improvements we’ve been focused on recently.
Pencea: We have two topics on the agenda. The first one, testing with separate data branches covers the idea that when you create a PR, you maybe want to test your PR using a separate data environment that contains potentially a separate database. The second one, zero downtime migrations, covers the idea that we want to improve the developer experience when merging changes that include schema changes, without having any downtime. Basically, zero downtime migrations. For that, we developed an open-source tool called pgroll. Going through the first one, I will be covering several topics. Basically, I will start by going through the code development flow that we focused on. The testing improvements that we had in mind. How we ensured we have data available in those new testing environments. How we ensured that data is safe to use.
Code Development Workflow
This is probably very familiar to all of you. It’s basically what I do every day. When I’m developing, I’m starting with the local dev. I have my local dev data. It’s fake data. I’m trying to create a good local dev dataset when I’m testing my stuff. I’m trying to think about corner cases, and cover them all. The moment I’m happy with what I have in my local dev, I’m using the dev environment. This is an environment that is shared between us developers. It’s located in the cloud, and it also has a dataset. This is the dev dataset. This is also fake data, but it’s crowdfunded from all the developers that use this environment. There is a chance that you find something that it’s not in the local dev. Once everything is tested, my PR is approved. I’m merging it. I reach staging.
In staging, there is another dataset which is closer to the real life, basically, because it’s from beta testing or from demos and so on. It’s not the real thing. The real thing is only in prod, and this is the production data. This is basically the final test. The moment my code reaches prod, it may fail, even though I did my best to try with everything else. In my mind, I would like to get my hands on the production dataset somehow without breaking anything, if possible, to test it before I reach production, so that I minimize the chance of bugs.
Data Testing Improvements – Using Production Data
This is what led to this. Can we use production data to do testing with it? We’ve all received those emails sometimes, that say, test email, and I wasn’t a test user. Production data would bring a lot of value when used for testing. If we go through the pros, the main thing is, it’s real data. It’s what real users created. It’s basically the most valuable data we have. It’s also large. It’s probably the largest dataset you have, if we don’t count load test generated data and so on. It’s fast in the way that you don’t have to write a script or populate a database. It’s already there, you can use it. There are cons to this. There are privacy issues. It’s production data: there’s private information, private health information. I probably don’t even have permission from my users to use the data for testing. Or, am I storing it in the right storage? Is this storage with the right settings so that I’m not breaking GDPR or some other privacy laws.
Privacy issues are a big con. The second thing, as you can see, large is also a con, because a large dataset does not mean a complete dataset. Normally, all the users will use your product in the most common way, and then you’ll have some outliers which give you the weird bugs and so on. Having a large dataset while testing may prevent you from seeing those corner cases, because they are better covered. Refreshing takes time because of the size. Basically, if somebody changes the data with another PR or something, you need to refresh everything, and then it takes longer than if you have a small subset. Also, because of another PR, you can get into data incompatibility. Basically, you can get into a state where your test breaks, but it’s not because of your PR. It’s because something broke, or something changed, and now it’s not compatible anymore.
If we look at the cons, it’s basically two categories that we can take from those. The first ones are related to data privacy. Then the second ones are related to the size of the dataset. That gives us our requirements. The first one would be, we would like to use production data but in a safe way, and, if possible, fast. Since we want to do a CI/CD pipeline, let’s make it automated. I don’t want to run a script by hand or something. Let’s have the full experience. Let’s start with the automated part. It’s very hard to cover all the ways software developers work. What we did first was to target a simplification, like considering GitHub as a standard workflow, because the majority of developers will use GitHub. One of the things GitHub gives to you is a notification when a PR gets created. Our idea was, we can use that notification, we can hook up to it. Then we can create what we call a database branch, which is basically a separate database, but with the same schema as the source branch, when a GitHub PR gets created. Then after creation, you can copy the data after it. Having this in place would give you the automation part of the workflow.
Let’s see how we could use the production data. We said we want to have a fast copy and also have it complete. I’ll say what that means. Copying takes time. There is no way around it. You copy data, it takes time. You can hack it. Basically, you can have a preemptive copy. You copy the data before anyone needs it, so when they need it, it’s already there. Preemptive copying means I will just have a lot of datasets around, just in case somebody uses it, and then, I have to keep everything in sync. That didn’t really fly with us. We can do Copy on Write, which basically means you copy at the last minute before data is actually used, so before that, all the pointers point to the old data. The problem with using Copy on Write for this specific case is that it did not leave us any way into which we could potentially change the data to make it safe. If I Copy on Write, it’s basically the same data. I will just change the one that I’m editing, but the rest of it is the same one.
For instance, if I want to anonymize an email or something, I will not be able to do it with Copy on Write. Then, you have the boring option, which is, basically, you don’t copy all the data, you just copy a part of the data. This is what we went for, even though it was boring. Let’s see about the second thing. We wanted to have a complete dataset. I’ll go back a bit, and consider the case of a relational database where you have links as a data type. Having a complete dataset means all the links will be resolved inside of this dataset. If I copy all the data, that’s obviously a complete dataset, but if I copy a subset, there is no guarantee it will be complete unless I make it so. The problem with having a complete dataset by following the links is it sounds like an NP-complete problem, and that’s because it is an NP-complete problem. If I want to copy a subset of a bigger data, and to find it of a certain size, I would actually need to find all the subsets that respect that rule, and then select the best one. That would mean a lot of time. In our case, we did not want the best dataset that has exactly the size that we have in mind. We were happy with having something around that size. In that case, we can just go with the first dataset that we can construct that follows this completeness with all the links being resolved in size.
Data Copy (Deep Dive)
The problem with constructing this complete subset is, where do we start? How do we advance? How do we know we got to the end, basically? The where do we start part is solvable, if we think about the relationships between the tables as a graph, and then we apply a topological sort on it. We list the tables based on their degrees of independence. In this case, this is an example. t7 is the most independent, then we have t1, t2, t3, and we can see that if we remove these two things, the degrees of independence for t2 and t3 are immediately increased because the links are gone. We have something like that. Then we go further up. Here we have the special case of a cycle, because you can’t point back with links to the same table that pointed to you. In this case, we can break the cycle, because going back we see the only way to reach this cycle is through t5.
Basically, we need to first reach t5 and then t6. This is what I call the anatomy of the schema. We can see this is the order in which we need to go through the tables when we collect records. In order to answer the other two questions, things get a bit more complicated, because the schema is not enough. The problem with the schema not being enough for these cases is because, first of all, it will tell you what’s possible, but it doesn’t have to be mandatory, unless you have a constraint. Usually, a link can also be empty. If you reach a point where you run into a link that points to nothing, that doesn’t mean you should stop. You need to go and exhaustively find the next potential record to add to the set. Basically, if you imagine it in 3D, you need to project this static analysis that we did on individual rows. The thing that you cannot see through the static analysis from the beginning is that you can have several records from one table pointing to the same one in another table. The first one will take everything with it, and the second one will bring nothing.
Then you might be tempted to stop searching, because you think, I didn’t make any progress, so then the set is complete, which is not true. You need to exhaustively look until the end of the set. These are just a few of the things that, when building this thing, need to be kept on the back of the mind, basically. We need to always allow full cycle before determining that no progress was made. When we select the next record, we should consider the fact that it might have been already brought into the set, and we shouldn’t necessarily stop at that point.
We talked about at the beginning how we want to have this production data, but have it safe to use. This is the last step that we are still working on. It is a bit more fluffy. The problem with masking the data is that, for some fields, you know exactly what they are. It’s an email, then, sure, it’s private data. What if it’s free text, then what? If it’s free text, you don’t know what’s inside. The assumption is it could be private data. The approach here was to provide several possibilities on how to redact data and allow the user to choose, because the user has the context and they should be able to select based on the use case. The idea of having, for instance, a full reduction or a partial reduction, is that, sure you can apply that, but it will break your aggregations.
For instance, if I have an aggregation by username, like my Gmail address, and I want to know how many items I have assigned to my email address, if I redact the username and it will be, **.gmail, then I get aggregations on any Gmail address that has items in my table. The most complete would be a full transformation. The problem with full transformation is that it takes up a lot of memory, because you need to keep the map with the initial item and the changed item. Depending on the use case, you might not need this because it’s more complex to maintain. Of course, if there is a field that has sensitive data and you don’t need it for your specific test case, you can just remove it. The problem with removing a field is that that would basically mean you’re changing the schema, so you’re doing a migration, and that normally causes issues. In our case, we have a solution for the migrations, so you can feel free to use it.
Zero Downtime Migrations
Ványi: In this section of the presentation, I would like to talk about, what do we mean by zero downtime. What challenges do we face when we are migrating the data layer? I will talk about the expand-contract pattern and how we implemented it in PostgreSQL. What do I mean when I say zero downtime? It sounds so nice. Obviously, downtime cannot be zero because of physics, but the user can perceive it as zero. They can usually tolerate around 20 milliseconds of latency. Here I talk about planned maintenance, not service outages. Unfortunately, we rarely have any control over service outages, but we can always plan for our application updates.
Challenges of Data Migrations
Let’s look at the challenges we might face during these data migrations. Some migrations require locking, unfortunately. These can be table, read, write locks, meaning no one can access the table. They cannot read. They cannot write. In case of high availability applications, that is unacceptable. There are other migrations that might rely on read, write locks. Those are a bit better, and we can live with that. Also, it’s something we want to avoid. Also, when there is a data change, obviously we have to update the application as well, and the new instance, it has to start and run basically at the same time as the old application is running. This means that the database that we are using, it has to be in two states at the same time. Because there are two application versions interacting with our database, we must make sure, for example, if we introduce a new constraint, that it is enforced in both the existing records and on the new data as well.
Based on these challenges, we can come up with a list of requirements. The database must serve both the old schema and the new schema to the application, because we are running the old application and the new application at the same time. Schema changes are not allowed to block database clients, meaning we cannot allow our applications to be blocked because someone is updating the schema. The data integrity must be preserved. For example, if we introduce a new data constraint, it must be enforced on the old records as well. When we have different schema versions live at the same time, they cannot interfere with each other. For example, when the old application is interacting with the database, we cannot yet enforce the new constraints, because it would break the old application. Finally, as we are interacting with two application versions at the same time, we must make sure that the data is still consistent.
Expand-Contract Pattern
The expand-contract pattern can help us with this. It can minimize downtime during these data migrations. It consists of three phases. The first phase is expand. This is the phase when we add new changes to our schema. We expand the schema. The next step is migrate. That is when we start our new application version. Maybe test it. Maybe we feel lucky, we don’t test it at all. At this point, we can also shut down the old application version. Finally, we contract. This is the third and last phase. We remove the unused and the old parts from the schema. This comes with several benefits.
In this case, the changes do not block the client applications, because we constantly add new things to the existing schema. The database has to be forward compatible, meaning it has to support the new application version, but at the same time, it has to support the old application version, so the database is both forward and backwards compatible with the application versions. Let’s look at a very simple example, renaming a column. It means that here we have to create the new column, basically with a new name, and copy the contents of the old column. Then we migrate our application and delete our column with the old name. It’s very straightforward. We can deploy this change using, for example, the blue-green deployments. Here, the old application is still live, interacting with our table through the old view. At the same time, we can deploy our new application version which interacts through another new view with the same table. Then we realize that everything’s passing. We can shut down the old application and remove the view, and everything just works out fine.
Implementation
Let’s see how we implemented in Postgres. First, I would like to explain why we chose PostgreSQL in the first place. Postgres is well known, open source. It’s been developed for 30 years now. The DDL statements are transactional, meaning, if one of these statements fail, it can be rolled back easily. Row level locking. They mostly rely on row level locking. Unfortunately, there are a few read, write locks, but we can usually work around those. For example, if you are adding a nonvolatile default value, the table is not rewritten. Instead, the value is added to the metadata of the table. The old records are updated when the whole record is updated. It doesn’t really work all the time. Let’s look at the building blocks that Postgres provides. We are going to use three building blocks, DDL statements, obviously, to alter the schema.
Views, to expose the different schema version to the different application versions. Triggers and functions to migrate the old data, and on failure, to roll back the migrations. Let’s look at a bit more complex example. We have an existing column, and we want to add the NOT NULL constraint to it. It seems simple, but it can be tricky because Postgres does a table scan, meaning it locks the table, and no one can update or read the table, because it goes through all of the records and checks if any of the existing records violate the NOT NULL constraint. If it finds a record that violates this constraint, then the statement returns an error, unfortunately. We can work around it. If we add NOT VALID to this constraint, the table scan escaped. Here we add the new column and set NOT NULL constraint and add NOT VALID to it, so we are not blocking the database clients.
We also create triggers that move the old values from the columns. It is possible that some of the old records don’t yet have values, and in this case, we need to add some default value or any backfill value we want, and then we migrate our app. We need to complete the migration, obviously. We need to clean up the trigger, the view we added, so the applications can interact with the table and the old column. Also, we must remember to remove NOT VALID from the original constraint. We can do it because the migration migrated the old values, and we know that all of the new records, or new values are there, and every record satisfies the constraint.
It all seemed quite tedious to do this all the time, and that’s why we created pgroll. It’s an open-source command line tool, but it is written in Go, so you can also use it as a library. It is used to manage safe and reversible migrations using the expand-contract pattern. I would like to walk you through how to use it. Basically, pgroll is running a Postgres instance, so you need one running somewhere. After you installed and initialized it, you can start creating your migrations. You can define migrations using JSON files. I will show you an example. Once you have your migration, you run a start command. Then it creates a new view, and you can interact with it through your new application. You can test it. Then you can also shut down your old application. You run the complete command. pgroll removes all of these leftover views and triggers for you. This is the JSON example I was just talking about.
Let’s say that we have a user’s column that has an ID field, name field, and a description, and we want to make sure that the description is always there, so we put a NOT NULL constraint on it. In this case, you have to define a name. For the migration, it will be the name of the view, or the schema in Postgres. We define a list of operations. We are altering a column. The table is obviously named users. The description field, we no longer allow null values in the column. This is the interesting part. This is the up migration. It contains what to do when we are migrating the old values. In this case, it means that if the description is missing, we add the text description for and insert the name. Or if the data is there, we just move it to the new column. The down migration defines what to do when there is an error and we want to roll back. In this case, we keep the old value, meaning, if the value was missing, it’s a null, and if there was something, we keep it there.
Here is the start command. Let’s see in psql what just happened. We have a user’s table with these three columns, but you can see that pgroll added a new column. Remember, there is this migration ongoing right now. In the old description column, there are records that do not yet satisfy the constraint. In the new description the backfill value is already there for us to use. We can inspect what schemas are in the database. We can notice that there is this create_users_table, that’s the old schema version. The new one is the user_description_set_nullable, which is the name of the migration we just provided in our JSON. Let’s try to insert some values into this table. We are inserting two values. The first one is basically how the new application version is behaving. The description is not empty. In the second record, we are mimicking what the old application is doing. Here the description is NULL. Let’s say that we succeeded. We can try to query this table.
From the old app’s point of view, we can set the search path to the old schema version and perform the following query so we can inspect what happened after we inserted these values. This is what we get back. The description for Alice is, this is Alice, and for Bob it’s NULL because the old application doesn’t enforce the constraint. Let’s change the search path again to the new schema version and perform the same query. Here we can see that we have the description for Alice. Notice that Bob has a description. It is the default description or default migration we provided in the JSON file. Then we might complete the migration using the complete command, and we can see that the old schema is cleaned up. Also, the intermediary column is also removed, and the triggers, functions, everything is removed. Check out pgroll. It’s open source. It takes care of mostly everything. There is no need to manually create new views, functions, new columns, nothing. After you complete your migrations, it cleans up after itself. It is still under development, so there are a few missing features. For example, few missing migrations. We do not yet support adding comments, unfortunately, or batched migrations.
Takeaways
Pencea: Basically, what we presented so far were bits and pieces from this puzzle that we want to build the CI/CD data pipeline. What we imagined when we created this was, somebody creates a PR. Then, test environment with a test database with valid data that’s also safe to use, gets created for them. Then the tests are run. Everything is validated, PR is merged. Then it goes through the pipeline, and nobody has to take care or worry about migrations, because we can do the changes and everything.
Ványi: The migrations are done without downtime. If your pull request is merged, it goes through the testing pipeline, and if everything passes, that’s so nice. We can clean up after ourselves and remove the old schema. If maybe there is a test failure or something is not working correctly, we can roll back anytime, because the old schema is still kept around just in case. As we just demonstrated or told you about, there are still some work left for us, but we already have some building blocks that you can integrate into your CI/CD pipeline. You can create a test database on the fly using GitHub notifications, fill it with safe and relevant data to test. You can create schema changes and merge them back and forth without worrying about data migrations. You can deploy and roll back your application without any downtime.
Questions and Answers
Participant 1: Does pgroll take care of keeping the metadata of every migration done: is started, ongoing, finished?
Ványi: Yes, there is a migration site. Also, you can store, obviously, your migrations file in Git if you want to control it, but pgroll has its own bookkeeping for past migrations.
Participant 2: For the copying of the data from production, was that for local tests, local dev, or the dev? How did you control costs around constantly copying that data, standing up databases, and tearing them back down?
Pencea: It’s usually for something that sits in the cloud, so not for the local dev.
Participant 2: How did you control cost if you’re constantly standing up a near production size database?
Pencea: What we use internally is data branching. We don’t start a new instance every time. We have a separate schema inside a bigger database. Also, what we offer right now is copying 10k of data, it’s not much in terms of storage. We figured it should be enough for testing purposes.
Participant 3: I saw in your JSON file that you can do migrations that pgroll knows about like, is set nullable to false? Can you also do pure SQL migrations?
Ványi: Yes. We don’t yet support every migration. If there is anything missing, you can always work around it by using raw SQL migrations. In this case, you can shoot yourself in the foot, because, for example, in case of NOT NULL, we take care of the skipping of the table scan for you. When you are writing your own raw SQL migration, you have to be careful not to block your table and the database access.
Participant 4: It’s always amazed me that these databases don’t do safer actions for these very common use cases. Have you ever talked to the Postgres project on improving the actual experience of just adding a new column, or something? It should be pretty simple.
Ványi: We’ve been trying to have conversations about it, but it is a very mature project, and it is somewhat hard to change such a fundamental part of this database. Constraints are like the basic building block for Postgres, and it’s not as easy to just make it more safe. There is always some story behind it.
Pencea: I think developer experience was not necessarily something that people were concerned about, up until recently. I feel like sometimes it was actually the opposite, if it was harder, you looked cooler, or you looked like a hacker. It wasn’t exactly something that people would optimize for. I think it’s something that everybody should work towards, because now everybody has an ergonomic chair or something, and nobody questions that, but we should work towards the same thing about developer experience, because it’s ergonomics in the end.
Participant 5: In a company, assuming they are adopting pgroll, all these scripts can grow in number, so at some point you have to apply all of them, I suppose, in order. Is there any sequence number, any indication, like how to apply these. Because some of them might be serial, some of them can be parallelized. Is there any plan to give direction on the execution? I’ve seen there is a number in the script file name, are you following that as a sequence number, or when you’re then developing your batching feature, you can add a sequence number inside?
Ványi: Do we follow some sequence number when we are running migrations?
Yes and no. pgroll maintains its own table or bookkeeping, where it knows what was the last migration, what is coming next? The number in the file name is not only for pgroll, but also for us.
Participant 6: When you have very breaking migrations using pgroll, let’s say you need to rename a column, or even changing its type, which you basically replicate a new column and then copying over the data. How do you deal with very large tables, say, millions of rows? Because you could end up having even some performance issues with copying these large amounts of data.
Ványi: How do we deal with tables that are basically big? How do we make sure that it doesn’t impact the performance of the database?
For example, in case of moving the values to the new column, we are creating triggers that move the data in batches. It’s not like everything is copied in one go, and you cannot really use your Postgres database because it is busy copying the old data. We try to minimize and distribute the load on the database.
Participant 7: I know you were using the small batches to copy the records from the existing column to the new column. Once you copy all the records, only then you will remove the old column. There is a cost with that.
See more presentations with transcripts
![MMS Founder](https://mobilemonitoringsolutions.com/wp-content/uploads/2019/04/by-RSS-Image@2x.png)
MMS • Ben Linders
Article originally posted on InfoQ. Visit InfoQ
![](https://mobilemonitoringsolutions.com/wp-content/uploads/2024/11/ben-1723612214361-1730969232188.jpg)
DORA can help to drive sustainable change, depending on how it is used by teams and the way it is supported in a company. According to Carlo Beschi, getting good data for the DORA keys can be challenging. Teams can use DORA reports for continuous improvement by analysing the data and taking actions.
Carlo Beschi spoke about using DORA for sustainable improvement at Agile Cambridge.
Doing DORA surveys in your company can help you reflect on how you are doing software delivery and operation as Beschi explained in Experiences from Doing DORA Surveys Internally in Software Companies. The way you design and run the surveys, and how you analyze the results, largely impact the benefits that you can get out of them.
Treatwell’s first DORA implementation in 2020 focused on getting DORA metrics from the tools. They set up a team that sits between their Platform Engineering team and their “delivery teams” (aka product teams, aka stream aligned teams), called CDA – Continuous Delivery Acceleration team. Half of their time is invested in making other developers and teams life better, and the other half is about getting DORA metrics from the tools:
We get halfway there, as we manage to get deployment frequency and lead time for changes for almost all of our services running in production, and when the team starts to dig into “change failure rate”, Covid kicks in and the company is sold.
DORA can help to drive sustainable change, but it depends on the people who lead and contribute to it, and how they approach it, as Beschi learned. DORA is just a tool, a framework, that you can use to:
- Lightweight assess your teams and organisation
- Play back the results, inspire reflection and action
- Check again a few months / one year later, maybe with the same assessment, to see if / how much “the needle has moved”
Beschi mentioned that teams use the DORA reports as part of their continuous improvement. The debrief about the report is not too different from a team retrospective, one that brings in this perspective and information, and from which the team defines a set of actions, that are then listed, prioritised, and executed.
He has seen benefits from using DORA in terms of aligning people on “this is what building and running good software nowadays looks like”, and “this is the way the best in the industry work, and a standard we aim for”. Beschi suggested focusing the conversation on the capabilities, much more than on the DORA measures:
I’ve had some good conversations, in small groups and teams, starting from the DORA definition of a capability. The sense of “industry standard” helped move away from “I think this” and “you think that”.
Beschi mentioned the advice and recommendations from the DORA community on “let the teams decide, let the teams pick, let the teams define their own ambition and pace, in terms of improvement”. This helps in keeping the change sustainable, he stated.
When it comes to meeting the expectations of senior stakeholders, when your CTO is the sponsor of a DORA initiative then there might be “pushback” on teams making decisions, and expectations regarding their “return of investment” on doing the survey, aiming to have more things change, quicker, Beschi added.
A proper implementation of DORA is far from trivial, Beschi argued. The most effective ones rely on a combination of data gathered automatically from your system alongside qualitative data gathered by surveying (in a scientific way) your developers. Getting good data quickly from the systems is easier said than done.
When it comes to getting data from your systems for the four DORA keys, while there has been some good progress in the tooling available (both open and commercial) it still requires effort to integrate any of them in your own ecosystem. The quality of your data is critical.
Start ups and scale ups are not necessarily very disciplined when it comes to consistent usage of their incident management processes – and this impacts a lot the accuracy of your “failure change rate” and “response time” measures, Beschi mentioned.
Beschi mentioned several resources for companies that are interested in using DORA:
- The DORA website, where you can self-serve all DORA key assets and find the State of DevOps reports
- The DORA community has a mailing list and bi-weekly vídeo calls
- The Accelerate book
In the community you will find a group of passionate and experienced practitioners, very open, sharing their stories “from the trenches” and very willing to onboard others, Beschi concluded.
.NET Aspire 9.0 Now Generally Available: Enhanced AWS & Azure Integration and More Improvements
![MMS Founder](https://mobilemonitoringsolutions.com/wp-content/uploads/2019/04/by-RSS-Image@2x.png)
MMS • Robert Krzaczynski
Article originally posted on InfoQ. Visit InfoQ
![](https://mobilemonitoringsolutions.com/wp-content/uploads/2024/11/generatedHeaderImage-1731520003662.jpg)
.NET Aspire 9.0 is now generally available, following the earlier release of version 9.0 Release Candidate 1 (RC1). This release brings several features aimed at improving cloud-native application development on both AWS and Azure. It supports .NET 8 (LTS) and .NET 9 (STS).
A key update in Aspire 9.0 is the integration of AWS CDK, enabling developers to define and manage AWS resources such as DynamoDB tables, S3 buckets, and Cognito user pools directly within their Aspire projects. This integration simplifies the process of provisioning cloud resources by embedding infrastructure as code into the same environment used for developing the application itself. These resources are automatically deployed to an AWS account, and the references are included seamlessly within the application.
Azure integration has been upgraded in Aspire 9.0. It now offers preview support for Azure Functions, making it easier for developers to build serverless applications. Additionally, there are more configuration options for Azure Container Apps, giving developers better control over their cloud resources. Aspire 9.0 also introduces Microsoft Entra ID for authentication in Azure PostgreSQL and Azure Redis, boosting security and simplifying identity management.
In addition to cloud integrations, Aspire 9.0 introduces a self-contained SDK that eliminates the need for additional .NET workloads during project setup. This change addresses the issues faced by developers in previous versions, where managing different .NET versions could lead to conflicts or versioning problems.
Aspire Dashboard also receives several improvements in this release. It is now fully mobile-responsive, allowing users to manage their resources on various devices. Features like starting, stopping, and restarting individual resources are now available, giving developers finer control over their applications without restarting the entire environment. The dashboard provides better insights into the health of resources, including improved health check functionality that helps monitor application stability.
Furthermore, telemetry and monitoring have been enhanced with expanded filtering options and multi-instance tracking, enabling better debugging in complex application environments. The new support for OpenTelemetry Protocol also allows developers to collect both client-side and server-side telemetry data for more comprehensive performance monitoring.
Lastly, resource orchestration has been improved with new commands like WaitFor
and WaitForCompletion
, which help manage resource dependencies by ensuring that services are fully initialized before dependent services are started. This is useful for applications with intricate dependencies, ensuring smoother deployments and more reliable application performance.
Community feedback highlights how much Aspire’s development experience has been appreciated. One Reddit user noted:
It is super convenient, and I am a big fan of Aspire and how far it has come in such a short time.
Full release details and upgrade instructions are available in the .NET Aspire documentation.
![MMS Founder](https://mobilemonitoringsolutions.com/wp-content/uploads/2019/04/by-RSS-Image@2x.png)
MMS • Tiago Bento
Article originally posted on InfoQ. Visit InfoQ
![](https://mobilemonitoringsolutions.com/wp-content/uploads/2024/11/TiagoBento-medium-1727159163147.jpeg)
Transcript
Bento: I think the first thing most people think when talking about monorepos is the opposite, which is the poly-repo setting. I’m going to start comparing both approaches. Here you can see some setups that you may recognize from diagrams that you may have in your company, in your organization. The white boxes there are the representation of a repo, and the blue cubes are representing artifacts that these repos produce. It can be like Docker images, or binaries, JARs, anything that you get out of the code that you have can be understood as the blue cube.
The first example here, you have a bunch of repos that have a very clearly defined dependency relationship, and they all together work to produce a single artifact. This is, I think, very common, where people have multiple repos to build a monolith. No mystery there. Then we move on to the next one, where we have a single repo, but we also produce a single artifact. That already has a question mark here from my side, because, is that a monorepo or not? Then, moving on, we have a more complicated setting, where you have multiple repos that have more dependency relationship. You have code sharing between them.
In the end, the combination is three artifacts. That, to me, is very clearly a poly-repo setting. Then the last one is what to me is also very clearly a monorepo setting. You have a single repo, and after a build, you have multiple artifacts. Then, other complicated cases where it’s not so obvious whether it’s a monorepo or a poly-repo setting. For example, here you have repos that in a straight line produce a single artifact, and they are not connected with each other in any way. Is that a poly-repo, if you don’t have sharing of code between multiple artifacts that, in the end, get produced? Conversely, if you have a single repo and the internal modules do not share any code, it’s the analogs of this situation. Can you say that this is a monorepo, or is it just a big repo?
Then, most of us work in a setup that is very similar to this, where you have everything mixed together and you cannot tell whether or not you’re using poly-repo or monorepo, or both, or neither. A simple way to understand them, in my view, what defines if a setting of repos that produce artifacts can be understood as a poly-repo or a monorepo setting is the way you share code between these pieces that you have to have to produce the artifacts. When you add the arrows that represent the dependency relationship between the repos that make artifacts share code, I think you can very clearly say that this is a poly-repo setting. The same is true for a monorepo setting where now you have code sharing.
Monorepos (Overview)
Giving a definition of what a monorepo can be understood as, in this presentation, and in my view, you can say that monolithic applications are not always coming from monorepos. Also, putting code together on the same repo does not characterize a monorepo as well, because if you don’t have the sharing relationship, then it’s just a bunch of modules that are disjoint and produce artifacts from the same place. Of course, to have a monorepo, we have to have a very well-defined relationship between them, so you can very distinctively draw the diagram that represent your dependency relationships.
Also, very important is that the modules on a monorepo are part of the same build system. Most of us, I think, come from a Java background, or a Go background here, and you’re very familiar with Maven builds. We have a lot of modules that are part of the same reactor build. This produces a single artifact in the end, or multiple smaller artifacts that may or may not be important to you, as far as publication go to a registry, or for third parties to consume, or even first parties to consume. Then, in the blue box are things that are not very well defined when talking about monorepos, but I wanted to give you my opinion and what I understand as a monorepo.
Monorepos are not necessarily big in the sense of, there’s a lot of code there, a mess everywhere. Nobody knows what depends on what. No, that’s not the case. Also, for me, monorepos are repos that contain code that produce more than one artifact that is interesting to you after the build. If you’re publishing, for example, two Docker images from one single repo, in my view, that’s a monorepo already. Of course, the code sharing. You need to have ability for a single module in your monorepo to be reused by final artifacts that you have. That’s the definition. We’ll build from there.
Right off the bat, you can do this claim. You can tell me, yes, but nobody has a monorepo, everyone has multiple repos. That is true. It’s very rare to see a company that has everything inside the same repository. That’s not the case at all. When we talk about monorepos, we actually are not talking about putting everything inside the same repo. I’m going to try to explain the way I understand that and how you can benefit from it. Conversely, using the definition that I just gave, you can also say that everyone has a monorepo, because if you think about a repository that has many libraries in it that get published, you have one single repository that publishes multiple artifacts that are interesting after the build. If you have a multi-module Go repo, or if you have a multi-module Maven repo, for me, you have a monorepo.
This session will try to answer this question, should you have a monorepo? The way I’ll try to do that is talking a little bit about code-based structure, and how teams operate together, how people do software, and how code reuse is understood and implemented in the industry. In the end, I hope that by looking at the examples and my personal experience that I’m going to share with you, you’ll be able to make a more informed decision of whether or not you can benefit from a monorepo or if you stick to a more traditional poly-repo setting.
Example 1 – Microservice E-commerce
Let’s see an example where our code base is composed by these repos. We have five microservices at the top, followed by two Backend for Frontends repos. Then we have two frontend applications. Then we have five repos that contain common code. In the end, we have the end-to-end tests repository. Everyone can relate in some way to this example, although oversimplified. The experience I had multiple times is very closely related to this, where you present something very simple, and all of a sudden everyone has an opinion of what should be done, of what’s a better way to do that, or even like personal taste: I like it, I don’t like it, whatever.
First thing I wanted to talk is, if you’re in this situation, chances are you’re not ready to have a conversation about whether or not to pursue a poly-repo or a monorepo setting. Because you have very fragmented opinions, and you don’t have people going to the same direction, even if they agree or not. My advice would be trying to get everyone moving in the same way so that everyone can be exposed to the same problems, understand what comes after them, in case of library building, or in case of the ops team deploying heavy Docker images or something. That is the way that this conversation starts. This conversation is very slow, very lengthy, and very complex.
It’s not something you can decide in a one-hour meeting when you put all the stakeholders together, or the architects, or whatever, and then you say, ok, voting, monorepo or not? It’s not going to go like that. Very important, people, teams, and your uniqueness, the uniqueness of your operating dynamics, come before choosing what to do. Then, even if you do decide what to do, having everyone align and understanding why you’re doing something. Put that in writing. Document it. Make presentations. Make people watch them. Whatever works for you comes before actually pursuing that change, implementing monorepo or poly-repo.
Let’s continue with our little e-commerce setup with the repos I showed before. Let’s break it down in teams. We can look at the repos, and we can see some affinity between them and the way that I as the architect or CTO or whatever, understand how people should be working together. This is valid. It’s something that someone is doing somewhere. Also, you could have something like this, where we have different teams, we have different names, we have different operating dynamics. This is very unique to your organization. Nobody is doing software exactly the same way. We’re sure, borrowing from colleagues and people from other companies, tech presentations, but in reality, the operations is very distinct in wherever you look. This is also valid.
Interesting, we have this now, where three teams are collaborating on the same repo. Also, here, we have all teams collaborating on the end-to-end tests. That’s pretty normal. Everyone owns the tests, so nobody does. Also, this is a very valid possibility where you have a big code base, you have your repos there. You have two main teams doing stuff everywhere, helping everybody out. Then you have a very focused team here doing only security, so they only care about security. You have another very focused team that only cares about tests. This is also valid. Why am I saying all this? Because you as individuals, as decision makers, as multipliers, you know how your team operates much better than me or anyone else that gives advice on the internet, or whatever.
By looking at your particular case and evaluating both techniques, both monorepo and poly-repo, you can see where you could benefit from each one of them. You don’t have to choose. You can pick monorepo for a part of your org, or poly-repo for another part.
So far, list of repos. Everybody had opinions. We sorted that out. Now we have our little team separation here. We know that this is important. We didn’t talk about the dependency relationship between the repos. It’s somewhat obvious from the names, but even so, people can understand them differently. This is a possibility. I know there’s a lot of arrows, but yes. I have the microservices there depending on a lot of common library repos here. Then they get used by the end-to-end tests.
Then you have the frontends, which is like separate, and they have their own code sharing with the design system, or very nasty selects that you implemented. This can be something that someone draws in your company by looking at the repos. Or, you can see someone understand it like this, where the BFFs actually depend on the user service as a hard dependency, and not just like API dependency or something. This is a valid possibility too. More than that, these are only the internal arrows that you would have if you would draw this diagram for your company in this example. There’s a lot of other arrows, which are third-party dependencies that every repo is depending on. You cannot build software without depending on them.
To make things more complicated, this is also something that many of us are doing. We’re versioning our own libraries and publishing them so that they can be consumed by our services. You can see this becomes very complicated very quickly, because it’s hard to tell what libraries are being used and what are not, and what versions of services are using what versions of libraries. It can be that you see this and you solve it like this.
Every time a library publishes a release, we have to update all the services to have everyone on the same page using the same version, no dependency conflicts. Everything works. Or, you can go one step further and you say, every time we make a change on a repo, this gets automatically published to the services. They rebuild, redeploy, do everything. In this case, why do we have repos then? We could put everything inside the same repo and call them modules. It’s a possibility.
I wanted to go through this example to highlight that making software inherently has these problems. If you haven’t, go watch this talk, “Dependency Hell, Monorepos and beyond”. It’s 7 years old, but it’s still current. It’s very educational to understand how software gets built and published. Very great material. We can all identify with these problems here. At some point, we had multiple releases just to consume a tiny library, or we had one person depending on a version of a library, and then this version has a security vulnerability, so you have to run and update everything.
That’s one thing I wanted to say, like making software is hard, even if you were in complete isolation, just you and your text editor, pushing characters there is already difficult. Then, if you start pointing to third-party dependencies, it gets more difficult, because now we have to manage the complexity of upgrading them and tracing them and see if they are reliable or not. Then, if you want to make your software be reusable by others, then that’s even harder, because now you have to care about who’s using your software. If you’re doing all at the same time, which is everyone, then this is the hardest thing ever. It’s really complicated.
Example 2 – Upstream vs. Downstream
Taking a step back, back to the monorepo and poly-repo conversation. How can we define them, after all this? Poly-repo is usually understood as very small repos that contain software for a very specific purpose, and they produce a single artifact, or like very few. Every time you publish something from a repo on a poly-repo setting, you don’t care about who’s using it. It’s their responsibility to get your new version and upgrade their code.
On a monorepo, you have multiple modules that can or cannot be related. The way you reuse code internally is by just pointing to a local artifact. On purpose I put that line there saying that builds can be fast on a monorepo, because if you build it right you can always filter the monorepo and build exactly the part that you want without having to build everything else that’s in there. That’s how you scale a monorepo.
I think we can move to the second example where we have a better understanding of the way I’m presenting poly-repo versus monorepo. This next example has to do with the concepts of upstream and downstream. Many of you may be familiar with it already, especially if you’re in a poly-repo setting. That’s repos in purple, artifacts in blue. Let’s imagine you’re there all the way to the right. What you call upstream is whatever comes before you in the build pipeline. Libraries you use, services you consume, that’s upstream. Everything that depends on you, your libraries, your Docker images, your services, that’s downstream.
In this example, we have one repo that’s upstream and one repo that’s downstream. Pretty easy. If you’re someone making changes to this repo, now your downstream looks like this, which is a lot of stuff. In this cut of the code base structure, that’s the entirety of your publishable artifacts. Changing something in that particular repository has implications in all artifacts you produce. Let’s go through the most common setup to share code, which is publishing libraries using semantic versioning. You can have this setup and you make a change of the very first one there where the red arrow is. What happens is you make a change, then you publish a new version.
Then, people have to upgrade their dependencies to use that new version. If you want to continue that and release the new artifacts using that new version of that library, then you have to continue that chaining, and you have a new version of this repo, and you publish a new version of this one. Then they update the version that they’re consuming. Then you publish a new version of those repos that represent the artifacts, and then you push the artifacts somewhere. This is how many of us are doing things. Depending on who’s doing what, this may be very wasteful. If there’s the same team managing all these repos, why are they making all those version publications, if their only purpose was to update the artifacts in the first place?
We can argue that, let’s use a monorepo then and put every component there in a module inside the same repo under the same build system, and then I make a change, and then I release. That’s not all good things, because on the last setup, the team doing changes here did their change, published their versions, continue to the next task. Then the other teams started upgrading their code. Now you can have something like this where you have a very big change to make, and you have to make this change all at once, which can take weeks especially if you’re doing a major version bump or something. You make the changes and then you can release.
This is the part where I take the responsibility out of me a little bit and hand it over to you, because you know what you’re doing, what your code looks like, what your teams prefer, and everything related to the operating dynamics of your code base. If you look at all this and you say, who’s responsible for updating downstream code? Each one of us will have a different answer. That answer has impact on whether or not you should be using a monorepo. Another question would be, how do you prefer to introduce changes? Where on a poly-repo, you do it incrementally. A team updates a library. This library gets published, and slowly everyone adopts it. Whereas on a monorepo, the team that’s changing the library may have to ask for help updating downstream code because they may not have the knowledge, so it may take longer.
Once you do it, you do it for everyone, so you know that the new artifacts are deployed aligned. This is my take. I don’t think poly-repo and monorepo are either/or. I think they’re complementary. It depends on where you are and what you’re doing and who you’re doing it with. They’re both techniques that can coexist in the same code base, and the way you structure it has to be very tightly coupled with the way you operate it, and the skill set that people have, and everything related to your day-to-day. Some parts of your organization are better understood and operated as a detached unit.
For example, if you have a framework team, or you’re doing a CLI team, where you have a very distinct workflow, they need to be independent to innovate and publish new versions, and you don’t need to force everyone to use the latest version of this tool. That happens. That exists. It’s a valid possibility. The other part is true as well. Some parts of your organization are better understood and operated as a unified conglomerate, where alignment and synchronicity are mandatory, even for success. For me, monorepos or poly-repos is not a question of one or the other. It’s a question of where and when you should be using each one of those.
Apache KIE Tools
I wanted to talk a little bit more about my experience working on a monorepo. We have a fairly big one. That’s the project I work on every day. It’s called Apache KIE Tools. It’s specialized tools for business automation, authoring and running and monitoring and everything. Some stats, our build system is pnpm. We are almost 5 years old. We have 200 packages, give or take. Each package is understood like a little repo. Each package has a package.json file, borrowed from the JavaScript ecosystem, that contains a script that defines how to build it. This package.json file also defines the relationship between the modules themselves.
It’s really easy for me to a simple command, say, I want to build this section of the monorepo, and select exact what part of the tree that I want to build. We have almost 50 artifacts coming from that monorepo, ranging from Docker images to VS Code extensions and Maven modules and Maven applications, and examples that get published, and everything. The way we put everyone under the same build system was by using standardized script names. Each package that has a build step has two commands called build:dev and build:prod, and packages that can be standalone developed, have a start command.
Then, all the configuration is done through environment variables, borrowing from the Twelve-Factor App manifesto. We have built an internal tool that manages this very big amount of environment variables to configure things like logo pass, or whether or not to turn on the optimizer or minifier, or run tests or not. Everything that we do is through environment variables. Every time we need to make a reference to another package, so, for example, I’m building a binary that puts together a bunch of libraries. We do that through the Node modules there, also borrowed from the JavaScript ecosystem.
Through symbolic links and the definitions we have in package.json, we can safely only reference things that we declare as a dependency. We’re not having a problem where we are just going back to directories and entering another package without declaring it as a dependency. This is something that we did to prevent ourselves from making mistakes, and during builds that select only a part of the monorepo, forget to build something.
Then, one thing that we have that’s very helpful, is the ability to partially build the monorepo in PR checks. Depending on the files you changed, we have scripts that will figure out what packages need to be rebuilt, what packages need to be retested, and things like that. The stats there are, for a run that changed only a few files in a particular module, the slowest partition build in 16 minutes, and all the other ones were very fast. For a full build, you can compare the times. Having the ability to split your builds in partitions and in sections of the tree is very important if you want to have speed on a monorepo that can scale very well.
Also, to make things more complicated, we have many languages in there. We are a polyglot monorepo. We have Java libs and apps building with Maven. We have TypeScript, same thing. We have Golang. We have a lot of container images as well. Another thing, and that’s all optimization, is that we have sparse checkout ability, so you can clone the repo and select only a portion of it that you want. Even if it gets really big, you can say on the git clone that you only want these packages, and they will be downloaded, and everything else is going to be ignored. The build system will continue working normally for you.
Apache KIE Tools (Challenges)
Of course, not everything is good. We have some challenges and some things that we are doing right now. One of those things is that we’re missing a user manual. A lot of the knowledge we have is on people’s heads and private messages on Slack, or Zulip chat for open source, and that’s not very good. We’re writing a user manual with all the conventions that we have, the reasoning behind the architecture of the repo and everything. Then, we’re also improving the development experience for Maven based packages, especially with importing them in IDEs and making sure that all the references are picked up, and things become red when they’re wrong, and things like that.
Then we have a very annoying problem, which is, if you change the lockfile, the top level one, our partitioning system doesn’t understand which modules are affected. We have a fix for it. We’re researching how to roll that out. I’m glad that we found a solution there. If that works the way it should, we’re never going to have a full build whenever our code change, unless it’s of like a very root level file, like the top-level package.json or something. Then we’re also, pending, trying a merge queue. Merge queue is when you press the merge button, the code doesn’t go instantly to the target branch. It goes to a queue where it simulates merges, and when a check passes, you can merge automatically. You can take things out of the queue if they’re going to break your main branch.
That’s a very cool thing to prevent semantic conflicts from happening, especially when they break tests or something. We’re pending trying that. Also, we’re pending having multiple cores available for each package to build. We can do parallel builds, but we don’t have a way to say, during your build, you can use this many cores. We don’t have that. We’re probably going to use an environment variable for that. Next one is related to this. You saw we have two commands to build, build:dev and build:prod, and sometimes there’s duplication in these commands, and it’s very annoying to maintain. One thing we’re researching is how we can use the environment variables to configure parameters that will distinct a prod and a dev build.
For example, on webpack, you can say the mode, and it will optimize your build or not. The last one, which I think is the most exciting, is taking advantage of turborepo, also a test runner that will understand package.json files. It has a very nice ability, which is caching, so you can, in theory, download our monorepo and start an app without building anything. You can see how powerful that is for development and for welcoming new people in a code base. Of course, if you look at a poly-repo, you don’t have the caching problem because you’re publishing everything, so you don’t have to build it again: tradeoffs.
Using a Monorepo (Yourself)
I wanted to close with some advice of, how can you build a monorepo yourself, or even improve existing monorepos that you might have? The first one is, if you’re starting a monorepo now, if you think this is for you, if you’re doing research, if you’re doing a POC or something, then don’t start big. Don’t try to hug every part of the code base. Start small. Pick a few languages or one. Choose one build tool, and go from there. You’re going to make mistakes. You’re going to learn from them. You’re going to incorporate the way your organization works. You’re going to have feedback. Start small. Don’t plan for the whole thing.
Then, the third bullet point there is, choose some defaults. Conventionalize from the beginning, like this is the way we do it. It doesn’t matter if it’s good or bad, you don’t know. You’re just starting. The important thing is that everyone is doing the same, and everyone is exposed to the exact same environment so they can feel the same struggles, if they happen. Chance is, there will be some. Then, fourth is, make the relationship between the modules easy to visualize. It’s really easy to get lost when you have a monorepo, because you have very small modules, and if you don’t plan them accordingly, you’ll have everyone depending on everything. This is not good. That also happens with a poly-repo.
Number five, be prepared to write some custom tools. Your build necessities are very unique too. Maybe your company has a weird setup because of network issues, or maybe you’re doing code in a very old platform that needs a very special tool to build, and you need to fetch it from somewhere and use an API key. Be prepared to write custom scripts that look like they are already made for you, tailored for your needs. This is something that’s really valuable. Sixth, be prepared to talk about the monorepo a lot, because this is a controversial topic, and people will have a lot of opinions, and you will have to explain why you’re doing this all over again multiple times.
Then, number seven is, optimize for development. That comes from my personal experience. It is much nicer for people to clone and start working right away, rather than having a massive configuration step. Our monorepo has everything turned off by default, everything targeting development, localhost, all the way, no production names, no production references, nothing. Everything just is made for being ran locally without dependencies on anything. Then some don’ts, which are equally important, in my opinion. Don’t group by technology.
Don’t look at your code base and think, yes, let’s get everything that is using Rust and put on the same monorepo. Because, yes, everything’s Rust. We have a build system built for it. No. Group by operating dynamics, by affinity of teams. Talk to people, see how they feel about interacting with other teams more often. Maybe they don’t like it, and maybe this is not for them, or maybe you’re putting two parts of your code base that are using the same technology but have nothing to do with each other. Don’t do that. Don’t group by technology, group by team affinity, by the things that you want to build and the way people are already operating.
Then, number nine, don’t compromise on quality. Be thorough about the decisions you make and why you’re making them, because otherwise you can become a big ball of mud very quickly. Say no. If some people want to put a bunch of code in your monorepo just to solve an immediate problem that they might be having, think about it. Structure it. Plan. Make POCs. Simulate what’s going to be like your day-to-day if the code was there. Have patience. Number 10 is, don’t do too much right away. I mentioned many things like partial build, sparse checkouts, caching, and unified configuration mechanism. You don’t have to have all these things to have a monorepo. Maybe your monorepo is small in the beginning, and it’s fine if you have to build everything every time, maybe.
Eleventh, I think, is the most important one, don’t be afraid if the monorepo doesn’t work out for you. Reevaluate, incorporate feedback. Learn from your mistakes, and hear people out. Because the goal of all this is to extract the most out of the people’s time. There’s no reason for you to put code in a monorepo if this is going to make people’s life harder. That’s true for the opposite direction too. There’s no reason to split just because it’s more beautiful.
Questions and Answers
Participant 1: Do you have any preference on the build tools? Because we have suffered a lot by choosing one build tool and then we quickly get lost from there.
Bento: You mean like choosing Maven over Bazel or Gradle?
Participant 1: You mentioned a lot of tools, so I was wondering which one is your preferred choice, like Gradle or Maven?
Bento: This is very closely related to your team’s preference and skill set. On my team, it was and it still is, very hard to get people to move from the Maven state of mind to a less structured approach that we have with like package.json and JavaScript all over the place. I don’t have a preference. They all are equally good depending on what you’re doing. It will depend.
Participant 2: Do you have any first-hand opinion on coding styles between the various modules of a monorepo? I’m more interested in a polyglot monorepo, if possible, like how the code is organized, various service layers, and all that.
Bento: I’m a very big sponsor of flat structures on the monorepo. I don’t like too much nesting, because it’s easier to visualize the relationship between all the internal modules if you don’t have a folder that hides 20, 100 modules. I always tell people, give the package name a very nice prefix, and don’t be afraid of creating as many as you want. Like the monorepo that we built is built for thousands of packages. We are prepared to grow that far. IDE support might suffer a little bit, but depending on what IDE you’re using, too. This is the thing that I talk about the most. Like flat structures at the top, easy visualization. If you go to the specifics of each language, then you go to the user manual and you see the code style there.
Participant 3: I’m from one of those rare companies that does pretty much use a monorepo for the entire code. There’s lots of pros and cons there, of course. We’re actually in the process of trying to unwind that. I just wanted to add maybe one don’t, actually, to your list, which is, try to avoid polyglot repos. Because you can easily get into a situation where you have dependency hell resulting from that, if you have downstream dependencies from that and upstream dependencies. You have a Java service that’s depending on a Python library being built, or something like that.
Bento: I don’t necessarily agree with that don’t, but it is a concern. Polyglot monorepos are really fragile and really hard to build. You can look at stories that people who use Bazel will tell you, and it can become very messy very quickly. For our use case, for example, we really do have the need of having this cross-language dependency because we’re having like a VS Code extension that is TypeScript depending on JARs, and these JARs usually depend on other TypeScript modules. The structure that we created lets us navigate this crazy dependency tree that we have in a way that allows us to stay in synchrony.
Participant 3: I was talking about individual repos having multiple languages in them. Maybe it’s specific to our company. We have pretty coarse-grained dependencies, so it’s like code base.
Bento: This is a good example, like your organization, it didn’t work out for you, so you’re moving out of it. That’s completely fine. Maybe a big part of your modules are staying inside a monorepo, which is fine too. You’re deciding where and when.
Participant 4: When you were talking about your own company’s problems, one of the bullets said that the package lockfile would change, or the pnpm-lock file would change, and it was causing downstream things to build, and you fixed that. Why is it a bad thing for when dependencies change, for everything downstream to get rebuilt?
Bento: It’s not a bad thing. It’s actually what we’re aiming for. We want to build only downstream things when a dependency changes. Our problem is that when the lockfile which is in the root folder changes, the scripting system that we have to decide what to build will understand that a root file changed so everything gets built. Now we have a solution where we leverage the turborepo diffing algorithm to understand which packages are affected by the dependencies that changed inside the lockfile. You’re right, like building only downstream when a dependency changed is a good thing.
See more presentations with transcripts
![MMS Founder](https://mobilemonitoringsolutions.com/wp-content/uploads/2019/04/by-RSS-Image@2x.png)
MMS • Robert Krzaczynski
Article originally posted on InfoQ. Visit InfoQ
![](https://mobilemonitoringsolutions.com/wp-content/uploads/2024/11/generatedHeaderImage-1731438380230.jpg)
Hugging Face has introduced SmolTools, a set of applications built on the recently launched SmolLM2 model, a compact 1.7-billion parameter language model. SmolTools includes specialized tools for summarization, rewriting, and task automation, bringing efficient AI functionality to a broader range of users.
SmolTools suite includes several applications designed to streamline common tasks:
- SmolSummarizer: Enables quick summarization for texts up to 20 pages, retaining key points and supporting follow-up questions for deeper understanding.
- SmolRewriter: Refines initial drafts to sound professional and approachable while preserving original intent, ideal for email and messaging needs.
- SmolAgent: Acts as a tool-integrated AI agent capable of executing tasks like random number generation or time checks. Its extensible tool system also allows users to add new capabilities as needed.
To install SmolTools, users can follow these setup steps:
1. Clone the repository:
git clone https://github.com/huggingface/smollm.git
cd smollm/smol_tools
2. Install dependencies:
uv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements.txt
These tools are powered by SmolLM2’s variants, including lighter models (360M and 135M), optimized for devices with limited resources. This development brings AI-powered functions to a wider range of platforms, with implications for small businesses, developers, and edge devices.
Drasko Draskovic noted the potential impact:
For small businesses, individual developers, and even edge devices like smartphones, this is game-changing. Imagine running sophisticated summarization or rewriting tasks directly on-device, empowering users everywhere with AI that’s accessible, efficient, and practical.
By pushing forward with innovations like SmolTools, Hugging Face is not just developing technology. They are helping democratize AI. They are proving that efficiency and accessibility are as important as power, opening doors to a future where AI is integrated into everyday workflows, making an impact on all levels of business and society.
SmolLM2’s on-device performance is enhanced with support for tool calling and structured outputs, features critical for building advanced workflows and agentic AI applications. Gaurav Dhiman raised the importance of these functions:
Without that, it is practically not possible to build useful AI apps other than general chatting summarization apps. For building something serious like Agentic workflows, both tool calling and structured outputs are crucial capabilities.
Andrés Marafioti, a machine learning researcher at Hugging Face, confirmed SmolTools support for these features, referencing a repository example that includes an agent for function calling and structured outputs.
SmolTools offers accessible, practical tools that simplify text processing tasks on-device, with potential applications across various fields.