Category: Uncategorized
Logic App Standard Hybrid Deployment Model Public Preview: More Flexibility and Control On-Premise

MMS • Steef-Jan Wiggers
Article originally posted on InfoQ. Visit InfoQ
Microsoft recently announced the public preview of the Logic Apps Hybrid Deployment Model, which allows organizations to have additional flexibility and control over running their Logic Apps on-premises.
With the hybrid deployment model, users can build and deploy workflows that run on their own managed infrastructure, allowing them to run Logic Apps Standard workflows on-premises, in a private cloud, or even in a third-party public cloud. The workflows run in the Azure Logic App runtime hosted in an Azure Container Apps extension. Moreover, the Hybrid deployment for Standard logic apps is available and supported only in the same regions as Azure Container Apps on Azure Arc-enabled AKS. However, when the offering reaches GA, more regions will be supported.
Principal PM for Logic Apps at Microsoft Kent Weare writes:
The Hybrid Deployment Model supports a semi-connected architecture. This means that you get local processing of workflows, and the data processed by the workflows remains in your local SQL Server. It also provides you the ability to connect to local networks. Since the Hybrid Deployment Model is based upon Logic Apps Standard, the built-in connectors will execute on your local compute, giving you access to local data sources and higher throughput.
(Source: Tech Community Blog Post)
Use cases for the hybrid deployment model are threefold, according to the company:
- Local processing: BizTalk Migration, Regulatory and Compliance, and Edge computing
- Azure Hybrid: Azure First Deployments, Selective Workloads on-premises, and Unified Management
- Multi-Cloud: Multi-Cloud Strategies, ISVs, and Proximity of Line of Business systems
The company’s new billing model supports the Hybrid Deployment Model, where customers manage their Kubernetes infrastructure (e.g., AKS or AKS-HCI) and provide their own SQL Server license for data storage. There’s a $0.18 (USD) charge per vCPU/hour for Logic Apps workloads, allowing customers to pay only for what they need and scale resources dynamically.
InfoQ spoke with Kent Weare about the Logic App Hybrid Deployment Model.
InfoQ: Which industries benefit the most from the hybrid deployment model?
Kent Weare: Having completed our private preview, we have seen interest from various industries, including Government, Manufacturing, Retail, Energy, and Healthcare, to name a few. The motivation of these companies varies a bit from use case to use case. Some organizations have regulatory and compliance needs. Others may have workloads running on the edge, and they want more control over how that infrastructure is managed while reducing dependencies on external factors like connectivity to the internet.
We also have customers interested in deploying some workloads to the cloud and then deploying some workloads on a case-by-case basis to on-premises. The fundamental value proposition from our perspective is that we give you the same tooling and capabilities and can then choose a deployment model that works best for you.
InfoQ: What are the potential performance trade-offs when using the Hybrid Deployment Model compared to fully cloud-based Logic Apps?
Weare: Because we are leveraging Logic Apps Standard as the underlying runtime, there are many similar experiences. The most significant difference will be in the co-location of your integration resources near the systems they are servicing. Historically, if you had inventory and ERP applications on-premises and needed to integrate those systems, you had to route through the cloud to talk to the other system.
With the Hybrid Deployment Model, you can now host the Logic Apps workflows closer to these workloads and reduce the communication latency across these systems. The other opportunity introduced in the hybrid deployment model is taking advantage of more granular scaling, which may allow customers to scale only the parts of their solution that need it.

MMS • Lakshmi Uppala
Article originally posted on InfoQ. Visit InfoQ

Transcript
Shane Hastie: Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. Today, I’m sitting down with Lakshmi Uppala. Lakshmi, welcome. Thanks for taking the time to talk to us.
Lakshmi Uppala: Thank you, Shane. It’s my pleasure.
Shane Hastie: My normal starting point in these conversations is, who’s Lakshmi?
Introductions [01:03]
Lakshmi Uppala: Yes, that’s a great start. So, professionally, I am a product and program management professional with over 12 years of experience working with Fortune 500 companies across multiple industries. Currently, I’m working with Amazon as a senior technical program manager, and we build technical solutions for recruiters in Amazon. Now, as you know, Amazon hires at a very, very large scale. We are talking about recruiters processing hundreds and thousands of applications and candidates across job families and roles hiring to Amazon. Our goal is to basically simplify the recruiter operations and improve the recruiter efficiency so that they can spend their time in other quality work.
What sets me apart is my extensive background in very different type of roles across companies and driving impactful results. So, in my experience, I’ve been a developer for almost seven years. I’ve been a product head in a startup. I’ve been leading technical teams in India, in some companies in India, and now in Amazon. Over the last seven years or so, I’ve had the privilege of driving large-scale technical programs for Amazon, and these include building large-scale AI solutions and very technically complex projects around tax calculation for Amazon.
Outside of work, outside of my professional world, I serve as a board member for Women at Amazon, DMV Chapter, and Carolina Women+ in Technology, Raleigh Chapter. Both of these are very close to me because they work towards the mission of empowering women+ in technology and enabling them to own their professional growth. So that’s a little bit about me. I think on my personal front, we are a family of three, my husband, Bharat, and my 11-year-old daughter, Prerani.
Shane Hastie: We came across each other because I heard about a talk you gave on a product value curve. So tell me, when we talk about a product value curve, what do we mean?
Explaining the product value curve [03:09]
Lakshmi Uppala: Yes. I think before we get into what we mean by product value curve, the core of this talk was about defining product strategy. Now, product strategy at its start is basically about identifying what the customer needs in your product. So that’s what product strategy is about, and product value curve is one of… very successful and a proven framework to building effective product strategy.
Shane Hastie: So figuring out what your customer really needs, this is what product management does, and then we produce a list of things, and we give them to a group of engineers and say, “Build that”, don’t we?
Lakshmi Uppala: No, it’s different from the traditional approach. Now, traditionally and conventionally, one way to do product strategy or define product strategy and product roadmap is to identify the list of hundred features which the customer wants from your product and create a roadmap saying like, “I will build this in this year. I will build this in the next year and so on”, give it to engineering teams, and ask them to maintain a backlog, and build it. But value curve is a little different. It is primarily focused on the value, the product offers to customers. So it is against the traditional thinking of features, and it is more about what value does it give as end result to the user.
For example, let’s say we pick a collaboration tool as a product. Right? Now, features in this collaboration tool can be, “Visually, it should look like this. It should allow for sharing of files”. Those can be the features. But when you think of values, it is a slight change in mindset. It is about, “Okay. Now, the customer values the communication to be real time”. That is a value generated to the customer. Now, the customer wants to engage with anybody who is in the network and who is in their contact list. Now, that is a value.
So, first, within this value curve model, a product manager really identifies the core values which the product should offer to the customers, understands how these values fit in with their organizational strategy, and then draw this value curve model. This is built as a year-over-year value curve, and this can be given to the engineering teams as a… The subsequent thing is to define features and then give it to engineering teams. So it is purely driven by values and not really the features which are to be built on the product. That’s the change in the mindset, and it is very much essential because finally, customers are really looking for the value. Right? It’s not about this feature versus that feature in the products. So this is essential.
Shane Hastie: As an engineer, what do I do with this because I’m used to being told, “Build this?”
Engaging engineers in identifying value [06:02]
Lakshmi Uppala: Yes. I think when we look at value curves and its significance to engineering teams, there are multiple ways this can benefit the engineering teams. One is in terms of the right way of backlog grooming and prioritization. Now, features also. Yes, there is a specific way of doing backlog grooming which engineering team does, but I think when you translate this into value-based roadmap, the engineers and the engineering teams are better able to groom the backlog, and they’re able to take the prioritization decisions in a more effective way because values are, again, going to translate to the success of the product.
Second is the design and the architectural decisions. Now, if we think about the traditional model, right? Because they’re given as a list of features, now if an engineer is building for the features, the design or architectural choices are going to be based on the features which are at hand. But thinking in terms of value, it changes their thinking to build for longer-term value which the product should offer to the customers, and that kind of changes the design and architectural decisions.
Third, prototyping and experimentation like engineers can actually leverage the value curves to rapidly prototype and also test design alternatives which, again, is not very suitable with the feature-based roadmap definition. I think one other thing is cross-functional alignment. This is one of the major problems which we encounter or which the engineering teams encounter when there are dependencies with other teams to build a specific feature in a product. It is very hard to get the buy-in from the dependency teams for them to prioritize the work in their plan when it is based on features.
But when you think of value and value curves, it gives a very easy visual representation of where we are trending towards in terms of value generation to customers and how our incremental delivery is going to be useful in the longer run which is very helpful to drive these cross-functional alignments for prioritization, and I know… being an engineering manager in Amazon for a year, I know that this is one of the major problems which the leaders face to get cross-team alignment for prioritization. All of these can be solved by a good visual representation using value curves and defining value-based roadmaps.
Shane Hastie: So approaching the backlog refinement, approaching the prioritization conversation, but what does it look like? So here I am. We’re coming up to a planning activity. What am I going to do?
Prioritizing by value [08:54]
Lakshmi Uppala: Okay. So when the product manager defines the value curves, let’s say they define it in incremental phases, that’s how I would recommend approaching this. So as a product manager, I would define how my product is going to look like three years down the line and in terms of the values, again, not just in terms of the features. I’m going to define like, “Okay. In three years, if I’m going to be there, year one, this is the area or this is the value I’m going to focus on. Year two, this is the value I’m going to focus on”. So this is going to give year-on-year product roadmap for the team to focus on.
Now, translating that to the tasks or deliverables which engineers can actually take on is a joint exercise between product managers, and engineering leaders, and senior engineers. Now, translating these values into the features for the product. Again, we are not driving this from features, but we are driving it from values. Given this value, what are the 10 features which I need to build? Creating technical deliverables from those features, that is the refined approach to backlog grooming and planning when compared to older-fashioned backlog grooming.
Shane Hastie: Shifting slant a little bit, your work empowering women in tech, what does that entail? How are you making that real?
Empowering women in tech [10:21]
Lakshmi Uppala: Yes. It’s a great question, and as I said, it’s very close to me and my heart. So, as I said, I am part of two organizations, one within Amazon itself and one which is Carolina Women in Tech, Raleigh Chapter, but both of these are common in terms of the initiatives we run and the overall mission. So, within the Women at Amazon group, we do quarterly events which are more focused around specific areas where women need more training, or more learning, or more skill development. We welcome all the women+ allies in Amazon to attend those and gain from them.
Similar in Carolina Women in Tech, but the frequency is more regular. Once in a month, we host and organize various events. Just to give you a flavor of these events, they can range from panel discussions by women leaders across industries and across companies, and these could be across various types of topics. One of the things which we recently did in Amazon was a panel discussion with director level women leaders around the topic of building personal brand. These events could also be speed mentoring events. Like last year when we did this in Amazon, we had 150-plus mentees participate with around 12 women and men leaders, and this is like speed mentoring. It’s like one mentor to many mentees across various topics. So it’s like an ongoing activity. We have multiple such initiatives going on around the year so that we can help women in Amazon and outside.
Shane Hastie: If we were to be giving advice to women in technology today, what would your advice be? Putting you on the spot.
Advice for women in technology [12:06]
Lakshmi Uppala: Okay. I was not prepared for this. There are two things which I constantly hear in these events. One of them is women have this constant fear and imposter syndrome a lot more than men. When women in tech or any other domain, when they’re in conversations or meetings with men or other counterparts, women generally tend to take a step back thinking that they’re less knowledgeable in the area, and they don’t voice out their opinions. I would recommend women to understand or to believe in themselves, to believe in their skills, and be vocal and speak up where they need.
Second is about the skill development. I think one of the other things which I noticed which is even true for me is while getting engaged with multiple commitments, including personal, and family, and professional, give very less importance to skill development, like personal skill development. I think that is very, very essential to grow and to be up-to-date in the market for growing. So I think continuous learning and skill development is something which everybody, not just women, but more importantly, women should focus on and invest their time and energy in. So those are the two things.
Shane Hastie: If somebody wants to create that Women in Tech type space, what’s involved?
Establishing a Women in Tech community [13:29]
Lakshmi Uppala: I think it’s a combination of things. One is definitely if somebody is thinking of creating such a group within their organization, then definitely, the culture of the organization should support it. I think that’s a very essential factor. Otherwise, even though people might create such groups, they will not sustain or they’ll not have success in achieving their goals. So organizational culture and alignment from leadership is definitely one key aspect which I would get at the first step.
Second is getting interested people to join and contribute to the group because this is never a one-person activity. It’s almost always done by a group of people, and they should be able to willingly volunteer their time because this is not counted in promotions, this is not counted in your career growth, and this is not going to help you advance in your job. So this is purely a volunteer effort, and people should be willing to volunteer and invest their time. So if you’re able to find a group of such committed folks, then amazing.
Third, coming up with the initiatives. I think it’s very tricky, and it’s also, in some bit, organizational-specific. So creating a goal for the year. Like For Women at Amazon, DMV Chapter, we create a goal at the beginning of the year saying that, “This is the kind of participation I should target for this kind of events, and these are the number of events I want to run. These are the number of people I want to impact”. So coming up with goals, and then thinking through the initiatives which work well with the organizational strategy is the right way to define and execute them. If they’re not aligned with the organizational culture and strategy, then probably you might run them for some iterations. They’ll not create impact, and they’ll not even be successful. That’s my take.
Shane Hastie: Some really interesting content, some good advice in here. If people want to continue the conversation, where do they find you?
Lakshmi Uppala: They can find me on LinkedIn. So a small relevant fact is I’m an avid writer on LinkedIn, and I post primarily on product program and engineering management. So people can definitely reach out to me on LinkedIn, and I’m happy to continue the discussion further.
Shane Hastie: Lakshmi, thank you very much for taking the time to talk to us today.
Lakshmi Uppala: Thank you so much, Shane. It was a pleasure.
Mentioned:
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

MMS • Ben Linders
Article originally posted on InfoQ. Visit InfoQ

High-performing teams expect their leader to enable them to make things better, Gillard-Moss said at QCon London. Independence in software teams can enable decision-making for faster delivery. Teams need empathy, understanding, and guidance from their managers.
Something most driven and motivated engineers have in common is that they will have a longer list of all the things they want to improve than the things they believe are going well. Gillard-Moss mentioned that for some managers this is intimidating and makes them believe the team is being negative, when in reality this is a good thing. They are motivated to get these things solved. In return, they need you to create the opportunity to solve things.
If the team feels that you are not enabling them to solve things fast enough, then sentiment turns negative, Gillard-Moss argued. This is because they’ve stopped believing in you as someone who can help. So rather than bringing you problems to solve and wanting your help, they become complaints, burdens, and excuses:
If you have a team that is unable to make things better, and is stuck complaining that things aren’t getting better, then you do not have a high-performing team.
We should strive for independence in teams for faster and better decision-making, which leads to faster delivery and faster impact, Gillard-Moss said. Waiting for decisions is the single biggest productivity killer, and making decisions on poor information is the most effective way to waste money, he explained:
In technology, we need to make thousands of decisions a day. It’s unrealistic for someone far from the information to make high-quality decisions without blocking teams. And it doesn’t scale.
With low-performing teams, Gillard-Moss suggested analysing their cycle time. The vast majority of it is waiting for someone to provide information or make a decision, he mentioned. And then, when they do, the team struggles to implement or pursue a suboptimal solution because the decision maker had an overly naive view of what needed to be done.
What teams need from managers is empathy, understanding, and guidance. Empathy comes from being able to think like an engineer, Gillard-Moss said, and understanding because you’ve been there and done that yourself. Guidance comes from a deep instinct for the universal fundamentals of engineering and how to apply them to get better results, he added, the evergreen wisdom and principles.
Gillard-Moss stated that a good engineering leader builds teams that can maximise impact by applying their expertise:
My experience as an engineer tells me that integrating early and often results in less delivery risk, and when to tell that’s not happening. It also tells me that’s sometimes easier said than done and the team might need help working through it. This gives me the patience and empathy to guide a team through the trade-offs in these difficult situations.
InfoQ interviewed Peter Gillard-Moss about managing high-performing teams.
InfoQ: How can managers help teams improve their cycle time?
Peter Gillard-Moss: There are so many factors that influence cycle time, from architecture to organisation design to culture to processes. The best thing any manager can do is observe the system and continuously ask, “Why did this take as long as it did? With hindsight how could we have improved the quality of this decision?” And then experiment with small changes. Little nudges. This is, after all, why we have retrospectives.
One example was a team who felt like they worked on a lot but nothing came out the other end. When we analysed the cards, we saw that they would keep moving back up the wall from Dev Complete back into Dev, or more cards would be created and the original card would be placed in Blocked or Ready to Deploy for weeks on end. What was happening was the stakeholder would specify the exact solution, literally down to fields in the database in the original card. The team would build it and then the QA would find edge cases. The edge cases would go back to the stakeholder who would then decide on the next steps, either adding new criteria to the original card or creating new ones.
Most of this was over email (because the stakeholder was too busy) and it was often missing context both ways. When you gathered the history and context around the cards, it looked absurd as weeks would go by for simple stories which have long email chains connected to them. Despite the obvious inefficiencies, this is a pattern I’ve seen in many teams.
InfoQ: How can engineering leaders keep their engineering expertise up-to-date?
Gillard-Moss: You can’t. I’m really sorry but you can’t. It’s impossible. Once you realise this, it will liberate you and you will become a better leader.
Plus it’s not what teams need from you. The engineering expertise needs to be in the team with the people doing the work. Knowledge and expertise is a group asset, not an individual asset.
How do you think a team would perform if all the knowledge and expertise was only in their manager’s head? Every time a team gets stuck having to go to their manager to get the answer. How slow is that? How expensive is that? And the manager will burn out and complain that they don’t have time to focus on the important things.
Presentation: Beyond Platform Thinking at RB Global – Build Things No One Expects, in a Place No One Expects It

MMS • Ranbir Chawla
Article originally posted on InfoQ. Visit InfoQ

Transcript
Chawla: I took a brief sojourn as a young kid into the finance business, and determined that that wasn’t where I wanted to be. I had met a couple of people in the finance business that had some money. Back in the early ’90s, when the internet looked like it might actually not just be an academic exercise, raised some money with a couple of friends, bought one of the first ISPs in the U.S. Joined another couple of friends, built Earthlink, which ended up being the largest ISP in the United States. It was fun. It was a great journey.
It was amazing to work on a lot of things we worked on. I always tell people, try not to be successful in your first two startups, because it colors your vision of what’s to come next, because they don’t always work after that. I’ve been in a lot of diverse industries. I’ve done a lot of consulting. I was at Thoughtworks. I’ve had a lot of opportunity to see all kinds of architectures and all kinds of situations, from small companies to some of the Fortune 5. The Ritchie Brothers journey was a really interesting one.
Introducing RB Global
There’s a little subtitle to the talk, build things no one expects, in a place no one expects it. This phrase came up in a conversation about a year ago with a group of some of the folks at Thoughtworks that were still working on a project at Ritchie Brothers. One of them asked me how I had reached some of the success I might have reached in my career. I was trying to think about what that really meant, and trying to come up with a way to describe it. At least for me, the things that I’ve done is exactly that. Whatever the solution is, it’s not what the solution somebody expected, or it was something no one was looking forward to.
No one had on their bingo card to build this thing. No one was expecting the World Wide Web in 1994, 1995, or building an architecture like we did at Ritchie Brothers, in an industry where no one expects that architecture. That leads into, what is Ritchie Brothers. We’re the largest global auction marketplace in the world for heavy equipment, agricultural equipment, heavy trucking, things like that. Sounds boring, but it’s actually really very interesting and a lot of fun. We have deep relationships across all these industries, and deep relationships with construction companies, large agricultural firms, and a lot of small businesses. Our customers range from very small family farms all the way up to major construction firms, or firms that rent and lease construction equipment and then move that to our marketplace.
Things about Ritchie in the plus column, the good things about Ritchie Brothers. We care deeply about our customers. When you think about our sellers, if you go to one of our live auctions or even look online, we spend a lot of time on merchandising the equipment that we sell. It’s just a pickup truck, or it’s just a large dump truck, or it’s just an excavator. If you go and you see the care that we took in bringing that equipment in and cleaning that equipment and merchandising it and making it available to people. We deeply realize that a lot of these people that we work with, from a seller perspective, have built a firm as a small business or a medium-sized business that is their retirement, that is their assets. They don’t have a pension somewhere. They don’t have cash somewhere. They’ve got this equipment. They’ve reached the end point where they’re going to go relax and go to Florida or go somewhere and be done.
We’re there to get them the very best price on that equipment. That’s what they deserve. That’s what we want to care about them. When it comes to our buyers. We want to get them the best possible pricing. We want to give them accurate information. You’re making a very large investment in a piece of equipment that you may only have a few minutes to inspect. We’re going to give you a professional inspection. We’re going to take a lot of time to make sure you understand what you have. Again, if you also aren’t getting the best possible customer service, we’re going to make sure you have that. We look at our business. We’re a very white glove, very hands-on business on both sides of the transaction. That’s a real positive for us.
On the not so positive side, or the, what’s become difficult, or really, why are we here talking about this. Ritchie Brothers evolved into RB Global, and built that company through acquisition. We bought another company called Iron Planet, 7 years, 8 years ago, that helped us bring another way of being in the digital marketplace. They were very online at the point where we were very physically centric. Auctions were at an auction yard, an auctioneer calling out. Iron Planet was very digital. RB moved into the digital universe on their own. Actually, what happened is, you had those companies, now we have other companies within our sphere.
The integrations were never done. They were uncommitted. We would find ways to make them work. We now white glove our own internal operations. There’s a lot of manual steps. How do I bring in a bunch of trucks that I’m going to sell, or a bunch of heavy equipment I’m going to sell? Some of it needs to get sold through the Iron Planet sales formats, and some of them needs to get sold through a local auction yard. There’s a ton of white glove manual steps that go into moving everything from system to system. We call it doing arts and crafts.
How do I fake this process? How do I fake that construct? I asked one of our people, what’s her dream for a successful integration? She says, I want to be able to do it from a dining room chair. I said, I don’t understand what you mean. I don’t want a swiveling office chair. I want a chair that doesn’t turn because my office chair allows me to have three monitors, because IT gives me three monitors because I’m in three different apps all the time. I want a screen. I want a boring, just simple engagement. That’s where we have that challenge of how to bring a better digital experience for our own customers, but also a better digital experience for all the people that were embedded in the company.
Architecture Perfect Storm
What did we find as we started this journey about 2 years ago? I came in, I was with Thoughtworks at the time. They came in and said, we want to do this digital transformation. We want to build a very large digital marketplace. We want to integrate all our companies. We want to do this amazing thing. Here’s what we found. We call it the architecture perfect storm. If you look at across all the systems we had, there were at least three of everything. Whatever domain you picked, there was three of it: three merchandising, three this, three asset management systems, three billing. All the different pieces all over the place. What was maybe worse, was that each one of these was believed to have to be in sync all the time. We’re going to talk about some of the models and things that we found. There was no system of record for anything in particular.
Even worse was all that interconnectivity. Everything’s in threes for some reason. If you think about the modes of intercommunication, we had a really interesting combination of literal database sharing, Kafka and Boomi, between all the different systems, one of which could fail at any one particular time. We had what everyone was calling an event-driven architecture, through Boomi and Kafka, which is really data synchron, data propagation. It was not event-driven architecture. There is a long-standing teams chat, where there are a group of people, three and a half, almost four years ago that came together in a teams chat because they were noticing auction data wasn’t propagating online between these two systems. We’ll get in a teams chat. We’ll bring people together when we see this, we’ll be able to go and fix it. That was supposed to be temporary. It’s four years. Sometimes people ask me what my goal is for this endeavor, and how do I know when I’m done? I shut that chat down. To me, that’s a sign of success.
Technical Myth Understandings
Many technical myth understandings. Why does our electronic bidder number only have five digits? We need more digits. We run out of numbers. We had a reason. There was a reason. You talk to 7 different people, 11 different answers, why we have this. What happens if we change it? I was told if we change, this thing breaks. No, that’s the thing that breaks. You’re just trying to layer on layer and layer of puzzlement of how this even happened. My other favorite, so we can’t schedule an inspection unless it’s tied to an event. In Ritchie Brothers’ domain language, an event means an auction. That’s what that means. I can’t inspect something if it doesn’t have an auction as its destination. It turns out, we run a good business of inspecting equipment for companies, because we’re good at it. They want to just get this item inspector to this fleet of excavators inspected, so they understand the value. They understand maybe what they have to do to fix them.
We have to literally create a fake auction in the system for that to happen. Events have IDs, which are all a small collection of numbers of sale ID. It’s possible now to run out of spaces. It feels like I’m doing COBOL. I’ve run out of literal spaces to have an ID. Now they say, we got to get rid of events. This is going to be a major rewrite. The team that builds the inspection service says, no, it’s not, we’ll just take the events out. Everyone’s like, what do you mean just take the events out? You told us to put them in. They’re actually not necessary. We’ve been doing that arts and crafts for 4 years, and it wasn’t necessary? Who said to put it in? I don’t know, it’s so long ago we forgot. I call this the cacophony of voices. There are many myths.
There are many people with a different version of the story. You’ve got a key system that we have that actually constructs the online bids. It constructs online listings by listening to five Kafka topics and hoping to synchronize them. When they get out of sorts, that’s what we had before, year-long chat for. Haven’t gotten to the point of just what if we had one. Timing. We have timing issues. We talked about the data synchronization. It’s a big mess. It’s just what I lovingly call right now, when we got there, was the architecture of errors.
The Architect’s New Clothes
Here’s the thing, nobody wants to point these things out. We’re all going to make do. There was nobody before just raising their hand saying, this has got to stop, or this is the wrong thing. When you put people individually in a room, they would tell you, of course, this is nuts. I knew better than that. You couldn’t go out and say that, because it would be very unpopular to say that, or this team might get insulted, or this boss did that before, and I don’t even know who’s responsible. We had to go through and peel layers of the onion off.
We got to the point where we just said, you know what we’re going to do? We’re just going to start with what we think the right way forward is, from an architecture perspective. We’ll deal with these things when we get to them, because it was impossible. Six months of trying to understand everything, we realized we just had six months, we hadn’t moved forward at all. We were just trying to understand. At some point you just say, we’re going to approach this as if we were building it, and then we’re going to see what we have to do as we walk through the system.
Time to Dig In – Building the Plan
What do we have to do to dig in and get through all this? The important part here, we had to understand the business model. There was a lot of people in our organization that wanted to treat this as an e-commerce exercise. There were comments of, we could extend Shopify. Shopify would work. Why don’t we just do Shopify? Let’s just do this. Let’s do a simple e-commerce architecture. We’re not e-commerce. There is no single thing that we sell that is identically the same as the other thing. I might sell a bunch of Cat 720s, but each one of them is inherently different. It has different hours, different condition. It’s been used in different kinds of work. This one needs painting. That one is different. This doesn’t have the same value. They’re not the same thing. Our inventory is not acquired from a single source, like an e-commerce inventory would be. We have some inventory that’s in our auction yards.
One of the things that Iron Planet specializes in is selling things at your location. You’ve got a big construction site, we’ll sell it from there. You don’t have to transport it to an auction yard. My inventory is distributed across the world. It’s not identical. It doesn’t have a common merchandising function or a common pricing function. It’s not e-commerce. It’s an online marketplace, though. There’s domains that are similar to anything that somebody who’s been in the space would understand, but it’s not e-commerce. Again, part of that early discovery process was treating it like e-commerce and realizing we really had gone down a path where it just didn’t fit. The architectures were not coherent. Again, at some point, you have to match the business model and step back. Something I always say, the architecture serves the business, not reverse.
First thing we had to do is start thinking about, what are the key domains in our system? What do we do? Effectively, at RB Global, we take something from person A and we facilitate a transaction to organization B. We collect money, we calculate our take, and we send the rest back to the other person. That’s what we do. Ironically, it’s a lot closer to a distribution model, or, as an ex-stockbroker, it’s very identical to the finance model. I’m going to help you get this thing, and I’m going to take my slice, and I’m going to deliver it to you over here. You have to think about what those domains are and understand them, and really deeply try not to model something that doesn’t represent your business. The other thing that we really discovered was, we were also in this transition to becoming a product-driven company. Really, what does that mean?
That was a very nascent journey that we’re still on together. Having that connectivity with the business and helping the business change and understand, again, remember I talked about, we’re very white glove, we’re very manual. That’s going to change. The business is moving. The business is undergoing change. The tech is undergoing change. Who is the translation unit between those? Our product people were the deep translation layer. One of the things I tell people is, when you’re going to be in a product-oriented environment, when you’re thinking about domains, when you’re thinking about platforms, one of the things you want to think about is, what is your communication path between the business and your teams? How do your teams communicate with each other? How do you communicate with the business?
We leveraged our product managers within these teams, and our technical product managers way down at the deep technical levels to be that interconnection point. I would often say that we wanted our teams to be isolated and focused in their domain and their context. The product people were that little integrated circuit pin. There was one pin to that whole API, and they would speak to the other team, whether it was a business team, whether it was a technical team. The product folks helped us, along with our staff engineers, and the architecture team, keep that all in alignment, and keep it all driving towards the same end goal.
This is the key, we fought for radical simplicity. Everything we did at Ritchie is massively complicated. I often make this joke that if you’re going from your main bedroom to your main bathroom, we would go through our neighbor’s house. Every process we do is intensely complicated. How do we break it down to its most simplified process is a struggle. One of the things that we realized was, especially at the technical level on this, don’t rush it. Incrementally think about it. I am very much an XP person. I always prefer code over documentation. This is one of those times when drawing it a couple of times was worth taking the time, before even coding it, sometimes. Just thinking about it. One of the things I like to talk about a lot, I think we need more thinking time.
I reference at the end of the presentation, there’s a great YouTube video of Toto Wolff being interviewed by Nico Rosberg. Toto Wolff is the principal of Mercedes Formula 1 team. One of the things he says in there is, we don’t take enough time, we have our phones. We don’t take nearly enough time to just stare out a window and think. We are all so busy trying to accomplish things. What I tell people when it comes to architecture, when it comes to trying to simplify or organize complex systems, walk more, think more, get away from the screen, because it’ll come to you in the shower. It’ll come to you walking the dog. It’ll come to you doing something else. The last thing is just question everything. You have the right as technical leaders to politely question everything. Are you sure I absolutely have to do that? Is that a requirement or is that a myth? Dig into it.
People always ask me, my CEO asks me all the time, how are we going to be done so fast? We’re going to write less code. They look at you like, what do you mean? The less work I do, the faster it gets done. Can I get it done with that little amount of code? That would be a victory. There’s also the, I spent this much money, I should get more. We’re not doing lines of code. It’s not Twitter. It’s not SpaceX. We don’t do lines of code. We do business outcomes. If we get the business outcome with the least amount of work, that’s the win.
The last thing too, is that we really tried to value fast feedback. We built a lot of our architecture and our platform for the sole purpose of shipping every day. We can’t always ship to production every day, but we ship internally every single day. Our product people get really tired of having to review code and review outputs every day, but that’s what we do. Get an idea. How do we get from ideation to somebody’s hands to take a look at, and get that fast feedback. Is this what you meant? Back to that old, simple process.
What Is Your Unfair Competitive Advantage?
One of the things we talk about a lot at Ritchie is, what’s our unfair competitive advantage? I’m a big car racing fan. Gordon Murray is a famous F1 designer from the ’70s and ’80s. He’s got his own company now, Gordon Murray Automotive. This is some slices of his new hypercar, the T.50. Professor Murray was the very first guy to just basically say, I’m going to put a fan on the back of an F1 car to get ground effects and just literally suck the air out of the car and have it go around the track even faster. The reason I bring this up is, as soon as he was winning everything doing that, they banned the fan.
That’s not fair. We didn’t all think of that. The rules don’t ban, they will next year. For that year, they won everything. Not everything, but most everything. They were ahead of all the other teams because of that. Even in the world of tech, if you think of something and you can do something, somebody will catch up to you. For the time that you have that, figure out what your unfair competitive advantage is and dig into it as your business. What can you instantiate in your architecture that matches your business? It’s unfair advantage.
Organizing the Complexity
I’m going to talk about organizing the complexity. I’m going to dig in a little bit to some of the basic principles we used here to get through all this. Borrowed this slide from my team at Thoughtworks. This is one of our favorite slides. If you look at the left-hand side, you see all the systems and all the complexity. Every system talking to every system. All the connectivity is really complicated. All the things are very spider web. What we want to do here is get to the right-hand side. Think about our key domains at the bottom. Think about our key APIs, our key services, our key systems.
How do we build something composable? How do we make clean interfaces between all the systems? How do we stop the data sharing? Is there one place that does asset management? Is there one place that does consignment? Is there one place that does invoices? Period. How do I compose them to build different experiences, whether it’s mobile, whether it’s an online external API, all of that composition, with clean boundaries. That was the goal. You want to think about retaining your deep knowledge. Are you going to move things to this new platform? Are you going to rebuild them?
In the amount of time we have, we have a lot to prove to both our investors, our board, our own company. We’re not going to rebuild everything. There are systems that we can use and keep. Part of our first adventure in that multiple system universe was taking a lot of those multiple systems, and in their first year, getting it down to one. One thing that was responsible for this. In a monolithic world that we were in, sometimes that one system was actually responsible for eight things, but it was only one of them for any particular thing. Then we could decide, what of those pieces are we going to build an API around, so that we could keep them. Which pieces are we going to strangle out later? It’s important in your architecture to know that and to be able to explain why you chose what you chose.
Because, again, the business elements of architecture, the unfortunate reality is, you’re always searching for funding. You’re always trying to get somebody to keep paying you to get this job done, or keep investing in it. A lot of these complex enterprise transformations or detanglements, they fail. People say all the time, why do they fail? My snarky answer usually is lack of executive courage and follow-through. That happens because you didn’t give the executives the inputs to see that it was worth to continue to fund, so people say, “It never happened. I gave up.” Which pieces do you move now? How do you think about your roadmap? There are things that we’re going to use in 2024 that will be gone in 2025, but we’re going to use them now. Again, think about the deep knowledge. Think about the things not worth changing. Make sure you mark them down and understand them.
I think about bounded context, and I wanted to find a nicer way to talk about that domain-driven design concept of a bounded context. Have clean interfaces around complexity. One of the best ways for me to explain this, relevant to Ritchie Brothers, is we have something we call our taxonomy. For us, what taxonomy means is how you organize all the different types of equipment we sell. We sell everything from small excavators to huge dump trucks, to Ford transit vans, to Ford F-150s. How do you organize all that data in a catalog? It turns out, it’s mostly tree based, if you really think about it. Knowing it’s a particular model of this particular Ford truck tells me something.
The things it tells me, for example, are its dimensions. It tells me how much its transport cost usually is. It tells me about its tax categories, which are very important, whether you’re in the EU or in UK or the U.S. There’s usually a lot of exemptions or different tax rates depending on if it’s agricultural or if it’s not agricultural. Is this piece of equipment a fixed agricultural unit, is it a mobile agricultural unit? At Ritchie Brothers, all that taxonomical information was actually distributed across the system in many places. As we were rebuilding this and thinking about, we have a modern tax interface now from Thomson Reuters to do the calculations, and it tracks tax exemptions. Where should the categories be? Obviously, the team building the tax system would have all the tax categories. That was everybody’s first answer.
We said, no, that’s actually not right, because what turns out happens in the taxonomy world is, the taxonomy is not fixed. It changes. We learn about equipment. Somebody sells a new piece of equipment. If you think of it as a tree, end leaves in the tree split into a node with two new leaves. If we’re changing that taxonomy tree, then the tax categories change. If I had changed this, now I have to go change another system to correspond to this. If I keep the tax categories inside the taxonomy when they change, I may go to the tax team, they may actually program the facts into the taxonomy. They’re not going to have to change data here and here, and hope that we both kept them organized and clean. That’s what we mean about containing that blast radius of change.
We had one system that was responsible for one thing, as we go through this architecture. Think about which domain is managing for change, because, again, an exercise like this is a change management exercise. Nothing will be the same 2 years from now. As you start pinning responsibilities around things, you have to understand which system and which team is responsible for that ownership. Even if it’s going to change, you know over time.
We also talk about decreasing the coupling between all the different systems and having a clean interface from that previous Thoughtworks slide. I’m going to try to give you a slightly simplified version of something we ran into. We found out we had a limitation on the number of items you could put on a contract for the seller. We dug into, why can you only have 100 to 200? If you have more than 200, the system just completely bogs down, and it takes more than 2 hours for it to generate the contract exhibit A that the customer has to sign, and they don’t want to wait. We dig into that. That’s not a business limitation. That’s a system limitation. Why do we have that? Contracts and all the contract data is stored in salesforce.com. That’s your CRM system. We had our asset management, we had our system called Mars, which is a big asset management monolith. We had all the contract terms and all the asset data in Mars.
It turns out, we synchronized the two of them. We would take all of the detailed asset data, ship it through Boomi to Mars, from Mars to Salesforce. Salesforce would recompute all the things we already knew about those assets to regenerate that PDF. Then, of course, the payout system would have to go to Mars in order to figure out what the commission was and the holdback rates and all the other things. It was a mess back and forth. When you think about it, though, what did our CRM system actually need? The CRM system just needed a PDF. It didn’t need all that data.
All the customer needed to store in Salesforce was the PDF that listed all the things they were selling. That’s it. When we refactor it, we say, we’re going to build a contract service. It generates the PDF for Salesforce. It’s a bidirectional. If you add more items to a contract, it goes to the contracts service. Contract service will return you a new PDF. Nice and simple. Thousands of items if you want them. No limitation anymore.
It’s also where payouts now goes to get the commissions. Payouts no longer goes to that big monolith anymore to find out what the commission rates are. It goes to the contract service. One more thing out of the monolith and one less hassle for our people internally by sending stuff back and forth to Salesforce. Slightly oversimplified version of that, but by extracting that and understanding its domains and having the right responsibility just for contracts and item association, we broke through a major barrier for customer service, because now they can have many items. We have some sellers that will bring us 10,000 things, and we would have to open 1000 contracts. One contract now, much simpler for everyone.
Event-based architecture. At Ritchie, event-based architecture was an excuse to do data synchronization. There was, A, lack of consistency. B, massive payloads containing all the data. Yet mysteriously missing data, so not a complete payload. Now again, too, there’s many forms of event-based architecture. You can get very complex CQRS. You can just get simple. What I always say in this space is, first of all, understand what model you’re going to use, and then be as simple as possible. Be consistent. Think about the information you’re sending as an API payload. If you’re going to send a notification of change, “This change, go retrieve it.” That API that you’re retrieving, it should be complete. If you’re sending all the data in the event, have all the data.
Our bidding engine is famous for sending winning bids. It has the winning bid, has the number right there. Just anyone want to guess what’s missing from that payload? No, the number’s there. The price is there. It’s a global company, you know what’s missing? Currency. One hundred and eleven what’s? No, that’s fine. Just go to the event and go look up the currency for that particular event. No, put the currency in there. No, just MapReduce it and just go over. No, just send the currency.
Everything in our system, and we modified it to go work on our new world, all the money is integer based, except our legacy which is all float based. When you do money, that’s a whole nother heart attack by itself. Then you have currency. It gets crazy, but again, inconsistent payloads that make systems go talk to each other, we have an incredibly chatty system. Because you have to assemble so much information from so many places, just because it isn’t well factored.
Talk a lot about communicating choices. I’m not going to go through this whole Wardley diagram. This was a key tool. This was a very early one built. We first started with some assumptions, but again, what was going to be things that we were just starting to think about? What deserved to be custom, what belonged to Ritchie Brothers and the way it did its business. What products could we go get, and what are commodities? Some interesting things that came out of that discussion.
We use Stripe for their payment APIs. We also leverage Stripe for money movement. We have a lot of rules in the UK and EU and the U.S., around what we call trust accounts. Certain states in the United States, if we sell a truck, the buyer paid has to go into an account that’s only in that state before the money gets paid out. There are all kinds of rules. There’s a lot of money that moves after the payments are made.
We use Stripe APIs for that versus manual entries in accounting right now. We could get that at a very low rate, negotiate a very low rate for a highly fiscally compliant and performant service. Why would we build our own? Same thing with taxes. Why build a tax calculation engine when you can get a global tax calculation engine off the shelf and rent it? We could help our board and help our leadership understand where we thought these things were. This diagram becomes a way of communicating. It also becomes a way of negotiating. Put that in Genesys, we’ll deal with it later. Or no, let’s go ahead and make that custom. I’m willing to invest in this.
Think about how you as architects are communicating your choices to your leadership so they understand. This is another example of that. This talks about some of the main business capabilities or domains that we have in our system. On the left, you see the things that we care about that we built on our technology infra environment. This is one way that we can communicate to our leadership, “This is what we’re working on now. These are the different experiences. These are the different touchpoints.” It’s not an architecture diagram, but you still have to be able to take that. Remember back to that search for funding, that’s also an important part of this.
Starting in the Middle
We started in the middle. Talk about the journey. With all that in mind, where do we actually start this process? We started in the middle, because it turns out, on the Ritchie Brothers side, on the Iron Planet side, some of the payment processing and post-payment processing was automated, not all of it. On the Ritchie Brothers side, there was no online payments at all. You got an invoice from a 25-year-old VBA based system, and then you called somebody to find out the wire instructions if you didn’t already know them, and then you wired the money, or you walked in with a check. Everything that happened after that was mostly manual. There is an AS/400 sitting in a corner. It only does manual certain flat rate commissions. There was a group of people that would stay every night to make sure that what was in the AS/400 actually computed, and they would do that on spreadsheets. Why? Because we don’t trust computers to do math, so we use Excel. That is the statement of the century.
We started with, we’re going to automate this process of checkout. We’re going to build a way for our customers to complete a purchase online: win their bid, win their item, and actually pay for it. Leveraging Stripe, we could give them an option to do ACH, because a lot of our customers, even the big ones, still pay a fee to do a wire. If we can use bank transfer, it saves everybody money. It gets a lot faster. We can do online credit card payments up to certain amounts. We do all of that, that’d be a much better experience.
More importantly, that manual process that we were using to figure out what we had to pay out to sellers, and figure that all out, was mainly a manual process. We have a certain number of days to pay our sellers, and so nowadays, with money earning even more money, that float, that number of days that we had was important. You can’t monetize and buy overnight repos and treasuries on money if you don’t know how much you’re supposed to pay out, because you don’t know how much you can make money on. You lose opportunity for float.
There were a lot of reasons we wanted to start in the middle. We did it wrong, and I think this is an important lesson in the process. I told you at the beginning, we do all that white glove service. With white glove service comes the famous exceptions. We do this, but unless the customer wants that, and then we do this. What about if they don’t pay? Then, if they don’t pay, we do this thing. Or what if they go into the yard and they say it’s missing a key, there’s no key for the excavator, so I want 50 bucks off.
Fine, now we have to give them $50 off. Or, this is a mess, can you paint it for me? Now we got to charge them. There were all these manual processes that we would do, and there were so many of them that everyone was so responsive to the business, we built a workflow that was basically managed on making sure all the exceptions worked. When we got to the end of that, and even started working on a pilot, the user experience for the customer suffered because we were thinking about exceptions instead of the customer.
More importantly, the accounting integrations weren’t perfect, because we forgot the end in mind. The goal of automating a checkout, and of all this calculation and settlement flow was to get the correct entries in the finance system. For us, that’s Oracle Financials. It doesn’t matter which one it is. Getting the accounting entries correct was mission critical. If you’re going to automate payout to customers, then Oracle needs an invoice in Oracle, in order to understand that and automate that whole process. We realized we’d gone off the deep end. We actually came and sat down with our team in the EU, and we learned about all the anti-corruption rules in the EU about immutable invoices, when people are allowed to get invoiced, when we trigger certain events in the workflow for VAT.
They were all remarkably straightforward and defined. It actually gave us a way to say, if we meet this high bar, and we define all the accounting and the logic for this high bar, everything else becomes easy. We went back and we refocused on, what does the happy path from, I won my bid, to the end of everything’s in the accounting. I’ve paid for my stuff. I’ve got my invoice. The seller got their money. What does that look like in Oracle? Done. Figured that out, built that. What’s interesting is now when you build all the exceptions, now when you build the $50 refund, or you build this, you build that, you have already defined what the end goal is supposed to look like in this system.
You can value check, did you do it correctly? It was a really important learning lesson for us to understand, really, what was the end, what was the happy path end in mind? Some of our integrations that are coming next are even more complicated. We have to go back and say, how do we know this worked? What’s our actual acceptance criterion? Thinking about that as part of the core architecture was a real lesson for us. Again, back to radical simplicity. The point of this was to sell something from Tom to Fred, keep the money in between, and account for everything properly. We had to do that part and then do all the other exceptions later.
Platform V1, V2, and V3
Just some quick, high-level versions of what the platform looked like. When we started, we had Stripe and Thomson Reuters on the finance side. They were our finance partners. When we started, we had a Ritchie Brothers legacy website. We had our Iron Planet website. We started by building a standalone checkout experience and a standalone account management experience, and adding mobile to our checkout app. That wasn’t perfect, and it would have been better to have added it to the existing app, but the legacy app also was a 20-year-old, unstable platform to work in. I talked about the spaghetti data flow. It was also very much in the middle of how all the things worked. It was a little dangerous at the time to play with it. To get ideas to customers, we actually built some standalone pieces, and then we built some core capabilities around contract management settling, settlement invoicing, all the things we needed to make that checkout work.
We got that working, and it was successful. Our customers liked it. It’s rolling out now. In the process of getting it ready to roll out, we actually did build a unified new, modern website for RB Global. We got it off. Anybody remember Liferay? This was all 20-year-old Liferay with bad microservices sitting in Java, and all the old things. Now it’s in a new, modern stack. It’s a lot less in the middle of all the data transactions. That’s where we are now. We’re releasing all this with our new checkout system, which is awesome, and it’s working. This is what’s coming next, is to integrate Iron Planet with Ritchie Brothers. This is our next task that we’re in the middle of. Really what we’re doing is saying, we’re not going to bring the systems together. What we’re going to do is, Iron Planet sells things a certain way, Ritchie brothers sells things a certain way.
We’re going to combine those ways into the new platform, so lift them both up to the new world. That means we have to change how we do consignment. It means we have to change how we track our assets, because we track different things between Iron Planet and Ritchie Brothers. All the things in that deeper orange color that changed, all got modified or getting modified, as you bring that integration together. Because we understand those domain boundaries, and we understand where we can change things or where things belong, it’s going to be very much easier for us to move that stuff off the old monolith into here, because we have a known destination. We can have a test suite that tells us, run the characterization tests and understand that what we did in Iron Planet now works over here. Again, because the architecture allowed us to migrate that incrementally.
Beyond Architecture – The Secrets to On-Time Delivery
Thinking about what we did to actually get to delivery. Because delivery of all these concepts is just as important as organizing the concepts. We built a lot of what we call delivery infrastructure and tooling around engineering enablement. We did think about how we want to make a productized delivery platform. We built it on Kubernetes. We wanted to think about the same concepts that exist. How do we build a set of well factored APIs, or which of the Kubernetes APIs do we use for our developers so they can get through this process and have the least amount of knowledge about how to use a complex thing like Kubernetes and containerization? Jeremy had talked about taking your deep knowledge, and taking that deep knowledge and wrapping it up and making it simpler for another person.
It’s a lot of what we tried to do here. We use a lot of the Kubernetes tools, such as mission controllers and operators to obfuscate even cloud provisioning. At Ritchie Brothers, if you want DynamoDB, it’s a couple of entries in your Helm chart. If your team is allowed to have that, you have the budget, it’ll show up for you. We don’t make you learn Cloud APIs for Azure or AWS. We put that in the delivery platform. That deep knowledge that we’ve had, we built it into some very simple APIs, so our engineers are really effective. We can go from no team, to putting a team of people together and have them all ship to production in less than an hour. It’s pretty cool.
The other thing is, I try to treat our engineers as our competitive advantage. It’s been said a couple of times during the keynote, at the end of the day, this is still a people exercise. The folks that you have in your organization who have that deep knowledge of the people you bring in, they’re all part of the people that are going to actually get through this slog and deliver all this. We focused deeply on investing in our own organization. Yes, we brought Thoughtworks in as a partner, but we also deeply invested in the people that we had, who’d been building monoliths, who’d been building code a certain way.
They had to learn a new way of working. We took the time to do that. We focus on joy as a metric, in the organization, we don’t focus on productivity. I always tell people, my goal as an engineering leader is to create an opportunity for people to have joy every day from delivering work, because we all like to do that. We build the stuff, we want somebody to use it. We want them to have that joy of doing the job. We want them to be able to focus every day and not be disturbed. I was begging our engineering leaders not to wait for engineers to be dead before their art is worth money. Can we please get this out the door? Because people get really frustrated if they don’t see their stuff hit the marketplace, if nobody sees what they did.
Then, other thoughts on maybe how you leverage your development partners through a process like this. We partner with Thoughtworks. We have other partners as well. Part of the entry criterion is, how do you create or uplift the environment that you’re in? I’m not paying for more hands on keyboard. I’m paying for some specific skill, or I’m paying for an attitude, or I’m paying for a way of working. I’m paying for something. I want more ROI than just hands on keyboard, than just code. I want them to help create that environment, to help build that new culture that we’re looking for. I always say, partner with a quality partner. Partner with somebody whose goals and culture and other things match your own. Don’t just try to find somebody to type stuff.
Also, if you’re going to be partnering with a partner in a complex transformation like this, mentoring is important. If you’re going to bring people in who’ve done this, it’s not enough for them to go into a room and do it on their own, because when they leave, you have nothing left behind. You have a bunch of code and a bunch of people that were separated. We have a phrase at Ritchie about, One Team All In.
We bring our partners in, they’re part of our teams. We have a product engineering team, period, whether you’re a partner or not. We don’t exclude people from the process. When I got there, there were a lot of people that were the cool kids that were going to work in the transformation, and there were not the cool kids that were going to do the legacy stuff. How motivated was everybody? Not so much. The way we did it was very simple. We have a lot of pets, and none of them get to die right now.
Every day, somebody’s got to feed the pets, and somebody’s got to take care of the new pets. We’re all going to shift, and we’re all going to do some of it. It isn’t cool kids versus the old kids. We’re one family. We have to take care of everything we have. We consciously did that and even brought some Thoughtworkers in, they’re like, why am I maintaining legacy code? Because it’s your week. It’s your week to take care of the old pets. It’s just the way it is. There isn’t cool kids and not cool kids.
When you think about a transformation journey where you’re building a KPI in a capability platform like we are, there are some things I’ve learned over the years of doing this, and the things that get me really excited aren’t always obvious. I watch across the organization to see if it’s sticking, to see if the message is working beyond my own teams. We call them moments of serendipity. My product leaders, can I use that settlement statement API to take the settlement statements and put them in our external fleet management software so sellers using the fleet management software can see their payouts?
Yes, it’s an API, use it. When you finally get somebody to say, I could use that API here, and I can use it here, that’s fantastic moment. That’s what you’re looking for, because right now, you built them to do one thing, but you really want those to be multiple ROI type investments. You start to hear shared language. You start hearing people talking about the new provisioning domain or the new consignment domain, or they’re talking about invoicing, and they’re talking about it in a specific way that you’re talking about it, and it’s not just your team. That means you’re getting that transformational knowledge out to the universe. People in your organization are thinking that new way. When you hear your business owners speaking in that language, you won. This one’s the hardest one.
I’m not done with this journey here, and there’s been a lot of slog in a lot of other places in this last point. Martin Fowler has written about it many times, moving from projects to products. This one is hard. We call it RB 2.0, that’s the name of this endeavor. I always say, this is a transformational endeavor at Ritchie Brothers, at RB Global, to change our way of doing technology in the technology we use. It is not a project that will someday end. It is not a thing that will ship when it’s done. It’s not a thing you fund until it’s over.
If you have invoicing for the rest of your business, you need to fund invoicing for the rest of your business. If it’s important for you to have a catalog of the things that you’re selling and be able to tell where they are in the process, then you will always need that, and it will not be done, and you need to fund it. It is a hard slog. Don’t ever back away from this fight, because if you do, you end up with legacy software again. You end up coming back here, or some other poor side ends up coming back here. Because you’re off doing something else, and now the stuff is worn out, and he’s still doing it.
Key Takeaways
Set time to stare out the window. Just every once in a while, it’s ok to think. I saw that Toto Wolff video, and I actually went and bought a chair for my home office. My wife says, what are you doing? I said, this is an official stare out the window chair. She’s like, are you allowed to do that? I’m like, I’m going to do it anyway. Then, don’t mistake action for results. Sometimes, the running around and the activity. Activity doesn’t mean outcomes. Be really careful about separating the two of them.
Then, everything in architecture is about tension. I just told you to slow down. Now I’m going to tell you to go faster. One of the things I always tell my teams all the time, is an old Yogi Berra, great, nonsensical baseball coach from the States, but when you see a fork in the road, take it. Make a decision. We have a lot of people, whether they’re executives or engineers, who are standing in the road.
The truck is coming and they’re standing there like a deer in the headlights. Go left. Go right. It’s code. What could possibly go wrong? You can change it tomorrow. Make a decision. Try something new. We have a lot of digital disruptors coming for us in the industry right now. They can move at that digital native speed. We don’t have time for that. If you build a good architecture, you can experiment quickly. You can stay ahead. That’s the argument here. Then the other thing is, just be a demon about removing all the friction from your value stream. Deeply think about what blocks you from getting from ideation to delivery.
Think about that, not just in your tech business, but work with your business leaders to understand what that means in the business. What are people doing right now that could be eliminated so we can make more margin and put people on smarter things? We all learn about value stream management. We all learn about those processes as engineers: educate the business. The last one, everyone can be a servant leader. It doesn’t matter what your position is from an engineering perspective. You could formally be a leader. You might be an IC. These are things you can do. Everybody deserves to carry this process down the line.
See more presentations with transcripts
Microsoft Unveils Azure Cobalt 100-Based Virtual Machines: Enhanced Performance and Sustainability

MMS • Steef-Jan Wiggers
Article originally posted on InfoQ. Visit InfoQ

Microsoft recently announced the general availability (GA) of its Azure Cobalt 100-based Virtual Machines (VMs), powered by Microsoft’s custom-designed Cobalt 100 Arm processors. According to the company, these VMs offer up to 50% improved price performance over previous Arm-based offerings.
The GA release follows up on the earlier preview this year. The Cobalt 100-based VMs are tailored for a broad range of computing tasks. With Arm architecture at the core, these VMs are built for energy-efficient, high-performance workloads and are available as general-purpose Dpsv6-series, Dplsv6-series, and memory-optimized Epsv6-series VMs.
Any VM in the series is available with and without local disks, so users can deploy the option that best suits their workload:
- The Dpsv6 and Dpdsv6 VM series offer up to 96 vCPUs and 384 GiB of RAM (4:1 ratio), suitable for scale-out workloads, cloud-native solutions, and small to medium databases.
- The Dplsv6 and Dpldsv6 VMs provide up to 96 vCPUs and 192 GiB of RAM (2:1 ratio), ideal for media encoding, small databases, gaming servers, and microservices.
- The Epsv6 and Epdsv6 memory-optimized VMs offer up to 96 vCPUs and 672 GiB of RAM (up to 8:1 ratio), designed for memory-intensive workloads like large databases and data analytics.
All these VMs support remote disk types such as Standard SSD, Standard HDD, Premium SSD, and Ultra Disk storage.
Sustainability is crucial to Microsoft’s strategy, and the new Azure Cobalt 100-based VMs contribute to this vision. Arm-based architecture is inherently more energy-efficient than traditional x86 processors, resulting in lower carbon footprints for businesses that adopt these machines. As enterprises globally prioritize green computing, this launch aligns with Microsoft’s broader goals of reducing emissions and offering more sustainable cloud solutions.
Regarding sustainability and energy-efficient computing power, AWS offers various Amazon EC2 Instances powered by AWS Graviton processors, which are also based on Arm architecture. Furthermore, Google Cloud also Ampere Altra Arm-based processors with Google Compute Engine Tau T2A Instances.
One of Microsoft’s partners, Databricks, has integrated the Azure Cobalt 100 VMs into their Databricks Data Intelligence Platform on Azure, which unlocks new possibilities for handling data-heavy workloads with greater efficiency, scalability, and cost-effectiveness. The company writes in a blog post:
With up to 50% better price-performance than previous generations, Cobalt 100 VMs enable Databricks customers to benefit from superior performance and lower operating costs.
Lastly, more details on pricing and availability are available on the Azure VM pricing and pricing calculator pages.
Kotlin HTTP Toolkit Ktor 3.0 Improves Performance and Adds Support for Server-Sent Events

MMS • Sergio De Simone
Article originally posted on InfoQ. Visit InfoQ

Ktor, Kotlin’s native framework to create asynchronous HTTP server and client applications, has reached version 3. It adopts kotlinx-io, which brings improved performance albeit at the cost of breaking changes, and adds support for Server-Sent events, CSFR, serving static resources from ZIP files, and more.
kotlinx-io is a low-level I/O library built around the abstraction of Buffer
, which is a mutable sequence of bytes. Buffers work like a queue, meaning you write data to its tail and read it from its head. Ktor 3 now aliases kotlinx.io.Source
to implement its Input
type, deprecates Output
, and reimplements ByteReadChannel
and ByteWriteChannel
. Developers that used these low-level classes directly will need to modify their apps to migrate to the new API.
The main benefit brought by kotlinx-io is improved performance:
we’ve cut down on the unnecessary copying of bytes between ByteReadChannel, ByteWriteChannel, and network interfaces. This allows for more efficient byte transformations and parsing, making room for future performance improvements.
Based on its own benchmark, JetBrains say the new Ktor shows a significant reduction in the time required for file and socket operations, which can be as high as 90% in certain cases.
Besides performance, the most significant change in Ktor 3.0 is support for server-sent events, a server push technology that enables the creation of server-to-client communication channels. Server-sent events are preferable to WebSockets in scenarios where data flows mostly in one direction, especially when it is necessary to circumvent firewall blocking or deal with connection drops. WebSockets, on the other hand, are more efficient and grant lower latency.
Other useful features in Ktor 3.0 are support for Cross-Site Request Forgery (CSRF), which can be specified for any given route, and the ability to serve static resource directly from ZIP archives. ZIP archives are served from a base path and may include subdirectories, which will be reflected in the URL structure appropriately.
As a final note about Ktor 3, it is worth noting that the Ktor client now supports Wasm as a build target. However, Kotlin/Wasm is still in alpha stage, so Wasm support in Ktor 3 is not ready yet for production use.
To start a new project using Ktor, head to the Ktor website and choose the plugins that most suit your requirements. Available plugins provide authentication, routing, monitoring, serialization, and more. If you want to update your existing Ktor 2 project to use Ktor 3, make sure you read the migration guide provided by JetBrains.

MMS • Loiane Groner
Article originally posted on InfoQ. Visit InfoQ

Transcript
Groner: We’re going to talk a little bit about API security. Before we get started, we have to understand why we have to do this. For application security, how many companies are handling this today is, you do your planning, development. Developers will push the PRs, going to do the build. There’s usually a QA environment, testing, UAT, many companies will call this different things. Then, that’s when you’re going to raise your security testing request. You’re going to ask the InfoSec team, please test my application. Let me know if you find any security vulnerabilities. If they find, goes back to the dev team, “We found this security issue. This is a very high risk for our business, and you have to fix it”. Again, goes through the PR, has to go through testing again, rinse, repeat, until you have a clean report or no high-risk vulnerabilities, and you finally can go to production. This has a few caveats.
First, it can cause production delays, because if you have to go through this testing and rinse, repeat, all this cycle until you get a clean report so you can go to production, that can take a while. Or, it’s even worse, companies are not even doing security testing through the software development lifecycle, and they’re doing this once a year, or not doing at all. There’s a very interesting research that was done by the Ponemon Institute, and they say that fixing software defects, or worse, fixing security risks once the product is in production, costs way more than if you are handling that during the development. That’s why in the industry, we say there is a shift left happening, because many years ago, we went through all that cultural change of having unit testing done as part of our development cycle, and now we’re going through this again.
However, we’re talking about security this time. It is much cheaper and much cost effective for the team, for the company as well, for you to handle all those security vulnerabilities, and make sure that your software is secure when you’re doing development. Security has to be from day one. It’s not a technical debt. It’s not something that we’re going to add in the next sprint. It has to be part of your user story. It has to be part of your acceptance criteria. It has to be part of your deliverable. I would like to show you a few things that I’ve learned throughout the years.
My name is Loiane. This talk is from a developer to other developers and leads, so we can go through this cultural change and make sure that security is indeed part of our development phase.
What is API Security?
First of all, whenever we say API security, if you decide to Google this, search this, go into YouTube, try to find a tutorial, you’re going to find a lot of tutorials talking about authentication and authorization, especially if you’re working with Spring Boot. All my examples here are going to be with Java, because this is the technical stack that I’m most familiar with. All the examples you can easily translate them into a different programming language, framework, or platform. Going back to my question, if you go to YouTube and you search by Java security or Spring security, you’ll find a lot of tutorials about authentication and authorization. Security is not only about that. If we take a look at the OWASP Top 10 vulnerabilities that are found each year, and this list is going to change year after year, you find a lot of the same things happening over and again. What I’m going to show you here, at least with all the tips and all these best practices, we can at least make sure that half of this list are not going to happen within our software.
Better Authorization
Let’s go through it first. Let’s suppose that everybody is doing authentication, so at least user password, or you’re using an OAuth service. You’re doing that in your software. We still need to handle authorization, which is why you have to make sure that the user that is trying to access your application or trying to perform a certain action within your application, is indeed able to perform that action. How do we make authorization better within our applications? Let’s start with the first example, with a bad practice. We’re checking here if we can update a course. This is a RESTful API, so we’re doing a Post here. We have the ID. We also have the object, the data that we’re trying to update. I get the user that’s authenticated. I’m checking if this user has a role student. If the user has a role student, they cannot update the course. If I am somebody that I don’t know anything about this application, and I’m reviewing this code, I don’t know who exactly can actually update this record. It’s not clear just reading the code.
A better way of doing that is deny by default. I’m going to write my code, I’m going to write my business logic, and by default, nobody is going to have access to it. What I’m going to do is I’m going to list whoever can actually update it, and everybody else, I’m just going to not allow it. When I read this code now, at least I can see that only admins and only teachers can actually update this record, so it’s a little bit better. The other thing is, the majority of the frameworks that we work with, they do have some support to role-based access control. In Java, for example, we handle a lot of things through annotations. When you’re working with Spring Boot, you do have some annotation that you can easily add all the roles that are actually allowed to do this. We are working with the deny by default approach. You’re free to write your own business logic and leave that part of the authorization, the security check, outside the main business logic. This is great. However, this works perfectly for small systems or systems that you don’t have a lot of roles.
I really wish that my applications were the same, that I watch those YouTube tutorials and I have user or admin. That’s it. That will be a wonderful world. Unfortunately, it’s not like that. What can happen here is role explosion. I’m going to start with the user and admin. Maybe I have a teacher as well, but now it will be really good to have teaching assistants as well. I’m going to grant access to them to my system so they can do a few things on behalf of the teacher as well. Or, maybe we’re working with an eLearning platform, I have an account manager. The account manager will also be able to do those things in my system. We start adding more roles to the system. Now my business logic is only one line of code, and I have more code, just doing the pre-authorization part. It can be, when you’re reading this, not so good, and we can do better. If you’re working with something like this, which actually looks like the projects that I work with, sometimes the authorization level goes to the button that I see on the screen or the link that I can click on the screen. RESTful here, it’s really going to depend on what’s the role that I have, if I’m able to perform that particular action or not. It really depends on the role and all the actions that I’m able to perform.
When we handle situations like this, it is much easier if we have something that is a little bit more dynamic. There are many different ways that you can do this. If we’re using Spring security and Java, of course, you can use a custom security expression. You can design this according to your needs, according to the size of your project and your business. You can maybe have all the mapping, all the authorization within a database or another storage, and you load that, and you have a method or a function that’s going to calculate if the user really has access or not. Of course, annotations for the win. We can actually use the annotation and have our method here with the privilege. Now it’s a little bit more clear for me that the only users that are able to actually perform this action are the ones that have the course update privilege, that’s mapped somewhere else. When we go into those more complex cases, this can be a little bit easier for us. There is no more hardcoding with all those roles within the system.
The other issue that we might face is, I’m logged in. I’m checking if I am authorized or not. Should I have access to update that particular record? If I am a teacher, let’s say we are in university, and there are many classes, should anyone update it? If I know the ID, should I be able to update it, just because I know the ID? I know that some of you are using the incremental identity that is generated automatically by the database. We have to be very careful with that. Again, we still can bypass even if we are authorized to use the system. Be very careful with that, and remember to always deny by default.
How exactly do we make that better? One thing that you can do is, once you have the information, again, you go through the authorization, you have to find a way to check if that particular record can be updated by that particular user. Maybe there is some kind of ID, the course teacher ID, you’re going to match that against the user ID that’s trying to update that record. That way you can make sure that only that that certain user is able to actually perform that action. However, one thing that I see happening a lot, is we’re getting that object, the course object, directly from the request. I still have my ID from the path variable that I’m parsing through my request, but the course that I’m actually checking my logic came from the request. It can be something as simple as using Postman or any other similar tool.
You can manipulate it, or if you’re a little bit more smart, you can use another tool to intercept the request, change the JSON that is being sent to the request, and the ID here might be something. Again, you can bypass any authorization logic that you have and still update that record in the database, and something that should not happen. Never trust the request. When you have to do something like this, always go back to the database or to the true source of the data, the data source. Check for the true data to make sure that that is actually able. There is a tradeoff here. This is going to be a little bit more slower, because we have to go to the data source. There is a request, milliseconds, but again, it is a small tradeoff that we are willing to pay here just to have our APIs more secure.
Property Level Issues
When we are working with objects, there are still a few other issues that we can run into. This has a very fancy name. Just to give you an example, if I have a user and I’m trying to get the data from the user that is logged in, I have a user and password. I’ve done this multiple times myself, exposing the entity directly. Because why am I going to create another class or another object that’s just a copy of my entity, and then I’m going to expose it. This can lead to some issues. In this particular case, if I’m only trying to expose the username, and I have some common sense, and I know that I’m not going to expose the password in the JSON, so using annotations, I can simply annotate my Get method and have a JsonIgnore. What happens if tomorrow we receive another requirement and we have to capture another field, for example, sensitive data such as social security number or something else.
The developer that is working on this unintentionally forgets to annotate the method to get the social security number, and when we’re sending back that information through the request, you are exposing something that you’re not supposed to. This can go through pull request reviews, code reviews, and we’re not going to notice. That can happen. A way that we have to avoid this is creating the data transfer objects or DTOs. You can use records if you’re using a more modern version of Java, or you can just create a class. You have to explicitly tell what are the properties that you want to expose in this case. It’s a much better way of doing that. If tomorrow we get, again, the requirement to add sensitive data to our object, we’re not going to expose because the public contract, I don’t have that information here, and that social security number or whatever other sensitive data that we have to capture, it’s going to stay internally within the system.
Then we can enter into another very good discussion here. Should I create a DTO for a request and have another DTO for a response? Again, this is conflict territory. Each one of us will have their own point of view on this. If you are reusing the same DTO for both requests and responses, just be careful. For example, for the request, do not use the ID, if you have the ID or whatever primary key or whatever unique property that you’re using to identify that object from your DTO when you’re handling requests. This can also slip through the cracks, and then, again, something might happen. It’s always best to have one for request, another one for response. In case you have something again, so you have a metric against duplicated lines of code that you have within a project, be very careful with that.
Password/Key Exposure
Now we’re able to handle authorization a little bit better. The second part will be the password and key exposure. This seems a little bit of common sense. Who here is going to expose the password and commit that to your GitHub repo and have the database password? There are a few different ways that this can still happen. Many companies, you have your URL, and then you have your resource name, something to help to identify the project. Then you are creating a developer database. Again, I really wish my project was the same thing as those tutorials, that I can simply have a MySQL database and a Docker image with two tables, and that’s it. That would be wonderful as well. Especially when you’re working with legacy systems and you have that huge database with maybe hundreds of tables as well, it’s a little bit more complicated, with lots of data as well.
Some of the companies, they will have their own database in a server or in a cloud that everybody is going to access that database. I don’t know about you, but me, personally, I’m not so good with names. That’s the hardest thing to do. How am I going to name a class, a variable? What name do I give to my database? I’m just going to give the company name, or maybe the project name, and dev, to indicate that this is a development environment. I’m going to use prod to indicate that this is production. This can be a little bit dangerous. Then for the password, again, I’m not going to remember all the 30 passwords that I have to do for all the services that we use. I’m just going to use something as well as my learningPlatform@Dev. Then for production, I just change that to production. If something like this gets committed into a repository and somebody sees that information, I wonder what happens if I change this from dev to prod or to another upper environment? Be really careful with that. Never leave passwords or any secrets within your properties YAML file or even hardcoded even for lower environments.
Another issue here, is this last line right here. If you’re working with JPA, if you’re working with Hibernate, there is a way for you that the framework is going to be responsible for checking all the entities that you have in a source code, and it’s going to create all the tables for you. It can create, drop, update. There are many different options. This is a big issue. Never use a user ID that is able to make schema changes in your database. Again, deny by default. You’ll start with, I need read access to my database, because I have my user ID, so you grant that read access. If your application is also writing to the database, then you grant the write access. If you need any access to execute any store procedures, then you add that access as well, but never grant more access than is actually needed. Be very careful with that. This only works for tutorials. This does not work for real applications.
Input Validation
The third part that I would like to bring to your attention is input validation. This seems to be also common sense. This seems to be something that is very basic as well. We are failing in this, for lots of code that we review. We are just not even adding any kind of validation, and we need to start changing that as well. We have our frontend. It’s beautiful, fully validated. I have all the error messages, user experience, chef’s kiss. Then if you take a look at the API that’s feeding that frontend, that’s just this. I have my create method. I have my DTO. There’s nothing. It’s just simple code. This is a big red flag. How can we improve that? Never trust the input. Again, if you have your frontend fully validated, the user is entering all the data, hits the submit or save button. Sends the request to the API. It passes the data it saved perfectly. Then, again, if you try Postman, or if you try any of the other approaches to actually evoke your API without a frontend, then you start to run into issues. There are no validations.
Always remember that if you’re working with an API that is being used by a frontend, the API exists independently from the frontend. We really have to start validating the API. First step, the same validations that you are applying in your frontend, you have to apply in your API as well. That’s the minimum that we have to do. I know it’s a lot of work, because there’s a lot of validations that can go through, especially when we’re working with forms, and we do have a lot of forms in some of the applications, but again, always add the validations to your API, at least the same. Remember that your API has to have more validations than your frontend. It is the one that has to be bulletproof and has to hold the fort when we’re talking about security.
Make sure that you’re validating type, length, formatting, range, and enforcing limits. Java is a beautiful language, because we have something that I like to call annotation driven development. We just start adding all the annotations, and magically, it’s going to do all the work behind the scenes for us. When you are annotating your entities, you have the @Column, for example, just to map this particular property from your class to the column in the database, or to the property in the document. Make sure that you’re adding the length as well, if it’s nullable or not, if it is unique. Try to map your database mapping into your code as well, because, again, that’s going to be at least one layer that we can add a security.
In Java, we have a really nice project that’s called the Jakarta Beans Validation. If you’re a little bit old school, the Java EE Beans Validation as well. Hibernate also has one of the implementations that’s called the Hibernate Validators, that you can use to enhance all your entities or all your documents as well. Do not forget to validate strings, when we have a name. Even if you look at this code right here, I see you have some validations, but that’s not enough. I don’t have all the validations. There is too much damage that I can do if I only have validations for the size, but I’m not validating the string itself. If you try to do a request, can I do !##$, and something, I’m just going to look at my keyboard and add some special characters or weird characters. Is that a valid name? Should it be allowed? Validate strings.
One thing that we usually tend to do, I just go to my keyboard. Let me look and I’m going to type and I’m going to create my regex from my keyboard. If you go to the ASCII table, or if you take a look at the Unicode table, you have hundreds of characters. Characters that I don’t even know that exist, or I don’t even know the name. Be very careful with that. Always prefer to work with an allowed list. What does it exactly mean? A name. If I need to have or I’m only allowed to have alphanumeric characters with maybe a space, parenthesis, underscore, so that is my name, anything else is going to be deny by default. One other thing that you can do is maybe sanitize as well. It really depends on the project. You can use the approach that, if the user tries, I’m going to not allow it, just going to throw an error. Or you can try to automatically remove those characters, or you sanitize those characters as well. Different approaches for different projects. Just make sure that you are choosing the one that is a better fit for you.
Always remember to secure all the layers. For example, we’re working here with three layers. We have our blob controller, and validate all the parameters that your methods have. Do not be shy to use those annotations. It only takes seconds to actually add those annotations over there. One thing that is very important, especially if you’re working with pagination, never forget to add an upper limit to your page size. My frontend only allows 100 records per page. That’s fine. Here, what if I parse a million, 5 million? What if I try to do a DDoS attack and send multiple requests with 5 million? Is your server able to handle that many requests? You can bring down your service, and that can cause business loss, financial loss to the company as well.
Always make sure that you’re adding validation to each and every parameter that your API is receiving. Again, in the service, you’re going to repeat that. The good thing is, you’ve done that in the controller, so Control C, Control V in the service, or maybe you’re doing the other way around, the service and then the controller. Make sure that you are propagating all those validations across all the layers. Because, what can happen, depending on the application that you are working with, you can have a service that is being consumed by only one controller, but again, maybe next week, next month, or next year, you have another controller also using that same service. What’s going to happen?
If the developer that is now coding the controller, that developer does not do any validation in the controller, at least the service is going to be able to handle any kind of validation and reject any kind of requests. Again, the entity or documents as well, don’t be shy to use and add all those annotations. The beautiful thing about this is, if you are handling a request, and sometimes if you have a column or a property that is only able to handle 10 characters, and let’s say that you are sending 50 characters through the request, you don’t get that truncate message, that exception, and it’s going to fail to write into the database. The other beautiful thing about this as well is if you are on the cloud and the service that you’re using is charging you per request when you have all these validations in place, you are saving a failed request to the database so that can actually bring some cost saving benefits to the organization when you have all these validations in place.
SQL injection. It’s 2024, we still have to talk about SQL injection. That is still happening. Make sure that you are validating, sanitizing your inputs, escaping those special characters that can be used for SQL injection. I know sometimes we don’t want to use some kind of Hibernate thing. When you have something a little bit more complex, you want to write your own native queries. Make sure you’re not using concatenation. Please, at least a prepare statement. Be a lazy developer, use what the framework has to offer you. Don’t try to do things on your own. Many developers have gone through the same issues before, and that’s why we have frameworks to try to abstract a few of these things for us. I’m still seeing code during code reviews with concatenations in place. Sad, but it’s life.
File Upload
Still talking about input. We’re only talking about validating the request. What about files? I work for an industry where we handle a lot of files. I’m not talking about images. I’m talking about Excel files, Word files, PDFs, things like that where you have to read those files, parse it, and then do something with the data that’s within the file. Then you go through with the business logic. First rule of thumb, always make sure that you are adding limits to the file size. If the file is too big, ask the user. Again, really depends on what’s the business use case here. Try to find, what’s your limit, something that is acceptable. Make sure that you are setting that into your application. Again, if you’re using Java Spring, two lines of code. Easy. Five seconds and you’re done. Make sure to also check for extension and type validations. These can be very deceiving. If you remember a few slides back, never trust the input, because here you can go to the content header, and you can manually change it and deceive the code, if you’re checking for the extension in the content header. What do we do?
The issue that we can run into this with the extension is, if your library is expecting one extension, and it’s actually something else, you can run into all sorts of issues. Also with the file name, there is one very famous vulnerability that’s called the path traversal vulnerability, where the file name, again, we don’t know what’s the file name. You can use those tools to intercept the request and change it, and have something that is malicious. You can completely wipe out directories of files. I don’t know if you’re using a NAS, if you’re using an S3 bucket, or any kind of storage, but there is a lot of damage that you can do only with a malicious file name. Make sure to also validate that. Be a lazy developer. Use tools that are already available, if you are able to add these dependencies to your project. If you need something that is very simple, very quick, you can use Apache Commons IO. There is a file, you choose file because we love a you choose class. There is a you choose file that you can use to normalize the file name.
If you need something a little bit more robust, you can use Apache Tika that you can actually read the metadata of the file, get the real file extension, sanitize the name of the file. I cannot tell you how many times this library has helped me to close a few vulnerability issues for the applications that I have worked with. Whenever I’m working with file upload, the first thing that I do, do I have Apache Tika in my pom.xml? If I have, then uncover, and then just can copy paste the boilerplate code, or you can create a static method just to run those validations for you and have some reusability as well. Again, if you are indeed saving the file somewhere, be sure that you are running the file through a virus scan. If you’re working with spreadsheets or CSVs or documents, again, deny by default. Do my Excel file need to have macros or formulas? My Word document, do I need to allow embedded objects? Does it make sense for my application? Do I have a valid business justification? Make sure that we have all those validations in place. Then you can safely store your file and live happily ever after.
Exception Handling and Logging
Exception handling and logging, this is where we have to be a little bit careful as well. We as developers, and I find this really funny, whenever I’m using a service on the internet and an error occurs, and I see, they’re using this tech stack. That’s really cool. For me, it is, but for somebody that doesn’t have good intentions, might not be. Never expose the stack trace. Log the stack trace, because we as developers, we’re going to rely on logs to do some debugging and try to fix some of the production issues. Log it, but do not expose it. Return a friendly and helpful message. Please do not return something like an error occurred, please get in touch with the administrator. What does it mean? Something that is helpful to whoever is seeing the message, but you’re not exposing anything.
You’re not exposing the technology stack that you are using. Because what happens is, if you expose the technology stack, the person that does not have good intentions might see, let me see if there is any vulnerabilities. You’re using Spring. Does Spring have any vulnerabilities that I can try to exploit? That is one of the reasons. Again, if you’re using Spring, one line of code that you can add to your properties file or YAML file to not expose the stack trace. Also, be careful with what you are logging. We’ve watched some talks during this conference here that we as developers, we are responsible. We have to be accountable for the code that we are writing. The beautiful thing of being a developer is that you can work within any industry. With power comes responsibilities. Different industries will have different regulations, so make sure that you’re not logging the password, even for debugging purposes.
If you work with personal identifiable information, like first name, last name, email, phone, address, something that can help to identify a person, do not log those in. We have several regulations, GDPR, California has the California Privacy Act. Other states are passing their own regulations. We have to study our programming language, and at the same time, we have to keep ourselves up to date with all these regulations that can impact our jobs as well, to make sure that we’re being ethical, and we are writing code that is not infringing any of those laws: financial information, health care data, any kind of confidential business information. Log something that is still helpful to you, to help you to debug those production issues, but do not log something that is sensitive.
One of the things that you can use to remove those sensitive data, especially if you’re using the toString to log something, again, remove any sensitive data for your toString. There are annotations that you can do this. I personally prefer to not use annotations on this, because, again, you can forget to annotate in case you’re adding a new property. I like to explicitly tell what’s my toString here, so I can actually safely log that information if I have to. In case you do have to log user IDs or credit card numbers or any sensitive confidential data, you can mask that data and still be presented in a helpful way to you, or you can use vault tokens as well.
There are many different ways that you can do this, in case you absolutely have to log it. Be very careful with that. Last but not least here, apply rate limits to your APIs. There are many flavors in the industry. It all depends on the size of your application. If you need something that is very quick and easy, you can use Spring AOP. There’s also a great library, Bucket4j. If you need a more robust enterprise solution, Redis for the win, among other solutions out there as well. Do apply because, in case your API does have any kind of vulnerability, at least here, you’re going to prevent some data mining. At least if you have some rate limit, you can control the damage that’s done here. At least have one of the things. If you cannot have it all, at least try to apply a few validations, rate limit so you can decrease the size of the damage.
Testing
Testing. After all we’ve talked about, of course, we have to test all of this. It’s not only our business logic. For testing, make sure that you are adding those exception edge cases as well to your testing. If you only care about percentage of code coverage, this is not going to add any code coverage to your reports, but at least you are testing if you have your validations in place. You know if your security checks are in place.
One of the things that really helps me, especially when I have to write this kind of data, you can use other data sources for this. You can have your invalid data into some sort of file, and load it. There are many ways of doing this. In case you’re writing the data yourself, use AI to help you with this. You write two, three, and then the AI is going to pick it up and bind the test, all the rest of the data for you. This is a way to also improve that.
The AI Era
Again, we are in our AI era here, so make sure that you are taking advantage of that. If you are starting to work with projects with AI, because, of course, now it’s AI, our companies are going to ask, can you just put an AI on that? Just make sure that we have an AI. In case you are working in one of those projects and you are handling prompt engineering, make sure to validate and sanitize that as well. This is a really cool comic. Make sure that you are validating and sanitizing your input. It doesn’t matter the project, always validate and sanitize. Use AI as an ally here. It’s a great IntelliSense tool. I really like to use as my best friend coding with me.
You’re not sure how to write a unit test for a validation, just ask Copilot, CodeWhisperer, whatever tool that you are using, it can help you with that. In case you’re using GitHub now, they’re coming out with a lot of services. I really think that this is adding the security within the pipeline itself. Keep your dependencies up to date, that also helps a lot. Add some code scanning. For any security vulnerability, make sure that you’re not exposing those passwords. It can also help a lot with that if you do have access to services like this. Of course, there are a lot of other services within the industry as well. It really depends what your company is using. There are great services out there that you can achieve a very similar result.
Education and Training
Of course, you’re not going to go back tomorrow and say, team, I think we need to start incorporating a little bit more security within our code. This change does not happen from night to day or from day to night. It is a slow process. We need to mentor junior developers on this and the rest of our team as well. This is a work in progress, through many months. One of the things that I like to do with the folks that I work with is, whenever we’re having demos of the product, I’ll start asking questions. This is a really nice, cool feature, you’re handling a file upload? Are you checking for the file name? Are you validating that? Or, if we have some RESTful API, what are you using for validation? Start asking questions.
Next time you’re having those sessions again, ask the same questions again. Next time she’s going to ask about that, let’s just add it so when she asks, we’ve already done it. That’s a different way of doing that. Provide feedback. Make sure that the requirements are part of your user stories, it is part of the requirements, so we can start to incorporate it as part of the development. One thing that I like to use as well is some security checklist whenever I’m doing code reviews. This is only a suggestion. These are some of the things that I find mostly in the code reviews that I do. Always be kind with the code reviews that you are doing. These are some of the things that I usually check whenever I’m doing code reviews. You can evolve from this. Adapt to something that works better for your team. Again, many flavors available out there.
Questions and Answers
Participant 1: Do you have any recommendations for libraries for file content validation?
Groner: It really depends on what kind of validation you are using. For example, for all the Word documents that I handle, all the spreadsheets that I handle, we usually do not allow macros, formulas, embedded objects. For the content itself, it really depends on the use case that you have. It can be something manual. You can use some OCR tool to help you to do that as well. It’s really going to depend.
Participant 1: Since you mentioned Excel files. We do have a use case where users upload Excel files. I was just wondering if there are any off-the-shell libraries that we can use, or do we have to write custom code?
Groner: Depending on what you need, we usually write our own. We only validate for things that we do not allow. If you have a data table you’re only trying to extract that data table, we’re going to run all the validations on all the types that we have all over again, and validate all the business logic to make sure that that data is what we are expecting. That level of detail, it’s usually that we usually write something. Depending on the use case, Google has services for that, and there are a few services out there that you can try to use to help you to go through that.
Participant 2: You’re logging what we should expose, what we should return. In our team, we are having this double-edged sword in the sense that we don’t want to return sensitive information, like expose our business logic, how we do our profile management. When we have issues escalated to our helpdesk or service centers, we can’t find the exact errors by looking at Splunk, because our APIs don’t return those important crumbs for us. How can we approach this better? Our architects suggested, for instance, maybe we should use error codes. Like, this is the error code 2 or 3. Have you encountered this issue before? What should we do?
Groner: There are a few different ways that you can approach this. One is you can definitely have your own dictionary of the error codes, as you mentioned, just to help you a little bit with the debugging process. The other way around it is, you can try to mask the data. It’ll still be something that is meaningful and it’s easier for you to consume, but not something that’s going to be exposing any sensitive data. Because often when we’re running into production issues, it can be something like a software defect where we have to fix, but it can also be data consistency issues as well.
Those cases are a little bit more difficult to do the debugging. If you have some masking that you still have, like the nature of the data itself, you can still go through that without actually having access to the database, or something like that. It will be one of the approaches that I would try to use. This is very specific. It really depends on the business case, but it helps a little bit. The other thing that you can do as well is some kind of vault. If you have that data, you have some token, and you can log the token that can help you to retrieve the data. That will be another approach as well.
Participant 3: Do you have any suggestions for any tool in the CI/CD pipeline to scan the code quality and check for security inside the code?
Groner: There are a few, like Snyk. There’s Sonar. Depending on how you configure Sonar, you can try to catch those as well. Personally, we use a lot of checkmarks to do that, like checkmarks for code. There is still a team of InfoSec that is reviewing the checkmark, what it’s flagging to review if it’s a real issue or not. There is Black Duck for any kind of CVEs that we have out there for dependencies. There are other tools on the market, but these are some that we use internally, that’s global to the organization. If you’re using GitHub, GitHub now, they’re rolling out a lot of features, and they have the code scanning. A lot of them are free to use if you’re actually using GitHub, but a few of them, you still have to have the license of the product in order to be able to use.
Participant 4: You mentioned having validation at all levels. One of the things we’ve done is pulled out, like we don’t have authentication at every level, we just handle that, not even in the service authorization, pull it up to the top level. For something like validation, we also have that at the top level, and not have underlying services or something that we handle just at the base level, like in a controller, so that we don’t have to keep adding that in. Is there a difference, in your opinion, on like authorization and authentication versus validation, and why you do validation at every single level, like why that’s different?
Groner: I think it’s really going to depend on the team itself. You definitely can do a validation only on the controller level, if you want to keep your service layer a little bit more clean. I would definitely add that to the entity as well, because sometimes, we make mistakes. You’re going to forget something in the controller level, so at least you have another layer protecting you. If your team has the discipline to always add those validations into the controller, and if that’s working for you, that’s great. You can continue doing that. It also really depends on the nature of the project. If you have your controller, and then you’re calling your service, and maybe you’re using microservices architecture, and you don’t have multiple controllers, that works really well. If you’re working in a monolithic application where you have thousands of controllers and then you have thousands of services.
Then, in one controller, you’re making reference to 10 different other services, that becomes a little bit more complex, and you can actually make a mistake when you are trying to reuse that service in a different file. Then if you forget something, that is one of the reasons that I would say, to add in to all layers. It depends on the project. If that’s working for you, that is great. For the validation and authorization itself, usually, this is only done on the highest layer, usually in the controller, if we’re talking about Java, Spring, or something like that. That’s usually where we handle. You don’t necessarily need to handle the services in the service layer, unless you have a service that’s calling another service. Then you need to have some kind of authorization and authentication, some mechanism in there as well, in case you are interfacing with a different service, like connectivity to a different web service, or what have you.
See more presentations with transcripts
Microsoft Launches Azure Confidential VMs with NVIDIA Tensor Core GPUs for Enhanced Secure Workloads

MMS • Steef-Jan Wiggers
Article originally posted on InfoQ. Visit InfoQ
Microsoft has announced the general availability of Azure confidential virtual machines (VMs—NCC H100 v5 SKU) featuring NVIDIA Tensor Core GPUs. These VMs combine hardware-based data protection from 4th-generation AMD EPYC processors with high performance.
The GA release follows the preview of the VMs last year. By enabling confidential computing on GPUs, Azure offers customers increased options and flexibility to run their workloads securely and efficiently in the cloud. These virtual machines are ideally suited for tasks such as inferencing, fine-tuning, and training small to medium-sized models. This includes models like Whisper, Stable Diffusion, its variants (SDXL, SSD), and language models such as Zephyr, Falcon, GPT-2, MPT, Llama2, Wizard, and Xwin.
The NCC H100 v5 VM SKUs offer a hardware-based Trusted Execution Environment (TEE) that improves the security of guest virtual machines (VMs). This environment protects against potential access to VM memory and state by the hypervisor and other host management code, thereby safeguarding against unauthorized operator access. Customers can initiate attestation requests within these VMs to verify that they are running on a properly configured TEE. This verification is essential before releasing keys and launching sensitive applications.
(Source: Tech Community Blog Post)
In a LinkedIn post by Vikas Bhatia, head of product, Azure confidential computing, and Drasko Draskovic, founder & CEO of Abstract Machines commented:
Congrats for this, but attestation is still the weakest point of TEEs in CSP VMs. Current attestation mechanisms from Azure and GCP – if I am not mistaken – demand trust with the cloud provider, which in many ways beats the purpose of Confidential Computing. Currently – looks that baremetal approach is the only viable option, but this again in many ways removes the need for TEEs (except for providing the service of multi-party computation).
Several companies have leveraged the Azure NCC H100 v5 GPU virtual machine for workloads like confidential audio-to-text inference using Whisper models, video analysis for incident prevention, data privacy with confidential computing, and stable diffusion projects with sensitive design data in the automotive sector.
Besides Microsoft, the two other big hyperscalers, AWS and Google, also offer NVIDIA H100 Tensor Core GPUs. For instance, AWS offers H100 GPUs through its EC2 P5 instances, which are optimized for high-performance computing and AI applications.
In a recent whitepaper about the architecture behind NVIDIA’s H100 Tensor Core GPU (based on Hopper architecture), the NVIDIA company authors write:
H100 is NVIDIA’s 9th-generation data center GPU designed to deliver an order-of-magnitude performance leap for large-scale AI and HPC over our prior-generation NVIDIA A100 Tensor Core GPU. H100 carries over the major design focus of A100 to improve strong scaling for AI and HPC workloads, with substantial improvements in architectural efficiency.
Lastly, Azure NCC H100 v5 virtual machines are currently only available in East US2 and West Europe regions.
Podcast: Generally AI – Season 2 – Episode 4: Coordinate Systems in AI and the Physical World

MMS • Anthony Alford Roland Meertens
Article originally posted on InfoQ. Visit InfoQ

Transcript
Roland Meertens: Anthony, did you ever have to program a turtle robot when you were learning to program?
Anthony Alford: I’ve never programmed a turtle robot, no.
Roland Meertens: Okay, so I had to do this when I was learning Java and in robotics, the concept of a TurtleBot is often that you have some kind of robot you can move across the screen and it has some kind of pen, so it has some trace. So you can start programming, go upward or go forward by one meter, then turn right by 90 degrees, go forward by one meter, turn right by 90 degrees, so that way you trace a pen over a virtual canvas.
Anthony Alford: The Logo language was based on that, right?
Roland Meertens: Yes, indeed. So the history is that the computer scientist Seymour Papert, who created a programming language Logo in 1967, apparently they use this programming language, so these are things I don’t know, to direct like a big robot with a pen in the middle, which would let you make drawings on actual paper.
Anthony Alford: Okay.
Roland Meertens: It’s pretty cool, right?
Anthony Alford: It’s a bit of a plotter, a printer.
Roland Meertens: So apparently in 1967 people learned to program with a physical moving plotter. They immediately start with a robot-
Anthony Alford: That’s pretty cool.
Roland Meertens: Yes, instead of using a virtual canvas. It was round and crawled like a turtle. But the other thing I found is that turtle robots, their first mention is in the 1940s they were invented by Grey Walter and he was using analog circuits as brains, and his more advanced model could already go back to a docking station when the battery became empty.
Anthony Alford: That’s pretty cool. In the ’40s.
Roland Meertens: In the 1940s. Yes. I will put the video in the show notes and I will also put two articles in the show notes. One is the History of Turtle Robots. Someone wrote an article about it for Weekly Robotics and another article on the history of turtle robots as programming paradigms.
Anthony Alford: Very cool.
Roland Meertens: Yes.
Anthony Alford: Slow and steady.
Roland Meertens: Yes, slow and steady and is a great way to get started with programming.
At the Library [02:15]
Roland Meertens: All right, welcome to Generally AI, Season 2, Episode 4, and in this InfoQ Podcast, I, Roland Meertens, will be discussing coordinate systems with Anthony Alford.
Anthony Alford: How’s it going, Roland?
Roland Meertens: Doing well. Do you want to get started with your coordinate system research?
Anthony Alford: Let’s go for it. So I decided to go with an AI theme of coordinates and perhaps you can guess where I’m going. We’ll see.
Roland Meertens: Tell me more.
Anthony Alford: Well, in the olden days, a teacher, for example, in a history class would often ask me and other students to write a paper about some topic. So let’s say the topic is the Great Pyramid of Egypt. Now probably most students don’t know everything about the Great Pyramid and the teacher says anyway, “You have to cite sources”, so you can’t just write anything you want.
Roland Meertens: I always hate this part. Yes, I always say, “I found this on the internet. These people can’t lie”.
Anthony Alford: Well, I’m talking about the days before the internet. But in the 20th century, let’s say, we would go to the library, an actual physical building, and there would be a big drawer full of small cards, the card catalog. These are in alphabetical order, and so we’d scroll through till we get to the Ps and then P-Y, pyramid. Great. Pyramid of Egypt, right?
Roland Meertens: Yes.
Anthony Alford: This card has a number on it. This is the call number for books that are about the Great Pyramid of Egypt. So in the US a lot of libraries use a catalog system called the Dewey Decimal system for nonfiction books. It’s a hierarchic classification system.
Books about history and geography in general, they have a call number in the range from 900 to 999. Within that, books about ancient history are in the range of 930 to 939. Books about Ancient Egypt specifically have call numbers that begin with the number 932. And then depending on what ancient Egyptian topic, there will be further numbers after the decimal point.
Roland Meertens: And maybe a weird question, but were you allowed to go through these cards yourself or did you ask someone else like, where can I find information about Ancient Egypt?
Anthony Alford: Both methods do work. If you’re young and adventurous, perhaps you’ll go to the card catalog and start rifling through. But yes, in fact, a lot of libraries had a person whose job was to answer questions like that: the reference librarian.
Roland Meertens: Yes, because I’m too young, I never saw these cards. My librarians would already have a computer they would use to search.
Anthony Alford: Right. But the point there is that the card catalog is pretty familiar to us in—speaking of search, the card catalog is an index. And it maps those keywords like Great Pyramid of Egypt, it maps those keywords to a call number or to maybe multiple call numbers.
Actually, university libraries, in my experience in the US, they don’t use Dewey Decimal, they use a different classification, but the idea is the same. Anyway, it’s a hierarchy and it assigns a call number to each book.
So to go actually get the physical book, it’s hopefully on a shelf that’s in a cabinet. We call these stacks. That’s the lingo. So the classification hierarchy is itself mapped physically to these stacks. There will be a row of cabinets for the books that are in the 900 to 999 range and maybe one cabinet for the 930 to 939, and then maybe one shelf for 932 and so on. Now that I think of it, this structure is itself somewhat like a pyramid.
Roland Meertens: Perfect example.
Anthony Alford: Hopefully if nobody’s messed with them, the physical order of the books matches the numeric order. So you’re doing an index scan or index search if we’re thinking about it in terms of a database or information retrieval. Because that’s what this is, it’s literally information.
Roland Meertens: Yes. And it is good that it’s indexed by topic because otherwise you don’t know if you’re searching for P for pyramids or G for great pyramids or E for Egyptian great pyramids.
Anthony Alford: Right. If you’re not talking to the reference librarian, you might try all those keyword searches in the index. So now that I’ve got a couple of books, I can use that content in those books to help me produce my essay about the Great Pyramid.
Now that was the bad old days of the 20th century. Here in the 21st century, it’s like you said: you do an internet search or maybe you read Wikipedia. That’s just the first quarter of the 21st century. Now we’re into the second quarter of the 21st century, and we’re in a golden age of AI. We don’t have to even do that. We just go to ChatGPT and copy and paste the assignment from the syllabus web page as a prompt and ChatGPT writes the essay.
Roland Meertens: Quite nice, quite neat.
RAG Time [07:35]
Anthony Alford: Well, in theory. So there’s a couple of problems. First, the teacher said, “Cite your sources”, and you have to do that—in the content where maybe you quote something or a reference you need to put in there. Another thing is ChatGPT is good, but maybe it’s not always a hundred percent historically accurate.
Roland Meertens: Yes, it sometimes makes up things.
Anthony Alford: And it really only knows things that are in its training data, which is large, but maybe there’s some really good books or papers that are not on the internet that might not be in that training data. So I think you know what is the answer.
Roland Meertens: Are we going to retrieve some data before we’re processing it?
Anthony Alford: Yes, it is RAG-time. So the key technology now is retrieval augmented generation, also known as RAG, R-A-G. So the general idea, we know that if we give an LLM some text, like the content of a history book, LLMs are quite good at answering questions or creating summaries of that content that you include with your prompt.
Now ignore the problem of limited context length, which is a problem. The other problem is: how do you know what content to provide it?
Roland Meertens: Yes, you can’t give it the entire stack of books.
Anthony Alford: Exactly. And even if you had the content electronically, and you had picked it out, you want to automate this, right? You don’t want to have to go hunt down the content to give to the LLM.
So finding the right history book, the right content in an electronic database of content, well, we already said it. This is information retrieval. And again, in the old days we’d use natural intelligence: we would use the reference librarian or go look up some keywords in the card catalog.
Roland Meertens: It is too bad that the librarian is not very scalable.
Anthony Alford: Exactly right. We want to automate this and scale it. So let’s take an analogy. The key idea of RAG is: take your LLM prompt and automatically assign it a call number. So now you can go directly from your prompt—your instructions for writing the essay—now we have automatically assigned it a call number, and now you just go get those books automatically and add that with your prompt.
Roland Meertens: Sounds pretty good.
Anthony Alford: Yes, more precisely: we have an encoder that can map an arbitrary blob of text into a point in some vector space with the requirement that points that are near each other in this space represent texts that have similar meanings. So typically we call this an embedding.
So we take an encoder, we apply the encoder to the prompt that turns that into a vector. Then we have all of our books in the universe, we have encoders applied to them, and we get vectors for them. We find the vectors that are close to the vector for our prompt. So easy-peasy, right?
Roland Meertens: Easy-peasy.
Anthony Alford: Right. Well, here’s the problem. So the encoder-
Roland Meertens: Encoding your data.
Anthony Alford: Well, there’s that. Well, I’m just going to assume somebody encoded all the books in the library. That’s a one-time job. The problem is that people usually use BERT as the encoder. Well, the embedding vector that you get is 768 dimensions. And so the question is: what does it mean to be nearby something in a 768-dimensional space?
Roland Meertens: Yes, that depends on what distance function you want to use.
Anthony Alford: That’s exactly right. With call numbers, it was easy because they’re scalars. So the distance function is: subtract.
Roland Meertens: Oh, it’s quite interesting. I never even realized that call numbers could be subtracted.
Anthony Alford: Well, that’s how you do it, right? If you go to find your book 932.35, you probably don’t do a scan. You probably do some kind of bisecting search, or you know you need to go over to the 900s, and then you jump to the middle of the 900s and scan back and forth depending on the number that you’re at.
Roland Meertens: And also for the library, it of course makes sense that they put books which are similar close together.
Anthony Alford: Yes, well, you physically store them in order of their call number.
Roland Meertens: Yes.
Cosine Similarity [12:04]
Anthony Alford: More or less. Anyway, like you said, this distance, the closer they are to zero, like the closer the two call numbers are together, physically the books are closer together.
So anyway, we need a distance function, or the opposite of a distance function, which is a similarity, right? The smaller the distance, the more similar. In the case of these embeddings, people typically use a similarity measure called cosine similarity. Now, if you’ve ever worked with vectors, you probably remember the inner product or sometimes called the dot product.
To explain this without a whiteboard, let’s say we’re in 3D space. So each vector has X, Y, and Z. The dot product of two vectors is you take the X from the first one, multiply by the X from the second one. Then you do that for the two Y components, and then the two Z components, you add those all up. That’s the dot product. And that’s a single number, a scalar.
Roland Meertens: Yes.
Anthony Alford: The geometric interpretation of the dot product is: it’s the length of the first one times the length of the second, and then times the cosine of the angle between them. So you could divide the dot product by the length of the two vectors, and what you’re left with is the cosine of the angle. And if they’re pointing in the same direction, that means the angle is zero and the cosine is 1. If they point in the opposite directions, the cosine is -1. And in between there, if it’s zero, they’re at right angles.
Roland Meertens: Yes, intuitively, you always think that it doesn’t really matter what the magnitude of the interest is, as long as the interests are at least in the same direction, it is probably fine in your library.
Anthony Alford: Yes, and I’m going to explain why. Anyway, the cosine similarity is a number between -1 and +1. And the closer that is to +1, the nearer the two embeddings are for our purposes. And you may wonder why cosine similarity. So again, with 3D space, X, Y, and Z, there’s a distance called the Euclidean distance, which is our normal “The distance between two points is a straight line”, right?
Roland Meertens: Yes.
Anthony Alford: So you basically take the X is the square of the Y is the square of the Zs, add them up and take the square root.
Roland Meertens: As long as we are in a Euclidean space, that’s the case.
Anthony Alford: And in vector terms, that’s just the magnitude of the vector drawn between those two points. Well, if you wonder why you don’t use that, why instead you use cosine similarity, if you look on Wikipedia, it’s something called the curse of dimensionality.
Basically, when you have these really high-dimensional spaces, and if the points are uniformly spread around there, they actually aren’t. The middle of the space and the corners of the space are empty-ish. And most of the points are actually concentrated near the surface of a sphere in the space.
So when all the points are on a sphere, their magnitudes are more or less all the same. And so you don’t care about them. And so the thing that makes them different points is there are different angles. They are at different angles relative to some reference. So that means we don’t care about the magnitude of vectors in the space, we care about the direction, and that’s why cosine similarity.
Roland Meertens: Is there any reason that the magnitude of the vectors tends to be the same?
Anthony Alford: It’s just the way that these sparse high-dimensional spaces…it’s just the math that they work out. And in fact, because the magnitudes are all more or less the same or at least very…you can take a shortcut, you can just use the dot product. You don’t have to get the cosine similarity, you can just do the dot product That’s a nice shortcut because GPUs are very good at calculating dot products.
And so let’s back up, right? We take our prompt, we encode it, we’ve already encoded the content of all the library. We just find the vectors in the library that have the largest dot product with our prompt vector. And in the original RAG paper they did that. It’s called maximum inner product search. So basically you take your queries vector, you do the dot product with the vector of all the documents and take the ones that have the biggest.
What’s the problem now? I bet you know.
Roland Meertens: What is the problem?
Anthony Alford: Well, the problem is you have to—basically every time you have a new prompt, you have to go and calculate the dot product against every other document.
Roland Meertens: If only there was a better way to store your data.
Who Is My Neighbor? [16:50]
Anthony Alford: Well, there’s a better way to search it turns out. The default way is linear complexity. So for a small library, it may be no big deal, but if we’re talking about every book ever written, well, if you compare it with index search in a database, that’s complexity around log(n). So linear is way worse. It’s terrible. So again, it turns out this is a well studied problem and it’s called nearest neighbor search.
Roland Meertens: Approximate nearest neighbors or exact nearest neighbors?
Anthony Alford: Well, one is a subset of the other. So if you go back to the database search, that’s log(n), and you can actually use a tree structure for nearest neighbor search. You can use something called a space partitioning tree and use a branch and bound algorithm. And with this strategy, you’re not guaranteed log(n), but the average complexity is log(n). But this usually is better in a lower dimensional space.
Roland Meertens: Okay, so why is it on average? Do you keep searching or-
Anthony Alford: Well, I think it is just like you’re not guaranteed, but based on the statistics, you can mathematically show that on average you get a log(n) complexity. But remember your favorite algorithm-
Roland Meertens: My favorite algorithm.
Anthony Alford: What was your favorite algorithm?
Roland Meertens: HyperLogLog.
Anthony Alford: Right. So, you already said it, approximate nearest neighbor. When you want to do things at scale, you approximate. So it turns out that a lot of RAG applications use an approximate nearest neighbor search or ANN, which also stands for “artificial neural network”. But just a coincidence.
So there are several algorithms for ANN and they have different trade-offs between speed and quality and memory usage. Now, quality here is some kind of metric like recall. So with information retrieval, you want to get a high recall, which means that of all the relevant results that exist, your query gives you a high percentage of those.
One of the popular algorithms lately for ANN is called hierarchical navigable small world, or HNSW. HNSW is a graph-based approach that’s used in a lot of vector databases. I actually wrote an InfoQ news piece about Spotify’s ANN library, which uses HNSW.
Roland Meertens: Oh, is it Voyager?
Anthony Alford: That’s correct, yes. You must have read it.
Roland Meertens: Oh, I tried it. It’s pretty cool.
Anthony Alford: Oh, okay. Well, you know all about this stuff.
Roland Meertens: Oh, I love vector searching.
Anthony Alford: So I found a nice tutorial about HNSW, which we’ll put in the show notes, and it expressed a very nice definition, concise:
Small world, referring to a unique graph with low average shortest path length, and a high clustering coefficient navigable, referring to the search complexity of the sub graphs which achieve logarithmic scaling, using a decentralized greedy search algorithm and hierarchical, referring to stacked sub graphs of exponentially decaying density.
So all of this to find out: who is my neighbor?
Roland Meertens: Who is your neighbor, and in which space are they your neighbor?
Anthony Alford: Yes. So I think I’ve filled up my context window for today. And for homework, I will let our listeners work out analogies between this topic and library stacks and pyramids.
Roland Meertens: For library stacks, I’m just hearing that they could have multiple boxes with the stacks and you just move from box to box, from room to room.
Anthony Alford: So here’s a very interesting thing. Here in my hometown, there’s a university, North Carolina State University, their engineering library has a robot that will go and get books out of the stacks. It’s basically an XYZ robot, and it’ll move around and get books out of the stacks for you.
Roland Meertens: Oh, nice. That’s pretty cool.
Anthony Alford: Yes, it looks really cool.
Roland Meertens: Always adding an extra dimension, then you can represent way more knowledge.
Anthony Alford: So that’s my fun fact.
Roland Meertens: That’s a pretty good fun fact.
Real World Coordinate Systems [22:02]
Roland Meertens: All right. For my fun fact for today, as the topic is coordinate systems, for software there’s many ways to represent a map in location software. So this can be important for your user data, maybe for helping with people and finding where they are, finding interesting locations close by, and the most popular format here is WGS84.
But what I wanted to dive into is the history of coordinate systems, especially how different countries chose them, and some of the legacy systems which are still in place because the history of coordinate systems, and of course longer than just computers, people wanted to know who owned what land for quite a long time, people wanted to know how to get somewhere for quite a long time.
And, first of all, there’s different ways to project a map. So you want to have a map in 2D, and our Earth is a sphere. In that way, you can project a sphere onto a cylinder, a cone, or just a flat disc on top of the sphere, and you always get some kind of compromise.
So you can choose to keep the angles of the map accurate. That’s, for example, the Mercator projection used by Google Maps. So if you’re going on Google Maps, you’re zooming out, then all the angles are preserved, but the sizes are not very true.
One fun question, by the way, maybe you can help me out with this, Anthony, is that I always ask what is bigger, Greenland or Australia, and by how much?
Anthony Alford: Oh, Australia is quite large. And again, I think the Mercator projection distorts our view of Greenland for those of us who are familiar with it. Australia is much larger, but I couldn’t tell you like by a factor or whatever.
Roland Meertens: Yes, I like to ask this to people because first I ask them what they would estimate and then I show them the Google Maps projection and I ask them if they want to change their guess. And sometimes people change it in the wrong direction, even though they know that Mercator doesn’t preserve size, even though they know that the map is lying. They just can’t get around the fact that Greenland looks really big on the map.
So if you want to fix this, you can use the Mollweide equal-area projection to ensure that all the map areas have the same proportional relationships to areas on the Earth. And the other thing you can do is if you want, for example, to keep the distance constant, there are equidistant projections that have a correct distance from the center of the map.
So this is useful for navigation, for example, if you want to have something centered around the UK that you at least know if I want to go here, it’s equally far as if I want to go here. And here, another fun fact for you is that azimuthal equidistant projection is the one they use for the emblem of the United Nations: this emblem where you see this map from the North Pole, that is an azimuthal equidistant projection where the distance is constant.
Anthony Alford: Okay, nice.
UK Ordnance Survey Maps [25:27]
Roland Meertens: But as I said, I wanted to talk a bit about other systems in the world and which projection they pick and perhaps some of the technical depth and incredibly smart choices they made when doing so.
And, first off, in the UK they have the Ordnance Survey Maps. It’s basically the national mapping agency for Great Britain. And in a previous episode of Generally AI, I already told you about multiple telescopes in the observatory in Greenwich, right?
Anthony Alford: Right. Yes.
Roland Meertens: And I think I also told you that they have multiple telescopes which all have a different prime meridian line, which indicates zero or used to indicate zero. I discovered that the Ordnance Survey meridian was picked in 1801, which is 50 years before this newer prime meridian was released. And nowadays with GPS, the prime meridian moved again. But the Ordnance Survey Maps are basically two prime meridian switches away from what it used to be.
Anthony Alford: I don’t know, but I’m guessing from the name that they would, in the worst case scenario, use these maps to choose targets for artillery. So hopefully they don’t miss.
Roland Meertens: No, actually what I think is probably a good reason to keep the Ordnance Survey Maps the same is that they probably use it to determine whose land belongs to whom.
Anthony Alford: Sure.
Roland Meertens: So you want to be able to keep measuring in the old way as you already determined who owns what land.
Anthony Alford: Makes sense.
Roland Meertens: Otherwise, but we will see this later in this episode, you start publishing error maps like the Netherlands is doing. But it’s interesting that since 1801, when they picked this survey meridian, they were for a long time simply six meters to the east of what people started to call zero for a long time.
I can also imagine that this is still confusing nowadays if people use their own GPS device and compare it to some older document from the 1800s and discover that their place is very much farther away from where they thought it should be. But I’ll post an article to this Ordnance Survey Zero Meridian in show notes.
Netherlands Triangle Coordinate System [27:49]
Roland Meertens: Anyway, moving to a different country, in the Netherlands, the geographic information system, the GIS system, is called Rijksdriehoekscoördinaten. So it’s a “national triangle coordinate system”. And as you can already guess, this mapping is accurate in angles and Wikipedia says it approaches being accurate in the distances, so it’s not accurate in distances.
Anthony Alford: Oh, I see. And so I guess it’s basically you need to orient in the right direction, but the distance is approximate? Is that-
Roland Meertens: Well, the thing is that if you have these coordinates, the angles between your coordinates are the same as the angles in the real world.
Anthony Alford: Cosine distance!
Roland Meertens: Yes. So the coordinates are in kilometers and then meters, right? It’s just that one kilometer in coordinates isn’t a kilometer in the real world. So one kilometer on the map in coordinates isn’t necessarily one kilometer in the real world. So the center of the map is a church in Amersfoort, so basically in the center of the Netherlands. Around there, the scale is 10 centimeters per kilometer too small.
Anthony Alford: Interesting.
Roland Meertens: Yes, I mean, it’s not a big error, it’s just only 10 centimeters.
Anthony Alford: This reminds me again of the last season where the king found out that his land was smaller than the map said it was.
Roland Meertens: Yes. So if you would take the Dutch triangle coordinate system and then determine that you’re going to walk 10 kilometers in the center of the Netherlands, you would have walked one meter too little after walking 10 kilometers.
Anthony Alford: Would you even notice though, right?
Roland Meertens: Indeed, you probably wouldn’t. On the edges, so if you go towards the coast areas into Germany, it’s 18 centimeters per kilometer too large.
Anthony Alford: So you could wind up in Germany and not know it…or would you know it? You might know.
Roland Meertens: You will find out that you’re crossing the border because it says you’re crossing a border.
Anthony Alford: Well, wait, Schengen, you guys are all…you just walk, right?
Roland Meertens: Yes, from where my parents live, you can very easily cycle to Germany. But it’s interesting that because you have such a small country, you can project things in a flat way and-
Anthony Alford: And the country is rather flat as well, I believe.
Roland Meertens: The country is rather flat as well. Yes, indeed. I will get to the height of the Netherlands actually, because that’s also interesting because they use different landmarks than the landmarks used for the triangle coordinate system.
Anthony Alford: Okay.
Roland Meertens: So as I said for the triangle coordinate system, the center of the coordinate system, let me tell you a fun fact about that first. So that’s a church in Amersfoort. And if you look at the coordinates, there’s an X and Y component where X goes from west to east and Y goes from south through north. That’s relatively simple.
But the X coordinates are between zero and 280 kilometers. The Y coordinates in the Netherlands are between 300 and 625. So (0,0) is basically somewhere to the north of Paris. And the nice trick here, which I think is just genius, is that all the coordinates in the Netherlands are positive and the Y coordinates in the Netherlands are always larger than the X coordinates-
Anthony Alford: Interesting.
Roland Meertens: … unlike continental Netherlands. So this removes all the possible confusion around what coordinate. So if I give you two coordinates, I don’t even have to tell you this is X, this is Y.
Anthony Alford: Got it.
Roland Meertens: I can turn them around, I can flip them around. Because as a software engineer, whenever it says coordinates, you get two numbers. I always plot latitude, longitude, trying out combinations to make sure that everything is correct. And here in the Netherlands, if only people would use the national triangle coordinate system, there would be no confusion in your software.
Anthony Alford: Is that a thing that most Netherlanders are aware of?
Roland Meertens: Probably not. I must also say that this coordinate system is not used a lot. Probably mostly for people who are doing navigation challenges or scouting or something.
Although I must say that it is quite nice to take one of those maps because they are divided in a very nice way. It’s very clear how far everything is because with latitude and longitudes, the distance between one latitude or one longitude is different depending on where you are on Earth, right?
Anthony Alford: Yes. But there’s a conversion to nautical miles, but I can’t remember it off the top of my head.
Roland Meertens: That’s a good point. I wanted to say in the Netherlands it’s fixed, but we just learned that it’s 10 centimeters per kilometer too small in the center and 18 centimeters per kilometer too large in the edges.
Anthony Alford: But originally part of the development of the metric system was to take the circumference of the Earth and make fractions of it to be the meter originally. I don’t think it worked out.
Roland Meertens: I think there’s also a map system where they try to keep the patches the same area, but then you get problems when you want to move from patch to patch. So if you have coordinates or if you have a route which crosses multiple patches, one point on one patch doesn’t necessarily map to the same place on another patch.
Anthony Alford: It’s a tough problem.
Roland Meertens: Yes, and that’s why I like to talk about it. It’s a lot of technical depth and it becomes more difficult once you start doing things with software or self-driving cars or things like that.
In terms of technical depth, the original map of the Netherlands was made between 1896 and 1926. And as you can imagine, we now have way more accurate mapping tools, but I already alluded to the fact that if you already mapped out a place and you say this is your property, you can’t really say, oh, there’s a new coordinate system, let’s go measure everything again and assign this again.
So what they do in the Netherlands, I think on three different occasions they published a correction grid with corrections up to 25 centimeters. So you can take an original coordinate and then apply the correction grids to get the coordinates in what is actually measured.
Anthony Alford: Gotcha. Well, not to derail your talk, but here, again, in North Carolina we have a border with another state, South Carolina, and about 10 years ago they had to adjust it. Basically the border had become ambiguous. It was unclear where it actually was. And so they fixed it and agreed on where the border is. And there were some people who woke up one morning in a different state without having to move.
Roland Meertens: I can tell you one other fun fact about borders in the Netherlands and between Germany and that is that in the Netherlands after World War II, there were some proposals around like, can we maybe have some part of Germany to make up for the Second World War?
So they got a few parts of Germany, but those are super small regions like a village or something. And this wasn’t really working out, taking a long time to move people, make sure everything was working well, build schools, et cetera.
So at some point they gave it back, but then weeks before they were giving back this country, big trucks would already start moving in with loads of goods in them. They would find places in the village to park and hours before this transition happened, big trucks would show up with loads of butter inside. So basically at 12 o’clock at night, the country swaps and these goods never crossed a border, so they didn’t have to pay taxes.
Anthony Alford: Loophole.
Roland Meertens: Yes. So they found a loophole which you could only do one night because some parts changed country overnight.
Anthony Alford: Interesting.
Roland Meertens: One last fun fact here about coordinate systems. You already said the Netherlands is quite flat. Good point. But this grid only tells you XY coordinates and it’s mostly based on locations of church towers to measure angles between. So it’s quite neat. Those are relatively consistent places and you can see between them.
There’s a separate mapping for height above sea level, the new Amsterdam Ordnance Datum, and this is actually used in a lot of Western European countries. And these points are indicated by screws on specific buildings. And I know this because once in high school we had to make an accurate map of a field close to a school and I was tasked to propagate the height from this screw to the rest of the field.
Anthony Alford: Wow.
Roland Meertens: We actually had these systems they use in professional area measuring setups.
Anthony Alford: The surveying tools…a transit.
Roland Meertens: There was something where something was perfectly flat and then we would stand somewhere with a height meter, measure the difference in height, place the measuring device somewhere else, have the person with the height meter stand somewhere else.
We also had to do it twice because the first time we made a mistake, I don’t know anymore what we did, but it’s just teenagers trying to come up with a way to measure a field.
Anthony Alford: Very cool.
Words of Wisdom [37:40]
Roland Meertens: All right. Words of wisdom. Did you learn anything in this podcast or did you learn anything yourself recently?
Anthony Alford: The fact that all the points in a high dimensional space are on a sphere was new to me. Maybe not all, but the fact that they all more or less have similar magnitude. That was an interesting fact that I was not aware of.
Roland Meertens: You would say that that means that there is space in the high dimensional space left over. The place in the middle and the corners could be utilized to store more information.
Anthony Alford: One would think, but then that would mess up the assumption of the cosine distance.
Roland Meertens: Yes, but more space to store. It’s free. It’s free storage.
Anthony Alford: Just add another dimension.
Roland Meertens: Yes, that’s why I always throw all my stuff on the floor in my room. I pay for it, I can store it wherever I want, everywhere in the space.
Anthony Alford: Definitely.
Roland Meertens: One thing from my side in terms of learning things, one recommendation I want to give you is, have you heard of the post office scandal in the UK?
Anthony Alford: No. Tell me.
Roland Meertens: It’s quite interesting. So the post office in the UK adopted a bookkeeping system by Fujitsu called Horizon, and it was basically plagued with bugs. Sometimes the system would duplicate transactions, sometimes it would change some balance when users would press enter at some frozen screen multiple times. So you’re like, oh, it’s frozen…let’s press enter.
Every time something would happen with your balance.
And it was possible to remotely log into the system. So Fujitsu or Horizon could remotely change the balances on these systems without the postmasters knowing. And I learned last week that rather than acknowledging these bugs, these postmasters were sued for the shortfalls in the system because the system would say, you owe us £30,000.
Anthony Alford: Oh, wow.
Roland Meertens: Yes. And so these postmasters were prosecuted, got criminal convictions, and this is still going on and still not fully resolved today.
Anthony Alford: That’s terrible.
Roland Meertens: It is absolutely insane. So I watched this drama series called Mr. Bates versus The Post Office, and I can definitely recommend you to watch this because it tells you a lot about impact your software can have on individuals and to what great length companies are willing to go to hide the impact of bugs or systems like this.
Anthony Alford: Goodness gracious.
Roland Meertens: Yes, it’s insane. We can do a whole episode about the post office scandal I think.
Anthony Alford: That would be depressing.
Roland Meertens: Yes, but I must say it’s very interesting. Every time when you read about this and you think, surely by now they will acknowledge that there can be problems like this, the post office just doubled down, hired more lawyers, created bigger lawsuits, and absolutely ruined the lives of people who were postmasters in the last 20 years actually.
Anthony Alford: Wow.
Roland Meertens: As I said, can recommend this as a thing to watch.
Anthony Alford: Sounds good.
Roland Meertens: Anyways, talking about recommendations. If you enjoyed this podcast, please like it, please tell your friends about it. If you want to learn more things about technology, go to InfoQ.com. Take a look at our other podcasts, take a look at our art course and the conference talks we recorded. Thank you very much for listening and thank you very much, Anthony, for joining me again.
Anthony Alford: Fun time as always.
Roland Meertens: Fun time as always. Thank you very much.
Anthony Alford: So long.
Roland Meertens: Any last fun facts you want to share?
Anthony Alford: Well, I don’t know if we want to put this one on the air, but I was looking at how property is described here in the US in a legal document. So you may know, you may not, that we have a system called Township and Range, and I think it was invented by our President Thomas Jefferson.
After our Revolution, we had all this land that legally speaking was not owned by anyone. So they divided it up into a grid. They laid a grid out over it. So here’s a description of a piece of property:
Township four north, range 12 west. The south half of the north half of the west half of the northeast quarter of the northeast quarter of the north half of the south half of section six.
Roland Meertens: Okay. Yes. So they made a grid and then they went really, really, really, really deep.
Anthony Alford: Subdividing the grid. Yep.
Roland Meertens: Yes, I do like that. When people started mapping this, they were probably like, ah, there’s so much land. It doesn’t really matter how accurate this is. Probably North US, South US is probably enough.
Anthony Alford: Well, what’s interesting, a surveyor was sort of a high status job in the colonial days. George Washington was a surveyor, and Thomas Jefferson amused himself by designing buildings. So these guys took it pretty seriously. That was the age of the Enlightenment and Renaissance men and all that.
Roland Meertens: But if you are not good at mapping, you don’t come home on your ship.
Anthony Alford: Yes, exactly.
Roland Meertens: And if there’s no maps of roads or you don’t know where you are, you don’t reach the village you wanted to get to.
Anthony Alford: Exactly.
Roland Meertens: Yes. Interesting.
Mentioned:
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.
Article: Adaptive Responses to Resiliently Handle Hard Problems in Software Operations

MMS • Laura Maguire
Article originally posted on InfoQ. Visit InfoQ

Key Takeaways
- Resilience – adapting to changing conditions in real time – is a hallmark of expert performance.
- Findings from Resilience Engineering studies have revealed generalizable patterns of human cognition when handling complex, changing environments.
- These studies guide how software engineers and their organizations can effectively organize teams and tasks.
- Five characteristics of resilient, adaptive expertise include early recognition of changing conditions, rapidly revising one’s mental model, accurately replanning, reconfiguring available resources, and reviewing to learn from past performance.
- These characteristics can be supported through various design techniques for software interfaces, changing work practices, and conducting training.
As software developers progress in their careers, they develop deep technical systems knowledge and become highly proficient in specific software services, components, or languages. However, as engineers move into more senior positions such as Staff Engineer, Architect, or Sr Tech Lead roles, the scope of how their knowledge is applied changes. At the senior level, knowledge and experience are often applied across the system. This expertise is increasingly called upon for handling novel or unstructured problems or designing innovative solutions to complex problems. This means considering software and team interdependencies, recognizing cascading effects and their implications, and utilizing one’s network to bring appropriate attention and resources to new initiatives or developing situations. In this article, I will discuss several strategies for approaching your role as a senior member of your organization.
Resilience in cognitively demanding software development work
Modern software engineering requires many core capabilities to cope with the complexity of building and running systems at speed and scale and to adapt to continuously changing circumstances. Resilience Engineering offers several concepts that apply to adapting to inevitable pressures, constraints, and surprises.
Resilience has been described in many ways by different fields. It has been used to describe psychological, economic, and societal attributes but comes primarily from ecology. It is used to describe adaptive characteristics of biological and ecological systems, and over the years, our understanding of resilience has changed. In software, perhaps the most impactful description of resilience is from safety researcher David Woods and the Theory of Graceful Extensibility. He defines it as “the ability of a system to extend its capacity to adapt when surprise events challenge its boundaries”.
This means an organization does not just “bounce back” or successfully defend itself from disruptions. Instead, it can respond in such a way that new capabilities emerge. Consider how, as Forbes notes in their article on business transformations, during the pandemic, commercial airlines responded to decreased travel by turning routes into cargo flights or how hotels that had lost travelers began offering daily room rates for employees working from home to stay productive safely.
Similarly, this resilience perspective is helpful for software engineering since “surprises” are a core characteristic of everyday operations in large-scale, continuous deployment environments. A core aspect of system design that allows for more resilient and reliable service delivery comes from designing, planning, and training for surprise handling.
Resilience Engineering techniques for everyday performance improvement
Researchers studying performance in high-demand work – like flying a fighter jet at 1800 mph close to terrain, rapidly shutting down a nuclear power plant after an earthquake, or performing open heart surgery on an infant in distress – have identified important human perceptual, reasoning, and response capabilities that allow someone to respond quickly and appropriately to a given event.
Even with extensive preparations and training, unexpected events can make following a playbook or work process difficult or impossible. Add time pressure and uncertainty about what is happening and how quickly things might fail, and the situation becomes overwhelmingly hard to manage.
People are forced to adapt in these kinds of surprising situations. They must rapidly identify a new way to handle the situation as it deteriorates to prevent the failure’s impacts from spreading. Successful adaptation is often praised as “quick thinking”, and in this article, we’ll explore the basis for quick thinking – or resilient performance – during software incidents.
The theoretical basis for quick thinking is drawn from research into high-consequence work settings. When applied to software operations, it can enhance resilient performance and minimize outages and downtime. Here, these findings are adapted into strategies for action for individual engineers and their organizations. These are:
- Recognizing subtly changing events to provide early assessment and action
- Revising your mental model in real time of how to adjust your actions
- Replanning in real time as conditions change
- Reconfiguring your available technical and human resources
- Reviewing performance for continuous learning
Together, these five capabilities can enable quick and accurate responses to unexpected events and move quickly without breaking things.
Recognizing: The importance of early detection
Early recognition of a problem, changing circumstances, or needing to revise our understanding of a situation is crucial to resilience. Early detection is beneficial in that it allows:
- more possibilities for action because the situation has not progressed very far
- the opportunity to gather more information before needing to act
- the ability to recruit additional resources to help cope with the situation
Early detection is not always possible due to a lack of data or poor representation of the available data. However, engineers can better recognize problems earlier by continually calibrating their understanding of how the system operates under varying conditions and noticing even subtle changes quickly. Here are three practical ways for software engineers to achieve this in day-to-day work:
Calibrating to variance: One approach is to become more familiar with expected vs. unexpected system behavior by regularly monitoring different operating conditions, not just when there is a problem. An active monitoring practice helps calibrate variance, such as when a spike in volume indicates a problem versus when a certain time zone or customer heavily utilizes the service.
Expanding knowledge about changes: Another strategy is to develop a practice of reading incident reports and reviewing what the dashboards looked like at the earliest indication of trouble to get better at noticing what an anomalous event looks like.
Encouraging knowledge transfer: Lastly, another technique for lightweight calibration to help early detection is asking, “What did you notice that caused you to think there was a problem?” whenever a coworker describes a near miss or a time they proactively averted an outage. Their explanations and your interpretations of these vicarious experiences reinforce a more elaborate mental model of nominal and off-nominal behavior.
Revising: The role of mental models in solving hard problems
A mental model is an internal representation of a system’s behavior. All software engineers construct mental models of how the system runs and fails. Mental models typically include information about relationships, interdependencies, and interactivity that allow for inferences. They can also help predict how a system will likely respond to different interventions.
For software engineers, this means mentally sifting through possible solutions, issues, and interactions to determine the most reasonable action. What is reasonable depends on assessing the action against the current and expected future conditions, goals, priorities, and available resources. In other words, to simulate how different choices will impact desired outcomes. A well-calibrated mental model can help engineers effectively simulate and be better prepared to assess the pros/cons of each and what risks may be involved.
But mental models can be – and often are – wrong. As noted in Behind Human Error, mental models are partial, incomplete, and flawed. This is not a criticism of the engineer. Instead, it acknowledges the complex and changing nature of large-scale software systems. No one person will have complete and current knowledge of the system. No one has a perfect understanding of the dependencies and interactions of a modern software system. Most software systems are simply too big, change too much, and too quickly for anyone’s knowledge to be consistently accurate.
Having poorly calibrated knowledge is not the problem. The problem is when you don’t know you have poorly calibrated knowledge. This means engineers must continually focus on model updating. A strategic resilience approach is cultivating a continual awareness of how current or stale your understanding of the situation may be. As a researcher studying how engineers respond to incidents, I constantly look for clues indicating how accurate the responders’ mental models are. In other words, is what they know or believe about a situation or a system correct? This is a signal that model updating is needed. A high-performing team can quickly identify when they’ve got it wrong and rapidly find data to update their understanding of the situation. Some approaches to continual revising include:
Call out the uncertainties and ambiguities: A technique that helps teams notice when their mental models are incorrect or differ is to ask clarifying questions like “What do you mean when you say this query syntax is wrong?” It’s a simple and direct question that my research has shown is not commonly asked. Explicitly asking creates opportunities for others to reveal what they are thinking and allows all involved to make sure they have the same understanding. This is especially crucial as situations are rapidly changing. Teams can develop shorthand ways of ensuring model alignment to avoid disrupting the incident response.
Developing a practice of explicitly stating assumptions and beliefs so that those around can track the line of reasoning and quickly identify an incorrect assumption or faulty belief. This seems so simple, but when you start doing this, you realize how much you “let slide” about inaccurate or faulty mental models in ourselves or others because it seems so small that it doesn’t seem worth revising, or time pressure prevents us from revising. A more junior engineer may be apprehensive about asking clarifying questions about a proposed deployment or hesitate to talk through their understanding of the risks of rolling back a change for fear of being wrong. The more senior engineer may not realize a gap in their mental model or may not want to publicly call out faulty knowledge.
Learn to be okay with being wrong: Software engineers must accept that their mental models will be wrong. Organizations need to normalize the practice of “being wrong”. This shift means that the processes around model updating – like asking seemingly obvious questions – become an accepted and common part of working together. Post-incident learning reviews or pair programming are excellent opportunities to dig into each party’s mental models and assumptions about how the technology works, interacts, and behaves under different conditions.
Replanning: It’s not the plan that counts, it’s the ability to revise
Software engineers responding to a service outage are, for the most part, hard-wired to generate solutions and take action. Researchers Gary Klein, Roberta Calderwood, and Anne Clinton-Cirocco studied expert practitioners in various domains. He showed that anomaly recognition, information synthesis, and taking action were tightly coupled processes in human cognition. The cycle of perception and action is a continuous feedback loop, which means constant replanning based on the changing available information. The replanning gets increasingly tricky as time pressure increases, partly due to the coordination requirements.
For example, replanning in everyday work situations such as in a sprint planning meeting and deciding how to prioritize one feature over another. In this scenario, there is time to consider the implications of changing the work sequencing or priorities. In this situation, it is possible to reach out to any parties affected by the decision and account for their input on how the plan may impact them. It is relatively easy to reorganize the workflow with less disruption for everyone.
Contrast that with a high-severity incident where there may be potential data loss in a critical, widely used internal project management tool. The incident response team thinks the data loss may be limited to only a part of the organization. While there is a slight possibility they could recover this data, it would mean keeping the service down for another day, impacting more users. One team has a critical meeting with an important client and needs the service restored within the next hour. This meant responders had to determine the blast radius of impacted users, the extent of their data loss, and the implications to those teams while the clock was ticking. Time pressure makes any kind of mental or coordinative efforts more challenging, and replanning with limited information can have significant consequences as needed perspectives may be unavailable to weigh in, causing more stress to all involved and forcing unexpected shifts in priorities or undesirable tradeoffs.
In a recent study looking at tradeoff decisions during high-severity incidents, my colleague Courtney Nash and I found that successfully replanning decisions was inevitably “cross-boundary”. A major outage often requires many different roles and levels of the organization to get involved. This meant that an understanding of the differing goals and priorities of each role was essential to being able to quickly replan without sacrificing anyone’s goals. Or, when goals and work need to be changed, the implications of doing so would be clearer to the replanning efforts. These findings and others from the resilience literature provide an important strategy for resilient replanning:
Create opportunities to broaden perspectives: Formal or informal discussions highlighting implicit perceptions and beliefs can influence how and when participants take action during an incident or work planning. They can use this information to revise inaccurate mental models, adjust policies and practices, and help organizations identify better approaches to team structure, escalation patterns, and other supporting workflows. A greater understanding of goals and priorities and how they may shift in response to different demands aids in prioritization during replanning. A crucial part of coping with degraded conditions is to assess achievable goals given the current situation and figure out which ones may need to be sacrificed or relaxed to sustain operations.
Reconfiguring: Adjusting to changing conditions
Surprises seldom occur when it is convenient to deal with them. Instead, organizations scramble to respond with whoever is available and whatever expertise, authority, or budget may be available. Organizations that flexibly and effectively use the given resources can support effective problem-solving and coordination even in challenging conditions. This can be simple things like having a widely accessible communication platform that doesn’t require special permissions, codes, or downloaded apps, allowing anyone who could help to join in the effort seamlessly. It may be more complex – such as an organization that promotes cross-training for adjacent roles.
Or it could be holding company-wide game days to be able to efficiently engineers from multiple teams on a significant outage because they have common ground – they know each other, have some familiarity with different parts of the system than they usually work on, and can rely on their shared experiences to accurately predict who may have appropriate skills to perform complex tasks. Just like you might add, delete, or move resources within your network configuration, a strategy of dynamic reconfiguration of people and software helps resilience by moving expertise and capabilities to where they are needed while minimizing any impacts of degraded performance in other areas. A resilient strategy for reconfiguration in software organizations includes:
Cultivating cross-boundary awareness: Reconfiguring allows an organization to share resources more efficiently when there is accurate knowledge about the current state of the goals, priorities, and work underway of adjacent teams or major initiatives within the organization. Research looking at complex coordination requirements has shown better outcomes for real-time reconfiguring when the parties have a reasonably calibrated shared mental model about the situation and the context for the decision. This enables each participant to bring their knowledge to bear quickly and effectively, to support collaborative cross-checking (essentially vetting new ideas relative to different perspectives) and allows for reciprocity (being able to lend help or relax constraints) across teams or organizations.
Maintaining some degree of slack in the system: Modern organizations are fixated on eliminating waste and running lean. But what is considered inefficient or redundant before an incident is often recognized as critical, or at least necessary, in hindsight. In many incidents I’ve studied, Mean Time To Repair (MTTR) is usually reduced by engineers proactively joining a response even when they are not on-call. This additional capacity, not typically acknowledged or accounted for when assessing the actual requirements for maintaining the system, is nonetheless critical. It is realized due to engineers’ professional courtesy to one another. It is highly stressful to be responsible for a challenging incident or deal with the pressure of a public outage. I’ve seen other engineers jump into the slack to assist even when putting babies to bed or taking vacations. Burnout, turnover, and changing roles are inevitable. Maintaining a slightly larger team than optimally efficient can help make the team more resilient by increasing communication, opportunities to build and maintain common ground, and cross-training for new skills.
Reviewing performance: Continuous learning supports continued performance
There is a difference between how we think work gets done and how work actually gets done. Learning review techniques that focus on what happened, not what was clear after the fact, helps to show how the system behaves under varying conditions and how organizations and teams function in practice. Discussing the contributing factors to the failure, hidden or surprising interdependencies, and other technical details should also include details about organizational pressures, constraints, or practices that helped or hindered the response. This is true even around the “small stuff”, like how an engineer noticed a spike in CPU usage on their day off or why a marketing intern was the only one who knew an engineering team was planning a major update the day before a critical feature launched. When the post-incident review broadens to include both the social and technical aspects of the event, the team can address both software and organizational factors, creating a more resilient future response.
Some strategies for enabling continuous learning to support resilience include:
Practice humility: As mentioned before, inaccurate or incomplete mental models are a fact of life in large-scale distributed software system operations. Listening to and asking clarifying questions helps to create micro-learning opportunities to update faulty mental models (including your own!)
Don’t assume everyone is on the same page: Where possible, always start with the least experienced person’s understanding of a particular problem, interaction, or situation and work up from there, adding technical detail and context as the review progresses. This gives everyone a common basis of understanding and helps highlight any pervasive but faulty assumptions or beliefs.
Make the learnings widely accessible: Importantly, organizations can extend learning by creating readable and accessible artifacts (including documents, recordings, and visualizations) that are easily shared, that allow for publicly asking and answering questions to promote a culture of knowledge sharing, and are available to multiple parties across the organization even non-engineering roles. A narrative approach that “tells the story” of the incident is engaging and helps the reader understand why certain actions or decisions made sense at the time. It’s a subtle but non-trivial framing. It encourages readers to be curious and promotes empathy, not judgment.
Resilience takeaways
Like any other key driver of performance, resilience requires investment. Organizations unwilling or unable to invest in taking a systems approach can reallocate resources to resilient performance in small but repeatable ways by maximizing the types of activities, interactions, and practices that allow for REvision, REcognition, REplanning, REconfiguring, and REviewing. In doing so, we can enable software teams to coordinate and collaborate more effectively under conditions of uncertainty, time pressure, and stress to improve operational outcomes.