Month: October 2024
AI and ML Tracks at QCon San Francisco 2024 – a Deep Dive into GenAI & Practical Applications
MMS • Artenisa Chatziou
Article originally posted on InfoQ. Visit InfoQ
At QCon San Francisco 2024, the international software development conference by InfoQ, there are two tracks dedicated to the rapid advancements in AI and ML, reflecting how these technologies have become central to modern software development.
The conference spotlights senior software developers from enterprise organizations, sharing their approaches to adopting emerging trends. Each session provides actionable insights and strategies attendees can immediately apply to their own projects. Each talk is led by seasoned practitioners who share actionable knowledge—no marketing pitches, just real-world solutions.
The first track, Generative AI in Production and Advancements, curated by Hien Luu, Sr. engineering manager @Zoox & author of MLOps with Ray, provides a deep dive into practical AI/ML applications and the latest industry innovations. The track explores real-world implementations of GenAI, focusing on how companies leverage it to enhance products and customer experiences.
Talks will include insights from companies like Pinterest, Meta, Microsoft, Voiceflow, and more, sharing best practices on deploying large language models (LLMs) for search and recommendations and exploring AI agents’ potential for the future of software:
- Scaling Large Language Model Serving Infrastructure at Meta: Charlotte Qi, senior staff engineer @Meta, will share insights on balancing model quality, latency, reliability, and cost, supported by real-world case studies from their production environments.
- GenAI for Productivity: Mandy Gu, senior software development manager @Wealthsimple, will share how they use Generative AI to boost operational efficiency and streamline daily tasks, blending in-house tools with third-party solutions.
- Navigating LLM Deployment: Tips, Tricks, and Techniques: Meryem Arik, co-founder @TitanML, recognized as a technology leader in Forbes 30 Under 30, will cover best practices for optimizing, deploying, and monitoring LLMs, with practical tips and real case studies on overcoming the challenges of self-hosting versus using API-based models.
- LLM Powered Search Recommendations and Growth Strategy: Faye Zhang, staff software engineer @Pinterest, covers model architecture, data collection strategies, and techniques for ensuring relevance and accuracy in recommendations, with a case study from Pinterest showcasing how LLMs can drive user activation, conversion, and retention across the marketing funnel.
- 10 Reasons Your Multi-Agent Workflows Fail and What You Can Do about It: Victor Dibia, principal research software engineer @Microsoft Research, core contributor to AutoGen, author of “Multi-Agent Systems with AutoGen” book, will share key challenges in transitioning from experimentation to production-ready systems and highlight 10 common reasons why multi-agent systems often fail.
- A Framework for Building Micro Metrics for LLM System Evaluation: Denys Linkov, head of ML @Voiceflow, LinkedIn Learning Instructor, ML advisor, and instructor, explores how to measure and improve LLM accuracy using multidimensional metrics beyond a simple accuracy score.
The second track, “AI and ML for Software Engineers: Foundational Insights“, curated by Susan Shu Chang, principal data scientist @Elastic, author of “Machine Learning Interviews”, is tailored to those looking to integrate AI/ML into their work.
The talks will cover the deployment and evaluation of ML models, providing hands-on lessons from companies that have faced similar challenges:
- Recommender and Search Ranking Systems in Large Scale Real World Applications: Moumita Bhattacharya, senior research scientist @Netflix, dives into the evolution of search and recommendation systems, from traditional models to advanced deep learning techniques.
- Why Most Machine Learning Projects Fail to Reach Production and How to Beat the Odds: Wenjie Zi, senior machine learning engineer and tech lead @Grammarly, specializing in natural language processing, 10+ years of industrial experience in artificial intelligence applications, will explore why many machine learning projects fail, despite the growing hype around AI, and how to avoid common pitfalls such as misaligned objectives, skill gaps, and the inherent uncertainty of ML.
- Verifiable and Navigable LLMs with Knowledge Graphs: Leann Chen, AI developer advocate @Diffbot, creator of AI and knowledge graph content on YouTube, will demonstrate how knowledge graphs enhance factual accuracy in responses and how their relationship-driven features enable LLM-based systems to generate more contextually-aware outputs.
- Reinforcement Learning for User Retention in Large-Scale Recommendation Systems: Saurabh Gupta, senior engineering leader @Meta, and Gaurav Chakravorty, Uber TL @Meta, will share insights into leveraging reinforcement learning (RL) for personalized content delivery, exploring reward shaping, optimal time horizons, and necessary infrastructure investments for success at scale.
- No More Spray and Pray— Let’s Talk about LLM Evaluations: Apoorva Joshi, senior AI developer @MongoDB, six years of experience as a data scientist in cybersecurity, active member of Girls Who Code, Women in Cybersecurity (WiCyS) and AnitaB.org, will share practical frameworks for LLM evaluation that someone can apply directly to their projects.
These sessions aim to bridge the gap between AI hype and practical, scalable applications, helping engineers of all levels navigate the evolving AI landscape.
Explore all 12 tracks taking place at QCon San Francisco this November 18-22, and take advantage of the last early bird tickets ending on October 29. With only a few weeks left, join over 1,000 of your peers in shaping the future of software development!
MMS • Gunnar Morling
Article originally posted on InfoQ. Visit InfoQ
Transcript
Morling: I would like to talk about a viral coding challenge which I did in January of this year, called, The One Billion Row Challenge. I would like to tell a little bit the story behind it, how all this went, what I experienced during January. Then, of course, also some of the things I learned, some of the things the community learned from doing that. This is how it started. On January 1st, I put out this tweet, I put out a blog post introducing this challenge. Was a little bit on my mind to do something like that for quite a while. Between Christmas and New Year’s Eve, I finally found the time to do it. I thought, let me go and put out this challenge. I will explain exactly what it was, and how it worked. That’s how it started. Then it got picked up quite quickly. People started to work on that. They implemented this in Java, which was the initial idea, but then they also did in other languages, other ecosystems, like .NET, Python, COBOL, with databases, Postgres, Snowflake, Apache Pinot, all those good things.
There was an article on InfoQ, which was the most read article for the first six months of the year, about this. There also was this guy, I didn’t know him, Prime, a popular YouTuber. He also covered this on his stream. What did I do then? I learned how to print 162 PDF files and send them out to people with their individual certificate of achievement. I learned how to send coffee mugs and T-shirts to people all around the world, because I wanted to give out some prices. I sent those things to Taiwan, South America, the Republic of South Korea, and so on. That was what happened during January.
Who am I? I work as a software engineer at a company called Decodable. We built a managed platform for stream processing based on Apache Flink. This thing is completely a side effort for me. It was a private thing, but then Decodable helped to sponsor it and supported me with doing that.
The Goals
What was the idea behind this? Why did I do this? I thought I would like to learn something new. There’s all those new Java versions every six months. They come with new APIs, new capabilities. It’s really hard to keep track of all those developments. I would like to know what’s new in those new Java versions, and what can I do with those things? I wanted to learn about it, but also, I wanted to give the community an avenue to do that so that everybody can learn something new. Then, of course, you always want to have some fun. It should be a cool thing to do. You don’t want to just go and read some blogs. You would like to get some hands-on experience.
Finally, also, the idea was to inspire others to do the same. This was the thing, which I think was a bit specific about this challenge, you could actually go and take inspiration from other people’s implementations. Nothing was secret. You could go and see what they did, and be inspired by them. Obviously, you shouldn’t just take somebody else’s code and submit it as your own implementation. That would make much sense. You could take inspiration, and people actually did that, and they teamed up in some cases. The other motivation for that was, I wanted to debunk this idea, which sometimes people still have, Java is slow, and nothing could be further from the truth, if you look at modern versions and their capabilities. Still, this is on people’s minds, and I wanted to help and debunk this. As we will see, I think that definitely was the case.
How Did it Work?
Let’s get a bit into it. How did it work? You can see it all here in this picture. The idea was, we have a file with temperature measurements, and it’s essentially like a CSV file, only that it wasn’t a comma as a separator, but a semicolon with two columns, and a station name, like Hamburg or Munich, and so on. Then, a temperature measurement value associated to that, randomized values. The task was, process that file, aggregate the values from that file, and for each of those stations, determine the minimum value, the maximum value, and the average temperature value. Easy. Then the caveat only was, this has 1 billion rows as the name of the challenge gives away.
This file, if you generate this on your machine, it has a size of around 13 gigabytes, so quite sizable. Then you had to print out the results, as you can see it here. This already goes to show a little bit that I didn’t spend super much time to prepare this, because this is just the two-string output of the Java HashMap implementation. Random to have this as the expected output. Then as people were implementing this, for instance, with relational databases, they actually went to great lengths to emulate that output format. I should have chosen a relational output format, next time.
The Rules
A little bit more about the rules. First of all, this was focused on Java. Why? It’s the platform I know best, I would like to support. This is also what I wanted to spread the knowledge on. Then you could choose any version. Any new versions, preview versions, all the kinds of distributions, like GraalVM, or all the different JDK providers which are out there. You managed using this tool called SDKMAN. Who has heard about SDKMAN? You should go and check out SDKMAN, and you should use it to manage Java versions. It’s a very nice tool for doing that, and switching back and forth between different versions. That’s it. Java only. No dependencies.
It wouldn’t make much sense to pull in some library and do the task for you. You should program this by yourself. No caching between runs. I did five runs of each implementation. Then I discarded the fastest and the slowest one, and I took the average value from the remaining three runs. It would make sense to have any caching there. Otherwise, you could just do the task once, persist the result in the file, read it back, and it would be very fast. That doesn’t make much sense. You were allowed to take inspiration by others. Of course, not again, just resubmit somebody else’s implementation. You could take inspiration.
Evaluation Environment
In terms of how I ran this. My company spent €100 on this machine which I got on the Hetzner cloud. Very beefy server with 32 cores, out of which I mostly used only 8 cores, I will explain later on. Quite a bit of RAM. Really, the file was always coming from a RAM disk. I wanted to make sure that disk I/O is not part of the equation, just because it’s much less predictable, would have made the life much harder for me. Only a purely CPU bound problem here. How would we go and do this? This is my baseline implementation. I’m an average Java developer, so that’s what I came up with. I use this Java Streams API. I use this files.lines method, which gives me a stream with the lines of a file. I read that file from disk, then I map each of my lines there using the split method. I want to separate the station name from the value. Then I collect the results, the lines into this grouping collector. I group it by the station name.
Then for each of my stations, I need to aggregate those values, which happens here in my aggregator implementation. Whenever a new value gets added to an existing aggregator object, I keep track of the min, the max, and in order to calculate average I keep track of the sum and the count of the values. Pretty much straight forward. That’s adding a line. Then, if I run this in parallel, I would need to merge two aggregators. That’s what’s happening here. Again, pretty much straight forward. Finally, if I’m done, I need to reduce my processed results, and I emit such a result through object with the min, the max value. Then, for the average, I just divide sum by count, and I print it out. On this machine, this ran in about five minutes. Not super-fast, but also not terribly bad. Writing this code, it took me half an hour or maybe less. It’s decent. Maybe, if you were to solve this problem in your job, you might call it a day and just go home, have a coffee and be done with it. Of course, for the purpose of this challenge, we want to go quite a bit faster and see how much we can move the needle here.
The First Submission
With this challenge, the important thing is somebody has to come and participate. That was a bit my concern like, what happens if nobody does it? It would be a bit embarrassing. Roy Van Rijn, another Java champion from the Netherlands, he was instantly interested in this, and an hour later or so, after I had put out the post, he actually created his own first implementation, and it wasn’t very fancy or very elaborate. His idea just was, I want to be part of it. I want to put out a submission so other people can see, this is something we also could do. This was really great to see, because as soon as the first person comes along and takes part, then also other people will come along and take part. Of course, he kept working on that. He was one of the most active people who iterated on his implementation, but he was the very first one to submit something.
Parallelization
Let’s dive a little bit into the means of what you can do to make this fast. People spent the entire month of January working on this, and they went down to a really deep level, essentially counting CPU instructions. My idea is, I want to give you some ideas of what exists, what you can do. Then maybe later on, if you find yourself in that situation where you would like to optimize certain things, then you might remember, I’ve heard about it. Then you could go and learn really deep. That’s the idea.
Let’s talk about parallelization, first of all, because we have many CPU cores. On my server, which I use to evaluate it, I have 32 cores, 64 with hyperthreading. We would like to make use of that. Would be a bit wasteful to just use a single core. How can we go about this? Going back to my simple baseline implementation, the first thing I could do is I could just say, let me add this parallel call, so this part of the Java Streams API.
Now this will process this pipeline, or I should say, part of this streaming pipeline in parallel. Just doing this, just adding this single method call, gets us down to 71 seconds. From 4 minutes 50 seconds, to 71 seconds by just adding a few characters for one method call. I think that’s a pretty good thing. Very easy win. Of course, that’s not all we can do. In particular, if you think about it, yes, it gets us down by quite a bit, but it’s not eight times faster than what we had initially, but we have 8 CPU cores which I’m using here. Why is it not eight times faster? This parallel operator, this applies to the processing logic. All this aggregating and grouping logic, this happens in parallel, but this reading of the file from memory, this still happens sequentially.
The entire file, the reading part, that’s sequentially, and we have still all our other CPU cores sitting idle, so we would like to also parallelize that. This comes back then to this notion that I would like to go out and learn something new, because all those new Java versions, they come with new APIs, the JEPs, the Java Enhancement Proposals. One of them, which was added recently, is the foreign function and memory API. You can see it here, so that’s taken straight from the JEP, but essentially, it’s a Java API which allows you to make use of native methods.
It’s a replacement, much easier to use than the old JNI API. It also allows you to make use of native memory. Instead of the heap, which is managed by the JVM, you get the ability to manage your own memory section, like an off-heap memory, and you will be in charge of maintaining that, and making sure you free it, and so on. That’s what we would like to use here, because we could memory map this file and then process it there in parallel. Let’s see how we can go about that.
I mentioned there’s a bit of code, but I will run you through it. That’s a bit of a recurring theme. The code you will see, it gets more dense as we progress. Again, you don’t really have to understand everything. I would like to give you a high-level intuition. What do we do here? First of all, we determine the degree of parallelism. We just say, how many CPU cores do we have? Eight in our case, so that’s our degree of parallelism. Next, we want to memory map this file. You could have memory map files also in earlier Java versions, but for instance, you had size limits. You couldn’t memory map an entire 13-gig file all at once, whereas now here with the new foreign memory API, that’s possible. We can do this. You map the file. We have this Arena object there. This is essentially our representation of this memory. There’s different kinds of Arenas. In this case, I’m just using this global Arena which just exists and is accessible from everywhere within my application. That’s where I have that file, and now I can access that entire section of memory in parallel using multiple threads.
In order to do so, we need to split up that file and the memory representation. That’s what happens here. First of all, roughly speaking, we divide into eight equal chunks. We take our entire size divided by eight. That’s our estimate of chunk sizes. Now, of course, what would happen is, in all likelihood, we will end up in the middle of a line. This is not really desirable, where, ideally, we would like to have our worker processes, they should work on entire lines. What’s happening here is, we went to roughly one-eighth of the file, we just keep going to the next line ending character. Then we say, that’s the end of this chunk, and the starting point of the next chunk. Then we process those chunks, essentially, just using threads. We will see later on how to do this. We start our threads, we join them. In the end, we’ve got to wait. Now this parallelizes the entire thing.
Now we really make use of all our 8 cores for the entire time, also while we do the I/O. There’s one caveat. Just by the nature of things, one of those CPU cores will always be the slowest. At some point, all the other seven, they will just wait for the last one to finish, because it’s a little bit unequally distributed. What people, in the end, did, instead of using 8 chunks, they split up this file in much smaller chunks. Essentially, they had a backlog of those chunks. Whenever one of those worker threads was done with the current chunk, it would go and pick up the next one.
By that, you make sure that all your 8 threads are utilized equally all the time. The ideal chunk size, as it turned out, was 2 megabytes. Why 2 megabytes? This specific CPU, which is in this machine which I used, it has a second level cache size of 16 megabytes, 8 threads processing 2 megabytes at a time. It’s just the best in terms of predictive I/O and so on. This is what people found out. This already goes to show, we really get down to the level of a specific CPU and the specific architecture to really optimize for that problem by doing those kinds of things. That’s parallel I/O.
1BRC – Mythbusters, and Trivial Task?
This challenge, it was going, people were participating. They had a good time. Of course, whenever something happens, there’s also conspiracy theories. That’s what I fear. People said, is this actually an engineering problem? At Decodable, you had this problem and you didn’t know how to do it, so you farmed it out to the community. I can tell you, this would have been the most silly thing I could have done, because I created so much work by running this challenge for myself. I didn’t do much besides it during the entire month of January. It was not that. It was just a genuine thing, which I felt would be interesting to me and the community. Was it an add for GraalVM? Because many people actually used GraalVM, and we will see later on more about it. Also, no. It was just like GraalVM lends itself really well towards that problem. Finally, is it an add for this AMD EPYC processor? Also, no.
Really, no conspiracies going on here. Who is on Hacker News? I read Hacker News way too often. Of course, you always have the Hacker News guy who says, that’s a trivial problem. Everybody who knows how to program just a little bit, they will have solved this in two hours, and it’s like a boring thing. Then, on the other hand, you have all the people from the Java community, and also, big names like Thomas Würthinger, who is the engineering lead at GraalVM, or Cliff Click, who was one of the original people behind the JVM, or Aleksey Shipilev, and all those people, they spend the entire month of January doing this. Of course, the Hacker News dude, he does it in two hours. Always interesting to see.
Parsing
Let’s dive a little more into parsing that. We have seen how to make use of our CPU cores, but what actually happens there to process a line? Let’s take a closer look at that. If we want to get away from what we had initially with just splitting up the file using regex and so on, that’s not very efficient. Let’s see what we can do here. That’s, again, something I would be able to come up with just processing those input lines, character by character. What’s happening here is, we have a little bit of a state machine. We read our characters. We keep reading the line until it has no more characters. Then we use the semicolon character which separates our station name from the temperature value to switch these states. Depending on which state we are, so do we either read the bytes which make up the station name, or do we read up the bytes which make up the measurement value? We need to add them into some builder or buffer which aggregates those particular values.
Then, if we are at the end of a line, so we have found the line ending character, then we need to consume those two buffers for the station and for the measurement, which we have built up. For the measurement, we will need to see how we convert that into an integer value, because that’s also what people figured out. The problem was described as a double or a floating-point arithmetic, so with values like 21.7 degrees, but then again, randomly, I always only had a single fractional digit. People realized, this data actually, it always only has a single fractional digit. Let’s take advantage of that and just consider that as an integer problem by just multiplying the number by 100, for the means of calculation. Then at the end, of course, divide it by 100, or by 10. That’s something which people did a lot, and which I underestimated how much they would take advantage of the particular characteristics of that dataset.
For that conversion, we can see it here, and it makes sense, so we process or we consume those values. If we see the minus character, we negate the value. If we see the first one of our two digits, we multiply it by 100 or by 10. That’s how we get our value there. Doing that, it gets us down to 20 seconds. This is already an order of magnitude faster than my initial baseline implementation. So far, nothing really magic has happened. One takeaway also for you should be, how much does it make sense to keep working on such a thing? Again, if this is a problem you’re faced with in your everyday job, maybe stop here. It’s well readable, well maintainable. It’s an order of magnitude faster than the native baseline implementation, so that’s pretty good.
Of course, for the purposes of this challenge, we probably need to go a bit further. What else can we do? We can, again, come back to the notion of parallelism and try to process multiple values at once, and now we have different means of parallelization. We already saw how to make the most out of all our CPU cores. That’s one degree of parallelism. We could think about scaling out to multiple compute nodes, which is what we typically would do with our datastores. For that problem, it’s not that relevant, we would have to split up that file and distribute it in a network. Maybe not that desirable, but that would be the other end of the spectrum. Whereas we also can go into the other direction and parallelize within specific CPU instructions. This is what happens here with SIMD, Single Instruction, Multiple Data.
Essentially all these CPUs, they have extensions which allow you to apply the same kind of operation onto multiple values at once. For instance, here, we would like to find the line ending character. Now, instead of comparing byte by byte, we can use such a SIMD instruction to apply this to 8 or maybe 16, or maybe even more bytes at once, and it will, of course, speed up things quite a bit. The problem is, in Java, you didn’t really have a good means to make use of those SIMD instructions because it’s a portable, abstract language, it just wouldn’t allow you to get down to this level of CPU specificity. There’s good news.
There’s this vector API, which is still incubating, I think, in the eighth incubating version or so, but this API allows you now to make use of those vectorized instructions at extensions. You would have calls like this compare call with this equal operator, and then this will be translated to the right SIMD instruction of the underlying architecture. This would translate to the Intel or AMD64 extensions. Also, for Arm, it would do that. Or it would fall back to a scalar execution if your specific machine doesn’t have any vector extensions. That’s parallelization on the instruction level. I did another talk about it, https://speakerdeck.com/gunnarmorling/to-the-moon-and-beyond-with-java-17-apis, which shows you how to use SIMD for solving FizzBuzz.
Sticking to that pattern, applying the same operation to multiple values at once, we also can do what’s called SWAR, SIMD Within A Register. Again, I realize, the code gets more dense. I probably wouldn’t even be able right now to explain each and every single line, or it would take me a while. The idea here is, this idea of doing the same thing, like equals to multiple values all at once, we also can do this within a single variable. Because if you have 8 bytes, we also could see one long, that’s 64 bits, that’s also 8 bytes. We can apply the right level of bit level magic to a long value, and then actually apply this operation to all the 8 bytes. It’s like bit level masking and shifting, and so on. That’s what’s happening here. There’s a very nice blog post by Richard Startin, which shows you, step by step, how to do this, or how to use this to find the first zero byte in a string.
I have put the math up here on the right-hand side, so you actually can go and follow along, and you will see, this actually gives you the first zero byte in a long like that. That’s SIMD Within A Register, SWAR. Now the interesting thing is, if you look at this code, something is missing here. Is somebody realizing what we don’t have here? There’s no ifs, there’s no conditionals, no branching in that code. This is actually very relevant, because we need to remember how our CPUs actually work. If you look at how a CPU would take and go and execute our code, it always has this pipelined approach. Each instruction has this phase of, it’s going to be fetched from memory, it’s decoded, it’s executed, and finally the result is written back. Now actually multiple of those things happen in parallel. While we decode our one instruction, the CPU will already go and fetch the next one. It’s a pipelined parallelized approach.
Of course, in order for this to work, the CPU actually needs to know what is this next instruction, because otherwise we wouldn’t know what to fetch. In order for it to know, we can’t really have any ifs, because then we wouldn’t know, which way will we go? Will we go left or right? If you have a way for expressing this problem in this branchless way, as we have seen it before, then this is very good, very beneficial for what’s called the branch predictor in the CPU, so it always knows which are the next instructions. We never have this situation that we actually need to flush this pipeline because we took a wrong path in this predictive execution. Very relevant for that. I didn’t really know much about those things, but people challenged it. One of the resources they employed a lot is this book, “Hacker’s Delight”. I recommend everybody to get this if this is interesting to you. Like this problem, like finding the first zero byte in a string, you can see it here. All those algorithms, routines are described in this book. If this is the thing which gets you excited, definitely check out, and get this book.
Then, Disaster Struck
Again, people were working on the challenge. It was going really well. They would send me pull requests every day, and I didn’t expect that many people to participate. That’s why I always went to the server and executed them manually. At some point, someday I woke up, I saw, that’s the backlog of implementations which people had sent over the night, so let’s say 10 PRs to review and to evaluate, and suddenly all those results were off. It was like twice as fast as before. I ran one of the implementations which I had run on the day before, and suddenly it was much faster. I was wondering, what’s going on? What happened is, this workload, I had it initially on a virtual server. I thought, I’m smart. I try to be smart, so I get dedicated virtual CPU cores, so I won’t have any noisy neighbors on that machine, this kind of thing.
What I didn’t expect is that they would just go and move this workload to another machine. I don’t know why. Maybe it was random. Maybe they saw there was lots of load in that machine. In any case, it just got moved to another host, which was faster than before. This, of course, was a big problem for me, because all the measurements which I had done so far, they were off and not comparable anymore. That was a problem. I was a bit desperate at that point in time. This is where the wonders of the community really were very apparent. Good things happened. I realized, I need to get a dedicated server so that this cannot happen again. I need to have a box, which I can use exclusively. As I was initially paying out of my own pocket for that, I thought, I don’t want to go there. I don’t want to spend €100. As I mentioned, Decodable, my employer, they stepped up to sponsor it.
Then, of course, I also needed help with maintaining that, because I’m not the big operations guy. I know a little bit about it. Then for that thing, you would, for instance, like to turn off the hyperthreading, or you would like to turn off turbo boost to have stable results. I wasn’t really well-versed in doing those things, but the community came to help. In particular, René came to help. He offered his help to set up the thing. We had a call. I spoke to him. I had not known him. It was the first time I ever spoke to him, but we had a great phone conversation. In the end, he just sent me his SSH key. I uploaded his key to the machine, gave him the key to the kingdom, and then he was going and configuring everything the right way. There were multiple people, many people like that, who came and helped, because otherwise I just could not have done it.
The 1BRC Community
All this was a bit of a spontaneous thing. Of course, I put out some rules and how this should work. Then, I wasn’t very prescriptive. Like, what is the value range? How long could station names be? What sort of UTF character planes and whatnot? I didn’t really specify it. Of course, people asked, how long can a station name be? What kinds of characters can it contain, and so on? We had to nail down the rules and the boundaries of the challenge. Then people actually built a TCK, a test kit. It was actually a test suite which you then had to pass. Not only you want to be fast, you also want to be correct. People built this test suite, and it grew, actually, over time. Then whenever a new submission, a new entry came in, it, first of all, had to pass those tests. Then if it was valid, then I would go and evaluate it and take the runtime. This is how this looked like. You can see it here.
It had example files with measurements, and an expected file, what should be the result for that file? Then the test runner would go process the implementation against that set of files, and ensure that result is right. That’s the test kit. The other thing, which also was very important as well, I had to run all those things on that machine. There’s quite a few things which were related to that, like just making sure the machine is configured correctly. Then, I had five runs, and I want to discard fastest and slowest, all those things. Jason, here, he came to help and scripted all that. It was actually very interesting to see how he did it. I would really recommend to go to the repo, and just check out the shell scripts which exist, which are used for running those evaluations. It’s a bit like a master class in terms of writing shell scripts, with very good error handling, colored output, all this good stuff to make it really easy and also safe to run those things. If you have to do shell scripting, definitely check out those scripts.
Bookkeeping
Then, let’s talk about one more thing, which is also very relevant, and this is what I would call bookkeeping. If you remember the initial code I showed, I had this Java Streams implementation, and I used this collector for grouping the values into different buckets, per weather station name. People realized, that’s another thing which we can optimize a lot ourselves. By intuition, you would use a HashMap for that. You would use the weather station name as the key in that HashMap. Java HashMap is a generic structure. It works well for a range of use cases. Then, if we want to get the most performance for one particular use case, then we may be better off implementing a bespoke, specific data structure ourselves. This is what we can see here. I think it might sound maybe scary, but actually it is not scary. It’s relatively simple. What happens here? We say, we would like to keep track of the measurements per our station name. It’s like a map, but it is backed by an array, so those buckets.
The idea now is, we take the hash key of our station name and we use this as the index within that array, and at that particular slot in the array, we will manage the aggregator object for a particular station name. We take the hash code, and we want to make sure we don’t have an overflow. That’s why we take it with logical end with the size of the array. We always have it well indexed in the array. Then we need to check, at that particular position in the array, is something there already? If nothing is there, that means, we have the first instance of a particular station in our hands, so the first value for Hamburg or the first value for Munich. We just go create this aggregator object there and store it at that particular offset in the array. That makes sense. The other situation, of course, is we go to the particular index in the array, and in all likelihood, something will be there already. If you have another value for the same station, something will be there already.
The problem is we don’t know yet, is this actually the aggregator object for that particular key we have in our hands, or is it something else? Because multiple station names could have the same key. Which means, in that case, if something exists already at this particular array slot, you need to fall back and compare the actual name. Only if the incoming name is also the name of the aggregate object in that slot, then we can go and add the value to that. That’s why it’s called linear probing. Otherwise, we will just keep iterating in that array until we either have found a free slot, so then we can go install it there, or we have found the slot for the key which we have in our hands. I think it’s relatively simple. Now for this particular case, this performs much better, actually, than what we could get with just using Java HashMap.
Of course, it depends a lot on the particular hash function here which we use to find that index. This is where it goes back to people really optimized a lot for the particular dataset, so they used hash functions which would be collision free for the particular dataset. This was a bit against what I had in mind, because the problem was this file, as I mentioned, it has a size of 13 gigabytes, and I just didn’t have a good way for distributing 13 gigabytes to people out there. That’s why, instead, they would have to generate it themselves. I had the data generator, and everybody could use this generator to create the file for themselves and then use it for their own testing purposes. The problem was, in this data generator, I had a specific key set. I had around 400 different station names with the idea being, that’s just an example, but people took it very literally, and they optimized then a lot for those 400 station names. They used hash functions, which would not have any collisions, for those 400 names. Again, people will take advantage of everything they can.
The problem with all that is it also creates lots of work for me, because you cannot really prove the absence of hash collisions. Actually, whenever somebody sent in their implementation, I had to go and check out, do they actually handle this case, the two stations which would create the same key, and do they handle those collisions accordingly? Because otherwise, if you don’t do this fall back to the slow case, you would be very fast, but you would be incorrect because you don’t deal correctly with all possible names. This was a bit of a trap, which I set up for myself, and it meant I always had to check for that and actually ask people in the pull request template, if you have a custom map implementation, where do you deal with collisions? Then we would have conversations like we see here. How do you deal with hash collisions? I don’t, that’s why it’s so fast. Then he would go and rework it. A big foot trap for myself.
GraalVM: JIT and AOT Compiler
Those are three big things, parallelization, then all this parsing with SIMD and SWAR, and custom hashmapping for bookkeeping. Those were recurring themes I saw again. Then there were more specific tricks, and I just wanted to mention a few of them. I just want to give you some inspiration of what exists. One thing which exists is the Epsilon garbage collector, which is a very interesting garbage collector because it doesn’t collect any garbage. It’s a no-op implementation. If you have your regular Java application, that would be not a good idea. Because you keep allocating objects, and if you don’t do any GC, you will run out of heap space at some point. Here, people realized, we can actually implement this in a way that we don’t do any allocations on our processing loop. We’ll do a few allocations initially when bootstrapping the program, but then later on, no more objects get created. We just have arrays which we can reuse, like mutable structures, which we can just update.
Then we don’t need any garbage collection, and we don’t need any CPU cycles to be spent on garbage collection, which means we just can be a bit faster. Again, I think that’s an interesting thing. Maybe, for instance, if you work on some CLI tool, short-lived thing, could be an interesting option to just disable the garbage collector and see how that goes. The other thing, which you can see here is people used a lot GraalVM. GraalVM, it’s two things, really. It’s an ahead-of-time compiler, so it will take your Java program and emit a native binary out of it. This has two essential advantages. First of all, it uses less memory. Secondly, it’s very fast to start because it doesn’t have to do class loading and the compilation and everything, this all happens at build time. This is fast to start if you have this native binary. Now to the level of results we got here, this actually mattered.
Initially, I thought saving a few hundred milliseconds on startup won’t make a difference for processing 13 gigabytes of file, but actually it does make a difference. The AOT compiler and most of the fastest implementations, they actually used the AOT compiler with GraalVM. There’s also the possibility to use this as a replacement for the just-in-time compiler in your JVM. You just can use it as a replacement for the C2 compiler. I’m not saying you should always do this. It depends a little bit on your particular workload and what you do, whether it’s advantageous or not. In this case, this problem, it lends itself very well to that. People just by using GraalVM as the JIT compiler in the JVM, this gave them a nice improvement of like 5% or so. It’s something I can recommend for you to try out, because it’s essentially free. You just need to make sure you use a JVM or a Java distribution which has GraalVM available as the C2 compiler replacement. Then it’s just means of saying, that’s the JIT compiler I want to use, and off you go. Either it does something for you or does not.
Further Tricks and Techniques
A few other things, like unsafe, what I found interesting is the construct here on the right-hand side, because if you look at that, this is our inner processing loop. We have a scanner object. We try to take next values. We try to process them, and so on. What we have here is we have the same loop three times in a program which is written up in a sequential way. If you look at it, you would say, those three loops, they run one after another. What actually happens is, as the CPUs have multiple execution units, the compiler will figure out, this can actually be parallelized, because there is no data dependencies between those loops. This is what happens, we can take those loops and run them concurrently. I found it very interesting. Why is it three times? Empirically determined.
Thomas, who came up with this, he tried it two times. He tried to have the loop four times and three times it was just the fastest on that particular machine. It could be different in other machines. Of course, you see already here with all those things, this creates questions around maintainability. Because I already can see the junior guys joining the team, and they’re like, “That’s duplication. It’s like the same code three times. Let me go and refactor it. Let me clean it up”, and your optimization would be out of the window. You would want to put a comment there, don’t go and refactor it into one loop. That’s the consideration. Are those things worth it? Should you do it for your particular context? That’s what you need to ask. I found this super interesting, that this is a thing.
The Results
You are really curious then, how fast were we in the end? This is the leaderboard with those 8 CPU cores I initially had. I had 8 CPU cores because that was what I had with this virtual server initially. When I moved to the dedicated box, I tried to be in the same ballpark. With those 8 cores, we went down to 1.5 seconds. I would not have expected that you could go that fast with Java, processing 13 gigabytes of input in 1.5 seconds. I found that pretty impressive. It gets better because I had this beefy server with 32 cores and 64 threads with hyperthreading. Of course, I would like to see, how fast can we go there? Then we go down to 300 milliseconds. To me, it’s like doubly mind blowing. Super impressive. Also, people did this, as I mentioned, in other languages, other platforms, and Java really is very much at the top, so you wouldn’t be substantially better with other platforms.
The other thing, there was another evaluation, which there was, because I mentioned I had this data generator with those 400-something station names, and people optimized a lot for that by choosing specific hashing functions and so on. Some people realized that actually, this was not my intention. I wanted to see, how fast can this be in general? Some people agreed with that view of the world. For those, we had another leaderboard where we actually had 10k different station names. As you can see here now, it’s actually a bit slower, because you really cannot optimize that much for that dataset. Also, it’s different people at the top here. If I go back, here we have Thomas Würthinger, and people who teamed up with him for the regular key set, and then for the 10k key set, it’s other people. It’s different tradeoffs, and you see how much this gets specific for that particular dataset.
It’s a Long Journey
People worked on that for a long time. Again, the challenge went through the entire month of January. I didn’t do much besides running it really. People like Thomas, who was the fastest in the end, he sent 10 PRs. There were other people who sent even more. The nice thing was, it was a community effort. People teamed up. As I mentioned before, like the fastest one, it was actually an implementation by three people who joined forces and they took inspiration. When people came up with particular tricks, then very quickly, the others would also go and adopt them into their own implementation. It was a long journey with many steps, and I would very much recommend to check this out.
This is, again, the implementation from Roy Van Rijn, who was the first one, because he kept this very nice log of all the things he did. You see how he progressed over time. If you go down at the very bottom, you will see, he started to struggle a bit because he did changes, and actually they were slower than what he had before. The problem was he was running on his Arm MacBook, which obviously has a different CPU with different characteristics than the machine I was running this on. He saw improvements locally, but it was actually faster on the evaluation machine. You can see it at the bottom, he went and tried to get an Intel MacBook, to have better odds to do something locally, which then also performs better on that machine. I found it really surprising to see this happening with Java, that we get down to this level where the particular CPU and even its generation would make a difference here.
Should You Do Any of This?
Should you do any of this? I touched on this already. It depends. If you work on an enterprise application, I know you deal with database I/O most of the times. Going to that level and trying to avoid CPU cycles in your business code probably isn’t the best use of your time. Whereas if you were to work on such a challenge, then it might be an interesting thing. What I would recommend is, for instance, check out this implementation, because this is one order of magnitude faster than my baseline. This would run 20 seconds or so. It’s still very well readable, and that’s what I observed, like improving by one order of magnitude. We have still very well readable code. It’s maintainable.
You don’t have any pitfalls in this. It just makes sense. You are very much faster than before. Going down to the next order of magnitude, so going down to 1.5 seconds, this is where you do all the crazy mid-level magic, and you should be very conscious whether you want to do it or not. Maybe not in your regular enterprise application. If you participate in a challenge you want to win a coffee mug, then it might be a good idea. Or if you want to be hired into the GraalVM team, I just learned this the other day, actually, some person who goes by the name of Mary Kitty in the competition, he actually got hired into the GraalVM compiler team at Oracle.
Wrap-Up, and Lessons Learned
This was impacting the Java community, but then also people in other ecosystems, databases, in Snowflake they had a One Trillion Row Challenge. This really blew up and kept people busy for quite a while. There was this show and tell in the GitHub repo. You can go there and take a look at all those implementations in Rust, and OCaml, and all the good things I’ve never heard about, to see what they did in a very friendly, competitive way. Some stats, you can go to my blog post there, you will see how many PRs, and 1900 workflow runs, so quite a bit of work, 187 lines of comment in Aleksey’s implementation. Really interesting to see. In terms of lessons learned there, if I ever want to do this again, I would have to be really prescriptive in terms of rules, automate more, and work with the community as it happened already today. Is Java slow? I think we have debunked that. I wouldn’t really say so. You can go very fast. Will I do it again next year? We will see. So far, I don’t really have a good problem which would lend itself to doing that.
See more presentations with transcripts
Logic App Standard Hybrid Deployment Model Public Preview: More Flexibility and Control On-Premise
MMS • Steef-Jan Wiggers
Article originally posted on InfoQ. Visit InfoQ
Microsoft recently announced the public preview of the Logic Apps Hybrid Deployment Model, which allows organizations to have additional flexibility and control over running their Logic Apps on-premises.
With the hybrid deployment model, users can build and deploy workflows that run on their own managed infrastructure, allowing them to run Logic Apps Standard workflows on-premises, in a private cloud, or even in a third-party public cloud. The workflows run in the Azure Logic App runtime hosted in an Azure Container Apps extension. Moreover, the Hybrid deployment for Standard logic apps is available and supported only in the same regions as Azure Container Apps on Azure Arc-enabled AKS. However, when the offering reaches GA, more regions will be supported.
Principal PM for Logic Apps at Microsoft Kent Weare writes:
The Hybrid Deployment Model supports a semi-connected architecture. This means that you get local processing of workflows, and the data processed by the workflows remains in your local SQL Server. It also provides you the ability to connect to local networks. Since the Hybrid Deployment Model is based upon Logic Apps Standard, the built-in connectors will execute on your local compute, giving you access to local data sources and higher throughput.
(Source: Tech Community Blog Post)
Use cases for the hybrid deployment model are threefold, according to the company:
- Local processing: BizTalk Migration, Regulatory and Compliance, and Edge computing
- Azure Hybrid: Azure First Deployments, Selective Workloads on-premises, and Unified Management
- Multi-Cloud: Multi-Cloud Strategies, ISVs, and Proximity of Line of Business systems
The company’s new billing model supports the Hybrid Deployment Model, where customers manage their Kubernetes infrastructure (e.g., AKS or AKS-HCI) and provide their own SQL Server license for data storage. There’s a $0.18 (USD) charge per vCPU/hour for Logic Apps workloads, allowing customers to pay only for what they need and scale resources dynamically.
InfoQ spoke with Kent Weare about the Logic App Hybrid Deployment Model.
InfoQ: Which industries benefit the most from the hybrid deployment model?
Kent Weare: Having completed our private preview, we have seen interest from various industries, including Government, Manufacturing, Retail, Energy, and Healthcare, to name a few. The motivation of these companies varies a bit from use case to use case. Some organizations have regulatory and compliance needs. Others may have workloads running on the edge, and they want more control over how that infrastructure is managed while reducing dependencies on external factors like connectivity to the internet.
We also have customers interested in deploying some workloads to the cloud and then deploying some workloads on a case-by-case basis to on-premises. The fundamental value proposition from our perspective is that we give you the same tooling and capabilities and can then choose a deployment model that works best for you.
InfoQ: What are the potential performance trade-offs when using the Hybrid Deployment Model compared to fully cloud-based Logic Apps?
Weare: Because we are leveraging Logic Apps Standard as the underlying runtime, there are many similar experiences. The most significant difference will be in the co-location of your integration resources near the systems they are servicing. Historically, if you had inventory and ERP applications on-premises and needed to integrate those systems, you had to route through the cloud to talk to the other system.
With the Hybrid Deployment Model, you can now host the Logic Apps workflows closer to these workloads and reduce the communication latency across these systems. The other opportunity introduced in the hybrid deployment model is taking advantage of more granular scaling, which may allow customers to scale only the parts of their solution that need it.
MMS • Lakshmi Uppala
Article originally posted on InfoQ. Visit InfoQ
Transcript
Shane Hastie: Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. Today, I’m sitting down with Lakshmi Uppala. Lakshmi, welcome. Thanks for taking the time to talk to us.
Lakshmi Uppala: Thank you, Shane. It’s my pleasure.
Shane Hastie: My normal starting point in these conversations is, who’s Lakshmi?
Introductions [01:03]
Lakshmi Uppala: Yes, that’s a great start. So, professionally, I am a product and program management professional with over 12 years of experience working with Fortune 500 companies across multiple industries. Currently, I’m working with Amazon as a senior technical program manager, and we build technical solutions for recruiters in Amazon. Now, as you know, Amazon hires at a very, very large scale. We are talking about recruiters processing hundreds and thousands of applications and candidates across job families and roles hiring to Amazon. Our goal is to basically simplify the recruiter operations and improve the recruiter efficiency so that they can spend their time in other quality work.
What sets me apart is my extensive background in very different type of roles across companies and driving impactful results. So, in my experience, I’ve been a developer for almost seven years. I’ve been a product head in a startup. I’ve been leading technical teams in India, in some companies in India, and now in Amazon. Over the last seven years or so, I’ve had the privilege of driving large-scale technical programs for Amazon, and these include building large-scale AI solutions and very technically complex projects around tax calculation for Amazon.
Outside of work, outside of my professional world, I serve as a board member for Women at Amazon, DMV Chapter, and Carolina Women+ in Technology, Raleigh Chapter. Both of these are very close to me because they work towards the mission of empowering women+ in technology and enabling them to own their professional growth. So that’s a little bit about me. I think on my personal front, we are a family of three, my husband, Bharat, and my 11-year-old daughter, Prerani.
Shane Hastie: We came across each other because I heard about a talk you gave on a product value curve. So tell me, when we talk about a product value curve, what do we mean?
Explaining the product value curve [03:09]
Lakshmi Uppala: Yes. I think before we get into what we mean by product value curve, the core of this talk was about defining product strategy. Now, product strategy at its start is basically about identifying what the customer needs in your product. So that’s what product strategy is about, and product value curve is one of… very successful and a proven framework to building effective product strategy.
Shane Hastie: So figuring out what your customer really needs, this is what product management does, and then we produce a list of things, and we give them to a group of engineers and say, “Build that”, don’t we?
Lakshmi Uppala: No, it’s different from the traditional approach. Now, traditionally and conventionally, one way to do product strategy or define product strategy and product roadmap is to identify the list of hundred features which the customer wants from your product and create a roadmap saying like, “I will build this in this year. I will build this in the next year and so on”, give it to engineering teams, and ask them to maintain a backlog, and build it. But value curve is a little different. It is primarily focused on the value, the product offers to customers. So it is against the traditional thinking of features, and it is more about what value does it give as end result to the user.
For example, let’s say we pick a collaboration tool as a product. Right? Now, features in this collaboration tool can be, “Visually, it should look like this. It should allow for sharing of files”. Those can be the features. But when you think of values, it is a slight change in mindset. It is about, “Okay. Now, the customer values the communication to be real time”. That is a value generated to the customer. Now, the customer wants to engage with anybody who is in the network and who is in their contact list. Now, that is a value.
So, first, within this value curve model, a product manager really identifies the core values which the product should offer to the customers, understands how these values fit in with their organizational strategy, and then draw this value curve model. This is built as a year-over-year value curve, and this can be given to the engineering teams as a… The subsequent thing is to define features and then give it to engineering teams. So it is purely driven by values and not really the features which are to be built on the product. That’s the change in the mindset, and it is very much essential because finally, customers are really looking for the value. Right? It’s not about this feature versus that feature in the products. So this is essential.
Shane Hastie: As an engineer, what do I do with this because I’m used to being told, “Build this?”
Engaging engineers in identifying value [06:02]
Lakshmi Uppala: Yes. I think when we look at value curves and its significance to engineering teams, there are multiple ways this can benefit the engineering teams. One is in terms of the right way of backlog grooming and prioritization. Now, features also. Yes, there is a specific way of doing backlog grooming which engineering team does, but I think when you translate this into value-based roadmap, the engineers and the engineering teams are better able to groom the backlog, and they’re able to take the prioritization decisions in a more effective way because values are, again, going to translate to the success of the product.
Second is the design and the architectural decisions. Now, if we think about the traditional model, right? Because they’re given as a list of features, now if an engineer is building for the features, the design or architectural choices are going to be based on the features which are at hand. But thinking in terms of value, it changes their thinking to build for longer-term value which the product should offer to the customers, and that kind of changes the design and architectural decisions.
Third, prototyping and experimentation like engineers can actually leverage the value curves to rapidly prototype and also test design alternatives which, again, is not very suitable with the feature-based roadmap definition. I think one other thing is cross-functional alignment. This is one of the major problems which we encounter or which the engineering teams encounter when there are dependencies with other teams to build a specific feature in a product. It is very hard to get the buy-in from the dependency teams for them to prioritize the work in their plan when it is based on features.
But when you think of value and value curves, it gives a very easy visual representation of where we are trending towards in terms of value generation to customers and how our incremental delivery is going to be useful in the longer run which is very helpful to drive these cross-functional alignments for prioritization, and I know… being an engineering manager in Amazon for a year, I know that this is one of the major problems which the leaders face to get cross-team alignment for prioritization. All of these can be solved by a good visual representation using value curves and defining value-based roadmaps.
Shane Hastie: So approaching the backlog refinement, approaching the prioritization conversation, but what does it look like? So here I am. We’re coming up to a planning activity. What am I going to do?
Prioritizing by value [08:54]
Lakshmi Uppala: Okay. So when the product manager defines the value curves, let’s say they define it in incremental phases, that’s how I would recommend approaching this. So as a product manager, I would define how my product is going to look like three years down the line and in terms of the values, again, not just in terms of the features. I’m going to define like, “Okay. In three years, if I’m going to be there, year one, this is the area or this is the value I’m going to focus on. Year two, this is the value I’m going to focus on”. So this is going to give year-on-year product roadmap for the team to focus on.
Now, translating that to the tasks or deliverables which engineers can actually take on is a joint exercise between product managers, and engineering leaders, and senior engineers. Now, translating these values into the features for the product. Again, we are not driving this from features, but we are driving it from values. Given this value, what are the 10 features which I need to build? Creating technical deliverables from those features, that is the refined approach to backlog grooming and planning when compared to older-fashioned backlog grooming.
Shane Hastie: Shifting slant a little bit, your work empowering women in tech, what does that entail? How are you making that real?
Empowering women in tech [10:21]
Lakshmi Uppala: Yes. It’s a great question, and as I said, it’s very close to me and my heart. So, as I said, I am part of two organizations, one within Amazon itself and one which is Carolina Women in Tech, Raleigh Chapter, but both of these are common in terms of the initiatives we run and the overall mission. So, within the Women at Amazon group, we do quarterly events which are more focused around specific areas where women need more training, or more learning, or more skill development. We welcome all the women+ allies in Amazon to attend those and gain from them.
Similar in Carolina Women in Tech, but the frequency is more regular. Once in a month, we host and organize various events. Just to give you a flavor of these events, they can range from panel discussions by women leaders across industries and across companies, and these could be across various types of topics. One of the things which we recently did in Amazon was a panel discussion with director level women leaders around the topic of building personal brand. These events could also be speed mentoring events. Like last year when we did this in Amazon, we had 150-plus mentees participate with around 12 women and men leaders, and this is like speed mentoring. It’s like one mentor to many mentees across various topics. So it’s like an ongoing activity. We have multiple such initiatives going on around the year so that we can help women in Amazon and outside.
Shane Hastie: If we were to be giving advice to women in technology today, what would your advice be? Putting you on the spot.
Advice for women in technology [12:06]
Lakshmi Uppala: Okay. I was not prepared for this. There are two things which I constantly hear in these events. One of them is women have this constant fear and imposter syndrome a lot more than men. When women in tech or any other domain, when they’re in conversations or meetings with men or other counterparts, women generally tend to take a step back thinking that they’re less knowledgeable in the area, and they don’t voice out their opinions. I would recommend women to understand or to believe in themselves, to believe in their skills, and be vocal and speak up where they need.
Second is about the skill development. I think one of the other things which I noticed which is even true for me is while getting engaged with multiple commitments, including personal, and family, and professional, give very less importance to skill development, like personal skill development. I think that is very, very essential to grow and to be up-to-date in the market for growing. So I think continuous learning and skill development is something which everybody, not just women, but more importantly, women should focus on and invest their time and energy in. So those are the two things.
Shane Hastie: If somebody wants to create that Women in Tech type space, what’s involved?
Establishing a Women in Tech community [13:29]
Lakshmi Uppala: I think it’s a combination of things. One is definitely if somebody is thinking of creating such a group within their organization, then definitely, the culture of the organization should support it. I think that’s a very essential factor. Otherwise, even though people might create such groups, they will not sustain or they’ll not have success in achieving their goals. So organizational culture and alignment from leadership is definitely one key aspect which I would get at the first step.
Second is getting interested people to join and contribute to the group because this is never a one-person activity. It’s almost always done by a group of people, and they should be able to willingly volunteer their time because this is not counted in promotions, this is not counted in your career growth, and this is not going to help you advance in your job. So this is purely a volunteer effort, and people should be willing to volunteer and invest their time. So if you’re able to find a group of such committed folks, then amazing.
Third, coming up with the initiatives. I think it’s very tricky, and it’s also, in some bit, organizational-specific. So creating a goal for the year. Like For Women at Amazon, DMV Chapter, we create a goal at the beginning of the year saying that, “This is the kind of participation I should target for this kind of events, and these are the number of events I want to run. These are the number of people I want to impact”. So coming up with goals, and then thinking through the initiatives which work well with the organizational strategy is the right way to define and execute them. If they’re not aligned with the organizational culture and strategy, then probably you might run them for some iterations. They’ll not create impact, and they’ll not even be successful. That’s my take.
Shane Hastie: Some really interesting content, some good advice in here. If people want to continue the conversation, where do they find you?
Lakshmi Uppala: They can find me on LinkedIn. So a small relevant fact is I’m an avid writer on LinkedIn, and I post primarily on product program and engineering management. So people can definitely reach out to me on LinkedIn, and I’m happy to continue the discussion further.
Shane Hastie: Lakshmi, thank you very much for taking the time to talk to us today.
Lakshmi Uppala: Thank you so much, Shane. It was a pleasure.
Mentioned:
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.
MMS • Ben Linders
Article originally posted on InfoQ. Visit InfoQ
High-performing teams expect their leader to enable them to make things better, Gillard-Moss said at QCon London. Independence in software teams can enable decision-making for faster delivery. Teams need empathy, understanding, and guidance from their managers.
Something most driven and motivated engineers have in common is that they will have a longer list of all the things they want to improve than the things they believe are going well. Gillard-Moss mentioned that for some managers this is intimidating and makes them believe the team is being negative, when in reality this is a good thing. They are motivated to get these things solved. In return, they need you to create the opportunity to solve things.
If the team feels that you are not enabling them to solve things fast enough, then sentiment turns negative, Gillard-Moss argued. This is because they’ve stopped believing in you as someone who can help. So rather than bringing you problems to solve and wanting your help, they become complaints, burdens, and excuses:
If you have a team that is unable to make things better, and is stuck complaining that things aren’t getting better, then you do not have a high-performing team.
We should strive for independence in teams for faster and better decision-making, which leads to faster delivery and faster impact, Gillard-Moss said. Waiting for decisions is the single biggest productivity killer, and making decisions on poor information is the most effective way to waste money, he explained:
In technology, we need to make thousands of decisions a day. It’s unrealistic for someone far from the information to make high-quality decisions without blocking teams. And it doesn’t scale.
With low-performing teams, Gillard-Moss suggested analysing their cycle time. The vast majority of it is waiting for someone to provide information or make a decision, he mentioned. And then, when they do, the team struggles to implement or pursue a suboptimal solution because the decision maker had an overly naive view of what needed to be done.
What teams need from managers is empathy, understanding, and guidance. Empathy comes from being able to think like an engineer, Gillard-Moss said, and understanding because you’ve been there and done that yourself. Guidance comes from a deep instinct for the universal fundamentals of engineering and how to apply them to get better results, he added, the evergreen wisdom and principles.
Gillard-Moss stated that a good engineering leader builds teams that can maximise impact by applying their expertise:
My experience as an engineer tells me that integrating early and often results in less delivery risk, and when to tell that’s not happening. It also tells me that’s sometimes easier said than done and the team might need help working through it. This gives me the patience and empathy to guide a team through the trade-offs in these difficult situations.
InfoQ interviewed Peter Gillard-Moss about managing high-performing teams.
InfoQ: How can managers help teams improve their cycle time?
Peter Gillard-Moss: There are so many factors that influence cycle time, from architecture to organisation design to culture to processes. The best thing any manager can do is observe the system and continuously ask, “Why did this take as long as it did? With hindsight how could we have improved the quality of this decision?” And then experiment with small changes. Little nudges. This is, after all, why we have retrospectives.
One example was a team who felt like they worked on a lot but nothing came out the other end. When we analysed the cards, we saw that they would keep moving back up the wall from Dev Complete back into Dev, or more cards would be created and the original card would be placed in Blocked or Ready to Deploy for weeks on end. What was happening was the stakeholder would specify the exact solution, literally down to fields in the database in the original card. The team would build it and then the QA would find edge cases. The edge cases would go back to the stakeholder who would then decide on the next steps, either adding new criteria to the original card or creating new ones.
Most of this was over email (because the stakeholder was too busy) and it was often missing context both ways. When you gathered the history and context around the cards, it looked absurd as weeks would go by for simple stories which have long email chains connected to them. Despite the obvious inefficiencies, this is a pattern I’ve seen in many teams.
InfoQ: How can engineering leaders keep their engineering expertise up-to-date?
Gillard-Moss: You can’t. I’m really sorry but you can’t. It’s impossible. Once you realise this, it will liberate you and you will become a better leader.
Plus it’s not what teams need from you. The engineering expertise needs to be in the team with the people doing the work. Knowledge and expertise is a group asset, not an individual asset.
How do you think a team would perform if all the knowledge and expertise was only in their manager’s head? Every time a team gets stuck having to go to their manager to get the answer. How slow is that? How expensive is that? And the manager will burn out and complain that they don’t have time to focus on the important things.
Presentation: Beyond Platform Thinking at RB Global – Build Things No One Expects, in a Place No One Expects It
MMS • Ranbir Chawla
Article originally posted on InfoQ. Visit InfoQ
Transcript
Chawla: I took a brief sojourn as a young kid into the finance business, and determined that that wasn’t where I wanted to be. I had met a couple of people in the finance business that had some money. Back in the early ’90s, when the internet looked like it might actually not just be an academic exercise, raised some money with a couple of friends, bought one of the first ISPs in the U.S. Joined another couple of friends, built Earthlink, which ended up being the largest ISP in the United States. It was fun. It was a great journey.
It was amazing to work on a lot of things we worked on. I always tell people, try not to be successful in your first two startups, because it colors your vision of what’s to come next, because they don’t always work after that. I’ve been in a lot of diverse industries. I’ve done a lot of consulting. I was at Thoughtworks. I’ve had a lot of opportunity to see all kinds of architectures and all kinds of situations, from small companies to some of the Fortune 5. The Ritchie Brothers journey was a really interesting one.
Introducing RB Global
There’s a little subtitle to the talk, build things no one expects, in a place no one expects it. This phrase came up in a conversation about a year ago with a group of some of the folks at Thoughtworks that were still working on a project at Ritchie Brothers. One of them asked me how I had reached some of the success I might have reached in my career. I was trying to think about what that really meant, and trying to come up with a way to describe it. At least for me, the things that I’ve done is exactly that. Whatever the solution is, it’s not what the solution somebody expected, or it was something no one was looking forward to.
No one had on their bingo card to build this thing. No one was expecting the World Wide Web in 1994, 1995, or building an architecture like we did at Ritchie Brothers, in an industry where no one expects that architecture. That leads into, what is Ritchie Brothers. We’re the largest global auction marketplace in the world for heavy equipment, agricultural equipment, heavy trucking, things like that. Sounds boring, but it’s actually really very interesting and a lot of fun. We have deep relationships across all these industries, and deep relationships with construction companies, large agricultural firms, and a lot of small businesses. Our customers range from very small family farms all the way up to major construction firms, or firms that rent and lease construction equipment and then move that to our marketplace.
Things about Ritchie in the plus column, the good things about Ritchie Brothers. We care deeply about our customers. When you think about our sellers, if you go to one of our live auctions or even look online, we spend a lot of time on merchandising the equipment that we sell. It’s just a pickup truck, or it’s just a large dump truck, or it’s just an excavator. If you go and you see the care that we took in bringing that equipment in and cleaning that equipment and merchandising it and making it available to people. We deeply realize that a lot of these people that we work with, from a seller perspective, have built a firm as a small business or a medium-sized business that is their retirement, that is their assets. They don’t have a pension somewhere. They don’t have cash somewhere. They’ve got this equipment. They’ve reached the end point where they’re going to go relax and go to Florida or go somewhere and be done.
We’re there to get them the very best price on that equipment. That’s what they deserve. That’s what we want to care about them. When it comes to our buyers. We want to get them the best possible pricing. We want to give them accurate information. You’re making a very large investment in a piece of equipment that you may only have a few minutes to inspect. We’re going to give you a professional inspection. We’re going to take a lot of time to make sure you understand what you have. Again, if you also aren’t getting the best possible customer service, we’re going to make sure you have that. We look at our business. We’re a very white glove, very hands-on business on both sides of the transaction. That’s a real positive for us.
On the not so positive side, or the, what’s become difficult, or really, why are we here talking about this. Ritchie Brothers evolved into RB Global, and built that company through acquisition. We bought another company called Iron Planet, 7 years, 8 years ago, that helped us bring another way of being in the digital marketplace. They were very online at the point where we were very physically centric. Auctions were at an auction yard, an auctioneer calling out. Iron Planet was very digital. RB moved into the digital universe on their own. Actually, what happened is, you had those companies, now we have other companies within our sphere.
The integrations were never done. They were uncommitted. We would find ways to make them work. We now white glove our own internal operations. There’s a lot of manual steps. How do I bring in a bunch of trucks that I’m going to sell, or a bunch of heavy equipment I’m going to sell? Some of it needs to get sold through the Iron Planet sales formats, and some of them needs to get sold through a local auction yard. There’s a ton of white glove manual steps that go into moving everything from system to system. We call it doing arts and crafts.
How do I fake this process? How do I fake that construct? I asked one of our people, what’s her dream for a successful integration? She says, I want to be able to do it from a dining room chair. I said, I don’t understand what you mean. I don’t want a swiveling office chair. I want a chair that doesn’t turn because my office chair allows me to have three monitors, because IT gives me three monitors because I’m in three different apps all the time. I want a screen. I want a boring, just simple engagement. That’s where we have that challenge of how to bring a better digital experience for our own customers, but also a better digital experience for all the people that were embedded in the company.
Architecture Perfect Storm
What did we find as we started this journey about 2 years ago? I came in, I was with Thoughtworks at the time. They came in and said, we want to do this digital transformation. We want to build a very large digital marketplace. We want to integrate all our companies. We want to do this amazing thing. Here’s what we found. We call it the architecture perfect storm. If you look at across all the systems we had, there were at least three of everything. Whatever domain you picked, there was three of it: three merchandising, three this, three asset management systems, three billing. All the different pieces all over the place. What was maybe worse, was that each one of these was believed to have to be in sync all the time. We’re going to talk about some of the models and things that we found. There was no system of record for anything in particular.
Even worse was all that interconnectivity. Everything’s in threes for some reason. If you think about the modes of intercommunication, we had a really interesting combination of literal database sharing, Kafka and Boomi, between all the different systems, one of which could fail at any one particular time. We had what everyone was calling an event-driven architecture, through Boomi and Kafka, which is really data synchron, data propagation. It was not event-driven architecture. There is a long-standing teams chat, where there are a group of people, three and a half, almost four years ago that came together in a teams chat because they were noticing auction data wasn’t propagating online between these two systems. We’ll get in a teams chat. We’ll bring people together when we see this, we’ll be able to go and fix it. That was supposed to be temporary. It’s four years. Sometimes people ask me what my goal is for this endeavor, and how do I know when I’m done? I shut that chat down. To me, that’s a sign of success.
Technical Myth Understandings
Many technical myth understandings. Why does our electronic bidder number only have five digits? We need more digits. We run out of numbers. We had a reason. There was a reason. You talk to 7 different people, 11 different answers, why we have this. What happens if we change it? I was told if we change, this thing breaks. No, that’s the thing that breaks. You’re just trying to layer on layer and layer of puzzlement of how this even happened. My other favorite, so we can’t schedule an inspection unless it’s tied to an event. In Ritchie Brothers’ domain language, an event means an auction. That’s what that means. I can’t inspect something if it doesn’t have an auction as its destination. It turns out, we run a good business of inspecting equipment for companies, because we’re good at it. They want to just get this item inspector to this fleet of excavators inspected, so they understand the value. They understand maybe what they have to do to fix them.
We have to literally create a fake auction in the system for that to happen. Events have IDs, which are all a small collection of numbers of sale ID. It’s possible now to run out of spaces. It feels like I’m doing COBOL. I’ve run out of literal spaces to have an ID. Now they say, we got to get rid of events. This is going to be a major rewrite. The team that builds the inspection service says, no, it’s not, we’ll just take the events out. Everyone’s like, what do you mean just take the events out? You told us to put them in. They’re actually not necessary. We’ve been doing that arts and crafts for 4 years, and it wasn’t necessary? Who said to put it in? I don’t know, it’s so long ago we forgot. I call this the cacophony of voices. There are many myths.
There are many people with a different version of the story. You’ve got a key system that we have that actually constructs the online bids. It constructs online listings by listening to five Kafka topics and hoping to synchronize them. When they get out of sorts, that’s what we had before, year-long chat for. Haven’t gotten to the point of just what if we had one. Timing. We have timing issues. We talked about the data synchronization. It’s a big mess. It’s just what I lovingly call right now, when we got there, was the architecture of errors.
The Architect’s New Clothes
Here’s the thing, nobody wants to point these things out. We’re all going to make do. There was nobody before just raising their hand saying, this has got to stop, or this is the wrong thing. When you put people individually in a room, they would tell you, of course, this is nuts. I knew better than that. You couldn’t go out and say that, because it would be very unpopular to say that, or this team might get insulted, or this boss did that before, and I don’t even know who’s responsible. We had to go through and peel layers of the onion off.
We got to the point where we just said, you know what we’re going to do? We’re just going to start with what we think the right way forward is, from an architecture perspective. We’ll deal with these things when we get to them, because it was impossible. Six months of trying to understand everything, we realized we just had six months, we hadn’t moved forward at all. We were just trying to understand. At some point you just say, we’re going to approach this as if we were building it, and then we’re going to see what we have to do as we walk through the system.
Time to Dig In – Building the Plan
What do we have to do to dig in and get through all this? The important part here, we had to understand the business model. There was a lot of people in our organization that wanted to treat this as an e-commerce exercise. There were comments of, we could extend Shopify. Shopify would work. Why don’t we just do Shopify? Let’s just do this. Let’s do a simple e-commerce architecture. We’re not e-commerce. There is no single thing that we sell that is identically the same as the other thing. I might sell a bunch of Cat 720s, but each one of them is inherently different. It has different hours, different condition. It’s been used in different kinds of work. This one needs painting. That one is different. This doesn’t have the same value. They’re not the same thing. Our inventory is not acquired from a single source, like an e-commerce inventory would be. We have some inventory that’s in our auction yards.
One of the things that Iron Planet specializes in is selling things at your location. You’ve got a big construction site, we’ll sell it from there. You don’t have to transport it to an auction yard. My inventory is distributed across the world. It’s not identical. It doesn’t have a common merchandising function or a common pricing function. It’s not e-commerce. It’s an online marketplace, though. There’s domains that are similar to anything that somebody who’s been in the space would understand, but it’s not e-commerce. Again, part of that early discovery process was treating it like e-commerce and realizing we really had gone down a path where it just didn’t fit. The architectures were not coherent. Again, at some point, you have to match the business model and step back. Something I always say, the architecture serves the business, not reverse.
First thing we had to do is start thinking about, what are the key domains in our system? What do we do? Effectively, at RB Global, we take something from person A and we facilitate a transaction to organization B. We collect money, we calculate our take, and we send the rest back to the other person. That’s what we do. Ironically, it’s a lot closer to a distribution model, or, as an ex-stockbroker, it’s very identical to the finance model. I’m going to help you get this thing, and I’m going to take my slice, and I’m going to deliver it to you over here. You have to think about what those domains are and understand them, and really deeply try not to model something that doesn’t represent your business. The other thing that we really discovered was, we were also in this transition to becoming a product-driven company. Really, what does that mean?
That was a very nascent journey that we’re still on together. Having that connectivity with the business and helping the business change and understand, again, remember I talked about, we’re very white glove, we’re very manual. That’s going to change. The business is moving. The business is undergoing change. The tech is undergoing change. Who is the translation unit between those? Our product people were the deep translation layer. One of the things I tell people is, when you’re going to be in a product-oriented environment, when you’re thinking about domains, when you’re thinking about platforms, one of the things you want to think about is, what is your communication path between the business and your teams? How do your teams communicate with each other? How do you communicate with the business?
We leveraged our product managers within these teams, and our technical product managers way down at the deep technical levels to be that interconnection point. I would often say that we wanted our teams to be isolated and focused in their domain and their context. The product people were that little integrated circuit pin. There was one pin to that whole API, and they would speak to the other team, whether it was a business team, whether it was a technical team. The product folks helped us, along with our staff engineers, and the architecture team, keep that all in alignment, and keep it all driving towards the same end goal.
This is the key, we fought for radical simplicity. Everything we did at Ritchie is massively complicated. I often make this joke that if you’re going from your main bedroom to your main bathroom, we would go through our neighbor’s house. Every process we do is intensely complicated. How do we break it down to its most simplified process is a struggle. One of the things that we realized was, especially at the technical level on this, don’t rush it. Incrementally think about it. I am very much an XP person. I always prefer code over documentation. This is one of those times when drawing it a couple of times was worth taking the time, before even coding it, sometimes. Just thinking about it. One of the things I like to talk about a lot, I think we need more thinking time.
I reference at the end of the presentation, there’s a great YouTube video of Toto Wolff being interviewed by Nico Rosberg. Toto Wolff is the principal of Mercedes Formula 1 team. One of the things he says in there is, we don’t take enough time, we have our phones. We don’t take nearly enough time to just stare out a window and think. We are all so busy trying to accomplish things. What I tell people when it comes to architecture, when it comes to trying to simplify or organize complex systems, walk more, think more, get away from the screen, because it’ll come to you in the shower. It’ll come to you walking the dog. It’ll come to you doing something else. The last thing is just question everything. You have the right as technical leaders to politely question everything. Are you sure I absolutely have to do that? Is that a requirement or is that a myth? Dig into it.
People always ask me, my CEO asks me all the time, how are we going to be done so fast? We’re going to write less code. They look at you like, what do you mean? The less work I do, the faster it gets done. Can I get it done with that little amount of code? That would be a victory. There’s also the, I spent this much money, I should get more. We’re not doing lines of code. It’s not Twitter. It’s not SpaceX. We don’t do lines of code. We do business outcomes. If we get the business outcome with the least amount of work, that’s the win.
The last thing too, is that we really tried to value fast feedback. We built a lot of our architecture and our platform for the sole purpose of shipping every day. We can’t always ship to production every day, but we ship internally every single day. Our product people get really tired of having to review code and review outputs every day, but that’s what we do. Get an idea. How do we get from ideation to somebody’s hands to take a look at, and get that fast feedback. Is this what you meant? Back to that old, simple process.
What Is Your Unfair Competitive Advantage?
One of the things we talk about a lot at Ritchie is, what’s our unfair competitive advantage? I’m a big car racing fan. Gordon Murray is a famous F1 designer from the ’70s and ’80s. He’s got his own company now, Gordon Murray Automotive. This is some slices of his new hypercar, the T.50. Professor Murray was the very first guy to just basically say, I’m going to put a fan on the back of an F1 car to get ground effects and just literally suck the air out of the car and have it go around the track even faster. The reason I bring this up is, as soon as he was winning everything doing that, they banned the fan.
That’s not fair. We didn’t all think of that. The rules don’t ban, they will next year. For that year, they won everything. Not everything, but most everything. They were ahead of all the other teams because of that. Even in the world of tech, if you think of something and you can do something, somebody will catch up to you. For the time that you have that, figure out what your unfair competitive advantage is and dig into it as your business. What can you instantiate in your architecture that matches your business? It’s unfair advantage.
Organizing the Complexity
I’m going to talk about organizing the complexity. I’m going to dig in a little bit to some of the basic principles we used here to get through all this. Borrowed this slide from my team at Thoughtworks. This is one of our favorite slides. If you look at the left-hand side, you see all the systems and all the complexity. Every system talking to every system. All the connectivity is really complicated. All the things are very spider web. What we want to do here is get to the right-hand side. Think about our key domains at the bottom. Think about our key APIs, our key services, our key systems.
How do we build something composable? How do we make clean interfaces between all the systems? How do we stop the data sharing? Is there one place that does asset management? Is there one place that does consignment? Is there one place that does invoices? Period. How do I compose them to build different experiences, whether it’s mobile, whether it’s an online external API, all of that composition, with clean boundaries. That was the goal. You want to think about retaining your deep knowledge. Are you going to move things to this new platform? Are you going to rebuild them?
In the amount of time we have, we have a lot to prove to both our investors, our board, our own company. We’re not going to rebuild everything. There are systems that we can use and keep. Part of our first adventure in that multiple system universe was taking a lot of those multiple systems, and in their first year, getting it down to one. One thing that was responsible for this. In a monolithic world that we were in, sometimes that one system was actually responsible for eight things, but it was only one of them for any particular thing. Then we could decide, what of those pieces are we going to build an API around, so that we could keep them. Which pieces are we going to strangle out later? It’s important in your architecture to know that and to be able to explain why you chose what you chose.
Because, again, the business elements of architecture, the unfortunate reality is, you’re always searching for funding. You’re always trying to get somebody to keep paying you to get this job done, or keep investing in it. A lot of these complex enterprise transformations or detanglements, they fail. People say all the time, why do they fail? My snarky answer usually is lack of executive courage and follow-through. That happens because you didn’t give the executives the inputs to see that it was worth to continue to fund, so people say, “It never happened. I gave up.” Which pieces do you move now? How do you think about your roadmap? There are things that we’re going to use in 2024 that will be gone in 2025, but we’re going to use them now. Again, think about the deep knowledge. Think about the things not worth changing. Make sure you mark them down and understand them.
I think about bounded context, and I wanted to find a nicer way to talk about that domain-driven design concept of a bounded context. Have clean interfaces around complexity. One of the best ways for me to explain this, relevant to Ritchie Brothers, is we have something we call our taxonomy. For us, what taxonomy means is how you organize all the different types of equipment we sell. We sell everything from small excavators to huge dump trucks, to Ford transit vans, to Ford F-150s. How do you organize all that data in a catalog? It turns out, it’s mostly tree based, if you really think about it. Knowing it’s a particular model of this particular Ford truck tells me something.
The things it tells me, for example, are its dimensions. It tells me how much its transport cost usually is. It tells me about its tax categories, which are very important, whether you’re in the EU or in UK or the U.S. There’s usually a lot of exemptions or different tax rates depending on if it’s agricultural or if it’s not agricultural. Is this piece of equipment a fixed agricultural unit, is it a mobile agricultural unit? At Ritchie Brothers, all that taxonomical information was actually distributed across the system in many places. As we were rebuilding this and thinking about, we have a modern tax interface now from Thomson Reuters to do the calculations, and it tracks tax exemptions. Where should the categories be? Obviously, the team building the tax system would have all the tax categories. That was everybody’s first answer.
We said, no, that’s actually not right, because what turns out happens in the taxonomy world is, the taxonomy is not fixed. It changes. We learn about equipment. Somebody sells a new piece of equipment. If you think of it as a tree, end leaves in the tree split into a node with two new leaves. If we’re changing that taxonomy tree, then the tax categories change. If I had changed this, now I have to go change another system to correspond to this. If I keep the tax categories inside the taxonomy when they change, I may go to the tax team, they may actually program the facts into the taxonomy. They’re not going to have to change data here and here, and hope that we both kept them organized and clean. That’s what we mean about containing that blast radius of change.
We had one system that was responsible for one thing, as we go through this architecture. Think about which domain is managing for change, because, again, an exercise like this is a change management exercise. Nothing will be the same 2 years from now. As you start pinning responsibilities around things, you have to understand which system and which team is responsible for that ownership. Even if it’s going to change, you know over time.
We also talk about decreasing the coupling between all the different systems and having a clean interface from that previous Thoughtworks slide. I’m going to try to give you a slightly simplified version of something we ran into. We found out we had a limitation on the number of items you could put on a contract for the seller. We dug into, why can you only have 100 to 200? If you have more than 200, the system just completely bogs down, and it takes more than 2 hours for it to generate the contract exhibit A that the customer has to sign, and they don’t want to wait. We dig into that. That’s not a business limitation. That’s a system limitation. Why do we have that? Contracts and all the contract data is stored in salesforce.com. That’s your CRM system. We had our asset management, we had our system called Mars, which is a big asset management monolith. We had all the contract terms and all the asset data in Mars.
It turns out, we synchronized the two of them. We would take all of the detailed asset data, ship it through Boomi to Mars, from Mars to Salesforce. Salesforce would recompute all the things we already knew about those assets to regenerate that PDF. Then, of course, the payout system would have to go to Mars in order to figure out what the commission was and the holdback rates and all the other things. It was a mess back and forth. When you think about it, though, what did our CRM system actually need? The CRM system just needed a PDF. It didn’t need all that data.
All the customer needed to store in Salesforce was the PDF that listed all the things they were selling. That’s it. When we refactor it, we say, we’re going to build a contract service. It generates the PDF for Salesforce. It’s a bidirectional. If you add more items to a contract, it goes to the contracts service. Contract service will return you a new PDF. Nice and simple. Thousands of items if you want them. No limitation anymore.
It’s also where payouts now goes to get the commissions. Payouts no longer goes to that big monolith anymore to find out what the commission rates are. It goes to the contract service. One more thing out of the monolith and one less hassle for our people internally by sending stuff back and forth to Salesforce. Slightly oversimplified version of that, but by extracting that and understanding its domains and having the right responsibility just for contracts and item association, we broke through a major barrier for customer service, because now they can have many items. We have some sellers that will bring us 10,000 things, and we would have to open 1000 contracts. One contract now, much simpler for everyone.
Event-based architecture. At Ritchie, event-based architecture was an excuse to do data synchronization. There was, A, lack of consistency. B, massive payloads containing all the data. Yet mysteriously missing data, so not a complete payload. Now again, too, there’s many forms of event-based architecture. You can get very complex CQRS. You can just get simple. What I always say in this space is, first of all, understand what model you’re going to use, and then be as simple as possible. Be consistent. Think about the information you’re sending as an API payload. If you’re going to send a notification of change, “This change, go retrieve it.” That API that you’re retrieving, it should be complete. If you’re sending all the data in the event, have all the data.
Our bidding engine is famous for sending winning bids. It has the winning bid, has the number right there. Just anyone want to guess what’s missing from that payload? No, the number’s there. The price is there. It’s a global company, you know what’s missing? Currency. One hundred and eleven what’s? No, that’s fine. Just go to the event and go look up the currency for that particular event. No, put the currency in there. No, just MapReduce it and just go over. No, just send the currency.
Everything in our system, and we modified it to go work on our new world, all the money is integer based, except our legacy which is all float based. When you do money, that’s a whole nother heart attack by itself. Then you have currency. It gets crazy, but again, inconsistent payloads that make systems go talk to each other, we have an incredibly chatty system. Because you have to assemble so much information from so many places, just because it isn’t well factored.
Talk a lot about communicating choices. I’m not going to go through this whole Wardley diagram. This was a key tool. This was a very early one built. We first started with some assumptions, but again, what was going to be things that we were just starting to think about? What deserved to be custom, what belonged to Ritchie Brothers and the way it did its business. What products could we go get, and what are commodities? Some interesting things that came out of that discussion.
We use Stripe for their payment APIs. We also leverage Stripe for money movement. We have a lot of rules in the UK and EU and the U.S., around what we call trust accounts. Certain states in the United States, if we sell a truck, the buyer paid has to go into an account that’s only in that state before the money gets paid out. There are all kinds of rules. There’s a lot of money that moves after the payments are made.
We use Stripe APIs for that versus manual entries in accounting right now. We could get that at a very low rate, negotiate a very low rate for a highly fiscally compliant and performant service. Why would we build our own? Same thing with taxes. Why build a tax calculation engine when you can get a global tax calculation engine off the shelf and rent it? We could help our board and help our leadership understand where we thought these things were. This diagram becomes a way of communicating. It also becomes a way of negotiating. Put that in Genesys, we’ll deal with it later. Or no, let’s go ahead and make that custom. I’m willing to invest in this.
Think about how you as architects are communicating your choices to your leadership so they understand. This is another example of that. This talks about some of the main business capabilities or domains that we have in our system. On the left, you see the things that we care about that we built on our technology infra environment. This is one way that we can communicate to our leadership, “This is what we’re working on now. These are the different experiences. These are the different touchpoints.” It’s not an architecture diagram, but you still have to be able to take that. Remember back to that search for funding, that’s also an important part of this.
Starting in the Middle
We started in the middle. Talk about the journey. With all that in mind, where do we actually start this process? We started in the middle, because it turns out, on the Ritchie Brothers side, on the Iron Planet side, some of the payment processing and post-payment processing was automated, not all of it. On the Ritchie Brothers side, there was no online payments at all. You got an invoice from a 25-year-old VBA based system, and then you called somebody to find out the wire instructions if you didn’t already know them, and then you wired the money, or you walked in with a check. Everything that happened after that was mostly manual. There is an AS/400 sitting in a corner. It only does manual certain flat rate commissions. There was a group of people that would stay every night to make sure that what was in the AS/400 actually computed, and they would do that on spreadsheets. Why? Because we don’t trust computers to do math, so we use Excel. That is the statement of the century.
We started with, we’re going to automate this process of checkout. We’re going to build a way for our customers to complete a purchase online: win their bid, win their item, and actually pay for it. Leveraging Stripe, we could give them an option to do ACH, because a lot of our customers, even the big ones, still pay a fee to do a wire. If we can use bank transfer, it saves everybody money. It gets a lot faster. We can do online credit card payments up to certain amounts. We do all of that, that’d be a much better experience.
More importantly, that manual process that we were using to figure out what we had to pay out to sellers, and figure that all out, was mainly a manual process. We have a certain number of days to pay our sellers, and so nowadays, with money earning even more money, that float, that number of days that we had was important. You can’t monetize and buy overnight repos and treasuries on money if you don’t know how much you’re supposed to pay out, because you don’t know how much you can make money on. You lose opportunity for float.
There were a lot of reasons we wanted to start in the middle. We did it wrong, and I think this is an important lesson in the process. I told you at the beginning, we do all that white glove service. With white glove service comes the famous exceptions. We do this, but unless the customer wants that, and then we do this. What about if they don’t pay? Then, if they don’t pay, we do this thing. Or what if they go into the yard and they say it’s missing a key, there’s no key for the excavator, so I want 50 bucks off.
Fine, now we have to give them $50 off. Or, this is a mess, can you paint it for me? Now we got to charge them. There were all these manual processes that we would do, and there were so many of them that everyone was so responsive to the business, we built a workflow that was basically managed on making sure all the exceptions worked. When we got to the end of that, and even started working on a pilot, the user experience for the customer suffered because we were thinking about exceptions instead of the customer.
More importantly, the accounting integrations weren’t perfect, because we forgot the end in mind. The goal of automating a checkout, and of all this calculation and settlement flow was to get the correct entries in the finance system. For us, that’s Oracle Financials. It doesn’t matter which one it is. Getting the accounting entries correct was mission critical. If you’re going to automate payout to customers, then Oracle needs an invoice in Oracle, in order to understand that and automate that whole process. We realized we’d gone off the deep end. We actually came and sat down with our team in the EU, and we learned about all the anti-corruption rules in the EU about immutable invoices, when people are allowed to get invoiced, when we trigger certain events in the workflow for VAT.
They were all remarkably straightforward and defined. It actually gave us a way to say, if we meet this high bar, and we define all the accounting and the logic for this high bar, everything else becomes easy. We went back and we refocused on, what does the happy path from, I won my bid, to the end of everything’s in the accounting. I’ve paid for my stuff. I’ve got my invoice. The seller got their money. What does that look like in Oracle? Done. Figured that out, built that. What’s interesting is now when you build all the exceptions, now when you build the $50 refund, or you build this, you build that, you have already defined what the end goal is supposed to look like in this system.
You can value check, did you do it correctly? It was a really important learning lesson for us to understand, really, what was the end, what was the happy path end in mind? Some of our integrations that are coming next are even more complicated. We have to go back and say, how do we know this worked? What’s our actual acceptance criterion? Thinking about that as part of the core architecture was a real lesson for us. Again, back to radical simplicity. The point of this was to sell something from Tom to Fred, keep the money in between, and account for everything properly. We had to do that part and then do all the other exceptions later.
Platform V1, V2, and V3
Just some quick, high-level versions of what the platform looked like. When we started, we had Stripe and Thomson Reuters on the finance side. They were our finance partners. When we started, we had a Ritchie Brothers legacy website. We had our Iron Planet website. We started by building a standalone checkout experience and a standalone account management experience, and adding mobile to our checkout app. That wasn’t perfect, and it would have been better to have added it to the existing app, but the legacy app also was a 20-year-old, unstable platform to work in. I talked about the spaghetti data flow. It was also very much in the middle of how all the things worked. It was a little dangerous at the time to play with it. To get ideas to customers, we actually built some standalone pieces, and then we built some core capabilities around contract management settling, settlement invoicing, all the things we needed to make that checkout work.
We got that working, and it was successful. Our customers liked it. It’s rolling out now. In the process of getting it ready to roll out, we actually did build a unified new, modern website for RB Global. We got it off. Anybody remember Liferay? This was all 20-year-old Liferay with bad microservices sitting in Java, and all the old things. Now it’s in a new, modern stack. It’s a lot less in the middle of all the data transactions. That’s where we are now. We’re releasing all this with our new checkout system, which is awesome, and it’s working. This is what’s coming next, is to integrate Iron Planet with Ritchie Brothers. This is our next task that we’re in the middle of. Really what we’re doing is saying, we’re not going to bring the systems together. What we’re going to do is, Iron Planet sells things a certain way, Ritchie brothers sells things a certain way.
We’re going to combine those ways into the new platform, so lift them both up to the new world. That means we have to change how we do consignment. It means we have to change how we track our assets, because we track different things between Iron Planet and Ritchie Brothers. All the things in that deeper orange color that changed, all got modified or getting modified, as you bring that integration together. Because we understand those domain boundaries, and we understand where we can change things or where things belong, it’s going to be very much easier for us to move that stuff off the old monolith into here, because we have a known destination. We can have a test suite that tells us, run the characterization tests and understand that what we did in Iron Planet now works over here. Again, because the architecture allowed us to migrate that incrementally.
Beyond Architecture – The Secrets to On-Time Delivery
Thinking about what we did to actually get to delivery. Because delivery of all these concepts is just as important as organizing the concepts. We built a lot of what we call delivery infrastructure and tooling around engineering enablement. We did think about how we want to make a productized delivery platform. We built it on Kubernetes. We wanted to think about the same concepts that exist. How do we build a set of well factored APIs, or which of the Kubernetes APIs do we use for our developers so they can get through this process and have the least amount of knowledge about how to use a complex thing like Kubernetes and containerization? Jeremy had talked about taking your deep knowledge, and taking that deep knowledge and wrapping it up and making it simpler for another person.
It’s a lot of what we tried to do here. We use a lot of the Kubernetes tools, such as mission controllers and operators to obfuscate even cloud provisioning. At Ritchie Brothers, if you want DynamoDB, it’s a couple of entries in your Helm chart. If your team is allowed to have that, you have the budget, it’ll show up for you. We don’t make you learn Cloud APIs for Azure or AWS. We put that in the delivery platform. That deep knowledge that we’ve had, we built it into some very simple APIs, so our engineers are really effective. We can go from no team, to putting a team of people together and have them all ship to production in less than an hour. It’s pretty cool.
The other thing is, I try to treat our engineers as our competitive advantage. It’s been said a couple of times during the keynote, at the end of the day, this is still a people exercise. The folks that you have in your organization who have that deep knowledge of the people you bring in, they’re all part of the people that are going to actually get through this slog and deliver all this. We focused deeply on investing in our own organization. Yes, we brought Thoughtworks in as a partner, but we also deeply invested in the people that we had, who’d been building monoliths, who’d been building code a certain way.
They had to learn a new way of working. We took the time to do that. We focus on joy as a metric, in the organization, we don’t focus on productivity. I always tell people, my goal as an engineering leader is to create an opportunity for people to have joy every day from delivering work, because we all like to do that. We build the stuff, we want somebody to use it. We want them to have that joy of doing the job. We want them to be able to focus every day and not be disturbed. I was begging our engineering leaders not to wait for engineers to be dead before their art is worth money. Can we please get this out the door? Because people get really frustrated if they don’t see their stuff hit the marketplace, if nobody sees what they did.
Then, other thoughts on maybe how you leverage your development partners through a process like this. We partner with Thoughtworks. We have other partners as well. Part of the entry criterion is, how do you create or uplift the environment that you’re in? I’m not paying for more hands on keyboard. I’m paying for some specific skill, or I’m paying for an attitude, or I’m paying for a way of working. I’m paying for something. I want more ROI than just hands on keyboard, than just code. I want them to help create that environment, to help build that new culture that we’re looking for. I always say, partner with a quality partner. Partner with somebody whose goals and culture and other things match your own. Don’t just try to find somebody to type stuff.
Also, if you’re going to be partnering with a partner in a complex transformation like this, mentoring is important. If you’re going to bring people in who’ve done this, it’s not enough for them to go into a room and do it on their own, because when they leave, you have nothing left behind. You have a bunch of code and a bunch of people that were separated. We have a phrase at Ritchie about, One Team All In.
We bring our partners in, they’re part of our teams. We have a product engineering team, period, whether you’re a partner or not. We don’t exclude people from the process. When I got there, there were a lot of people that were the cool kids that were going to work in the transformation, and there were not the cool kids that were going to do the legacy stuff. How motivated was everybody? Not so much. The way we did it was very simple. We have a lot of pets, and none of them get to die right now.
Every day, somebody’s got to feed the pets, and somebody’s got to take care of the new pets. We’re all going to shift, and we’re all going to do some of it. It isn’t cool kids versus the old kids. We’re one family. We have to take care of everything we have. We consciously did that and even brought some Thoughtworkers in, they’re like, why am I maintaining legacy code? Because it’s your week. It’s your week to take care of the old pets. It’s just the way it is. There isn’t cool kids and not cool kids.
When you think about a transformation journey where you’re building a KPI in a capability platform like we are, there are some things I’ve learned over the years of doing this, and the things that get me really excited aren’t always obvious. I watch across the organization to see if it’s sticking, to see if the message is working beyond my own teams. We call them moments of serendipity. My product leaders, can I use that settlement statement API to take the settlement statements and put them in our external fleet management software so sellers using the fleet management software can see their payouts?
Yes, it’s an API, use it. When you finally get somebody to say, I could use that API here, and I can use it here, that’s fantastic moment. That’s what you’re looking for, because right now, you built them to do one thing, but you really want those to be multiple ROI type investments. You start to hear shared language. You start hearing people talking about the new provisioning domain or the new consignment domain, or they’re talking about invoicing, and they’re talking about it in a specific way that you’re talking about it, and it’s not just your team. That means you’re getting that transformational knowledge out to the universe. People in your organization are thinking that new way. When you hear your business owners speaking in that language, you won. This one’s the hardest one.
I’m not done with this journey here, and there’s been a lot of slog in a lot of other places in this last point. Martin Fowler has written about it many times, moving from projects to products. This one is hard. We call it RB 2.0, that’s the name of this endeavor. I always say, this is a transformational endeavor at Ritchie Brothers, at RB Global, to change our way of doing technology in the technology we use. It is not a project that will someday end. It is not a thing that will ship when it’s done. It’s not a thing you fund until it’s over.
If you have invoicing for the rest of your business, you need to fund invoicing for the rest of your business. If it’s important for you to have a catalog of the things that you’re selling and be able to tell where they are in the process, then you will always need that, and it will not be done, and you need to fund it. It is a hard slog. Don’t ever back away from this fight, because if you do, you end up with legacy software again. You end up coming back here, or some other poor side ends up coming back here. Because you’re off doing something else, and now the stuff is worn out, and he’s still doing it.
Key Takeaways
Set time to stare out the window. Just every once in a while, it’s ok to think. I saw that Toto Wolff video, and I actually went and bought a chair for my home office. My wife says, what are you doing? I said, this is an official stare out the window chair. She’s like, are you allowed to do that? I’m like, I’m going to do it anyway. Then, don’t mistake action for results. Sometimes, the running around and the activity. Activity doesn’t mean outcomes. Be really careful about separating the two of them.
Then, everything in architecture is about tension. I just told you to slow down. Now I’m going to tell you to go faster. One of the things I always tell my teams all the time, is an old Yogi Berra, great, nonsensical baseball coach from the States, but when you see a fork in the road, take it. Make a decision. We have a lot of people, whether they’re executives or engineers, who are standing in the road.
The truck is coming and they’re standing there like a deer in the headlights. Go left. Go right. It’s code. What could possibly go wrong? You can change it tomorrow. Make a decision. Try something new. We have a lot of digital disruptors coming for us in the industry right now. They can move at that digital native speed. We don’t have time for that. If you build a good architecture, you can experiment quickly. You can stay ahead. That’s the argument here. Then the other thing is, just be a demon about removing all the friction from your value stream. Deeply think about what blocks you from getting from ideation to delivery.
Think about that, not just in your tech business, but work with your business leaders to understand what that means in the business. What are people doing right now that could be eliminated so we can make more margin and put people on smarter things? We all learn about value stream management. We all learn about those processes as engineers: educate the business. The last one, everyone can be a servant leader. It doesn’t matter what your position is from an engineering perspective. You could formally be a leader. You might be an IC. These are things you can do. Everybody deserves to carry this process down the line.
See more presentations with transcripts
Microsoft Unveils Azure Cobalt 100-Based Virtual Machines: Enhanced Performance and Sustainability
MMS • Steef-Jan Wiggers
Article originally posted on InfoQ. Visit InfoQ
Microsoft recently announced the general availability (GA) of its Azure Cobalt 100-based Virtual Machines (VMs), powered by Microsoft’s custom-designed Cobalt 100 Arm processors. According to the company, these VMs offer up to 50% improved price performance over previous Arm-based offerings.
The GA release follows up on the earlier preview this year. The Cobalt 100-based VMs are tailored for a broad range of computing tasks. With Arm architecture at the core, these VMs are built for energy-efficient, high-performance workloads and are available as general-purpose Dpsv6-series, Dplsv6-series, and memory-optimized Epsv6-series VMs.
Any VM in the series is available with and without local disks, so users can deploy the option that best suits their workload:
- The Dpsv6 and Dpdsv6 VM series offer up to 96 vCPUs and 384 GiB of RAM (4:1 ratio), suitable for scale-out workloads, cloud-native solutions, and small to medium databases.
- The Dplsv6 and Dpldsv6 VMs provide up to 96 vCPUs and 192 GiB of RAM (2:1 ratio), ideal for media encoding, small databases, gaming servers, and microservices.
- The Epsv6 and Epdsv6 memory-optimized VMs offer up to 96 vCPUs and 672 GiB of RAM (up to 8:1 ratio), designed for memory-intensive workloads like large databases and data analytics.
All these VMs support remote disk types such as Standard SSD, Standard HDD, Premium SSD, and Ultra Disk storage.
Sustainability is crucial to Microsoft’s strategy, and the new Azure Cobalt 100-based VMs contribute to this vision. Arm-based architecture is inherently more energy-efficient than traditional x86 processors, resulting in lower carbon footprints for businesses that adopt these machines. As enterprises globally prioritize green computing, this launch aligns with Microsoft’s broader goals of reducing emissions and offering more sustainable cloud solutions.
Regarding sustainability and energy-efficient computing power, AWS offers various Amazon EC2 Instances powered by AWS Graviton processors, which are also based on Arm architecture. Furthermore, Google Cloud also Ampere Altra Arm-based processors with Google Compute Engine Tau T2A Instances.
One of Microsoft’s partners, Databricks, has integrated the Azure Cobalt 100 VMs into their Databricks Data Intelligence Platform on Azure, which unlocks new possibilities for handling data-heavy workloads with greater efficiency, scalability, and cost-effectiveness. The company writes in a blog post:
With up to 50% better price-performance than previous generations, Cobalt 100 VMs enable Databricks customers to benefit from superior performance and lower operating costs.
Lastly, more details on pricing and availability are available on the Azure VM pricing and pricing calculator pages.
Kotlin HTTP Toolkit Ktor 3.0 Improves Performance and Adds Support for Server-Sent Events
MMS • Sergio De Simone
Article originally posted on InfoQ. Visit InfoQ
Ktor, Kotlin’s native framework to create asynchronous HTTP server and client applications, has reached version 3. It adopts kotlinx-io, which brings improved performance albeit at the cost of breaking changes, and adds support for Server-Sent events, CSFR, serving static resources from ZIP files, and more.
kotlinx-io is a low-level I/O library built around the abstraction of Buffer
, which is a mutable sequence of bytes. Buffers work like a queue, meaning you write data to its tail and read it from its head. Ktor 3 now aliases kotlinx.io.Source
to implement its Input
type, deprecates Output
, and reimplements ByteReadChannel
and ByteWriteChannel
. Developers that used these low-level classes directly will need to modify their apps to migrate to the new API.
The main benefit brought by kotlinx-io is improved performance:
we’ve cut down on the unnecessary copying of bytes between ByteReadChannel, ByteWriteChannel, and network interfaces. This allows for more efficient byte transformations and parsing, making room for future performance improvements.
Based on its own benchmark, JetBrains say the new Ktor shows a significant reduction in the time required for file and socket operations, which can be as high as 90% in certain cases.
Besides performance, the most significant change in Ktor 3.0 is support for server-sent events, a server push technology that enables the creation of server-to-client communication channels. Server-sent events are preferable to WebSockets in scenarios where data flows mostly in one direction, especially when it is necessary to circumvent firewall blocking or deal with connection drops. WebSockets, on the other hand, are more efficient and grant lower latency.
Other useful features in Ktor 3.0 are support for Cross-Site Request Forgery (CSRF), which can be specified for any given route, and the ability to serve static resource directly from ZIP archives. ZIP archives are served from a base path and may include subdirectories, which will be reflected in the URL structure appropriately.
As a final note about Ktor 3, it is worth noting that the Ktor client now supports Wasm as a build target. However, Kotlin/Wasm is still in alpha stage, so Wasm support in Ktor 3 is not ready yet for production use.
To start a new project using Ktor, head to the Ktor website and choose the plugins that most suit your requirements. Available plugins provide authentication, routing, monitoring, serialization, and more. If you want to update your existing Ktor 2 project to use Ktor 3, make sure you read the migration guide provided by JetBrains.
MMS • Loiane Groner
Article originally posted on InfoQ. Visit InfoQ
Transcript
Groner: We’re going to talk a little bit about API security. Before we get started, we have to understand why we have to do this. For application security, how many companies are handling this today is, you do your planning, development. Developers will push the PRs, going to do the build. There’s usually a QA environment, testing, UAT, many companies will call this different things. Then, that’s when you’re going to raise your security testing request. You’re going to ask the InfoSec team, please test my application. Let me know if you find any security vulnerabilities. If they find, goes back to the dev team, “We found this security issue. This is a very high risk for our business, and you have to fix it”. Again, goes through the PR, has to go through testing again, rinse, repeat, until you have a clean report or no high-risk vulnerabilities, and you finally can go to production. This has a few caveats.
First, it can cause production delays, because if you have to go through this testing and rinse, repeat, all this cycle until you get a clean report so you can go to production, that can take a while. Or, it’s even worse, companies are not even doing security testing through the software development lifecycle, and they’re doing this once a year, or not doing at all. There’s a very interesting research that was done by the Ponemon Institute, and they say that fixing software defects, or worse, fixing security risks once the product is in production, costs way more than if you are handling that during the development. That’s why in the industry, we say there is a shift left happening, because many years ago, we went through all that cultural change of having unit testing done as part of our development cycle, and now we’re going through this again.
However, we’re talking about security this time. It is much cheaper and much cost effective for the team, for the company as well, for you to handle all those security vulnerabilities, and make sure that your software is secure when you’re doing development. Security has to be from day one. It’s not a technical debt. It’s not something that we’re going to add in the next sprint. It has to be part of your user story. It has to be part of your acceptance criteria. It has to be part of your deliverable. I would like to show you a few things that I’ve learned throughout the years.
My name is Loiane. This talk is from a developer to other developers and leads, so we can go through this cultural change and make sure that security is indeed part of our development phase.
What is API Security?
First of all, whenever we say API security, if you decide to Google this, search this, go into YouTube, try to find a tutorial, you’re going to find a lot of tutorials talking about authentication and authorization, especially if you’re working with Spring Boot. All my examples here are going to be with Java, because this is the technical stack that I’m most familiar with. All the examples you can easily translate them into a different programming language, framework, or platform. Going back to my question, if you go to YouTube and you search by Java security or Spring security, you’ll find a lot of tutorials about authentication and authorization. Security is not only about that. If we take a look at the OWASP Top 10 vulnerabilities that are found each year, and this list is going to change year after year, you find a lot of the same things happening over and again. What I’m going to show you here, at least with all the tips and all these best practices, we can at least make sure that half of this list are not going to happen within our software.
Better Authorization
Let’s go through it first. Let’s suppose that everybody is doing authentication, so at least user password, or you’re using an OAuth service. You’re doing that in your software. We still need to handle authorization, which is why you have to make sure that the user that is trying to access your application or trying to perform a certain action within your application, is indeed able to perform that action. How do we make authorization better within our applications? Let’s start with the first example, with a bad practice. We’re checking here if we can update a course. This is a RESTful API, so we’re doing a Post here. We have the ID. We also have the object, the data that we’re trying to update. I get the user that’s authenticated. I’m checking if this user has a role student. If the user has a role student, they cannot update the course. If I am somebody that I don’t know anything about this application, and I’m reviewing this code, I don’t know who exactly can actually update this record. It’s not clear just reading the code.
A better way of doing that is deny by default. I’m going to write my code, I’m going to write my business logic, and by default, nobody is going to have access to it. What I’m going to do is I’m going to list whoever can actually update it, and everybody else, I’m just going to not allow it. When I read this code now, at least I can see that only admins and only teachers can actually update this record, so it’s a little bit better. The other thing is, the majority of the frameworks that we work with, they do have some support to role-based access control. In Java, for example, we handle a lot of things through annotations. When you’re working with Spring Boot, you do have some annotation that you can easily add all the roles that are actually allowed to do this. We are working with the deny by default approach. You’re free to write your own business logic and leave that part of the authorization, the security check, outside the main business logic. This is great. However, this works perfectly for small systems or systems that you don’t have a lot of roles.
I really wish that my applications were the same, that I watch those YouTube tutorials and I have user or admin. That’s it. That will be a wonderful world. Unfortunately, it’s not like that. What can happen here is role explosion. I’m going to start with the user and admin. Maybe I have a teacher as well, but now it will be really good to have teaching assistants as well. I’m going to grant access to them to my system so they can do a few things on behalf of the teacher as well. Or, maybe we’re working with an eLearning platform, I have an account manager. The account manager will also be able to do those things in my system. We start adding more roles to the system. Now my business logic is only one line of code, and I have more code, just doing the pre-authorization part. It can be, when you’re reading this, not so good, and we can do better. If you’re working with something like this, which actually looks like the projects that I work with, sometimes the authorization level goes to the button that I see on the screen or the link that I can click on the screen. RESTful here, it’s really going to depend on what’s the role that I have, if I’m able to perform that particular action or not. It really depends on the role and all the actions that I’m able to perform.
When we handle situations like this, it is much easier if we have something that is a little bit more dynamic. There are many different ways that you can do this. If we’re using Spring security and Java, of course, you can use a custom security expression. You can design this according to your needs, according to the size of your project and your business. You can maybe have all the mapping, all the authorization within a database or another storage, and you load that, and you have a method or a function that’s going to calculate if the user really has access or not. Of course, annotations for the win. We can actually use the annotation and have our method here with the privilege. Now it’s a little bit more clear for me that the only users that are able to actually perform this action are the ones that have the course update privilege, that’s mapped somewhere else. When we go into those more complex cases, this can be a little bit easier for us. There is no more hardcoding with all those roles within the system.
The other issue that we might face is, I’m logged in. I’m checking if I am authorized or not. Should I have access to update that particular record? If I am a teacher, let’s say we are in university, and there are many classes, should anyone update it? If I know the ID, should I be able to update it, just because I know the ID? I know that some of you are using the incremental identity that is generated automatically by the database. We have to be very careful with that. Again, we still can bypass even if we are authorized to use the system. Be very careful with that, and remember to always deny by default.
How exactly do we make that better? One thing that you can do is, once you have the information, again, you go through the authorization, you have to find a way to check if that particular record can be updated by that particular user. Maybe there is some kind of ID, the course teacher ID, you’re going to match that against the user ID that’s trying to update that record. That way you can make sure that only that that certain user is able to actually perform that action. However, one thing that I see happening a lot, is we’re getting that object, the course object, directly from the request. I still have my ID from the path variable that I’m parsing through my request, but the course that I’m actually checking my logic came from the request. It can be something as simple as using Postman or any other similar tool.
You can manipulate it, or if you’re a little bit more smart, you can use another tool to intercept the request, change the JSON that is being sent to the request, and the ID here might be something. Again, you can bypass any authorization logic that you have and still update that record in the database, and something that should not happen. Never trust the request. When you have to do something like this, always go back to the database or to the true source of the data, the data source. Check for the true data to make sure that that is actually able. There is a tradeoff here. This is going to be a little bit more slower, because we have to go to the data source. There is a request, milliseconds, but again, it is a small tradeoff that we are willing to pay here just to have our APIs more secure.
Property Level Issues
When we are working with objects, there are still a few other issues that we can run into. This has a very fancy name. Just to give you an example, if I have a user and I’m trying to get the data from the user that is logged in, I have a user and password. I’ve done this multiple times myself, exposing the entity directly. Because why am I going to create another class or another object that’s just a copy of my entity, and then I’m going to expose it. This can lead to some issues. In this particular case, if I’m only trying to expose the username, and I have some common sense, and I know that I’m not going to expose the password in the JSON, so using annotations, I can simply annotate my Get method and have a JsonIgnore. What happens if tomorrow we receive another requirement and we have to capture another field, for example, sensitive data such as social security number or something else.
The developer that is working on this unintentionally forgets to annotate the method to get the social security number, and when we’re sending back that information through the request, you are exposing something that you’re not supposed to. This can go through pull request reviews, code reviews, and we’re not going to notice. That can happen. A way that we have to avoid this is creating the data transfer objects or DTOs. You can use records if you’re using a more modern version of Java, or you can just create a class. You have to explicitly tell what are the properties that you want to expose in this case. It’s a much better way of doing that. If tomorrow we get, again, the requirement to add sensitive data to our object, we’re not going to expose because the public contract, I don’t have that information here, and that social security number or whatever other sensitive data that we have to capture, it’s going to stay internally within the system.
Then we can enter into another very good discussion here. Should I create a DTO for a request and have another DTO for a response? Again, this is conflict territory. Each one of us will have their own point of view on this. If you are reusing the same DTO for both requests and responses, just be careful. For example, for the request, do not use the ID, if you have the ID or whatever primary key or whatever unique property that you’re using to identify that object from your DTO when you’re handling requests. This can also slip through the cracks, and then, again, something might happen. It’s always best to have one for request, another one for response. In case you have something again, so you have a metric against duplicated lines of code that you have within a project, be very careful with that.
Password/Key Exposure
Now we’re able to handle authorization a little bit better. The second part will be the password and key exposure. This seems a little bit of common sense. Who here is going to expose the password and commit that to your GitHub repo and have the database password? There are a few different ways that this can still happen. Many companies, you have your URL, and then you have your resource name, something to help to identify the project. Then you are creating a developer database. Again, I really wish my project was the same thing as those tutorials, that I can simply have a MySQL database and a Docker image with two tables, and that’s it. That would be wonderful as well. Especially when you’re working with legacy systems and you have that huge database with maybe hundreds of tables as well, it’s a little bit more complicated, with lots of data as well.
Some of the companies, they will have their own database in a server or in a cloud that everybody is going to access that database. I don’t know about you, but me, personally, I’m not so good with names. That’s the hardest thing to do. How am I going to name a class, a variable? What name do I give to my database? I’m just going to give the company name, or maybe the project name, and dev, to indicate that this is a development environment. I’m going to use prod to indicate that this is production. This can be a little bit dangerous. Then for the password, again, I’m not going to remember all the 30 passwords that I have to do for all the services that we use. I’m just going to use something as well as my learningPlatform@Dev. Then for production, I just change that to production. If something like this gets committed into a repository and somebody sees that information, I wonder what happens if I change this from dev to prod or to another upper environment? Be really careful with that. Never leave passwords or any secrets within your properties YAML file or even hardcoded even for lower environments.
Another issue here, is this last line right here. If you’re working with JPA, if you’re working with Hibernate, there is a way for you that the framework is going to be responsible for checking all the entities that you have in a source code, and it’s going to create all the tables for you. It can create, drop, update. There are many different options. This is a big issue. Never use a user ID that is able to make schema changes in your database. Again, deny by default. You’ll start with, I need read access to my database, because I have my user ID, so you grant that read access. If your application is also writing to the database, then you grant the write access. If you need any access to execute any store procedures, then you add that access as well, but never grant more access than is actually needed. Be very careful with that. This only works for tutorials. This does not work for real applications.
Input Validation
The third part that I would like to bring to your attention is input validation. This seems to be also common sense. This seems to be something that is very basic as well. We are failing in this, for lots of code that we review. We are just not even adding any kind of validation, and we need to start changing that as well. We have our frontend. It’s beautiful, fully validated. I have all the error messages, user experience, chef’s kiss. Then if you take a look at the API that’s feeding that frontend, that’s just this. I have my create method. I have my DTO. There’s nothing. It’s just simple code. This is a big red flag. How can we improve that? Never trust the input. Again, if you have your frontend fully validated, the user is entering all the data, hits the submit or save button. Sends the request to the API. It passes the data it saved perfectly. Then, again, if you try Postman, or if you try any of the other approaches to actually evoke your API without a frontend, then you start to run into issues. There are no validations.
Always remember that if you’re working with an API that is being used by a frontend, the API exists independently from the frontend. We really have to start validating the API. First step, the same validations that you are applying in your frontend, you have to apply in your API as well. That’s the minimum that we have to do. I know it’s a lot of work, because there’s a lot of validations that can go through, especially when we’re working with forms, and we do have a lot of forms in some of the applications, but again, always add the validations to your API, at least the same. Remember that your API has to have more validations than your frontend. It is the one that has to be bulletproof and has to hold the fort when we’re talking about security.
Make sure that you’re validating type, length, formatting, range, and enforcing limits. Java is a beautiful language, because we have something that I like to call annotation driven development. We just start adding all the annotations, and magically, it’s going to do all the work behind the scenes for us. When you are annotating your entities, you have the @Column, for example, just to map this particular property from your class to the column in the database, or to the property in the document. Make sure that you’re adding the length as well, if it’s nullable or not, if it is unique. Try to map your database mapping into your code as well, because, again, that’s going to be at least one layer that we can add a security.
In Java, we have a really nice project that’s called the Jakarta Beans Validation. If you’re a little bit old school, the Java EE Beans Validation as well. Hibernate also has one of the implementations that’s called the Hibernate Validators, that you can use to enhance all your entities or all your documents as well. Do not forget to validate strings, when we have a name. Even if you look at this code right here, I see you have some validations, but that’s not enough. I don’t have all the validations. There is too much damage that I can do if I only have validations for the size, but I’m not validating the string itself. If you try to do a request, can I do !##$, and something, I’m just going to look at my keyboard and add some special characters or weird characters. Is that a valid name? Should it be allowed? Validate strings.
One thing that we usually tend to do, I just go to my keyboard. Let me look and I’m going to type and I’m going to create my regex from my keyboard. If you go to the ASCII table, or if you take a look at the Unicode table, you have hundreds of characters. Characters that I don’t even know that exist, or I don’t even know the name. Be very careful with that. Always prefer to work with an allowed list. What does it exactly mean? A name. If I need to have or I’m only allowed to have alphanumeric characters with maybe a space, parenthesis, underscore, so that is my name, anything else is going to be deny by default. One other thing that you can do is maybe sanitize as well. It really depends on the project. You can use the approach that, if the user tries, I’m going to not allow it, just going to throw an error. Or you can try to automatically remove those characters, or you sanitize those characters as well. Different approaches for different projects. Just make sure that you are choosing the one that is a better fit for you.
Always remember to secure all the layers. For example, we’re working here with three layers. We have our blob controller, and validate all the parameters that your methods have. Do not be shy to use those annotations. It only takes seconds to actually add those annotations over there. One thing that is very important, especially if you’re working with pagination, never forget to add an upper limit to your page size. My frontend only allows 100 records per page. That’s fine. Here, what if I parse a million, 5 million? What if I try to do a DDoS attack and send multiple requests with 5 million? Is your server able to handle that many requests? You can bring down your service, and that can cause business loss, financial loss to the company as well.
Always make sure that you’re adding validation to each and every parameter that your API is receiving. Again, in the service, you’re going to repeat that. The good thing is, you’ve done that in the controller, so Control C, Control V in the service, or maybe you’re doing the other way around, the service and then the controller. Make sure that you are propagating all those validations across all the layers. Because, what can happen, depending on the application that you are working with, you can have a service that is being consumed by only one controller, but again, maybe next week, next month, or next year, you have another controller also using that same service. What’s going to happen?
If the developer that is now coding the controller, that developer does not do any validation in the controller, at least the service is going to be able to handle any kind of validation and reject any kind of requests. Again, the entity or documents as well, don’t be shy to use and add all those annotations. The beautiful thing about this is, if you are handling a request, and sometimes if you have a column or a property that is only able to handle 10 characters, and let’s say that you are sending 50 characters through the request, you don’t get that truncate message, that exception, and it’s going to fail to write into the database. The other beautiful thing about this as well is if you are on the cloud and the service that you’re using is charging you per request when you have all these validations in place, you are saving a failed request to the database so that can actually bring some cost saving benefits to the organization when you have all these validations in place.
SQL injection. It’s 2024, we still have to talk about SQL injection. That is still happening. Make sure that you are validating, sanitizing your inputs, escaping those special characters that can be used for SQL injection. I know sometimes we don’t want to use some kind of Hibernate thing. When you have something a little bit more complex, you want to write your own native queries. Make sure you’re not using concatenation. Please, at least a prepare statement. Be a lazy developer, use what the framework has to offer you. Don’t try to do things on your own. Many developers have gone through the same issues before, and that’s why we have frameworks to try to abstract a few of these things for us. I’m still seeing code during code reviews with concatenations in place. Sad, but it’s life.
File Upload
Still talking about input. We’re only talking about validating the request. What about files? I work for an industry where we handle a lot of files. I’m not talking about images. I’m talking about Excel files, Word files, PDFs, things like that where you have to read those files, parse it, and then do something with the data that’s within the file. Then you go through with the business logic. First rule of thumb, always make sure that you are adding limits to the file size. If the file is too big, ask the user. Again, really depends on what’s the business use case here. Try to find, what’s your limit, something that is acceptable. Make sure that you are setting that into your application. Again, if you’re using Java Spring, two lines of code. Easy. Five seconds and you’re done. Make sure to also check for extension and type validations. These can be very deceiving. If you remember a few slides back, never trust the input, because here you can go to the content header, and you can manually change it and deceive the code, if you’re checking for the extension in the content header. What do we do?
The issue that we can run into this with the extension is, if your library is expecting one extension, and it’s actually something else, you can run into all sorts of issues. Also with the file name, there is one very famous vulnerability that’s called the path traversal vulnerability, where the file name, again, we don’t know what’s the file name. You can use those tools to intercept the request and change it, and have something that is malicious. You can completely wipe out directories of files. I don’t know if you’re using a NAS, if you’re using an S3 bucket, or any kind of storage, but there is a lot of damage that you can do only with a malicious file name. Make sure to also validate that. Be a lazy developer. Use tools that are already available, if you are able to add these dependencies to your project. If you need something that is very simple, very quick, you can use Apache Commons IO. There is a file, you choose file because we love a you choose class. There is a you choose file that you can use to normalize the file name.
If you need something a little bit more robust, you can use Apache Tika that you can actually read the metadata of the file, get the real file extension, sanitize the name of the file. I cannot tell you how many times this library has helped me to close a few vulnerability issues for the applications that I have worked with. Whenever I’m working with file upload, the first thing that I do, do I have Apache Tika in my pom.xml? If I have, then uncover, and then just can copy paste the boilerplate code, or you can create a static method just to run those validations for you and have some reusability as well. Again, if you are indeed saving the file somewhere, be sure that you are running the file through a virus scan. If you’re working with spreadsheets or CSVs or documents, again, deny by default. Do my Excel file need to have macros or formulas? My Word document, do I need to allow embedded objects? Does it make sense for my application? Do I have a valid business justification? Make sure that we have all those validations in place. Then you can safely store your file and live happily ever after.
Exception Handling and Logging
Exception handling and logging, this is where we have to be a little bit careful as well. We as developers, and I find this really funny, whenever I’m using a service on the internet and an error occurs, and I see, they’re using this tech stack. That’s really cool. For me, it is, but for somebody that doesn’t have good intentions, might not be. Never expose the stack trace. Log the stack trace, because we as developers, we’re going to rely on logs to do some debugging and try to fix some of the production issues. Log it, but do not expose it. Return a friendly and helpful message. Please do not return something like an error occurred, please get in touch with the administrator. What does it mean? Something that is helpful to whoever is seeing the message, but you’re not exposing anything.
You’re not exposing the technology stack that you are using. Because what happens is, if you expose the technology stack, the person that does not have good intentions might see, let me see if there is any vulnerabilities. You’re using Spring. Does Spring have any vulnerabilities that I can try to exploit? That is one of the reasons. Again, if you’re using Spring, one line of code that you can add to your properties file or YAML file to not expose the stack trace. Also, be careful with what you are logging. We’ve watched some talks during this conference here that we as developers, we are responsible. We have to be accountable for the code that we are writing. The beautiful thing of being a developer is that you can work within any industry. With power comes responsibilities. Different industries will have different regulations, so make sure that you’re not logging the password, even for debugging purposes.
If you work with personal identifiable information, like first name, last name, email, phone, address, something that can help to identify a person, do not log those in. We have several regulations, GDPR, California has the California Privacy Act. Other states are passing their own regulations. We have to study our programming language, and at the same time, we have to keep ourselves up to date with all these regulations that can impact our jobs as well, to make sure that we’re being ethical, and we are writing code that is not infringing any of those laws: financial information, health care data, any kind of confidential business information. Log something that is still helpful to you, to help you to debug those production issues, but do not log something that is sensitive.
One of the things that you can use to remove those sensitive data, especially if you’re using the toString to log something, again, remove any sensitive data for your toString. There are annotations that you can do this. I personally prefer to not use annotations on this, because, again, you can forget to annotate in case you’re adding a new property. I like to explicitly tell what’s my toString here, so I can actually safely log that information if I have to. In case you do have to log user IDs or credit card numbers or any sensitive confidential data, you can mask that data and still be presented in a helpful way to you, or you can use vault tokens as well.
There are many different ways that you can do this, in case you absolutely have to log it. Be very careful with that. Last but not least here, apply rate limits to your APIs. There are many flavors in the industry. It all depends on the size of your application. If you need something that is very quick and easy, you can use Spring AOP. There’s also a great library, Bucket4j. If you need a more robust enterprise solution, Redis for the win, among other solutions out there as well. Do apply because, in case your API does have any kind of vulnerability, at least here, you’re going to prevent some data mining. At least if you have some rate limit, you can control the damage that’s done here. At least have one of the things. If you cannot have it all, at least try to apply a few validations, rate limit so you can decrease the size of the damage.
Testing
Testing. After all we’ve talked about, of course, we have to test all of this. It’s not only our business logic. For testing, make sure that you are adding those exception edge cases as well to your testing. If you only care about percentage of code coverage, this is not going to add any code coverage to your reports, but at least you are testing if you have your validations in place. You know if your security checks are in place.
One of the things that really helps me, especially when I have to write this kind of data, you can use other data sources for this. You can have your invalid data into some sort of file, and load it. There are many ways of doing this. In case you’re writing the data yourself, use AI to help you with this. You write two, three, and then the AI is going to pick it up and bind the test, all the rest of the data for you. This is a way to also improve that.
The AI Era
Again, we are in our AI era here, so make sure that you are taking advantage of that. If you are starting to work with projects with AI, because, of course, now it’s AI, our companies are going to ask, can you just put an AI on that? Just make sure that we have an AI. In case you are working in one of those projects and you are handling prompt engineering, make sure to validate and sanitize that as well. This is a really cool comic. Make sure that you are validating and sanitizing your input. It doesn’t matter the project, always validate and sanitize. Use AI as an ally here. It’s a great IntelliSense tool. I really like to use as my best friend coding with me.
You’re not sure how to write a unit test for a validation, just ask Copilot, CodeWhisperer, whatever tool that you are using, it can help you with that. In case you’re using GitHub now, they’re coming out with a lot of services. I really think that this is adding the security within the pipeline itself. Keep your dependencies up to date, that also helps a lot. Add some code scanning. For any security vulnerability, make sure that you’re not exposing those passwords. It can also help a lot with that if you do have access to services like this. Of course, there are a lot of other services within the industry as well. It really depends what your company is using. There are great services out there that you can achieve a very similar result.
Education and Training
Of course, you’re not going to go back tomorrow and say, team, I think we need to start incorporating a little bit more security within our code. This change does not happen from night to day or from day to night. It is a slow process. We need to mentor junior developers on this and the rest of our team as well. This is a work in progress, through many months. One of the things that I like to do with the folks that I work with is, whenever we’re having demos of the product, I’ll start asking questions. This is a really nice, cool feature, you’re handling a file upload? Are you checking for the file name? Are you validating that? Or, if we have some RESTful API, what are you using for validation? Start asking questions.
Next time you’re having those sessions again, ask the same questions again. Next time she’s going to ask about that, let’s just add it so when she asks, we’ve already done it. That’s a different way of doing that. Provide feedback. Make sure that the requirements are part of your user stories, it is part of the requirements, so we can start to incorporate it as part of the development. One thing that I like to use as well is some security checklist whenever I’m doing code reviews. This is only a suggestion. These are some of the things that I find mostly in the code reviews that I do. Always be kind with the code reviews that you are doing. These are some of the things that I usually check whenever I’m doing code reviews. You can evolve from this. Adapt to something that works better for your team. Again, many flavors available out there.
Questions and Answers
Participant 1: Do you have any recommendations for libraries for file content validation?
Groner: It really depends on what kind of validation you are using. For example, for all the Word documents that I handle, all the spreadsheets that I handle, we usually do not allow macros, formulas, embedded objects. For the content itself, it really depends on the use case that you have. It can be something manual. You can use some OCR tool to help you to do that as well. It’s really going to depend.
Participant 1: Since you mentioned Excel files. We do have a use case where users upload Excel files. I was just wondering if there are any off-the-shell libraries that we can use, or do we have to write custom code?
Groner: Depending on what you need, we usually write our own. We only validate for things that we do not allow. If you have a data table you’re only trying to extract that data table, we’re going to run all the validations on all the types that we have all over again, and validate all the business logic to make sure that that data is what we are expecting. That level of detail, it’s usually that we usually write something. Depending on the use case, Google has services for that, and there are a few services out there that you can try to use to help you to go through that.
Participant 2: You’re logging what we should expose, what we should return. In our team, we are having this double-edged sword in the sense that we don’t want to return sensitive information, like expose our business logic, how we do our profile management. When we have issues escalated to our helpdesk or service centers, we can’t find the exact errors by looking at Splunk, because our APIs don’t return those important crumbs for us. How can we approach this better? Our architects suggested, for instance, maybe we should use error codes. Like, this is the error code 2 or 3. Have you encountered this issue before? What should we do?
Groner: There are a few different ways that you can approach this. One is you can definitely have your own dictionary of the error codes, as you mentioned, just to help you a little bit with the debugging process. The other way around it is, you can try to mask the data. It’ll still be something that is meaningful and it’s easier for you to consume, but not something that’s going to be exposing any sensitive data. Because often when we’re running into production issues, it can be something like a software defect where we have to fix, but it can also be data consistency issues as well.
Those cases are a little bit more difficult to do the debugging. If you have some masking that you still have, like the nature of the data itself, you can still go through that without actually having access to the database, or something like that. It will be one of the approaches that I would try to use. This is very specific. It really depends on the business case, but it helps a little bit. The other thing that you can do as well is some kind of vault. If you have that data, you have some token, and you can log the token that can help you to retrieve the data. That will be another approach as well.
Participant 3: Do you have any suggestions for any tool in the CI/CD pipeline to scan the code quality and check for security inside the code?
Groner: There are a few, like Snyk. There’s Sonar. Depending on how you configure Sonar, you can try to catch those as well. Personally, we use a lot of checkmarks to do that, like checkmarks for code. There is still a team of InfoSec that is reviewing the checkmark, what it’s flagging to review if it’s a real issue or not. There is Black Duck for any kind of CVEs that we have out there for dependencies. There are other tools on the market, but these are some that we use internally, that’s global to the organization. If you’re using GitHub, GitHub now, they’re rolling out a lot of features, and they have the code scanning. A lot of them are free to use if you’re actually using GitHub, but a few of them, you still have to have the license of the product in order to be able to use.
Participant 4: You mentioned having validation at all levels. One of the things we’ve done is pulled out, like we don’t have authentication at every level, we just handle that, not even in the service authorization, pull it up to the top level. For something like validation, we also have that at the top level, and not have underlying services or something that we handle just at the base level, like in a controller, so that we don’t have to keep adding that in. Is there a difference, in your opinion, on like authorization and authentication versus validation, and why you do validation at every single level, like why that’s different?
Groner: I think it’s really going to depend on the team itself. You definitely can do a validation only on the controller level, if you want to keep your service layer a little bit more clean. I would definitely add that to the entity as well, because sometimes, we make mistakes. You’re going to forget something in the controller level, so at least you have another layer protecting you. If your team has the discipline to always add those validations into the controller, and if that’s working for you, that’s great. You can continue doing that. It also really depends on the nature of the project. If you have your controller, and then you’re calling your service, and maybe you’re using microservices architecture, and you don’t have multiple controllers, that works really well. If you’re working in a monolithic application where you have thousands of controllers and then you have thousands of services.
Then, in one controller, you’re making reference to 10 different other services, that becomes a little bit more complex, and you can actually make a mistake when you are trying to reuse that service in a different file. Then if you forget something, that is one of the reasons that I would say, to add in to all layers. It depends on the project. If that’s working for you, that is great. For the validation and authorization itself, usually, this is only done on the highest layer, usually in the controller, if we’re talking about Java, Spring, or something like that. That’s usually where we handle. You don’t necessarily need to handle the services in the service layer, unless you have a service that’s calling another service. Then you need to have some kind of authorization and authentication, some mechanism in there as well, in case you are interfacing with a different service, like connectivity to a different web service, or what have you.
See more presentations with transcripts
Microsoft Launches Azure Confidential VMs with NVIDIA Tensor Core GPUs for Enhanced Secure Workloads
MMS • Steef-Jan Wiggers
Article originally posted on InfoQ. Visit InfoQ
Microsoft has announced the general availability of Azure confidential virtual machines (VMs—NCC H100 v5 SKU) featuring NVIDIA Tensor Core GPUs. These VMs combine hardware-based data protection from 4th-generation AMD EPYC processors with high performance.
The GA release follows the preview of the VMs last year. By enabling confidential computing on GPUs, Azure offers customers increased options and flexibility to run their workloads securely and efficiently in the cloud. These virtual machines are ideally suited for tasks such as inferencing, fine-tuning, and training small to medium-sized models. This includes models like Whisper, Stable Diffusion, its variants (SDXL, SSD), and language models such as Zephyr, Falcon, GPT-2, MPT, Llama2, Wizard, and Xwin.
The NCC H100 v5 VM SKUs offer a hardware-based Trusted Execution Environment (TEE) that improves the security of guest virtual machines (VMs). This environment protects against potential access to VM memory and state by the hypervisor and other host management code, thereby safeguarding against unauthorized operator access. Customers can initiate attestation requests within these VMs to verify that they are running on a properly configured TEE. This verification is essential before releasing keys and launching sensitive applications.
(Source: Tech Community Blog Post)
In a LinkedIn post by Vikas Bhatia, head of product, Azure confidential computing, and Drasko Draskovic, founder & CEO of Abstract Machines commented:
Congrats for this, but attestation is still the weakest point of TEEs in CSP VMs. Current attestation mechanisms from Azure and GCP – if I am not mistaken – demand trust with the cloud provider, which in many ways beats the purpose of Confidential Computing. Currently – looks that baremetal approach is the only viable option, but this again in many ways removes the need for TEEs (except for providing the service of multi-party computation).
Several companies have leveraged the Azure NCC H100 v5 GPU virtual machine for workloads like confidential audio-to-text inference using Whisper models, video analysis for incident prevention, data privacy with confidential computing, and stable diffusion projects with sensitive design data in the automotive sector.
Besides Microsoft, the two other big hyperscalers, AWS and Google, also offer NVIDIA H100 Tensor Core GPUs. For instance, AWS offers H100 GPUs through its EC2 P5 instances, which are optimized for high-performance computing and AI applications.
In a recent whitepaper about the architecture behind NVIDIA’s H100 Tensor Core GPU (based on Hopper architecture), the NVIDIA company authors write:
H100 is NVIDIA’s 9th-generation data center GPU designed to deliver an order-of-magnitude performance leap for large-scale AI and HPC over our prior-generation NVIDIA A100 Tensor Core GPU. H100 carries over the major design focus of A100 to improve strong scaling for AI and HPC workloads, with substantial improvements in architectural efficiency.
Lastly, Azure NCC H100 v5 virtual machines are currently only available in East US2 and West Europe regions.