Category: Uncategorized

MMS • Robert Krzaczynski
Article originally posted on InfoQ. Visit InfoQ
Rhymes AI has introduced Aria, an open-source multimodal native Mixture-of-Experts (MoE) model capable of processing text, images, video, and code effectively. In benchmarking tests, Aria has outperformed other open models and demonstrated competitive performance against proprietary models such as GPT-4o and Gemini-1.5. Additionally, Rhymes AI has released a codebase that includes model weights and guidance for fine-tuning and development.
Aria has several features, including multimodal native understanding and competitive performance against existing proprietary models. Rhymes AI has shared that Aria’s architecture, built from scratch using multimodal and language data, achieves state-of-the-art results across various tasks. This architecture includes a fine-grained mixture-of-experts model with 3.9 billion activated parameters per token, offering efficient processing with improved parameter utilization.
Rashid Iqbal, a machine learning engineer, raised considerations regarding Aria’s architecture:
Impressive release! Aria’s Mixture-of-Experts architecture and novel multimodal training approach certainly set it apart. However, I am curious about the practical implications of using 25.3B parameters with only 3.9B active—does this lead to increased latency or inefficiency in certain applications?
Also, while beating giants like GPT-4o and Gemini-1.5 on benchmarks is fantastic, it is crucial to consider how it performs in real-world scenarios beyond controlled tests.
In benchmarking tests, Aria has outperformed other open models, such as Pixtral-12B and Llama3.2-11B, and performs competitively against proprietary models like GPT-4o and Gemini-1.5. The model excels in areas like document understanding, scene text recognition, chart reading, and video comprehension, underscoring its suitability for complex, multimodal tasks.
Source: https://huggingface.co/rhymes-ai/Aria
In order to support development, Rhymes AI has released a codebase for Aria, including model weights, a technical report, and guidance for using and fine-tuning the model with various datasets. The codebase also includes best practices to streamline adoption for different applications, with support for frameworks like vLLM. All resources are made available under the Apache 2.0 license.
Aria’s efficiency extends to its hardware requirements. In response to a community question about the necessary GPU for inference, Leonardo Furia explained:
ARIA’s MoE architecture activates only 3.5B parameters during inference, allowing it to potentially run on a consumer-grade GPU like the NVIDIA RTX 4090. This makes it highly efficient and accessible for a wide range of applications.
Addressing a question from the community about whether there are plans to offer Aria via API, Rhymes AI confirmed that API support is on the roadmap for future models.
With Aria’s release, Rhymes AI encourages participation from researchers, developers, and organizations in exploring and developing practical applications for the model. This collaborative approach aims to further enhance Aria’s capabilities and explore new potential for multimodal AI integration across different fields.
For those interested in trying or training the model, it is available for free at Hugging Face.
OpenJDK News Roundup: Stream Gatherers, Scoped Values, Generational Shenandoah, ZGC Non-Gen Mode

MMS • Michael Redlich
Article originally posted on InfoQ. Visit InfoQ

There was a flurry of activity in the OpenJDK ecosystem during the week of October 21st, 2024, highlighting: JEPs that have been Targeted and Proposed to Target for JDK 24; and drafts that have been promoted to Candidate status. JEP 485, Stream Gatherers, is the fifth JEP confirmed for JDK 24. Four JEPs have been Proposed to Target and will be under review during the week of October 28, 2024.
Targeted
After its review had concluded, JEP 485, Stream Gatherers, has been promoted from Proposed to Target to Targeted for JDK 24. This JEP proposes to finalize this feature after two rounds of preview, namely: JEP 473: Stream Gatherers (Second Preview), delivered in JDK 23; and JEP 461, Stream Gatherers (Preview), delivered in JDK 22. This feature was designed to enhance the Stream API to support custom intermediate operations that will “allow stream pipelines to transform data in ways that are not easily achievable with the existing built-in intermediate operations.” More details on this JEP may be found in the original design document and this InfoQ news story.
Proposed to Target
JEP 490, ZGC: Remove the Non-Generational Mode, has been promoted from Candidate to Proposed to Target for JDK 24. This JEP proposes to remove the non-generational mode of the Z Garbage Collector (ZGC). The generational mode is now the default as per JEP 474, ZGC: Generational Mode by Default, delivered in JDK 23. By removing the non-generational mode of ZGC, it eliminates the need to maintain two modes and improves development time of new features for the generational mode. The review is expected to conclude on October 29, 2024.
JEP 487, Scoped Values (Fourth Preview), has been promoted from Candidate to Proposed to Target for JDK 24. Formerly known as Extent-Local Variables (Incubator), proposes a fourth preview, with one change, in order to gain additional experience and feedback from one round of incubation and three rounds of preview, namely: JEP 481, Scoped Values (Third Preview), delivered in JDK 23; JEP 464, Scoped Values (Second Preview), delivered in JDK 22; JEP 446, Scoped Values (Preview), delivered in JDK 21; and JEP 429, Scoped Values (Incubator), delivered in JDK 20. This feature enables sharing of immutable data within and across threads. This is preferred to thread-local variables, especially when using large numbers of virtual threads. The only change from the previous preview is the removal of the callWhere()
and runWhere()
methods from the ScopedValue
class that leaves the API fluent. The ability to use one or more bound scoped values is accomplished via the call()
and run()
methods defined in the ScopedValue.Carrier
class. The review is expected to conclude on October 30, 2024.
JEP 478, Key Derivation Function API (Preview), has been promoted from Candidate to Proposed to Target for JDK 24. This JEP proposes to introduce an API for Key Derivation Functions (KDFs), cryptographic algorithms for deriving additional keys from a secret key and other data, with goals to: allow security providers to implement KDF algorithms in either Java or native code; and enable the use of KDFs in implementations of JEP 452, Key Encapsulation Mechanism. The review is expected to conclude on October 31, 2024.
JEP 404, Generational Shenandoah (Experimental), has been promoted from Candidate to Proposed to Target for JDK 24. Originally targeted for JDK 21, JEP 404 was officially removed from the final feature set due to the “risks identified during the review process and the lack of time available to perform the thorough review that such a large contribution of code requires.” The Shenandoah team decided to “deliver the best Generational Shenandoah that they can” and target a future release. The review is expected to conclude on October 30, 2024.
New JEP Candidates
JEP 495, Simple Source Files and Instance Main Methods (Fourth Preview), has been promoted from its JEP Draft 8335984 to Candidate status. This JEP proposes a fourth preview without change (except for a second name change), after three previous rounds of preview, namely: JEP 477, Implicitly Declared Classes and Instance Main Methods (Third Preview), delivered in JDK 23; JEP 463, Implicitly Declared Classes and Instance Main Methods (Second Preview), delivered in JDK 22; and JEP 445, Unnamed Classes and Instance Main Methods (Preview), delivered in JDK 21. This feature aims to “evolve the Java language so that students can write their first programs without needing to understand language features designed for large programs.” This JEP moves forward the September 2022 blog post, Paving the on-ramp, by Brian Goetz, Java language architect at Oracle. Gavin Bierman, consulting member of technical staff at Oracle, has published the first draft of the specification document for review by the Java community. More details on JEP 445 may be found in this InfoQ news story.
JEP 494, Module Import Declarations (Second Preview), has been promoted from its JEP Draft 8335987 to Candidate status.This JEP proposes a second preview after the first round of preview, namely: JEP 476, Module Import Declarations (Preview), delivered in JDK 23. This feature will enhance the Java programming language with the ability to succinctly import all of the packages exported by a module with a goal to simplify the reuse of modular libraries without requiring to import code to be in a module itself. Changes from the first preview include: remove the restriction that a module is not allowed to declare transitive dependencies on the java.base
module; revise the declaration of the java.se
module to transitively require the java.base
module; and allow type-import-on-demand declarations to shadow module import declarations. These changes mean that importing the java.se
module will import the entire Java SE API on demand.
JEP 493, Linking Run-Time Images without JMODs, has been promoted from its JEP Draft 8333799 to Candidate status. This JEP proposes to “reduce the size of the JDK by approximately 25% by enabling the jlink
tool to create custom run-time images without using the JDK’s JMOD files.” This feature must explicitly be enabled upon building the JDK. Developers are now allowed to link a run-time image from their local modules regardless where those modules exist.
JDK 24 Release Schedule
The JDK 24 release schedule, as approved by Mark Reinhold, Chief Architect, Java Platform Group at Oracle, is as follows:
- Rampdown Phase One (fork from main line): December 5, 2024
- Rampdown Phase Two: January 16, 2025
- Initial Release Candidate: February 6, 2025
- Final Release Candidate: February 20, 2025
- General Availability: March 18, 2025
For JDK 24, developers are encouraged to report bugs via the Java Bug Database.

MMS • Sergio De Simone
Article originally posted on InfoQ. Visit InfoQ

In a Software Engineering Daily podcast hosted by Kevin Ball, Steve Klabnik and Herb Sutter discuss several topics related to Rust and C++, including what the languages have in common and what is unique to them, where they differ, how they evolve, and more.
The perfect language does not exist
For Klabnik, besides the common idea about Rust being a memory-safe language, what really makes it different is its taking inspiration from ideas that have not been popular in more recent programming languages but that are well-known in the programming languages space.
Sutter, on the other hand, stresses the idea of zero-overhead abstraction, i.e., the idea you can express things in a higher-level way without paying an excessive cost for that abstraction. Here Klabnik echoes Sutter considering how the primary mechanism Rust achieves “zero-cost” abstractions is through types:
[Klabnik] Basically, the more that you could express in the type system, the less you needed to check at runtime.
But not all abstractions are zero-overhead, says Sutter, bringing the example of two C++ features that compilers invariably offer the option to disable: exception handling and runtime typing.
[Sutter] If you look at why, it’s because those two can be written better by hand. Or you pay for them even if you don’t use them.
So that brings both to the idea that language design is just as much of an art or a craft, heavily influenced by taste and a philosophy of what a language ought to be, so we are not yer to the point where a language can be the perfect language.
Areas of applicability
Coming to how the languages are used, Sutter says they have a very similar target, with ecosystem maturity and thread-safety guarantees being what may lead to choose one over the other. Klabnik stresses out the fact that Rust is being often adopted to create networked services, even at large companies:
[Klabnik] Cloudflare, for example, is using Rust code as their backend. And they power 10% of internet traffic. Other companies […] had some stuff in Python and they rewrote it in Rust. And then they were able to drop thousands of servers off of their cloud bill.
Sutter highlights the ability of C++ of giving you control of time and space and the richness of the tool ecosystem that has been built around C++ for 30+ years, while Rust is comparatively still a young language.
[Sutter] Right now, there are things where I can’t ship code out of the company unless it goes through this tool. And it understands C++ but it doesn’t understand Rust. I can’t ship that code. If I write in Rust, I’m not allowed to ship it without a VP exception
How languages evolve
When it comes to language evolution, the key point is when it is worth to add complexity to the language for a new feature, which is not always an easy decision. In this regard, one of the things Klabnik praises about C++ is its commitment to backwards compatibility “within reason”.
[Klabnik] I’m not saying that the rust team necessarily doesn’t have a similar view in the large. But there’s been some recent proposals about some things that kind of add a degree of complexity that I’m am not necessarily personally comfortable with.
Sutter mentions how the C# team at some point decided to add nullability to the language and, although this is surely a great thing to have, it also brought hidden costs:
[Sutter] Because there’s complexity for users and the complexity in the language now but also for the next 20 years. Because you’re going to be supporting this thing forever, right? And it’s going to constrain evolution forever. […] It turns out that once you add nullability […] the whole language needs to know about it more than you first thought.
To further stress how hard it is to take such kinds of decisions, Klabnik recalls a case where he fought against Rust’s postfix syntax for await
, which differs from most other languages where it is prefix:
[Klabnik] Now that I’m writing a lot of code with async and await, gosh, I’m glad they went with postfix await. Because it’s like so, so much nicer and in so many ways.
The power of language editions
One way for the Rust Core team to manage change is defining language editions to preserve source compatibility when new features are added.
Sutter shows himself to be interested in understanding whether this mechanism could have made it possible to introduce C++ async/await
support without needing to adopt the co_async
and co_await
keywords to prevent clashing existing codebases. Klabnik explains editions are a powerful mechanism to add new syntax, but they have limitations:
[Klabnik] Changing
mute
tounique
as a purely syntactic change would absolutely work. Because what ends up happening is when it compiles one package for 2015 and one package for 2024, and 2024 addsunique
or whatever, it would desugar to the same internal [representation … but] you can’t like add a GC in an edition, because that would really change the way that things work on a more deep level.
Similarly, the Rust standard library is a unique binary and thus you have to ensure it includes all functions that are used in all editions.
Klabnik and Sutter cover more ground and detail in the podcast than we can summarize here. Those include the importance for C++ of its standardization process; whether having an ISO standard language would be of any interest to the Rust community; the reasons for and implications of having multiple compilers vs. just one in Rust’s case; how to deal with conformance; and the value of teaching languages like C++ or Rust in universities. Do not miss the full podcast if you are interested in learning more.

MMS • Edin Kapic
Article originally posted on InfoQ. Visit InfoQ

Microsoft released version 8.1 of the Windows Community Toolkit in August 2024. The new release has updated dependencies, .NET 8 support, two new controls and several changes to existing controls and helpers.
The Windows Community Toolkit (WCT) is a collection of controls and libraries that help Windows developers by providing additional features that the underlying platform doesn’t yet offer. Historically, the features provided by the toolkit were gradually incorporated into the Windows development platform itself.
The Windows Community Toolkit is not to be confused with the .NET Community Toolkit (NCT), which contains the common features of WCT that aren’t tied to any underlying UI platform.
There are no major changes in version 8.1. The most important ones are the updated dependencies and the older NuGet package redirects.
The dependencies of the toolkit are now updated to the latest versions, Windows App SDK 1.5 and Uno Platform 5.2. The minimum Windows target version is bumped to version 22621. This move allows WCT consumers to target .NET 8 features in their apps.
When WCT 8.0 was released last year, namespaces were rationalised to remove redundancies. In version 8.1, there are NuGet package redirects to point the code targeting version 7.x of the toolkit to the equivalent namespaces in version 8.x. For example, Microsoft.Toolkit.Uwp.UI.Controls.Primitives
will now redirect to CommunityToolkit.Uwp.Controls.Primitives
.
Beyond the new dependencies and package redirects, there are several new features in the updated version. The most prominent ones are the ColorPicker and the ColorPickerButton controls. They were previously available in WCT 7.x and now are ported back again with Fluent WinUI look-and-feel and several bug fixes.
Another control ported from version 7.x is the TabbedCommandBar. It has been also updated with new WinUI styles and a bug regarding accent colour change was fixed in this new version.
Minor fixes in version 8.1 include making camera preview helpers work with Windows Apps SDK, support for custom brushes to act as overlays for ImageCropper control, and new spacing options for DockPanel control.
The migration process for the users of the older 7.X version of the WCT involves updating the TargetFramework
property in the .csproj
file and the publishing profile to point to the new version of Windows SDK.
Microsoft recommends developers check and contribute to the Windows Community Toolkit Labs, a repository for pre-release and experimental features that aren’t stable enough for the main WCT repository. Still, the issue of several controls dropped in the move from version 7 to version 8 of the toolkit is a source of frustration for WCT developers.
Version 8.1.240821 was pre-released on June 6th 2024 and released for general availability on August 22nd 2024. The source code for the toolkit is available on GitHub.

MMS • Robert Krzaczynski
Article originally posted on InfoQ. Visit InfoQ

JetBrains announced a significant change to its licensing model for Rider, making it free for non-commercial use. Users can now access Rider at no cost for non-commercial activities, including learning, open-source project development, content creation, or personal projects. This update aims to enhance accessibility for developers while commercial clients will still need to obtain a paid license.
In order to take advantage of the new non-commercial subscription, installing Rider is necessary. Upon opening the program for the first time, a license dialog box will appear. At that point, the option for non-commercial use should be selected, followed by logging in to an existing JetBrains account or creating a new one. Finally, acceptance of the terms of the Toolbox Subscription Agreement for Non-Commercial Use is required.
Developers asked whether the non-commercial license for Rider would be “free forever or for a limited time.” JetBrains clarified that the non-commercial license is available indefinitely. Users will receive a one-year license that automatically renews, allowing them to use Rider for non-commercial projects without interruption.
Furthermore, several questions have arisen regarding the licensing details. One concern came from user MrX, who asked how the licensing applies to game development:
How does that work if someone is making a game with it? Can they technically use it for years and only need to start a subscription if they’ve actually released something to sell?
In response, JetBrains explained that any user planning to release a product with commercial benefits, whether immediately or in the future, should use a commercial license.
Another community member, Pravin Chaudhary, shared doubts about changing project intentions:
Plans can change, no? I may have a side project that I will make commercial only if it gets enough traction. How to handle such a case?
JetBrains addressed this by stating that users would need to reevaluate their licensing status if their project goals shift from non-commercial to commercial.
Following the announcement, community feedback has been positive. Many developers have expressed excitement about the implications of this licensing change. Matt Eland, a Microsoft MVP in AI and .NET, remarked:
This changes the paradigm for me as someone supporting cross-platform .NET, particularly as someone writing .NET books primarily from a Linux operating system. I now have a lot more comfort in including Rider screenshots in my content now that I can easily point people to this as a free non-commercial tool.
For more information about this licensing change, users can visit the official JetBrains website.

MMS • Sergio De Simone
Article originally posted on InfoQ. Visit InfoQ

Stability AI has released Stable Diffusion 3.5 Large, its most powerful text-to-image generation model to date, and Stable Diffusion 3.5 Large Turbo, with special emphasis on customizability, efficiency, and flexibility. Both models come with a free licensing model for non commercial and limited commercial use.
Stable Diffusion 3.5 Large is an 8 billion parameters model which can generate professional images at 1 megapixel resolution, says Stability AI. Stable Diffusion 3.5 Large Turbo is a distilled version of Stable Diffusion 3.5 Large that focuses on being faster by reducing the number of required steps to just four. Both models, says Stability AI, provide top-tier performance in prompt adherence and image quality.
One of the goals behind the Stable Diffusion 3.5 models is customizability, meaning the possibility for users to fine-tune the model or build customized workflows. To train LoRAs with Stable Diffusion 3.5, you can use the existing SD3 training script with some additional caveats if you want to have it work with quantization. Stable Diffusion 3.5 is also optimized to run on standard consumer hardware, according to Stability AI, and to provide a diverse output, including skin tones, 3D images, photography, painting, and so on.
Stable Diffusion 3.5 follows Stable Diffusion 3 Medium, released last June, which garnered criticism in several areas, including its ability to accurately depict human anatomy and specifically hands. In the 3.5 release announcement, Stability AI acknowledged the community’s dissatisfaction and made clear 3.5 is not a quick fix but a step forward in Stable Diffusion evolution. Anyway, while Stable Diffusion 3.5 fixes the known issues with “girls lying in the grass”, it still may fail with apparently basic prompts.
As Stability AI explains, Stable Diffusion 3.5 has a similar architecture to SD3’s, with two major changes: the use of QK normalization and of double attention layers.
As mentioned, Stable Diffusion 3.5 is released under a permissive license allowing free use for non commercial projects and commercial purposes for creators whose total annual revenue is less than $1M. The free “community” model expressly forbids creating competing foundational models. While this could sound too restrictive, custom models trained using common customization techniques such as LoRAs, hypernetworks, finetunes, retrain, retrain from scratch are not considered “foundational”.
Later this month, Stability AI is going to release Stable Diffusion 3.5 Medium, using 2.5 billion parameters and designed to run on consumer hardware. This will further enable the creation of custom models on a variety of hardware, albeit with a slight loss of output quality.
You can download Stable Diffusion 3.5 inference code from GitHub, while the model itself is available on huggingface. You can also use the model on platforms like Replicate, ComfyUI, and DeepInfra or directly using Stability AI API.
AI and ML Tracks at QCon San Francisco 2024 – a Deep Dive into GenAI & Practical Applications

MMS • Artenisa Chatziou
Article originally posted on InfoQ. Visit InfoQ

At QCon San Francisco 2024, the international software development conference by InfoQ, there are two tracks dedicated to the rapid advancements in AI and ML, reflecting how these technologies have become central to modern software development.
The conference spotlights senior software developers from enterprise organizations, sharing their approaches to adopting emerging trends. Each session provides actionable insights and strategies attendees can immediately apply to their own projects. Each talk is led by seasoned practitioners who share actionable knowledge—no marketing pitches, just real-world solutions.
The first track, Generative AI in Production and Advancements, curated by Hien Luu, Sr. engineering manager @Zoox & author of MLOps with Ray, provides a deep dive into practical AI/ML applications and the latest industry innovations. The track explores real-world implementations of GenAI, focusing on how companies leverage it to enhance products and customer experiences.
Talks will include insights from companies like Pinterest, Meta, Microsoft, Voiceflow, and more, sharing best practices on deploying large language models (LLMs) for search and recommendations and exploring AI agents’ potential for the future of software:
- Scaling Large Language Model Serving Infrastructure at Meta: Charlotte Qi, senior staff engineer @Meta, will share insights on balancing model quality, latency, reliability, and cost, supported by real-world case studies from their production environments.
- GenAI for Productivity: Mandy Gu, senior software development manager @Wealthsimple, will share how they use Generative AI to boost operational efficiency and streamline daily tasks, blending in-house tools with third-party solutions.
- Navigating LLM Deployment: Tips, Tricks, and Techniques: Meryem Arik, co-founder @TitanML, recognized as a technology leader in Forbes 30 Under 30, will cover best practices for optimizing, deploying, and monitoring LLMs, with practical tips and real case studies on overcoming the challenges of self-hosting versus using API-based models.
- LLM Powered Search Recommendations and Growth Strategy: Faye Zhang, staff software engineer @Pinterest, covers model architecture, data collection strategies, and techniques for ensuring relevance and accuracy in recommendations, with a case study from Pinterest showcasing how LLMs can drive user activation, conversion, and retention across the marketing funnel.
- 10 Reasons Your Multi-Agent Workflows Fail and What You Can Do about It: Victor Dibia, principal research software engineer @Microsoft Research, core contributor to AutoGen, author of “Multi-Agent Systems with AutoGen” book, will share key challenges in transitioning from experimentation to production-ready systems and highlight 10 common reasons why multi-agent systems often fail.
- A Framework for Building Micro Metrics for LLM System Evaluation: Denys Linkov, head of ML @Voiceflow, LinkedIn Learning Instructor, ML advisor, and instructor, explores how to measure and improve LLM accuracy using multidimensional metrics beyond a simple accuracy score.
The second track, “AI and ML for Software Engineers: Foundational Insights“, curated by Susan Shu Chang, principal data scientist @Elastic, author of “Machine Learning Interviews”, is tailored to those looking to integrate AI/ML into their work.
The talks will cover the deployment and evaluation of ML models, providing hands-on lessons from companies that have faced similar challenges:
- Recommender and Search Ranking Systems in Large Scale Real World Applications: Moumita Bhattacharya, senior research scientist @Netflix, dives into the evolution of search and recommendation systems, from traditional models to advanced deep learning techniques.
- Why Most Machine Learning Projects Fail to Reach Production and How to Beat the Odds: Wenjie Zi, senior machine learning engineer and tech lead @Grammarly, specializing in natural language processing, 10+ years of industrial experience in artificial intelligence applications, will explore why many machine learning projects fail, despite the growing hype around AI, and how to avoid common pitfalls such as misaligned objectives, skill gaps, and the inherent uncertainty of ML.
- Verifiable and Navigable LLMs with Knowledge Graphs: Leann Chen, AI developer advocate @Diffbot, creator of AI and knowledge graph content on YouTube, will demonstrate how knowledge graphs enhance factual accuracy in responses and how their relationship-driven features enable LLM-based systems to generate more contextually-aware outputs.
- Reinforcement Learning for User Retention in Large-Scale Recommendation Systems: Saurabh Gupta, senior engineering leader @Meta, and Gaurav Chakravorty, Uber TL @Meta, will share insights into leveraging reinforcement learning (RL) for personalized content delivery, exploring reward shaping, optimal time horizons, and necessary infrastructure investments for success at scale.
- No More Spray and Pray— Let’s Talk about LLM Evaluations: Apoorva Joshi, senior AI developer @MongoDB, six years of experience as a data scientist in cybersecurity, active member of Girls Who Code, Women in Cybersecurity (WiCyS) and AnitaB.org, will share practical frameworks for LLM evaluation that someone can apply directly to their projects.
These sessions aim to bridge the gap between AI hype and practical, scalable applications, helping engineers of all levels navigate the evolving AI landscape.
Explore all 12 tracks taking place at QCon San Francisco this November 18-22, and take advantage of the last early bird tickets ending on October 29. With only a few weeks left, join over 1,000 of your peers in shaping the future of software development!

MMS • Gunnar Morling
Article originally posted on InfoQ. Visit InfoQ

Transcript
Morling: I would like to talk about a viral coding challenge which I did in January of this year, called, The One Billion Row Challenge. I would like to tell a little bit the story behind it, how all this went, what I experienced during January. Then, of course, also some of the things I learned, some of the things the community learned from doing that. This is how it started. On January 1st, I put out this tweet, I put out a blog post introducing this challenge. Was a little bit on my mind to do something like that for quite a while. Between Christmas and New Year’s Eve, I finally found the time to do it. I thought, let me go and put out this challenge. I will explain exactly what it was, and how it worked. That’s how it started. Then it got picked up quite quickly. People started to work on that. They implemented this in Java, which was the initial idea, but then they also did in other languages, other ecosystems, like .NET, Python, COBOL, with databases, Postgres, Snowflake, Apache Pinot, all those good things.
There was an article on InfoQ, which was the most read article for the first six months of the year, about this. There also was this guy, I didn’t know him, Prime, a popular YouTuber. He also covered this on his stream. What did I do then? I learned how to print 162 PDF files and send them out to people with their individual certificate of achievement. I learned how to send coffee mugs and T-shirts to people all around the world, because I wanted to give out some prices. I sent those things to Taiwan, South America, the Republic of South Korea, and so on. That was what happened during January.
Who am I? I work as a software engineer at a company called Decodable. We built a managed platform for stream processing based on Apache Flink. This thing is completely a side effort for me. It was a private thing, but then Decodable helped to sponsor it and supported me with doing that.
The Goals
What was the idea behind this? Why did I do this? I thought I would like to learn something new. There’s all those new Java versions every six months. They come with new APIs, new capabilities. It’s really hard to keep track of all those developments. I would like to know what’s new in those new Java versions, and what can I do with those things? I wanted to learn about it, but also, I wanted to give the community an avenue to do that so that everybody can learn something new. Then, of course, you always want to have some fun. It should be a cool thing to do. You don’t want to just go and read some blogs. You would like to get some hands-on experience.
Finally, also, the idea was to inspire others to do the same. This was the thing, which I think was a bit specific about this challenge, you could actually go and take inspiration from other people’s implementations. Nothing was secret. You could go and see what they did, and be inspired by them. Obviously, you shouldn’t just take somebody else’s code and submit it as your own implementation. That would make much sense. You could take inspiration, and people actually did that, and they teamed up in some cases. The other motivation for that was, I wanted to debunk this idea, which sometimes people still have, Java is slow, and nothing could be further from the truth, if you look at modern versions and their capabilities. Still, this is on people’s minds, and I wanted to help and debunk this. As we will see, I think that definitely was the case.
How Did it Work?
Let’s get a bit into it. How did it work? You can see it all here in this picture. The idea was, we have a file with temperature measurements, and it’s essentially like a CSV file, only that it wasn’t a comma as a separator, but a semicolon with two columns, and a station name, like Hamburg or Munich, and so on. Then, a temperature measurement value associated to that, randomized values. The task was, process that file, aggregate the values from that file, and for each of those stations, determine the minimum value, the maximum value, and the average temperature value. Easy. Then the caveat only was, this has 1 billion rows as the name of the challenge gives away.
This file, if you generate this on your machine, it has a size of around 13 gigabytes, so quite sizable. Then you had to print out the results, as you can see it here. This already goes to show a little bit that I didn’t spend super much time to prepare this, because this is just the two-string output of the Java HashMap implementation. Random to have this as the expected output. Then as people were implementing this, for instance, with relational databases, they actually went to great lengths to emulate that output format. I should have chosen a relational output format, next time.
The Rules
A little bit more about the rules. First of all, this was focused on Java. Why? It’s the platform I know best, I would like to support. This is also what I wanted to spread the knowledge on. Then you could choose any version. Any new versions, preview versions, all the kinds of distributions, like GraalVM, or all the different JDK providers which are out there. You managed using this tool called SDKMAN. Who has heard about SDKMAN? You should go and check out SDKMAN, and you should use it to manage Java versions. It’s a very nice tool for doing that, and switching back and forth between different versions. That’s it. Java only. No dependencies.
It wouldn’t make much sense to pull in some library and do the task for you. You should program this by yourself. No caching between runs. I did five runs of each implementation. Then I discarded the fastest and the slowest one, and I took the average value from the remaining three runs. It would make sense to have any caching there. Otherwise, you could just do the task once, persist the result in the file, read it back, and it would be very fast. That doesn’t make much sense. You were allowed to take inspiration by others. Of course, not again, just resubmit somebody else’s implementation. You could take inspiration.
Evaluation Environment
In terms of how I ran this. My company spent €100 on this machine which I got on the Hetzner cloud. Very beefy server with 32 cores, out of which I mostly used only 8 cores, I will explain later on. Quite a bit of RAM. Really, the file was always coming from a RAM disk. I wanted to make sure that disk I/O is not part of the equation, just because it’s much less predictable, would have made the life much harder for me. Only a purely CPU bound problem here. How would we go and do this? This is my baseline implementation. I’m an average Java developer, so that’s what I came up with. I use this Java Streams API. I use this files.lines method, which gives me a stream with the lines of a file. I read that file from disk, then I map each of my lines there using the split method. I want to separate the station name from the value. Then I collect the results, the lines into this grouping collector. I group it by the station name.
Then for each of my stations, I need to aggregate those values, which happens here in my aggregator implementation. Whenever a new value gets added to an existing aggregator object, I keep track of the min, the max, and in order to calculate average I keep track of the sum and the count of the values. Pretty much straight forward. That’s adding a line. Then, if I run this in parallel, I would need to merge two aggregators. That’s what’s happening here. Again, pretty much straight forward. Finally, if I’m done, I need to reduce my processed results, and I emit such a result through object with the min, the max value. Then, for the average, I just divide sum by count, and I print it out. On this machine, this ran in about five minutes. Not super-fast, but also not terribly bad. Writing this code, it took me half an hour or maybe less. It’s decent. Maybe, if you were to solve this problem in your job, you might call it a day and just go home, have a coffee and be done with it. Of course, for the purpose of this challenge, we want to go quite a bit faster and see how much we can move the needle here.
The First Submission
With this challenge, the important thing is somebody has to come and participate. That was a bit my concern like, what happens if nobody does it? It would be a bit embarrassing. Roy Van Rijn, another Java champion from the Netherlands, he was instantly interested in this, and an hour later or so, after I had put out the post, he actually created his own first implementation, and it wasn’t very fancy or very elaborate. His idea just was, I want to be part of it. I want to put out a submission so other people can see, this is something we also could do. This was really great to see, because as soon as the first person comes along and takes part, then also other people will come along and take part. Of course, he kept working on that. He was one of the most active people who iterated on his implementation, but he was the very first one to submit something.
Parallelization
Let’s dive a little bit into the means of what you can do to make this fast. People spent the entire month of January working on this, and they went down to a really deep level, essentially counting CPU instructions. My idea is, I want to give you some ideas of what exists, what you can do. Then maybe later on, if you find yourself in that situation where you would like to optimize certain things, then you might remember, I’ve heard about it. Then you could go and learn really deep. That’s the idea.
Let’s talk about parallelization, first of all, because we have many CPU cores. On my server, which I use to evaluate it, I have 32 cores, 64 with hyperthreading. We would like to make use of that. Would be a bit wasteful to just use a single core. How can we go about this? Going back to my simple baseline implementation, the first thing I could do is I could just say, let me add this parallel call, so this part of the Java Streams API.
Now this will process this pipeline, or I should say, part of this streaming pipeline in parallel. Just doing this, just adding this single method call, gets us down to 71 seconds. From 4 minutes 50 seconds, to 71 seconds by just adding a few characters for one method call. I think that’s a pretty good thing. Very easy win. Of course, that’s not all we can do. In particular, if you think about it, yes, it gets us down by quite a bit, but it’s not eight times faster than what we had initially, but we have 8 CPU cores which I’m using here. Why is it not eight times faster? This parallel operator, this applies to the processing logic. All this aggregating and grouping logic, this happens in parallel, but this reading of the file from memory, this still happens sequentially.
The entire file, the reading part, that’s sequentially, and we have still all our other CPU cores sitting idle, so we would like to also parallelize that. This comes back then to this notion that I would like to go out and learn something new, because all those new Java versions, they come with new APIs, the JEPs, the Java Enhancement Proposals. One of them, which was added recently, is the foreign function and memory API. You can see it here, so that’s taken straight from the JEP, but essentially, it’s a Java API which allows you to make use of native methods.
It’s a replacement, much easier to use than the old JNI API. It also allows you to make use of native memory. Instead of the heap, which is managed by the JVM, you get the ability to manage your own memory section, like an off-heap memory, and you will be in charge of maintaining that, and making sure you free it, and so on. That’s what we would like to use here, because we could memory map this file and then process it there in parallel. Let’s see how we can go about that.
I mentioned there’s a bit of code, but I will run you through it. That’s a bit of a recurring theme. The code you will see, it gets more dense as we progress. Again, you don’t really have to understand everything. I would like to give you a high-level intuition. What do we do here? First of all, we determine the degree of parallelism. We just say, how many CPU cores do we have? Eight in our case, so that’s our degree of parallelism. Next, we want to memory map this file. You could have memory map files also in earlier Java versions, but for instance, you had size limits. You couldn’t memory map an entire 13-gig file all at once, whereas now here with the new foreign memory API, that’s possible. We can do this. You map the file. We have this Arena object there. This is essentially our representation of this memory. There’s different kinds of Arenas. In this case, I’m just using this global Arena which just exists and is accessible from everywhere within my application. That’s where I have that file, and now I can access that entire section of memory in parallel using multiple threads.
In order to do so, we need to split up that file and the memory representation. That’s what happens here. First of all, roughly speaking, we divide into eight equal chunks. We take our entire size divided by eight. That’s our estimate of chunk sizes. Now, of course, what would happen is, in all likelihood, we will end up in the middle of a line. This is not really desirable, where, ideally, we would like to have our worker processes, they should work on entire lines. What’s happening here is, we went to roughly one-eighth of the file, we just keep going to the next line ending character. Then we say, that’s the end of this chunk, and the starting point of the next chunk. Then we process those chunks, essentially, just using threads. We will see later on how to do this. We start our threads, we join them. In the end, we’ve got to wait. Now this parallelizes the entire thing.
Now we really make use of all our 8 cores for the entire time, also while we do the I/O. There’s one caveat. Just by the nature of things, one of those CPU cores will always be the slowest. At some point, all the other seven, they will just wait for the last one to finish, because it’s a little bit unequally distributed. What people, in the end, did, instead of using 8 chunks, they split up this file in much smaller chunks. Essentially, they had a backlog of those chunks. Whenever one of those worker threads was done with the current chunk, it would go and pick up the next one.
By that, you make sure that all your 8 threads are utilized equally all the time. The ideal chunk size, as it turned out, was 2 megabytes. Why 2 megabytes? This specific CPU, which is in this machine which I used, it has a second level cache size of 16 megabytes, 8 threads processing 2 megabytes at a time. It’s just the best in terms of predictive I/O and so on. This is what people found out. This already goes to show, we really get down to the level of a specific CPU and the specific architecture to really optimize for that problem by doing those kinds of things. That’s parallel I/O.
1BRC – Mythbusters, and Trivial Task?
This challenge, it was going, people were participating. They had a good time. Of course, whenever something happens, there’s also conspiracy theories. That’s what I fear. People said, is this actually an engineering problem? At Decodable, you had this problem and you didn’t know how to do it, so you farmed it out to the community. I can tell you, this would have been the most silly thing I could have done, because I created so much work by running this challenge for myself. I didn’t do much besides it during the entire month of January. It was not that. It was just a genuine thing, which I felt would be interesting to me and the community. Was it an add for GraalVM? Because many people actually used GraalVM, and we will see later on more about it. Also, no. It was just like GraalVM lends itself really well towards that problem. Finally, is it an add for this AMD EPYC processor? Also, no.
Really, no conspiracies going on here. Who is on Hacker News? I read Hacker News way too often. Of course, you always have the Hacker News guy who says, that’s a trivial problem. Everybody who knows how to program just a little bit, they will have solved this in two hours, and it’s like a boring thing. Then, on the other hand, you have all the people from the Java community, and also, big names like Thomas Würthinger, who is the engineering lead at GraalVM, or Cliff Click, who was one of the original people behind the JVM, or Aleksey Shipilev, and all those people, they spend the entire month of January doing this. Of course, the Hacker News dude, he does it in two hours. Always interesting to see.
Parsing
Let’s dive a little more into parsing that. We have seen how to make use of our CPU cores, but what actually happens there to process a line? Let’s take a closer look at that. If we want to get away from what we had initially with just splitting up the file using regex and so on, that’s not very efficient. Let’s see what we can do here. That’s, again, something I would be able to come up with just processing those input lines, character by character. What’s happening here is, we have a little bit of a state machine. We read our characters. We keep reading the line until it has no more characters. Then we use the semicolon character which separates our station name from the temperature value to switch these states. Depending on which state we are, so do we either read the bytes which make up the station name, or do we read up the bytes which make up the measurement value? We need to add them into some builder or buffer which aggregates those particular values.
Then, if we are at the end of a line, so we have found the line ending character, then we need to consume those two buffers for the station and for the measurement, which we have built up. For the measurement, we will need to see how we convert that into an integer value, because that’s also what people figured out. The problem was described as a double or a floating-point arithmetic, so with values like 21.7 degrees, but then again, randomly, I always only had a single fractional digit. People realized, this data actually, it always only has a single fractional digit. Let’s take advantage of that and just consider that as an integer problem by just multiplying the number by 100, for the means of calculation. Then at the end, of course, divide it by 100, or by 10. That’s something which people did a lot, and which I underestimated how much they would take advantage of the particular characteristics of that dataset.
For that conversion, we can see it here, and it makes sense, so we process or we consume those values. If we see the minus character, we negate the value. If we see the first one of our two digits, we multiply it by 100 or by 10. That’s how we get our value there. Doing that, it gets us down to 20 seconds. This is already an order of magnitude faster than my initial baseline implementation. So far, nothing really magic has happened. One takeaway also for you should be, how much does it make sense to keep working on such a thing? Again, if this is a problem you’re faced with in your everyday job, maybe stop here. It’s well readable, well maintainable. It’s an order of magnitude faster than the native baseline implementation, so that’s pretty good.
Of course, for the purposes of this challenge, we probably need to go a bit further. What else can we do? We can, again, come back to the notion of parallelism and try to process multiple values at once, and now we have different means of parallelization. We already saw how to make the most out of all our CPU cores. That’s one degree of parallelism. We could think about scaling out to multiple compute nodes, which is what we typically would do with our datastores. For that problem, it’s not that relevant, we would have to split up that file and distribute it in a network. Maybe not that desirable, but that would be the other end of the spectrum. Whereas we also can go into the other direction and parallelize within specific CPU instructions. This is what happens here with SIMD, Single Instruction, Multiple Data.
Essentially all these CPUs, they have extensions which allow you to apply the same kind of operation onto multiple values at once. For instance, here, we would like to find the line ending character. Now, instead of comparing byte by byte, we can use such a SIMD instruction to apply this to 8 or maybe 16, or maybe even more bytes at once, and it will, of course, speed up things quite a bit. The problem is, in Java, you didn’t really have a good means to make use of those SIMD instructions because it’s a portable, abstract language, it just wouldn’t allow you to get down to this level of CPU specificity. There’s good news.
There’s this vector API, which is still incubating, I think, in the eighth incubating version or so, but this API allows you now to make use of those vectorized instructions at extensions. You would have calls like this compare call with this equal operator, and then this will be translated to the right SIMD instruction of the underlying architecture. This would translate to the Intel or AMD64 extensions. Also, for Arm, it would do that. Or it would fall back to a scalar execution if your specific machine doesn’t have any vector extensions. That’s parallelization on the instruction level. I did another talk about it, https://speakerdeck.com/gunnarmorling/to-the-moon-and-beyond-with-java-17-apis, which shows you how to use SIMD for solving FizzBuzz.
Sticking to that pattern, applying the same operation to multiple values at once, we also can do what’s called SWAR, SIMD Within A Register. Again, I realize, the code gets more dense. I probably wouldn’t even be able right now to explain each and every single line, or it would take me a while. The idea here is, this idea of doing the same thing, like equals to multiple values all at once, we also can do this within a single variable. Because if you have 8 bytes, we also could see one long, that’s 64 bits, that’s also 8 bytes. We can apply the right level of bit level magic to a long value, and then actually apply this operation to all the 8 bytes. It’s like bit level masking and shifting, and so on. That’s what’s happening here. There’s a very nice blog post by Richard Startin, which shows you, step by step, how to do this, or how to use this to find the first zero byte in a string.
I have put the math up here on the right-hand side, so you actually can go and follow along, and you will see, this actually gives you the first zero byte in a long like that. That’s SIMD Within A Register, SWAR. Now the interesting thing is, if you look at this code, something is missing here. Is somebody realizing what we don’t have here? There’s no ifs, there’s no conditionals, no branching in that code. This is actually very relevant, because we need to remember how our CPUs actually work. If you look at how a CPU would take and go and execute our code, it always has this pipelined approach. Each instruction has this phase of, it’s going to be fetched from memory, it’s decoded, it’s executed, and finally the result is written back. Now actually multiple of those things happen in parallel. While we decode our one instruction, the CPU will already go and fetch the next one. It’s a pipelined parallelized approach.
Of course, in order for this to work, the CPU actually needs to know what is this next instruction, because otherwise we wouldn’t know what to fetch. In order for it to know, we can’t really have any ifs, because then we wouldn’t know, which way will we go? Will we go left or right? If you have a way for expressing this problem in this branchless way, as we have seen it before, then this is very good, very beneficial for what’s called the branch predictor in the CPU, so it always knows which are the next instructions. We never have this situation that we actually need to flush this pipeline because we took a wrong path in this predictive execution. Very relevant for that. I didn’t really know much about those things, but people challenged it. One of the resources they employed a lot is this book, “Hacker’s Delight”. I recommend everybody to get this if this is interesting to you. Like this problem, like finding the first zero byte in a string, you can see it here. All those algorithms, routines are described in this book. If this is the thing which gets you excited, definitely check out, and get this book.
Then, Disaster Struck
Again, people were working on the challenge. It was going really well. They would send me pull requests every day, and I didn’t expect that many people to participate. That’s why I always went to the server and executed them manually. At some point, someday I woke up, I saw, that’s the backlog of implementations which people had sent over the night, so let’s say 10 PRs to review and to evaluate, and suddenly all those results were off. It was like twice as fast as before. I ran one of the implementations which I had run on the day before, and suddenly it was much faster. I was wondering, what’s going on? What happened is, this workload, I had it initially on a virtual server. I thought, I’m smart. I try to be smart, so I get dedicated virtual CPU cores, so I won’t have any noisy neighbors on that machine, this kind of thing.
What I didn’t expect is that they would just go and move this workload to another machine. I don’t know why. Maybe it was random. Maybe they saw there was lots of load in that machine. In any case, it just got moved to another host, which was faster than before. This, of course, was a big problem for me, because all the measurements which I had done so far, they were off and not comparable anymore. That was a problem. I was a bit desperate at that point in time. This is where the wonders of the community really were very apparent. Good things happened. I realized, I need to get a dedicated server so that this cannot happen again. I need to have a box, which I can use exclusively. As I was initially paying out of my own pocket for that, I thought, I don’t want to go there. I don’t want to spend €100. As I mentioned, Decodable, my employer, they stepped up to sponsor it.
Then, of course, I also needed help with maintaining that, because I’m not the big operations guy. I know a little bit about it. Then for that thing, you would, for instance, like to turn off the hyperthreading, or you would like to turn off turbo boost to have stable results. I wasn’t really well-versed in doing those things, but the community came to help. In particular, René came to help. He offered his help to set up the thing. We had a call. I spoke to him. I had not known him. It was the first time I ever spoke to him, but we had a great phone conversation. In the end, he just sent me his SSH key. I uploaded his key to the machine, gave him the key to the kingdom, and then he was going and configuring everything the right way. There were multiple people, many people like that, who came and helped, because otherwise I just could not have done it.
The 1BRC Community
All this was a bit of a spontaneous thing. Of course, I put out some rules and how this should work. Then, I wasn’t very prescriptive. Like, what is the value range? How long could station names be? What sort of UTF character planes and whatnot? I didn’t really specify it. Of course, people asked, how long can a station name be? What kinds of characters can it contain, and so on? We had to nail down the rules and the boundaries of the challenge. Then people actually built a TCK, a test kit. It was actually a test suite which you then had to pass. Not only you want to be fast, you also want to be correct. People built this test suite, and it grew, actually, over time. Then whenever a new submission, a new entry came in, it, first of all, had to pass those tests. Then if it was valid, then I would go and evaluate it and take the runtime. This is how this looked like. You can see it here.
It had example files with measurements, and an expected file, what should be the result for that file? Then the test runner would go process the implementation against that set of files, and ensure that result is right. That’s the test kit. The other thing, which also was very important as well, I had to run all those things on that machine. There’s quite a few things which were related to that, like just making sure the machine is configured correctly. Then, I had five runs, and I want to discard fastest and slowest, all those things. Jason, here, he came to help and scripted all that. It was actually very interesting to see how he did it. I would really recommend to go to the repo, and just check out the shell scripts which exist, which are used for running those evaluations. It’s a bit like a master class in terms of writing shell scripts, with very good error handling, colored output, all this good stuff to make it really easy and also safe to run those things. If you have to do shell scripting, definitely check out those scripts.
Bookkeeping
Then, let’s talk about one more thing, which is also very relevant, and this is what I would call bookkeeping. If you remember the initial code I showed, I had this Java Streams implementation, and I used this collector for grouping the values into different buckets, per weather station name. People realized, that’s another thing which we can optimize a lot ourselves. By intuition, you would use a HashMap for that. You would use the weather station name as the key in that HashMap. Java HashMap is a generic structure. It works well for a range of use cases. Then, if we want to get the most performance for one particular use case, then we may be better off implementing a bespoke, specific data structure ourselves. This is what we can see here. I think it might sound maybe scary, but actually it is not scary. It’s relatively simple. What happens here? We say, we would like to keep track of the measurements per our station name. It’s like a map, but it is backed by an array, so those buckets.
The idea now is, we take the hash key of our station name and we use this as the index within that array, and at that particular slot in the array, we will manage the aggregator object for a particular station name. We take the hash code, and we want to make sure we don’t have an overflow. That’s why we take it with logical end with the size of the array. We always have it well indexed in the array. Then we need to check, at that particular position in the array, is something there already? If nothing is there, that means, we have the first instance of a particular station in our hands, so the first value for Hamburg or the first value for Munich. We just go create this aggregator object there and store it at that particular offset in the array. That makes sense. The other situation, of course, is we go to the particular index in the array, and in all likelihood, something will be there already. If you have another value for the same station, something will be there already.
The problem is we don’t know yet, is this actually the aggregator object for that particular key we have in our hands, or is it something else? Because multiple station names could have the same key. Which means, in that case, if something exists already at this particular array slot, you need to fall back and compare the actual name. Only if the incoming name is also the name of the aggregate object in that slot, then we can go and add the value to that. That’s why it’s called linear probing. Otherwise, we will just keep iterating in that array until we either have found a free slot, so then we can go install it there, or we have found the slot for the key which we have in our hands. I think it’s relatively simple. Now for this particular case, this performs much better, actually, than what we could get with just using Java HashMap.
Of course, it depends a lot on the particular hash function here which we use to find that index. This is where it goes back to people really optimized a lot for the particular dataset, so they used hash functions which would be collision free for the particular dataset. This was a bit against what I had in mind, because the problem was this file, as I mentioned, it has a size of 13 gigabytes, and I just didn’t have a good way for distributing 13 gigabytes to people out there. That’s why, instead, they would have to generate it themselves. I had the data generator, and everybody could use this generator to create the file for themselves and then use it for their own testing purposes. The problem was, in this data generator, I had a specific key set. I had around 400 different station names with the idea being, that’s just an example, but people took it very literally, and they optimized then a lot for those 400 station names. They used hash functions, which would not have any collisions, for those 400 names. Again, people will take advantage of everything they can.
The problem with all that is it also creates lots of work for me, because you cannot really prove the absence of hash collisions. Actually, whenever somebody sent in their implementation, I had to go and check out, do they actually handle this case, the two stations which would create the same key, and do they handle those collisions accordingly? Because otherwise, if you don’t do this fall back to the slow case, you would be very fast, but you would be incorrect because you don’t deal correctly with all possible names. This was a bit of a trap, which I set up for myself, and it meant I always had to check for that and actually ask people in the pull request template, if you have a custom map implementation, where do you deal with collisions? Then we would have conversations like we see here. How do you deal with hash collisions? I don’t, that’s why it’s so fast. Then he would go and rework it. A big foot trap for myself.
GraalVM: JIT and AOT Compiler
Those are three big things, parallelization, then all this parsing with SIMD and SWAR, and custom hashmapping for bookkeeping. Those were recurring themes I saw again. Then there were more specific tricks, and I just wanted to mention a few of them. I just want to give you some inspiration of what exists. One thing which exists is the Epsilon garbage collector, which is a very interesting garbage collector because it doesn’t collect any garbage. It’s a no-op implementation. If you have your regular Java application, that would be not a good idea. Because you keep allocating objects, and if you don’t do any GC, you will run out of heap space at some point. Here, people realized, we can actually implement this in a way that we don’t do any allocations on our processing loop. We’ll do a few allocations initially when bootstrapping the program, but then later on, no more objects get created. We just have arrays which we can reuse, like mutable structures, which we can just update.
Then we don’t need any garbage collection, and we don’t need any CPU cycles to be spent on garbage collection, which means we just can be a bit faster. Again, I think that’s an interesting thing. Maybe, for instance, if you work on some CLI tool, short-lived thing, could be an interesting option to just disable the garbage collector and see how that goes. The other thing, which you can see here is people used a lot GraalVM. GraalVM, it’s two things, really. It’s an ahead-of-time compiler, so it will take your Java program and emit a native binary out of it. This has two essential advantages. First of all, it uses less memory. Secondly, it’s very fast to start because it doesn’t have to do class loading and the compilation and everything, this all happens at build time. This is fast to start if you have this native binary. Now to the level of results we got here, this actually mattered.
Initially, I thought saving a few hundred milliseconds on startup won’t make a difference for processing 13 gigabytes of file, but actually it does make a difference. The AOT compiler and most of the fastest implementations, they actually used the AOT compiler with GraalVM. There’s also the possibility to use this as a replacement for the just-in-time compiler in your JVM. You just can use it as a replacement for the C2 compiler. I’m not saying you should always do this. It depends a little bit on your particular workload and what you do, whether it’s advantageous or not. In this case, this problem, it lends itself very well to that. People just by using GraalVM as the JIT compiler in the JVM, this gave them a nice improvement of like 5% or so. It’s something I can recommend for you to try out, because it’s essentially free. You just need to make sure you use a JVM or a Java distribution which has GraalVM available as the C2 compiler replacement. Then it’s just means of saying, that’s the JIT compiler I want to use, and off you go. Either it does something for you or does not.
Further Tricks and Techniques
A few other things, like unsafe, what I found interesting is the construct here on the right-hand side, because if you look at that, this is our inner processing loop. We have a scanner object. We try to take next values. We try to process them, and so on. What we have here is we have the same loop three times in a program which is written up in a sequential way. If you look at it, you would say, those three loops, they run one after another. What actually happens is, as the CPUs have multiple execution units, the compiler will figure out, this can actually be parallelized, because there is no data dependencies between those loops. This is what happens, we can take those loops and run them concurrently. I found it very interesting. Why is it three times? Empirically determined.
Thomas, who came up with this, he tried it two times. He tried to have the loop four times and three times it was just the fastest on that particular machine. It could be different in other machines. Of course, you see already here with all those things, this creates questions around maintainability. Because I already can see the junior guys joining the team, and they’re like, “That’s duplication. It’s like the same code three times. Let me go and refactor it. Let me clean it up”, and your optimization would be out of the window. You would want to put a comment there, don’t go and refactor it into one loop. That’s the consideration. Are those things worth it? Should you do it for your particular context? That’s what you need to ask. I found this super interesting, that this is a thing.
The Results
You are really curious then, how fast were we in the end? This is the leaderboard with those 8 CPU cores I initially had. I had 8 CPU cores because that was what I had with this virtual server initially. When I moved to the dedicated box, I tried to be in the same ballpark. With those 8 cores, we went down to 1.5 seconds. I would not have expected that you could go that fast with Java, processing 13 gigabytes of input in 1.5 seconds. I found that pretty impressive. It gets better because I had this beefy server with 32 cores and 64 threads with hyperthreading. Of course, I would like to see, how fast can we go there? Then we go down to 300 milliseconds. To me, it’s like doubly mind blowing. Super impressive. Also, people did this, as I mentioned, in other languages, other platforms, and Java really is very much at the top, so you wouldn’t be substantially better with other platforms.
The other thing, there was another evaluation, which there was, because I mentioned I had this data generator with those 400-something station names, and people optimized a lot for that by choosing specific hashing functions and so on. Some people realized that actually, this was not my intention. I wanted to see, how fast can this be in general? Some people agreed with that view of the world. For those, we had another leaderboard where we actually had 10k different station names. As you can see here now, it’s actually a bit slower, because you really cannot optimize that much for that dataset. Also, it’s different people at the top here. If I go back, here we have Thomas Würthinger, and people who teamed up with him for the regular key set, and then for the 10k key set, it’s other people. It’s different tradeoffs, and you see how much this gets specific for that particular dataset.
It’s a Long Journey
People worked on that for a long time. Again, the challenge went through the entire month of January. I didn’t do much besides running it really. People like Thomas, who was the fastest in the end, he sent 10 PRs. There were other people who sent even more. The nice thing was, it was a community effort. People teamed up. As I mentioned before, like the fastest one, it was actually an implementation by three people who joined forces and they took inspiration. When people came up with particular tricks, then very quickly, the others would also go and adopt them into their own implementation. It was a long journey with many steps, and I would very much recommend to check this out.
This is, again, the implementation from Roy Van Rijn, who was the first one, because he kept this very nice log of all the things he did. You see how he progressed over time. If you go down at the very bottom, you will see, he started to struggle a bit because he did changes, and actually they were slower than what he had before. The problem was he was running on his Arm MacBook, which obviously has a different CPU with different characteristics than the machine I was running this on. He saw improvements locally, but it was actually faster on the evaluation machine. You can see it at the bottom, he went and tried to get an Intel MacBook, to have better odds to do something locally, which then also performs better on that machine. I found it really surprising to see this happening with Java, that we get down to this level where the particular CPU and even its generation would make a difference here.
Should You Do Any of This?
Should you do any of this? I touched on this already. It depends. If you work on an enterprise application, I know you deal with database I/O most of the times. Going to that level and trying to avoid CPU cycles in your business code probably isn’t the best use of your time. Whereas if you were to work on such a challenge, then it might be an interesting thing. What I would recommend is, for instance, check out this implementation, because this is one order of magnitude faster than my baseline. This would run 20 seconds or so. It’s still very well readable, and that’s what I observed, like improving by one order of magnitude. We have still very well readable code. It’s maintainable.
You don’t have any pitfalls in this. It just makes sense. You are very much faster than before. Going down to the next order of magnitude, so going down to 1.5 seconds, this is where you do all the crazy mid-level magic, and you should be very conscious whether you want to do it or not. Maybe not in your regular enterprise application. If you participate in a challenge you want to win a coffee mug, then it might be a good idea. Or if you want to be hired into the GraalVM team, I just learned this the other day, actually, some person who goes by the name of Mary Kitty in the competition, he actually got hired into the GraalVM compiler team at Oracle.
Wrap-Up, and Lessons Learned
This was impacting the Java community, but then also people in other ecosystems, databases, in Snowflake they had a One Trillion Row Challenge. This really blew up and kept people busy for quite a while. There was this show and tell in the GitHub repo. You can go there and take a look at all those implementations in Rust, and OCaml, and all the good things I’ve never heard about, to see what they did in a very friendly, competitive way. Some stats, you can go to my blog post there, you will see how many PRs, and 1900 workflow runs, so quite a bit of work, 187 lines of comment in Aleksey’s implementation. Really interesting to see. In terms of lessons learned there, if I ever want to do this again, I would have to be really prescriptive in terms of rules, automate more, and work with the community as it happened already today. Is Java slow? I think we have debunked that. I wouldn’t really say so. You can go very fast. Will I do it again next year? We will see. So far, I don’t really have a good problem which would lend itself to doing that.
See more presentations with transcripts
Logic App Standard Hybrid Deployment Model Public Preview: More Flexibility and Control On-Premise

MMS • Steef-Jan Wiggers
Article originally posted on InfoQ. Visit InfoQ
Microsoft recently announced the public preview of the Logic Apps Hybrid Deployment Model, which allows organizations to have additional flexibility and control over running their Logic Apps on-premises.
With the hybrid deployment model, users can build and deploy workflows that run on their own managed infrastructure, allowing them to run Logic Apps Standard workflows on-premises, in a private cloud, or even in a third-party public cloud. The workflows run in the Azure Logic App runtime hosted in an Azure Container Apps extension. Moreover, the Hybrid deployment for Standard logic apps is available and supported only in the same regions as Azure Container Apps on Azure Arc-enabled AKS. However, when the offering reaches GA, more regions will be supported.
Principal PM for Logic Apps at Microsoft Kent Weare writes:
The Hybrid Deployment Model supports a semi-connected architecture. This means that you get local processing of workflows, and the data processed by the workflows remains in your local SQL Server. It also provides you the ability to connect to local networks. Since the Hybrid Deployment Model is based upon Logic Apps Standard, the built-in connectors will execute on your local compute, giving you access to local data sources and higher throughput.
(Source: Tech Community Blog Post)
Use cases for the hybrid deployment model are threefold, according to the company:
- Local processing: BizTalk Migration, Regulatory and Compliance, and Edge computing
- Azure Hybrid: Azure First Deployments, Selective Workloads on-premises, and Unified Management
- Multi-Cloud: Multi-Cloud Strategies, ISVs, and Proximity of Line of Business systems
The company’s new billing model supports the Hybrid Deployment Model, where customers manage their Kubernetes infrastructure (e.g., AKS or AKS-HCI) and provide their own SQL Server license for data storage. There’s a $0.18 (USD) charge per vCPU/hour for Logic Apps workloads, allowing customers to pay only for what they need and scale resources dynamically.
InfoQ spoke with Kent Weare about the Logic App Hybrid Deployment Model.
InfoQ: Which industries benefit the most from the hybrid deployment model?
Kent Weare: Having completed our private preview, we have seen interest from various industries, including Government, Manufacturing, Retail, Energy, and Healthcare, to name a few. The motivation of these companies varies a bit from use case to use case. Some organizations have regulatory and compliance needs. Others may have workloads running on the edge, and they want more control over how that infrastructure is managed while reducing dependencies on external factors like connectivity to the internet.
We also have customers interested in deploying some workloads to the cloud and then deploying some workloads on a case-by-case basis to on-premises. The fundamental value proposition from our perspective is that we give you the same tooling and capabilities and can then choose a deployment model that works best for you.
InfoQ: What are the potential performance trade-offs when using the Hybrid Deployment Model compared to fully cloud-based Logic Apps?
Weare: Because we are leveraging Logic Apps Standard as the underlying runtime, there are many similar experiences. The most significant difference will be in the co-location of your integration resources near the systems they are servicing. Historically, if you had inventory and ERP applications on-premises and needed to integrate those systems, you had to route through the cloud to talk to the other system.
With the Hybrid Deployment Model, you can now host the Logic Apps workflows closer to these workloads and reduce the communication latency across these systems. The other opportunity introduced in the hybrid deployment model is taking advantage of more granular scaling, which may allow customers to scale only the parts of their solution that need it.

MMS • Lakshmi Uppala
Article originally posted on InfoQ. Visit InfoQ

Transcript
Shane Hastie: Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. Today, I’m sitting down with Lakshmi Uppala. Lakshmi, welcome. Thanks for taking the time to talk to us.
Lakshmi Uppala: Thank you, Shane. It’s my pleasure.
Shane Hastie: My normal starting point in these conversations is, who’s Lakshmi?
Introductions [01:03]
Lakshmi Uppala: Yes, that’s a great start. So, professionally, I am a product and program management professional with over 12 years of experience working with Fortune 500 companies across multiple industries. Currently, I’m working with Amazon as a senior technical program manager, and we build technical solutions for recruiters in Amazon. Now, as you know, Amazon hires at a very, very large scale. We are talking about recruiters processing hundreds and thousands of applications and candidates across job families and roles hiring to Amazon. Our goal is to basically simplify the recruiter operations and improve the recruiter efficiency so that they can spend their time in other quality work.
What sets me apart is my extensive background in very different type of roles across companies and driving impactful results. So, in my experience, I’ve been a developer for almost seven years. I’ve been a product head in a startup. I’ve been leading technical teams in India, in some companies in India, and now in Amazon. Over the last seven years or so, I’ve had the privilege of driving large-scale technical programs for Amazon, and these include building large-scale AI solutions and very technically complex projects around tax calculation for Amazon.
Outside of work, outside of my professional world, I serve as a board member for Women at Amazon, DMV Chapter, and Carolina Women+ in Technology, Raleigh Chapter. Both of these are very close to me because they work towards the mission of empowering women+ in technology and enabling them to own their professional growth. So that’s a little bit about me. I think on my personal front, we are a family of three, my husband, Bharat, and my 11-year-old daughter, Prerani.
Shane Hastie: We came across each other because I heard about a talk you gave on a product value curve. So tell me, when we talk about a product value curve, what do we mean?
Explaining the product value curve [03:09]
Lakshmi Uppala: Yes. I think before we get into what we mean by product value curve, the core of this talk was about defining product strategy. Now, product strategy at its start is basically about identifying what the customer needs in your product. So that’s what product strategy is about, and product value curve is one of… very successful and a proven framework to building effective product strategy.
Shane Hastie: So figuring out what your customer really needs, this is what product management does, and then we produce a list of things, and we give them to a group of engineers and say, “Build that”, don’t we?
Lakshmi Uppala: No, it’s different from the traditional approach. Now, traditionally and conventionally, one way to do product strategy or define product strategy and product roadmap is to identify the list of hundred features which the customer wants from your product and create a roadmap saying like, “I will build this in this year. I will build this in the next year and so on”, give it to engineering teams, and ask them to maintain a backlog, and build it. But value curve is a little different. It is primarily focused on the value, the product offers to customers. So it is against the traditional thinking of features, and it is more about what value does it give as end result to the user.
For example, let’s say we pick a collaboration tool as a product. Right? Now, features in this collaboration tool can be, “Visually, it should look like this. It should allow for sharing of files”. Those can be the features. But when you think of values, it is a slight change in mindset. It is about, “Okay. Now, the customer values the communication to be real time”. That is a value generated to the customer. Now, the customer wants to engage with anybody who is in the network and who is in their contact list. Now, that is a value.
So, first, within this value curve model, a product manager really identifies the core values which the product should offer to the customers, understands how these values fit in with their organizational strategy, and then draw this value curve model. This is built as a year-over-year value curve, and this can be given to the engineering teams as a… The subsequent thing is to define features and then give it to engineering teams. So it is purely driven by values and not really the features which are to be built on the product. That’s the change in the mindset, and it is very much essential because finally, customers are really looking for the value. Right? It’s not about this feature versus that feature in the products. So this is essential.
Shane Hastie: As an engineer, what do I do with this because I’m used to being told, “Build this?”
Engaging engineers in identifying value [06:02]
Lakshmi Uppala: Yes. I think when we look at value curves and its significance to engineering teams, there are multiple ways this can benefit the engineering teams. One is in terms of the right way of backlog grooming and prioritization. Now, features also. Yes, there is a specific way of doing backlog grooming which engineering team does, but I think when you translate this into value-based roadmap, the engineers and the engineering teams are better able to groom the backlog, and they’re able to take the prioritization decisions in a more effective way because values are, again, going to translate to the success of the product.
Second is the design and the architectural decisions. Now, if we think about the traditional model, right? Because they’re given as a list of features, now if an engineer is building for the features, the design or architectural choices are going to be based on the features which are at hand. But thinking in terms of value, it changes their thinking to build for longer-term value which the product should offer to the customers, and that kind of changes the design and architectural decisions.
Third, prototyping and experimentation like engineers can actually leverage the value curves to rapidly prototype and also test design alternatives which, again, is not very suitable with the feature-based roadmap definition. I think one other thing is cross-functional alignment. This is one of the major problems which we encounter or which the engineering teams encounter when there are dependencies with other teams to build a specific feature in a product. It is very hard to get the buy-in from the dependency teams for them to prioritize the work in their plan when it is based on features.
But when you think of value and value curves, it gives a very easy visual representation of where we are trending towards in terms of value generation to customers and how our incremental delivery is going to be useful in the longer run which is very helpful to drive these cross-functional alignments for prioritization, and I know… being an engineering manager in Amazon for a year, I know that this is one of the major problems which the leaders face to get cross-team alignment for prioritization. All of these can be solved by a good visual representation using value curves and defining value-based roadmaps.
Shane Hastie: So approaching the backlog refinement, approaching the prioritization conversation, but what does it look like? So here I am. We’re coming up to a planning activity. What am I going to do?
Prioritizing by value [08:54]
Lakshmi Uppala: Okay. So when the product manager defines the value curves, let’s say they define it in incremental phases, that’s how I would recommend approaching this. So as a product manager, I would define how my product is going to look like three years down the line and in terms of the values, again, not just in terms of the features. I’m going to define like, “Okay. In three years, if I’m going to be there, year one, this is the area or this is the value I’m going to focus on. Year two, this is the value I’m going to focus on”. So this is going to give year-on-year product roadmap for the team to focus on.
Now, translating that to the tasks or deliverables which engineers can actually take on is a joint exercise between product managers, and engineering leaders, and senior engineers. Now, translating these values into the features for the product. Again, we are not driving this from features, but we are driving it from values. Given this value, what are the 10 features which I need to build? Creating technical deliverables from those features, that is the refined approach to backlog grooming and planning when compared to older-fashioned backlog grooming.
Shane Hastie: Shifting slant a little bit, your work empowering women in tech, what does that entail? How are you making that real?
Empowering women in tech [10:21]
Lakshmi Uppala: Yes. It’s a great question, and as I said, it’s very close to me and my heart. So, as I said, I am part of two organizations, one within Amazon itself and one which is Carolina Women in Tech, Raleigh Chapter, but both of these are common in terms of the initiatives we run and the overall mission. So, within the Women at Amazon group, we do quarterly events which are more focused around specific areas where women need more training, or more learning, or more skill development. We welcome all the women+ allies in Amazon to attend those and gain from them.
Similar in Carolina Women in Tech, but the frequency is more regular. Once in a month, we host and organize various events. Just to give you a flavor of these events, they can range from panel discussions by women leaders across industries and across companies, and these could be across various types of topics. One of the things which we recently did in Amazon was a panel discussion with director level women leaders around the topic of building personal brand. These events could also be speed mentoring events. Like last year when we did this in Amazon, we had 150-plus mentees participate with around 12 women and men leaders, and this is like speed mentoring. It’s like one mentor to many mentees across various topics. So it’s like an ongoing activity. We have multiple such initiatives going on around the year so that we can help women in Amazon and outside.
Shane Hastie: If we were to be giving advice to women in technology today, what would your advice be? Putting you on the spot.
Advice for women in technology [12:06]
Lakshmi Uppala: Okay. I was not prepared for this. There are two things which I constantly hear in these events. One of them is women have this constant fear and imposter syndrome a lot more than men. When women in tech or any other domain, when they’re in conversations or meetings with men or other counterparts, women generally tend to take a step back thinking that they’re less knowledgeable in the area, and they don’t voice out their opinions. I would recommend women to understand or to believe in themselves, to believe in their skills, and be vocal and speak up where they need.
Second is about the skill development. I think one of the other things which I noticed which is even true for me is while getting engaged with multiple commitments, including personal, and family, and professional, give very less importance to skill development, like personal skill development. I think that is very, very essential to grow and to be up-to-date in the market for growing. So I think continuous learning and skill development is something which everybody, not just women, but more importantly, women should focus on and invest their time and energy in. So those are the two things.
Shane Hastie: If somebody wants to create that Women in Tech type space, what’s involved?
Establishing a Women in Tech community [13:29]
Lakshmi Uppala: I think it’s a combination of things. One is definitely if somebody is thinking of creating such a group within their organization, then definitely, the culture of the organization should support it. I think that’s a very essential factor. Otherwise, even though people might create such groups, they will not sustain or they’ll not have success in achieving their goals. So organizational culture and alignment from leadership is definitely one key aspect which I would get at the first step.
Second is getting interested people to join and contribute to the group because this is never a one-person activity. It’s almost always done by a group of people, and they should be able to willingly volunteer their time because this is not counted in promotions, this is not counted in your career growth, and this is not going to help you advance in your job. So this is purely a volunteer effort, and people should be willing to volunteer and invest their time. So if you’re able to find a group of such committed folks, then amazing.
Third, coming up with the initiatives. I think it’s very tricky, and it’s also, in some bit, organizational-specific. So creating a goal for the year. Like For Women at Amazon, DMV Chapter, we create a goal at the beginning of the year saying that, “This is the kind of participation I should target for this kind of events, and these are the number of events I want to run. These are the number of people I want to impact”. So coming up with goals, and then thinking through the initiatives which work well with the organizational strategy is the right way to define and execute them. If they’re not aligned with the organizational culture and strategy, then probably you might run them for some iterations. They’ll not create impact, and they’ll not even be successful. That’s my take.
Shane Hastie: Some really interesting content, some good advice in here. If people want to continue the conversation, where do they find you?
Lakshmi Uppala: They can find me on LinkedIn. So a small relevant fact is I’m an avid writer on LinkedIn, and I post primarily on product program and engineering management. So people can definitely reach out to me on LinkedIn, and I’m happy to continue the discussion further.
Shane Hastie: Lakshmi, thank you very much for taking the time to talk to us today.
Lakshmi Uppala: Thank you so much, Shane. It was a pleasure.
Mentioned:
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.