GPULlama3.java Brings GPU-Accelerated LLM Inference to Pure Java

MMS Founder
MMS A N M Bazlur Rahman

Article originally posted on InfoQ. Visit InfoQ

The University of Manchester’s Beehive Lab has released GPULlama3.java, marking the first Java-native implementation of Llama3 with automatic GPU acceleration. This project leverages TornadoVM to enable GPU-accelerated large language model inference without requiring developers to write CUDA or native code, potentially transforming how Java developers approach AI applications in enterprise environments.

At the heart of GPULlama3.java lies TornadoVM, an innovative heterogeneous programming framework that extends OpenJDK and GraalVM to automatically accelerate Java programs on GPUs, FPGAs, and multi-core CPUs. Unlike traditional GPU programming approaches that require developers to rewrite code in low-level languages like CUDA or OpenCL manually, TornadoVM enables GPU acceleration while keeping all code in pure Java.

According to the TornadoVM documentation, the system works by extending the Graal JIT compiler with specialized backends that translate Java bytecode to GPU-compatible code at runtime. When a method is marked for acceleration using annotations like @Parallel, TornadoVM’s compilation pipeline converts standard Java bytecode through Graal’s Intermediate Representation, applies GPU-specific optimizations, and generates target-specific code—whether that’s OpenCL C for cross-platform compatibility, PTX assembly for NVIDIA GPUs, or SPIR-V binary for Intel graphics.

// TornadoVM Task-Graph API example from documentation
TaskGraph taskGraph = new TaskGraph("computation")
.transferToDevice(DataTransferMode.FIRST_EXECUTION, data)
.task("process", MyClass::compute, input, output)
.transferToHost(DataTransferMode.EVERY_EXECUTION, output);

TornadoExecutionPlan executor = new TornadoExecutionPlan(taskGraph.snapshot());
executor.execute();

The TornadoVM programming guide demonstrates how developers can utilize hardware-agnostic APIs, enabling the same Java source code to run identically on various hardware accelerators. The TornadoVM runtime handles all device-specific optimizations, memory management, and data transfers automatically.

According to the GPULlama3.java repository, the project supports three primary backends, enabling execution across diverse hardware:

  • NVIDIA GPUs: Full support through both OpenCL and PTX backends
  • Intel GPUs: Including Arc discrete graphics and integrated HD Graphics through the OpenCL backend
  • Apple Silicon: M1/M2/M3 support through OpenCL (though Apple has deprecated OpenCL in favour of Metal)

The repository indicates that configuration is handled through command-line flags:

# Run with GPU acceleration (from project README)
./llama-tornado --gpu --verbose-init --opencl --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "Explain the benefits of GPU acceleration."

The GPULlama3.java implementation leverages modern Java features as documented in the repository:

  • Java 21+ requirement for Vector API and Foreign Memory API support
  • GGUF format support for single-file model deployment
  • Quantization support for Q4_0 and Q8_0 formats to reduce memory requirements

The project builds upon Mukel’s original Llama3.java, adding GPU acceleration capabilities through TornadoVM integration.

GPULlama3.java joins other Java LLM projects, including:

  • JLama: A modern LLM inference engine for Java with distributed capabilities
  • Llama3.java: The original pure Java implementation focusing on CPU optimization

As noted in Quarkus’s blog on Java LLMs, the Java ecosystem is expanding its AI/ML capabilities, enabling developers to build LLM-powered applications without leaving the Java platform.

TornadoVM originated from research at the University of Manchester, aiming to make heterogeneous computing accessible to Java developers. The framework has been in development since 2013 and continues to progress with new backend support and optimizations.

GPULlama3.java is currently in beta, with ongoing performance optimization and benchmark collection. The performance on Apple Silicon remains suboptimal due to the deprecation of OpenCL. The TornadoVM team is developing a Metal backend to enhance support for Apple Silicon, optimizing transformer operations and broadening model architecture compatibility.

GPULlama3.java represents a significant advancement in bringing GPU-accelerated large language model (LLM) inference to the Java ecosystem. By leveraging TornadoVM, the project demonstrates that Java developers can utilize GPU acceleration without leaving their familiar programming environment. While performance optimization continues and the project remains in active development, it opens up new possibilities for Java-based AI applications in enterprise settings, where Java’s strengths in security, scalability, and maintainability are highly valued.

For developers interested in exploring GPU-accelerated LLM inference in Java, the project is open source and accessible on GitHub, complete with documentation and examples to help get started.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Article: The State Space Solution to Hallucinations: How State Space Models are Slicing the Competition

MMS Founder
MMS Albert Lie

Article originally posted on InfoQ. Visit InfoQ

Key Takeaways

  • Transformers often hallucinate because they prioritize generating statistically likely text rather than factual accuracy.
  • State Space Models (SSMs) offer a more reliable alternative for maintaining factual accuracy and context.
  • SSMs process information sequentially, making them more efficient and less prone to hallucinations.
  • Case studies on Perplexity and RoboMamba demonstrate the practical impact of SSMs in real-world scenarios.
  • Practical guidelines are provided for implementing SSMs, including architecture selection, memory optimization, and real-time data integration.

AI-powered search tools like Perplexity and Arc are quickly becoming the go-to platforms for millions of users seeking instant answers. These tools promise quick, conversational responses with cited sources, making them feel more like talking to a smart assistant than using a traditional search engine. However, there is a growing problem: these systems often hallucinate.

In other words, they confidently make up facts, misquote sources, and recycle outdated information. For users, this means you might get an answer that sounds right, but is actually wrong. For example, Air Canada’s chatbot once confidently gave a fake refund policy to a grieving customer, leading the airline to be legally ordered to compensate him.

While many people blame bad data or unclear prompts, the real issue is deeper and tied to the architecture of most AI models: the transformer. In this article, I’ll explain why transformers struggle with hallucinations, how SSMs offer a promising solution, and what this shift could mean for the future of AI search.

Why Transformers Hallucinate

Transformers are the backbone of popular AI models like GPT-4. They predict the next word in a sentence by analyzing relationships between all words in a text at once. This attention mechanism is powerful for generating fluent and coherent text, but it comes with trade-offs:

Token Prediction,Not Truth Seeking

Transformers are designed to generate text that is statistically likely, not necessarily factually correct. In instances where there may be issues with the training data such as unfilled gaps or a level of noise or ambiguity, the model ends up filling these gaps with near guesses that may seem reasonably correct, but are not always well attuned to the context of the prompt or past information.

Computational Overload

Transformers analyze every word relationship, which becomes expensive and inefficient for long texts. As a result, they sometimes take shortcuts, losing important context and increasing the risk of errors.

Source Blindness

When given multiple sources, transformers can’t always tell which ones are reliable. This unreliability can lead to citing AI-generated or outdated information, as seen when Perplexity cited an AI-generated LinkedIn post about Kyoto festivals.

The end result is that AI search tools can act like persuasive storytellers. They are confidently wrong with answers that sound good but aren’t always accurate.

State Space Models: A Step Forward Towards Context Aware Accuracies

SSMs are emerging as a promising alternative to transformers for many sequence-based tasks. Unlike transformers, SSMs process information step-by-step, updating a memory bank as they go. Arguably, this approach resembles the typical framework of how humans read and retain information.

How SSMs Work

Using step-by-step analysis, SSMs read information one piece at a time, building understanding incrementally. This reduces the risk of context overload and helps the model keep track of important details.

SSMs are more efficient with computations. The memory and computational needs of SSMs grow linearly with the length of the input, rather than exponentially as with transformers. Therefore, SSMs can handle much longer texts or sequences without running into performance issues found in transformers.

SSMs store key facts in a controlled state, which helps minimize errors from conflicting information. In AI dialogue systems, maintaining a consistent internal state is crucial for ensuring coherent and contextually relevant interactions. For instance, Microsoft’s research into goal-oriented dialogue systems highlights the necessity of memory in complex tasks like vacation planning. These tasks involve multiple parameters and require the system to remember user preferences and constraints across the conversation. Without memory, the system would struggle to provide consistent responses, leading to user frustration.

A technical example of this is the MemoryBank mechanism, which enhances large language models (LLMs) by incorporating long-term memory. MemoryBank allows the model to recall relevant memories, update them over time, and adapt to a user’s personality. This is achieved through a memory updating mechanism inspired by the Ebbinghaus Forgetting Curve, enabling the AI to forget and reinforce memories based on their significance and the time elapsed.

Recent research shows that SSMs, especially models like Mamba, can perform competitively with transformers on many language tasks, particularly those involving long sequences or the need to track information over time. While transformers still have an edge in some areas, SSMs are closing the gap and offer unique advantages for certain applications.

Case Study 1: Perplexity’s Hallucination Trap

Perplexity, one of the leading AI-powered search engines, provides a clear example of why architecture matters. Despite using retrieval-augmented generation (RAG) to pull in real-time data, Perplexity has cited non-existent Vietnamese markets and AI-generated travel guides. This happens for several reasons.

By trusting unreliable sources, transformers treat all retrieved data equally, even if it’s AI-generated misinformation. Transformer models, such as BERT and GPT, are designed to process and generate text based on patterns learned from large datasets. Unfortunately, they lack inherent mechanisms to assess the veracity of the information they process. This limitation means that when these models retrieve information, especially through techniques like RAG, they may treat all sources with equal weight, regardless of their reliability.

For instance, if a model retrieves data from both a reputable academic paper and a fabricated AI-generated article, it might integrate both sources without distinguishing between them. This indiscriminate treatment can lead to the propagation of misinformation, as the model cannot inherently verify the authenticity of the retrieved content.

Context Collapse

When comparing multiple sources, often overweight transformers repeated phrases or patterns, rather than truly verifying facts. Transformer models, particularly those utilizing self-attention mechanisms, excel at identifying and leveraging patterns within text. However, this strength can become a weakness when the model encounters repeated phrases or patterns across multiple sources.

Instead of critically evaluating the factual accuracy of the information, the model may assign higher importance to the frequency of certain phrases or structures. This phenomenon, known as context collapse, occurs when the model overemphasizes repetitive elements, potentially leading to the reinforcement of inaccuracies. For example, if several sources erroneously state that a specific event occurred on a particular date, the model might prioritize this repeated information, even if it is incorrect, due to its pattern recognition capabilities.

If Perplexity were built on an SSM-based architecture, it could be improved by leveraging structured memory and long-term contextual awareness to reduce inconsistencies and hallucinations.

Currently, Perplexity operates primarily on transformer-based architectures combined with RAG, which enables it to fetch and integrate real-time information from external sources such as web pages or proprietary documents. While this setup offers access to up-to-date data, it lacks persistent memory and treats each query independently. As a result, the system often struggles with maintaining factual consistency over time, especially in multi-turn interactions or complex queries requiring reasoning across multiple sources.

By contrast, an SSM-based architecture, such as those used in models like Mamba or RWKV, offers the advantage of continuous memory over long sequences. These models are designed to simulate how signals evolve over time, allowing them to retain critical facts and suppress irrelevant or contradictory data.

For instance, in medical imaging, a Mamba-based model called AM‑UNet was used to accurately segment cervical cancer tumors from CT scans, demonstrating how continuous memory helps retain important patterns across long sequences of data. Similarly, if Perplexity integrated SSMs into its architecture, it could maintain a structured internal representation of facts and user preferences across sessions. This integration would prevent repeating misinformation retrieved from unreliable sources and provide more coherent and personalized responses over time.

Further improvements would also be seen in sequential verification and efficient cross-referencing. Checking sources one by one while maintaining a fact-checking memory, makes it harder for false information to slip through. For example, during the 2025 LA protests, AI tool Grok verified viral claims step by step, debunking a fake video through metadata and news sources and then using that memory to flag similar false content later.: In terms of efficient cross-referencing, handling long documents or many sources without losing track of important details is possible, because SSMs are designed for this kind of task.

Case Study 2: RoboMamba’s Precision in Robotics

RoboMamba, a robotics-focused SSM, demonstrates the practical benefits of this architecture outside of search. In laboratory tests, RoboMamba significantly reduced failed actions caused by hallucinations. This success was accomplished through real-time error correction. RoboMamba could adjust its grip on objects mid-task when sensors detected slippage, which was something transformers struggled with due to context overload. Making context-aware decisions, the model prioritized safety protocols over speed in unpredictable environments, reducing the risk of dangerous mistakes.

This kind of precision is critical for tasks like surgical robotics and automated manufacturing, where a single hallucination could have serious consequences.

How SSMs Compare to Other Solutions

Researchers have tried several approaches to reduce hallucinations in AI models, including reinforcement learning from human feedback (RLHF), which involves humans rating AI outputs to help the model learn what is acceptable. While RLHF is helpful, it can’t fix the underlying tendency of transformers to guess when unsure.

Another approach, Knowledge-Augmented LLMs integrate structured databases, but still rely on transformer architectures at their core. For example, in enhanced Text-to-SQL systems, the model first retrieves relevant schema information or example queries from a structured database, then uses a transformer (like GPT-3.5 or Codex) to generate the appropriate SQL query. This approach allows the LLM to ground its output in real data while still leveraging its generative capabilities.

SSMs offer a fundamentally different approach by changing how information is processed and remembered. They are especially strong in tasks where accuracy and long-term consistency matter, such as legal document review, medical research, and robotics.

The following table illustrates the differences between how the above approaches work.

Method Strengths Weaknesses
RLHF Aligns with human values Doesn’t fix guessing
SSMs Accuracy and efficiency with long texts Less flexible for images/video

Table 1: Strengths and Weaknesses of RLHF vs. SSMs

What This Means for Everyday Users

For most people, the shift to SSMs could mean a variety of things, such as fewer fake citations, better answers to complex questions, and even offline functionality. AI-powered search tools would verify sources before citing them, reducing the risk of being misled by fake citations. SSMs offer a significant advantage over traditional transformer-based architectures by maintaining structured memory and long-term contextual awareness. This capability enables AI systems to verify sources before citing them, thereby reducing the risk of propagating misinformation.

For instance, a study evaluating generative search engines found that existing systems often contain unsupported statements and inaccurate citations. On average, only 51.5% of generated sentences were fully supported by citations, and only 74.5% of citations supported their associated sentence. These findings highlight the need for AI systems to improve source verification processes to enhance reliability.

Additionally, the use of RAG has been shown to reduce AI hallucinations by grounding responses in actual documents. By pulling information from custom databases, RAG narrows the AI’s focus and helps ensure that factual claims can be attributed to sources. However, experts emphasize that the quality of RAG implementation is crucial, and human oversight remains essential to verify AI-generated content.

SSMs can also provide better answers for complex or rare questions. SSMs can handle long or complicated queries without breaking down, making them ideal for specialized searches (like rare diseases or technical topics).

Because SSMs are efficient, they could run locally on your phone or laptop, reducing the need for cloud-based processing and improving privacy.

Imagine asking, “What’s the best treatment for XYZ syndrome?” An SSM-based tool would check medical journals one by one, flag conflicting studies, and highlight the consensus-all without inventing answers or making dangerous mistakes.

Where SSMs Excel and Where They Don’t

While SSMs are promising, they aren’t perfect. Research shows that transformers are still better at tasks that require copying long chunks of text or remembering exact details from far back in the input. This is because transformers can “look at” the entire context at once, while SSMs compress information into a fixed-size memory.

However, SSMs shine in a number of tasks. When the input is very long (like legal contracts and scientific research) SSMs excel due to their linear time complexity, allowing them to handle extensive documents efficiently. For instance, models like Mamba and S4 have demonstrated superior performance in long-range sequence modeling tasks, such as the Long-Range Arena (LRA) benchmark, which evaluates reasoning ability and handling of diverse data types. These models can capture hierarchical dependencies over long contexts, making them suitable for tasks involving lengthy inputs.

Consistency and accuracy over time are more important than copying exact details. In applications requiring sustained accuracy and contextual understanding, SSMs maintain structured memory and long-term contextual awareness, reducing inconsistencies and hallucinations. In dialogue systems, SSMs can track user preferences and conversation history to ensure consistent and accurate responses over time. This capability is crucial for applications where maintaining context and coherence is more important than exact replication of details.

Running models on small devices or in real-time applications is meeting the needs of efficiency and lowered computational costs. SSMs are designed to be computationally efficient, making them suitable for deployment on devices with limited resources. For instance, a study demonstrated the efficient token-by-token inference of the SSM S4D on Intel’s Loihi 2 neuromorphic processor. This implementation outperformed traditional recurrent and convolutional models in terms of energy consumption, latency, and throughput, highlighting the potential of SSMs for real-time applications.

Researchers are now exploring hybrid models that combine the strengths of both architectures, such as adding attention-like mechanisms to SSMs for better context retrieval.

The Future of AI Search and SSMs

The transition to SSMs is already underway in some industries. Hybrid models like Mamba-2 blend SSM efficiency with some Transformer-like flexibility, making them suitable for tasks that require both long-term memory and attention to detail.

A notable example of this hybrid architecture is the Mamba-2-Hybrid model, which combines 43% Mamba-2, 7% attention, and 50% MLP layers. In a comprehensive empirical study, this hybrid model outperformed an 8B-parameter transformer model across twelve standard tasks, achieving an average improvement of +2.65 points. Additionally, it demonstrated up to eight times faster token generation during inference, highlighting its efficiency and scalability.

When extended to support long-context sequences of 16K, 32K, and 128K tokens, the Mamba-2-Hybrid continued to closely match or exceed the Transformer model’s performance on average across twenty-three long-context tasks. These results underscore the effectiveness of integrating Mamba-2’s structured state space modeling with selective attention mechanisms to balance efficiency, scalability, and performance in complex tasks.

In terms of enterprise adoption, banks, hospitals, and law firms are testing SSMs for tasks where accuracy is critical and hallucinations are unacceptable. Similarly, SSMs are being applied to research in a wide range of fields, from genomics and drug design to time series forecasting and recommendation systems.

As researchers continue to improve SSMs and address their current limitations, we can expect to see more AI tools built on these architectures, especially in areas where trust and accuracy are non-negotiable.

Conclusion: Building Trust Through Better AI Architecture

The race to build the best AI search engine is no longer just about speed or flashy features – it’s about trust. While transformers powered the first wave of chatbots and AI search tools, their tendency to hallucinate makes them unreliable for truth-critical tasks. SSMs, with their step-by-step analysis and structured memory, offer a new path toward AI that doesn’t just answer questions but actually understands them.

As tools like Perplexity and RoboMamba evolve, the winners will be those that prioritize architectural integrity over quick fixes. The next generation of AI search won’t just retrieve answers, it will build them, one verified fact at a time.

References:

  1. Transformer Models Are Better at Copying than SSMs
  2. Repeat After Me: Transformers Are Better Than State Space Models at Copying
  3. State Space Models Strike Back: Stronger Than Ever with Diagonal Plus Low-Rank
  4. Repeat After: Transformers are better
  5. LLM Hallucinations: What They Are and How to Fix Them
  6. Evaluating Hallucination in Large Language Models

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Midjourney Debuts V1 AI Video Model

MMS Founder
MMS Daniel Dominguez

Article originally posted on InfoQ. Visit InfoQ

Midjourney has launched its first video generation V1 model, a web-based tool that allows users to animate still images into 5-second video clips. This new model marks a significant step toward the company’s broader vision of real-time open-world simulations, which will require the integration of image, video, and 3D models to create dynamic, interactive environments.

V1 works by enabling users to animate images through two options: an automatic animation setting, which generates a motion prompt for basic movement, and a manual animation feature, where users can describe specific actions and camera movements. The system is designed to work with images generated by Midjourney as well as those uploaded from external sources, offering flexibility in video creation.

The model also introduces a unique workflow for animating images. Users can drag images into the prompt bar and mark them as the starting frame, then apply a motion prompt to animate them. V1 includes two settings for motion: low motion, which is suitable for ambient scenes with slow or minimal movement, and high motion, which is better for fast-paced scenes with active camera and subject movement. However, high motion can sometimes result in unintended glitches or errors.

When compared to other AI video generation tools currently on the market, V1 offers a distinct approach. Unlike more established platforms like Runway or DeepBrain, which focus on highly polished, pre-built video assets with complex editing features and audio integration, V1 prioritizes the animation of static images within a specific aesthetic that aligns with Midjourney’s popular image models. While competitors like Veo 3 are known for their real-time video creation with full audio integration and high-quality motion capture, V1 sticks to simpler video outputs with limited motion capabilities, focusing primarily on image-to-video transformations.

Midjourney’s V1 Video Model launch has sparked excitement across creative communities, with users praising its stunning visual consistency and artistic flair, often comparing it favorably to competitors.

AI Artist Koldo Huici commented on X:

Creating animations used to take 3 hours in After Effects. Now with Midjourney, I do it in 3 minutes! I’ll tell you how ridiculously easy it is.

While Gen AI expert Everett World posted:

It’s fantastic to have a new video model, especially since it’s made by Midjourney – it opens up new, unexpected possibilities. Some generations look incredibly natural (anime looks great!). Even though it’s only 480p, I think we’re seeing interesting developments in the AI video space, and I’m so glad we can have fun with this model!

Midjourney plans to continue evolving its video capabilities, with an eye on making real-time, open-world simulations a reality in the near future. For now, the V1 model is available for web use only, and the company is monitoring usage closely to ensure that it can scale its infrastructure to meet demand.

This launch comes in the wake of ongoing legal challenges for the company, including a recent lawsuit from Disney and Universal over alleged copyright infringement. Despite these challenges, Midjourney is focusing on expanding its technology, with V1 seen as a significant step toward achieving the company’s vision for immersive, interactive digital environments.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Shifting Left for Better Engineering Efficiency

MMS Founder
MMS Ying Dai

Article originally posted on InfoQ. Visit InfoQ

Transcript

Dai: I have been at Roblox for almost 4 years now. When Jennifer first reached out to me, I didn’t have a thing to talk about. Basically, just do casual chatting. I told her what I’ve done at Roblox, the projects that I’ve done. I shared two stories. She immediately caught that there’s a thing there. There’s a trend that actually I’m shifting left from production to testing. Throughout this shift, I can see how my work contributes to increase the productivity of engineers at Roblox. I also want to call out that there are two things I pay particular attention to in my past 4 years, the two metrics, those are reliability and productivity. Those two things have been driving me to make all those decisions and leading our team when we are facing choices, leading us to make the decisions.

Background

I’ll start with some background. Then I’ll share two stories: first migration, and then second migration. Then, some of my learnings throughout the journey.

When I joined Roblox back almost 4 years ago, I started at the telemetry team. I was really excited back then because Roblox was actually the smallest company I’ve ever worked for. Before Roblox, I was at Google and LinkedIn. That was much bigger than Roblox. It was a small company, and I got to know a lot of people. I was so excited, like I can work with everybody. Everybody can know each other. Back then, the company was undergoing a lot of fast user growth. We’re still on a very fast trajectory. Everything was new to me. It was a very exciting experience to me to begin with. There are a lot of things going on. As the telemetry team, I got to experience all the setup, all the size and scale of our architecture infrastructure. We run everything on our own DC. We have almost 2,000 microservices. As the telemetry team, we were managing billions of active time series.

Then, that’s not the fun part. There were a lot of on-calls. The first week that I joined Roblox, the very first weekend, I was actually on hacking with a couple of friends. Somehow that route had signal, and I got paged. I got dragged into a Zoom call with the VP and CTO being there together, debugging some Player Count job issue. Then, the question that got to the telemetry team is that, is this number really trustworthy? Can you tell me where this number was from? Can you tell me if it’s accurate, if it’s really telling the true story? It was a really stressful first weekend. I almost thought about quitting the first weekend. I’m glad I didn’t.

Then, I just hang in there. Then, the worst part is that every SEV score back then, people would start with questioning telemetry first. Every time there is a company-wide incident, we got paged as a telemetry team. I would never have thought about that back at Google. The metrics are just there, you’re trying to read the numbers, trying to figure out what’s going on yourself. Who would actually question telemetry being wrong?

That was actually very common at Roblox back then, especially those SEV0, SEV1s, people would actually spend first 10, sometimes even 20 minutes to rule out the possibility of telemetry being wrong. I don’t really blame that, because the telemetry system was not so reliable. We did produce wrong numbers from time to time. That made me realize that telemetry, essentially, is reliability. Reliability should be the first metric to telemetry. Bad production reliability actually ends up causing very low engineering productivity, because all those engineers spending their time trying to work with us to figure out whether it’s a telemetry issue and how to fix the telemetry issue. That was really time consuming. Ended up having a very long, for example, mean time to detection, as well as mean time to mitigation.

We were like, we have to stop. We need to fix this. Let’s take a closer look at the telemetry problem. The problems were actually everywhere, basically. We took a look at the in-house telemetry tool. The engineers prior to us, they built a very nice in-house telemetry tool that actually lasted for years. Served the company needs really well. This in-house tool, it’s responsible for metric collection to processing to storage and to visualization. Everything was in-house. There’s some clear disadvantage of this pipeline. The bottom chart is a high-level overview of how this pipeline works. If you pay close attention, at the very end, there is a key-value store. That’s a single key-value store. Which means if you want to get a metric, for example, QPS, and then latency per data center across all your services, per data center, and then you want to GROUPBY by a different dimension, for example, let’s say per container group or something.

Then, you need to go through this whole process, create this processing component here to generate the new key-value here. That’s a very long process for generating a single chart. We have to go through this whole thing to be able to draw a chart in our in-house visualization tool. It’s very inflexible and very slow. There are some other problems with this old pipeline. We have everything built in-house. Our quantile calculation was also built in-house. We made very common mistakes of how quantiles should be calculated. You got quantiles on a local machine. Then you got quantiles by aggregating among all machines. That’s a very typical mistake of how to calculate quantiles. There are inconsistent aggregation results with other standard tools.

The worst part is that if you make a mistake in all of your systems by all of your teams, the same mistake, you probably can still see the trend of telemetry going wrong by looking at the wrong number, because they’re making a mistake. Maybe the trend still tells you something. The worst part here is that some teams are using this way to calculate things like quantiles. Some other teams, they’re using maybe the open-source standard way to calculate quantiles. When we do side-by-side comparison across services owned by different teams, we got very inconsistent results. Then, there are availability issues. As a telemetry team back then, our availability was less than 99.8% availability, that means we have at least four hours of outages every quarter. I don’t blame people who questioned telemetry at the beginning, because we have so many outages.

The First Migration

With all those problems being identified, it’s clear that we need a better telemetry solution. We came up with a very detailed plan. The plan contains three steps. Step one is to design, implement, and productionize a new solution. We evaluated like buy versus build options. We ended up with Grafana Enterprise and VictoriaMetrics as our telemetry solution. We productionized our infrastructure on top of that. Step two is transition. It’s clear. When you do migration, you do it right for some time period and then you kill the old one, and then you move to the new one. Throughout the transition process, you make sure your data is consistent. Because it’s telemetry, you also need to make sure the alerts and dashboards, they’re taken care of. The very last step is basically to remove the old pipeline, and then we can claim victory. That was the plan. That was my first project, basically, after joining Roblox. We did estimation and we thought one quarter should be enough.

One month for developing, one month for transition, one month for kill, and celebration. That’s just me doing engineering estimations. Then, the reality. That was also very eye-opening for me because, from telemetry side, we can see all the limitations, all the shortcomings of the old tool. The reality is that migration of basic tools is very hard. Like I said, that was an in-house solution that has been existing for almost 10 years. There were a lot of customizations made to the in-house tool to make it really easy to use and really pleasant during the incident calls.

For example, there are very customized annotations to label every chart, like the changes that are made to the service, deployments, configuration changes. Then there were very small things like a very cool Player Globe View. There are latency differences, like I mentioned earlier, like quantile calculation was a big pain point. Very inconsistent. The technical issues just takes time. We spend more time. We can bring those two tools closer to each other. Those are easy. We can just focus on our day-to-day job and get those problems solved.

Then, I think the more difficult side from my experience looking back are actually engineers, our engineers, they are very used to and attached to the old tool. It has been there for almost 10 years. I can share some stories. We tried to include a new link to the old chart, which will redirect people, if they click them, will redirect them to the new chart in this new system. Everything will be very similar to the old one. Engineers, they choose to stay with the old tool.

There was one time, I remember, like if they want to go back to the old tool, they have to attach redirect equals to false to their URL. We tried to create those frictions for them to use the old tool. That’s just people, they tend to use redirection. They will just add manually every time, even bookmark redirect equals to false to the old tool. It’s just like that stickiness and attachment to the old tool. That was really eye-opening for me. We also realized that because people just love the old tool, if we force them to move, it would harm their productivity. The people who actually get harmed, their productivity, are people who have been there for years. Those people are valuable. They have valuable experiences during, for example, incident calls and debugging complicated issues. We need to take care of those people’s experiences carefully.

What happened then? Instead of one quarter, it took roughly three quarters. There were three key things that we actually invested in to make this migration smooth. First one is reliable. The reliable meaning our new pipeline needs to be super reliable. Remember I said there was 99.8% availability with the old one? We managed to achieve 100% availability with the new tool for multiple quarters in a row. Then people started to trust this new tool more than the old one. Then, delightful. That’s my learning, and now I also apply this to future migration projects that I worked on, that the transition experience really needs to be delightful.

Try to ask at the minimum work from the customers, even just internal tools. Try to ask as minimum as you can, and do as much as you can for them. Then, overall, if we can make it reliable, make the experience delightful, you will see a productivity improvement. Also, usually the new tool, we bake into our new thinking of how to improve their productivity. When they are getting used to the new tool, we can also see a productivity boost.

I’ll give you some high-level overview of the architecture. This is not the architecture that you’ve always wondered, so I’ll stay very high-level here. Remember I said the problems with our storage, that there were limitations. It wasn’t as scalable as we hoped it to be. We choose to use VictoriaMetrics. We shard our clusters by their services. We have a query layer on top to help us when we need to re-shard. When a service gets too big, it probably needs its own shard. Then we move those services to its own shard and dual-write for a while, have those query layers covered up, so people won’t tell the difference when we move their data around. On the right side is our Grafana setup. We actually set up our Grafana in AWS with multi-region and multi-availability zones. A single version failure wouldn’t actually cause any impact to our telemetry system. It worked pretty well, at least for us. That was the architecture.

Now I’m going to share more about the transition. How we make the transition internally, are three major steps. First, we send a lot of announcements to get a lot of awareness. Those announcements are usually top-down. We send it to email, Slack, and also put up a banner on the old tool, just to get awareness. You wouldn’t believe how people try so hard to deny reading all the messages that are reaching them. Yes, announcements, getting the awareness. That’s the first step. Then we took the soft redirect approach, like the redirect URL equals to false. That trick. That was one of them. We put up a link to every chart, every dashboard in the old tool.

Basically, we prompt them to create a new tool and take them to the new Grafana chart, which basically is an identical chart. People don’t really click that. I’ll share more. After the soft redirect period, that was actually several months. When we reached a certain adoption rate with the new tool, that’s when we start to enforce the hard redirect. If people really need to go back to the old tool, that’s when we enable it.

Otherwise, it will be hard redirect. By the time we actually enable hard redirect, the daily active usage of the old tool was less than 10 already. What we did along the way. We did a lot of trainings. You wouldn’t believe for internal tools, you need to organize company level sessions, also callouts, ad hoc trainings. If a team requests a training with us, we’ll be very happy to just host a training session with them. We also have recordings, but people just prefer in-person training. Also, we kept an eye on our customer support channel and usage dashboard. We proactively actually connect with heavy users and their teams. We try to understand, what is blocking you from this new tool? For a time period every day, we just get the top 10 most active users of the old tool, and we’re going to reach out one by one to them.

At the end, we actually made it. We delivered a reliable telemetry system. We deprecated the old tool. The on-call is still part of life, but it’s just much more peaceful. Some more fun story to share is that, at Roblox, we have many incident calls. During those incident calls, a true rewarding moment was that people started to send those new tool links, like Grafana chart link instead of the old tool link in the Slack channels to communicate with each other. Because the old tool is still unreliable. Sometimes it has issues.

If people report, like the old tool has issues, we don’t even need to jump into the conversation. Our customers who are used to the new tool already, they would actually reply back to those customers saying, try the new one. The new one is super cool. That was a really rewarding moment, now thinking back. The on-calls are still part of life. The good thing is that people don’t really question telemetry when they get paged, which is a good thing. Then, still there are other improvements we can make to our telemetry system. For example, how to make our root cause analysis easier. Those investments are still ongoing.

The Second Migration

For me, personally, first migration was done. What’s next? I was at the telemetry team. I also spent some time trying to build an automated root cause analysis system with the team. Then, soon I realized there are still incidents, and it doesn’t take that much time for us to analyze the root cause of those incidents. We analyzed the root cause of those incidents. Soon we realized that, actually, 20% of them were caused by change rollouts. That was a lot of financial loss and also user engagement loss to us.

Also, just imagining how much time our engineers need to spend time on debugging, triaging, and also doing postmortems on those incidents. Just a lot of engineering productivity loss, in my opinion. I talked to my boss, and then, instead of improving telemetry, I told him I probably want to have a better understanding of the incidents and see how I can help with actually reducing the number of incidents. It was very obvious, validation of the rollouts needs improvement. It’s still a lot of manual steps involved today, but back then, 100% manual steps were involved in change rollout at Roblox. When a change got rolled out first, it would be manually triggered by an engineer. This engineer needs to be responsible for the validation of this change. By looking at the charts, the alerts, pay attention to their secret runbook.

Then, if there’s any issue, the manager would be responsible for rollback, if needed. We really trust our engineers throughout this process. No matter how good engineers are, we’re all still humans. There are still misses sometimes. Depending on the level of engineers, like junior engineers sometimes they’re not really experienced. They don’t know which chart to look at. There’s no way of blocking production that protects them from making mistakes. It happens, basically. It contributes to 20% of incidents. It’s very obvious that improvements must be made. We need to add automated tests to our rollout pipeline and add automations of rollout trigger, rollback, and all those validations. Everything should be automated. There are so many manual steps involved in a change rollout, there are a lot of rooms for improvements.

Then, our question is where to begin with. I didn’t really know. I feel like there were just so many things that we can do. Just get one thing done and let’s see the results. Our PM told me to do some customer interviews. I’m glad that I listened to him. I did a lot of customer interviews together with our PM in our group. Customer interviews for internal tools. We talked to a lot of groups. They were telling us the same story, like tests are, in general, good, but our test environments are not stable. Roblox’s infrastructure is too complicated. The test environment is hard to make it stable. Teams today have their own validation steps. It’s really hard to generalize and automate all those validations, because we have the client apps. We have the platform side. We have the edge side, the game server side. It’s really hard to generalize those validations for all the teams. We decided our first step is to automate the canary analysis part. Why?

First, canary analysis happens in production. Production is stable. We don’t need to deal with the test environment problem at least for now for this step. Second, it would actually bring immediate values to improve our reliability. Our reliability is just not good. Twenty percent of incidents were caused by changes. There are a lot of low-hanging fruits there. Let’s start with this canary analysis, get the low-hanging fruits done. Third, thanks to my experience working on the telemetry team, back then, we created common alerts, because we know the common metrics. We created common alerts for services. Those common alerts can actually be reused for defining the default canary analysis rules. That sounds like a good start.

Again, we came up with some plans. I’m the type of person who likes to do plans a lot. Borrowed from my previous experience, essentially, this is still a migration project. Migrating from the old deployment tool to the new one with some automations. Step one, we designed our new solution. Our new solution involved a new internal engineering portal for deploying their services. Back then, at Roblox, before this tool, we don’t really have a central place where you can view a catalog of services. Everything is just thrown at one single page. You cannot really tell how many services are there, and what services are having changes, or who owns what. It’s just really a mystery to tell back then. We also define the canary analysis by default. Sometimes teams, they don’t really have canary instances. When they roll out to production, it’s to 100% with one single click.

Then we also define a set of default rules that can compare canary metrics with non-canary. Those rules were basically based on the default alerts that I mentioned previously. At the end, we also designed a new automated canary evaluation engine. Step two, our plan was, let’s develop the new solution and then move to transition. We’re going to enable canary analysis for every service rollout. We’re so friendly, we invest so much into customizations. We think people will like us. We also allow them to choose whether you want to have auto rollback and roll forward. In our design, those were all considered, because we thought that this would be a very easy and friendly project to roll out. We had all those things considered in our plan. Then step three is basically deprecation, disable the old way of service rollout. That was the plan.

As a spoiler, we ended up sticking to this plan. There are some hiccups or pushbacks. Again, we got a lot of pushbacks. This time, the pushback was even stronger than the telemetry migration project. There were just productivity concerns. The screenshot here is a snippet of the old tool, what it looks like. Just like two text boxes and a button at the bottom. Then, that’s it. Basically, that’s all you get from deploying your service. After we deploy, probably there will be a table showing you the progress. That’s it. Everybody goes to this place. Everyone’s deployment is visible to each other. Just to recall, we have 2,000 services. Also, that was quite fascinating. People really stick to this old tool and think that this old tool can give them productivity boost because it’s a single click.

The new tool, because we introduced the default 30-minute canary analysis validation phase. When we were debating how long the default time duration should be, I told them, I’m so used to 45-minute canary validation phase. People were like, no, we can only do 10 minutes. We ended up with 30 minutes. That was a very random number we came up with. We got a lot of pushbacks. At the end, we actually reduced it a little bit to begin with. Then the third pushback, again, just new tool and new deployment workflow. People actually told us that it takes time to learn. Even though we even had a UX designer for this internal tool, we thought the whole operation, the workflow, is very streamlined, but no, engineers don’t like it.

Just a high level of what happened. This is a magic three quarters. Yes, it’s just a magic number. Again, we did it together in three quarters. In summary, again, those are the three pillars that I think plays a key role to the success of this project. Reliable. Because of canary analysis, it’s such an obvious low-hanging fruit, it ended up actually catching a lot of bad service rollouts to production. We had the metrics to track that. We tracked the services that failed at the canary phase, and actually ended up getting rolled back. That same image version never got rolled out again. We have a way to auto track that number. That number was really high. It was really delightful. Those new tools are just so obviously better than the older one. Every service gets its own page. You can clearly see who owns what. What are the alerts configured to those services. Which alerts are firing. What are the recent changes, deployments. This is so easy to tell what’s going on with a service.

After a while, after the initial pushback, people started to like it. We pay very close attention to their feedback. We just invest heavily to make sure that this is a delightful experience for them. After everyone transitioned to this new tool, we clearly saw a boost in our productivity. Throughout our user surveys, people reporting their productivity themselves, or show their appreciation to our new tools. Also from our incident drop, we also measure how many bad service rollouts actually caused a production incident. The number was much lower than before.

This is just showing you the canary analysis, the UI. I didn’t manage to get approval for getting a full screenshot of what the new tool looks like. This is just the canary analysis part of the new tool. You can see on the top is the deployment workflow. There is a canary deployment phase that happens automatically for every production rollout. Then there is a canary analysis phase in between. If people click that, they can see a clear analysis in the Grafana charts, set of rules, and how long, and all the configs that were configured for this run.

At the end, there is a production rollout with a clear progress report and easy navigation to their logs and charts, everything. We also allow customized rules, configs. Our engineers really like UIs. Somehow our engineers, they don’t really like command line tools. We made everything in the UI. They can just click and edit their custom rules and configs. When they submit, we’ll automatically create a PR for them. Everything is checked in on GitHub. Every change will need to be reviewed by their teammates.

This is the adoption trend with this project, canary analysis. There are some fun things that you can tell from this trend. You can see at the beginning, the slope was very steady. We were having a really hard time to get early adopters. We talked to people. We tried top-down, and also, we tried bottom-up approaches. Talking to people, “Try our new tool. It’s really cool. Get your team on board”. People will say yes, but then their whole team is still on the old tool. We also tried the top-down approach. Talk to managers, “Your team owns those services. Those service rollouts are crucial to the success of the company. You have to use this new tool to ensure your service rollout is reliable”. It didn’t work well either. What ended up working well was that we collaborated, we have a reliability team, with SREs.

Then, when there’s another incident caused by a service rollout, we would jump to that incident call and also tell them how canary analysis could have prevented the issue. Of course, that was after the mitigation is done. We jumped into those calls. Also, we joined the postmortem process and basically onboarding with automated canary analysis, part of the action items. With that attraction, with the critical services and also incidents, the good thing about incidents is that it has a lot of awareness. A lot of people pay attention to them. We got those services on board. Then, there are people, basically, people with the spirit of awareness. We also get our own teams, our peer teams to try our tool.

From all those directions, we start to have slowly increased adoption in our new tool. I think this one, this big jump was actually because there was a reliability initiative email sent out, announcement sent out just basically saying, you should be all using automated canary analysis. Those are the examples where we could actually prevent those incidents with automated canary analysis. Both bottom-up and top-down, that’s how we get our adoption. Over and again, we start to see a lot of self-adoptions.

Basically, this is just organic growth. We also changed our onboarding tool for new hires. Basically, we do change that page to use this new tool. We paid a lot of attention to details. New hires, they just like this new tool. They will never go back to the old one. As of today, it’s already being used 100% for all the service rollouts. I don’t think, actually, in this case, no one is missing the old tool. The telemetry use case, there are still people missing the old one. This one, no.

What’s next? Canary analysis is actually a true low-hanging fruit, I think. Because we do need to deal with the environment problem. I do think creating a stable test environment and being able to deal with all those version conflicts and also run the integration test with the real dependencies, that’s a more complicated, in my opinion, problem to solve. That’s a good start to begin with. What’s next? Next year, we are investing heavily in integration tests and help our engineers to write good tests, and also improve our integration testing environment with all those micro-environments, allow them to run integration tests with proper dependencies, and also automated, basically like continuous delivery, CD support.

Summary

Just a reflection on the two migration stories. Those are just my personal key takeaways from my experiences. Make production tools highly reliable and available. That’s the basic needs. Internal tools really need to be reliable and available. Sometimes they’re like telemetry, for example, and also the deployment tool. They’re probably the ones that require the highest availability numbers. Understand what is needed most and where to invest. Thinking back, I feel very glad that I listened to a lot of people’s opinions, like our PM’s advice. I listened to our customers.

All those voices really helped us to make the decision. Have a North Star with a clear path on how we get there. Remember those two plan slides? I feel like they are crucial, even though we didn’t really stick to the timelines. We were really off on the timeline side. We stick to the plan. Those are our North Star to guide us. Even though there were pushbacks. For example, there were slow adoptions. There were those moments. We need to believe that what we’re doing is really beneficial for the company and for the engineers, and stick to the plan.

The fourth one is, be considerate and don’t be interruptive. Even though it’s just internal tools, we need to be really considerate to our internal engineers’ productivity. Roblox today has over 1,500 engineers. There’s a lot of engineers and a lot of productivity time. Don’t be interruptive. Just don’t force them to change, is my personal takeaway. Then, a delightful experience is extremely important for internal tools. Internal tools, they are just like external product now. In my opinion, you need to make it really delightful and easy to use, almost like an external facing product.

Questions and Answers

Participant 1: How are your rules tuned to handle different cardinality of metrics based on high traffic applications and non-high traffic applications?

Dai: How I tune that, that’s still an area that we’re investing. The current struggle we have, one is at every service level, setting a limit. We have a monitoring on which service is giving us more unique number of time series in a very short time period. We’ll get alerted for that. We’ll actually drop them. Unless they fix their metrics, we’ll drop this particular metric from that service. That’s one thing. Also, we have other layers of throttling on the cluster side. We use VictoriaMetrics. They also have tools for us to use.

Participant 1: Have you had false positives where you have a canary run that fails and you investigate and it’s actually a problem with the run itself and not a bug?

Dai: Yes, that happens all the time. We do track false positive and false negative, to see if a canary succeeded or failed. Did it end up, like this version gets rolled out or rolled back? We do check those numbers. Our current number, we are at about 89%, 90% accuracy. It’s not 100% all the time. To address that issue, I don’t think it’s our team. We have a very small team supporting this tool, at least at Roblox. We cannot afford to solve this issue looking at every single issue for every service. I think what we do and I think worked pretty well is that we train our users.

Instead of giving them the correct rule, taking a deeper look with them together, it’s basically, we send them the guide on how to tune their rules. They know their metric better. They were the ones who were doing the manual validation. They know how to set up the rule. Our general philosophy is basically, teach them how to use the tool, how to tune their rules. After that, it’s them. Also, sometimes, it’s better to be more aggressive on those tools.

Participant 2: You talked a lot about pushback from the engineers and the users of the tool. You didn’t talk at all about pushback from the executives or higher ups. When a project goes three times longer than you expected, usually, someone’s going to say, what’s going on here? How did you manage that?

Dai: There was a lot of pressure to manage. I got a lot of direct pings from a very high-level person asking me, why are we doing this project? Also, very high-level people talked to my bosses’ boss’s boss, several layers above, can we stop this project? How I handled that is, basically, I did a lot of company-wide sessions, including office hours with executives and very high-level people, like explain to them using numbers rather than the perception. Numbers tell them why we need to make this transition from the reliability gains, from the cost gains, from the engineering efficiency gains. Just try to convince them. I also really listened carefully to their feedback and try to invest however we can to make their experience better. If they say this annotation on Grafana looks different from the old tool, we spend a lot of time trying to fix those annotations, trying to align the timelines to seconds, just to bring the experience closer. I think at the end of the day, we’re just all humans. We spend so much time trying to make them happy. I think people would actually also show appreciation and understand the importance of our work and the respect we give to everybody during this migration.

Participant 3: Could you talk a bit about how you made sure that your new telemetry tool was actually giving you the right numbers? Did you have some process to compare the results with the old one?

Dai: Telemetry, it’s basically like counters, gauges, and quantiles. The counters and gauges, I don’t think our old tool was doing anything wrong. I think we had a good understanding of how the open-source world is dealing with those ways of measuring metrics and performances. The only thing that’s different from the old tool and the open-source world was the quantile. It’s not magic. You have the raw numbers. You can do calculations. You can get the real numbers of the real quantile distribution. You have the histogram calculations and also the old tool quantile over quantile thing. You basically compare all those three numbers. It wasn’t that hard to get which one is closer to the real quantile distribution.

Participant 3: Was it a manual validation or did you automate it?

Dai: We have automated validation and also a manual one. Manual one to prove that the old way was wrong. The old one to prove that the new tool and the old tool produced a similar metric output, with the differences, which is basically quantile, the new tool is basically correct.

Participant 4: You talked about build versus buy in the first migration, but I was interested if you ever revisited that as you were building, particularly the canary capability.

Dai: Yes, I revisited it. I think it’s a question that I don’t think there is a perfect answer. It’s really case by case. Sometimes it’s company by company. When you want something quick, buy seems very feasible. It can give you something really quick and it can solve your current problem. As you’re getting bigger, the costs are getting higher, when you reach a certain number, you start to take a break and think about the solutions.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


ACI Worldwide Expands Technology Partnership Ecosystem to Power ACI Connetic

MMS Founder
MMS RSS

Posted on mongodb google news. Visit mongodb google news

ACI Worldwide has expanded its global technology partnership ecosystem to help financial institutions across the globe increase operational resiliency and address evolving regulatory requirements to safeguard the stability of the financial system. Building on strategic partnerships with Microsoft, Red Hat and IBM, ACI is collaborating with MongoDB, a document-oriented NoSQL database, and open source technology NATS from Synadia Communications for the reference architecture of ACI Connetic, ACI’s unified, cloud-native payments platform. These partnerships help extend ACI Connetic far beyond a traditional payments hub, delivering robust, highly functional payment engines to support financial institutions in meeting growingly stringent non-functional requirements and increase resilience against potential disruptions.

Banks have come under increasing pressure to future-proof their payments infrastructure, as the industry shifts toward real-time, API-driven, globally distributed architectures. With traditional architectures creaking under the weight of new demands and digital-native players setting new benchmarks for performance, availability and innovation, regulators across the globe have called on banks to shore up defence and increase operational resilience. New laws–including Europe’s Digital Operational Resilience Act (DORA), the UK’s operational resiliency scheme, and Australia’s CPS 230 Operational Risk Management standard–all require banks to follow stringent guidelines for safeguarding against information and communication technology-related incidents.

These regulations aim to improve banks’ operational resilience, with the ultimate goal of protecting economies and consumers from the impact of operational disruptions. ACI Connetic brings together card and account-to-account processing on a single, unified platform, delivering a unique combination of proven payments capabilities, integrated fraud prevention and cutting-edge cloud architecture. Designed to meet the demands of modern banking, it gives financial institutions the flexibility, scalability and resilience they need to compete in an increasingly complex payments landscape.

Article originally posted on mongodb google news. Visit mongodb google news

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Private equity types to snap up NoSQL biz Couchbase – The Register

MMS Founder
MMS RSS

Posted on nosqlgooglealerts. Visit nosqlgooglealerts

Document database company Couchbase is set to be bought by a private equity biz in all-cash transaction valued at approximately $1.5 billion.

After building a user-base which includes Comcast, GE, and UPS, Couchbase reached terms with Haveli Investments late last week. In 2021, the company, which has introduced SQL-like features on its multi-modal NoSQL database, raised $200 million in its IPO, valuing it at around $1.2 billion.

Sumit Pande, senior managing director at Haveli Investments, said: “The data layer in enterprise IT stacks is continuing to increase in importance as a critical enabler of next-gen AI applications. Couchbase’s innovative data platform is well positioned to meet the performance and scalability demands of the largest global enterprises.”

The buyout is expected to complete in the second half of 2025, subject to customary closing conditions and regulatory approvals.

Emerging from Membase, Couchbase Server is designed to appeal to developers with a distributed multi-model NoSQL document-oriented database software package optimized for modern, interactive applications.

Notable Couchbase employees include Donald Chamberlin, who co-authored the ubiquitous database language SQL in the 1970s. In the role, he is backing a new query language designed to overcome some of the shortcomings of SQL.

In 2023, Couchbase added a columnar side-car to boost analytics performance for users who want more insight into their real-time data. Only available as a package with the main DBaaS, the in-memory analytics system offers support for Tableau and PowerBI for analytic development and visualization.

In 2021, the NoSQL database began to support multi-statement SQL transactions and an approach to building schema-like structures into the database, allowing it to support multiple applications from the same data. It said the “multi-statement SQL transactions,” allows statements that commit or roll back together. Couchbase moved to the license BSL 1.1 in March that year. ®

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Caylent: Interview With CTO Randall Hunt About The Cloud Native Services Company

MMS Founder
MMS RSS

Posted on mongodb google news. Visit mongodb google news

Caylent is a cloud native services company that helps organizations bring the best out of their people and technology using AWS. Pulse 2.0 interviewed Caylent CTO Randall Hunt to learn more about the company.

Randall Hunt’s Background

Randall Hunt and team

What is Randall Hunt’s background? Hunt said:

“I started programming in middle school, initially to hack into some of the early Massively multiplayer online (MMO) games. Over time, it became less about the games and more about the programming itself. I pursued physics and computer science at Western Carolina University, which led to a summer internship at NASA Langley. There, I worked on designing and implementing a mass properties Application Programming Interface (API) and, weirdly, testing the behavior of ping pong balls in a vacuum and at extreme temperatures. That experience helped me realize I enjoyed the profession of programming much more than the profession of physics, so I joined NASA Ames’ Intelligent Robotics Group (IRG) for a semester, where I was able to do a lot of work in Python.” 

“Following that, I participated in a summer fellowship through hackNY in NYC, where I built an SVD-based restaurant recommendation machine learning (ML) model for a startup called SpotOn. During that summer, I met Eliot Horowitz, the founder of MongoDB, which led me to drop out of school and join MongoDB full-time. At MongoDB, I wore many hats, gaining exposure to various aspects of technology and business.” 

“After several years, I joined AWS as a technical evangelist. My role involved writing blog posts, building demos, giving talks, and traveling the world. I loved both AWS and the job. About a year into my time there, a SpaceX recruiter reached out. I had applied to SpaceX 10 times before without success, I had never made it past the onsite interview. This time I did succeed, and I joined SpaceX. At SpaceX, I managed their AWS usage and the flight software CI/CD team. I was fortunate to be there for the first several Falcon 9 landings. It was an intense but immensely rewarding experience, where I learned a great deal.” 

“As SpaceX began to pivot away from AWS, I decided to return to AWS to continue to focus my career on cloud technology. Over the years at AWS, I spoke with thousands of customers across 50+ countries and contributed to 180+ service and feature launches. We got to design and build systems capable of handling billions of requests per second, globally. 2016 to 2020 was a very good time to be at AWS, the growth and pace of innovation were exceptional. The Amazon leadership principles and culture are peculiar and unique. At AWS they focus on a bias for action and a preference for writing over presentations. Clear, written-thought is valued over a flashy powerpoint. Those concepts continue to inform my day-to-day actions.” 

“After AWS, I joined Meta to work on the PyTorch team. I was impressed by the initial advancements in language models (pre-ChatGPT) and wanted to contribute to PyTorch, a framework central to the AI research community. After my time at Meta, I invested in several startups, took some time off to be a ski bum, and eventually found my way to Caylent. I remember I signed my offer letter on a chairlift in Mammoth.” 

“Fast forward to the present day, joining Caylent has been the best decision of my career to-date. Each step of my journey has been filled with fantastic memories and incredible colleagues, and I feel fortunate to have worked at such amazing places.”

“I wear a lot of hats. The long-term strategy of my role involves staying ahead of the market on technology and understanding how Caylent can continue to differentiate itself technically from our competitors. Questions I often ask myself include: How can we constantly be improving without changing the fundamentals that got us to where we are? How do we scale? How do we grow well?. Some of that long-term strategy involves meeting with many AWS product teams to understand what they’re working on and how it might apply to current or potential customers. The tactical day-to-day involves everything from managing escalations (internal and customer), providing training for our delivery teams, diving deep on a new AWS release, working on sales or marketing content, supporting sales pursuits, supporting customer deployments, to helping with IT initiatives.”

Favorite Memory

What has been your favorite memory working for the company so far? Hunt reflected:

“On June 9th of 2022 we were in New York City when we learned Caylent had earned our AWS Premier Tier designation. This was the milestone that unlocked all of our subsequent success. At the time, it felt incredible, in hindsight, it seems even more incredible. I don’t think any other AWS Partner has ever accomplished that goal as quickly as Caylent. I knew then that we were on to something exceptional.” 

“My second favorite moment in Caylent’s history was very recent. Val Henderson, our president and CRO, presented on stage at AWS re:Invent. Val presented Caylent’s history and evolution into AWS’s most trusted SI partner. If June 9 of 2022 was when I knew we were on to something then December 4 of 2024 is when I knew we were winning.”

Core Products

What are the company’s core products and features? Hunt explained:

“Caylent helps businesses turn their ideas into impact, faster. We build cool stuff. If a customer has an idea or a need then they come to us to get it done. We’re an all-in-one AWS premier tier consulting partner and we work across everything – from product strategy to frontend, backend, to heterogeneous database migrations. We are biased for hiring passionate autodidacts who have an innate curiosity and desire to work across many different fields. Our goal isn’t just to be the best AWS consulting partner, but to be the only consulting partner that can do the things we do (shout out to Val and Jerry Garcia).”

Challenges Faced

What challenges has Hunt and the team faced in building the company? Hunt acknowledged:

“I actually believe the recent macroeconomic turmoil benefited Caylent by giving us an opportunity to talk to customers who otherwise would have continued to be plagued by legacy consulting partners. We’ve grown massively over the last several years, and with this massive growth, it can be hard to maintain our unique culture. Maintaining that culture is vital to our success because customers don’t choose Caylent because we’re competitive in price, they choose Caylent over Accenture or Deloitte because of our unique experience, culture and bias for action.”

“We’ve expanded into product strategy, UI/UX, cloud-native application development, and GenAI and we don’t need to convince the best talent in the industry to join us. They join us because they have the opportunity to build something new, free of the legacy organizational and technical debt of our predecessors. Caylent survives by constantly raising the bar. We overcome challenges, as corny as it sounds, by rising together, one of our core values, and fearlessly questioning the assumptions of the past.”

Evolution Of The Company’s Technology

How has the company’s technology evolved since its launch? Hunt noted:

“Since launching, we have introduced new service offerings and solutions to meet customers’ needs through the various eras of cloud modernization. We have also introduced them at pace with AWS’ product innovations. We have leaned into AWS’ custom silicon-like graviton early on and helped customers achieve massive price/performance savings in their infrastructure.” 

“In this era of AI, it’s important now more than ever to adapt with speed. There is a real possibility that businesses will fail faster than ever if they don’t evolve.” 

“We found the careful balance between organizational evolution and disruption. Growing too fast without a scalable operational backbone could lead to disruptions. Innovating too quickly without the organizational muscle memory for adaptation could lead to dissonance.” 

“We balanced our rapid growth from 50 employees to 650 by evolving our talent, operations and business strategy to deliver the greatest value for our customers.” 

“It starts with building a culture of curiosity and continuous learning, what we call ‘The Caylent Way.’ This has built the organizational muscle-memory needed to adapt to new technologies like GenAI. We also pride ourselves in not being like a typical IT vendor, we truly partner with our customers to accelerate their cloud evolution journey by “Doing-with” vs “Doing for”. We enable and educate their teams along the whole development and deployment process so they are confident in managing and operating the solution after the engagement ends.”

Customer Success Stories

When asking Hunt about customer success stories, he cited the following:

– BrainBox AI

– Venminder

– Life360

Differentiation From The Competition

What differentiates the company from its competition? Hunt affirmed: 

“Caylent is the only company that does what we do at this level of quality. Our people and our culture are why customers choose Caylent. Our tight alignment with AWS keeps us ahead of the competition and focused on the latest and greatest features that can materially improve our customer’s applications and infrastructure. We’re an innovation engine for the cloud, and customers work with us to jumpstart their own innovation.”

Future Company Goals

What are some of the company’s future goals? Hunt emphasized:

“We want to write the playbook for modern AI-driven tech services. This year, we plan to double down on AI solutions for Applied AI, internal process automation, generative AI-powered processes, and the Caylent Delivery Platform. These investments will streamline customer engagements, optimize delivery cycles, and ensure our teams have the tools and support they need to consistently deliver excellence.” 

“We are also seeing trends on the agentic AI, and would like to build applications for our customers across various industries, from software to retail, to use AI for problem solving, reasoning, planning and execution.”

Additional Thoughts

Any other topics you would like to discuss? Hunt concluded:

“The constantly evolving unit economics of Generative AI are extremely interesting right now. People should pay close attention to what’s happening there.”

Article originally posted on mongodb google news. Visit mongodb google news

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Swift Scanner Kingfisher Exposes Active Code Secrets – MENAFN.com

MMS Founder
MMS RSS

Posted on mongodb google news. Visit mongodb google news

(MENAFN– The Arabian Post)

A high‐performance tool named Kingfisher, developed by MongoDB, now enables developers and security teams to detect and validate active secrets-such as API keys and credentials-in codebases in real time. Its release addresses shortcomings in existing scanners by verifying through live checks against cloud services.

Kingfisher began as a personal project in July 2024 by MongoDB security engineer Mick Grove, who was dissatisfied with current open‐source secret scanners. Internal testing confirmed that by April 2025 it had become a core part of MongoDB’s internal security workflows-scanning pre‐commit code, CI/CD pipelines, Git histories and on‐premise files to identify active secrets. The tool has now been made publicly available under the Apache 2.0 licence.

Introducing Kingfisher: The Open Source Secret Scanner that Finds and Validates Leaked Secrets Fast

Crafted in Rust, Kingfisher employs Intel’s Hyperscan for high‐speed regex matching and Tree‐sitter for language‐aware source parsing across more than 20 languages. It runs multi‐threaded scans on repositories and file systems and adds entropy‐based rules to filter high‐confidence detections. The standout feature is active validation: when a potential secret is found, the tool attempts to authenticate against external APIs-such as AWS, Azure, GCP or Stripe-to determine if it remains functional.

This real‐time validation sharply reduces false positives. For example, Kingfisher identified one active AWS secret and four inactive Slack tokens in illustrative internal tests. The tool ships with over 700 built‐in detection rules and supports custom configurations via YAML, making it extensible to new credential types.

Performance benchmarking shows Kingfisher outpaces popular tools such as TruffleHog and Gitleaks in terms of runtime, offering a faster, more efficient scanning solution. Its cloud‐agnostic validation ensures organisations obtain unified visibility over secrets, irrespective of the cloud provider in use.

See also OpenInfra and Linux Foundation Forge Unified Front in Open Source Infrastructure

Using Kingfisher aligns with compliance demands, particularly those of the Supply‐chain Levels for Software Artifacts. It aids organisations working toward SLSA Level 2 and beyond by preventing embedded credentials in source code and safeguarding build integrity during the software supply chain lifecycle.

Unlike cloud‐hosted secret scanning, Kingfisher operates entirely on‐premise or within authorised infrastructure. This ensures that detected secrets do not leave the user’s environment, addressing data privacy and sovereignty concerns.

Kingfisher is accessible across major operating systems, including Linux, macOS and Windows. Installation options range from pre‐built binaries to source compilation via Docker. It also integrates seamlessly with GitHub, GitLab, and CI/CD systems, enabling detection at pre‐commit, pull‐request and post‐merge stages.

Given the surge in credential‐related breaches and the market’s growing concern over hidden, hard‐coded secrets, Kingfisher directly responds to a critical need. Credential exposure remains a leading cause of data breaches, with stolen secrets frequently exploited by automated botnets and sold on underground markets.

By combining live validation, speed, and extensibility, Kingfisher represents a meaningful shift in the secret‐scanning ecosystem. It not only identifies potential security issues, but confirms those that pose genuine risk-allowing developers and security engineers to focus remediation efforts on threats that truly matter.

Its release as open‐source ensures broader access: security teams, DevOps practitioners and smaller organisations can now employ an enterprise‐grade scanner without incurring licensing fees or relying on proprietary systems. MongoDB’s publication of Kingfisher thus reinforces its commitment to open‐source solutions that empower the wider tech community.

Notice an issue?

Arabian Post strives to deliver the most accurate and reliable information to its readers. If you believe you have identified an error or inconsistency in this article, please don’t hesitate to contact our editorial team at editor[at]thearabianpost[dot]com . We are committed to promptly addressing any concerns and ensuring the highest level of journalistic integrity.

MENAFN23062025000152002308ID1109711831

Article originally posted on mongodb google news. Visit mongodb google news

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


From fortune 500s to startups: Mayur Nagarsheth on empowering AI innovation

MMS Founder
MMS RSS

Posted on mongodb google news. Visit mongodb google news

Photo courtesy of Mayur Nagarsheth

Opinions expressed by Digital Journal contributors are their own.

As artificial intelligence reshapes how enterprises manage risk, security, and operational integrity, few voices stand out with the depth and clarity of Mayur Nagarsheth. Trusted advisor to Fortune 500 firms and now also a co-founder of Portend AI, Nagarsheth brings an unusually comprehensive blend of technical fluency, business insight, and customer-centric thinking to the evolving AI startup ecosystem. “I don’t really have one title,” Nagarsheth says. “But if I had to summarize, I’d call myself the Chief Customer Officer. My focus is on designing solutions around customer needs and getting them into their hands effectively.”

Bridging technical depth with business acumen

Nagarsheth’s journey began with his pivotal role at MongoDB, where he was instrumental in scaling the company through critical revenue phases. “MongoDB gave me the foundational knowledge that is needed to take a company from x revenue to x plus multi-million. It also exposed me to technical architecture, sales methodology, and a deep understanding of customer thinking,” he shares.

His next chapter advising both Fortune 500 companies and early-stage firms added critical perspective on market dynamics. “What surprised me most was how different the needs were depending on the customer’s size. Large enterprises often struggle to define the problem itself, while startups want to try everything, but they don’t always want to pay for it.” These experiences cemented Nagarsheth’s approach: meeting customers where they are, understanding their intent (even when they can’t articulate it), and engineering clarity through innovation.

Founding portend AI: risk intelligence reimagined

Nagarsheth’s transition from advisor to founder was a response to a systemic gap he observed firsthand. “Companies are walking into risk blindfolded,” he says. “The collapse of Silicon Valley Bank highlighted how little visibility firms have into their vendors, suppliers, or even their own operations. Trust is fractured, and that’s what Portend AI is built to address.” Portend AI operates at the convergence of AI observability, cybersecurity, and risk management.

The platform continuously monitors and assesses five core business risks: cybersecurity, supply chain, availability, reputation, and compliance. “Reputation is huge,” Nagarsheth stresses. “You may be doing business with a company whose vendor is trending negatively, and that risk becomes your own.” By aggregating and analyzing these signals through machine learning models, Portend AI provides a unified risk profile, and goes a step further. “We’re not just storytellers,” he says. “We provide remediation steps by connecting the dots. That’s the key differentiator.”

Navigating the founder’s journey

While advising offers perspective, founding a company demands skin in the game. “As an advisor, you offer a roadmap. But as a founder, you’re the one behind the wheel,” Nagarsheth reflects. “People may doubt your idea or close doors on you early on. It’s humbling. But the feedback, good or bad, shapes the company you’re building.”

Portend AI is already gaining traction. With more than seven beta customers and growing interest from small and medium businesses (SMBs), Nagarsheth is focused on serving the underserved. “SMBs lack the resources to monitor security, reputation, or compliance in-house. We take that off their plate.”

A scalable vision that connects the dots

Portend’s ambition doesn’t stop at SMBs. The platform is finding a strong use case among venture capital firms, many of whom manage portfolios of hundreds of early-stage companies. “These firms need a magnifying glass to assess and monitor their investments. Our solution offers due diligence, ongoing intelligence, and risk scoring on vendors, people, and product-market alignment.”

According to Nagarsheth, no existing product effectively connects the risk dots across all five critical business pillars. “There are cybersecurity solutions and tools for reputation management, but none that tie everything together,” he says. “Only Portend does that: connecting the dots, telling the story, and offering a path forward.”

A leader recognized for vision and impact

Nagarsheth’s recognition as a Fellow of the British Computer Society (FBCS) reflects his long standing contributions to innovation, leadership, and technical advancement. But it’s his grounded, customer-first approach that continues to define his impact. With Portend AI, Nagarsheth is not only building a platform but enabling a more transparent and resilient future for AI-driven business. Whether guiding venture capitalists through portfolio risks or helping SMBs safeguard operations, he’s carving out a new standard for intelligent risk management.

To follow Mayur Nagarsheth’s journey and learn more about Portend AI, visit his LinkedIn.

Article originally posted on mongodb google news. Visit mongodb google news

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Couchbase Acquired for $1.5 Billion | StartupHub.ai

MMS Founder
MMS RSS

Posted on nosqlgooglealerts. Visit nosqlgooglealerts

June 23, 2025, 1:48 pm IDT

“The data layer in enterprise IT stacks is continuing to increase in importance as a critical enabler of next-gen AI applications.”

Couchbase, a provider of distributed NoSQL database platforms for cloud-native applications, has been acquired by Haveli Investments, in an all-cash transaction valued at approximately $1.5 billion.

The company offers a data platform combining JSON document storage with in-memory key-value access, enabling scalable and responsive systems. Its platform supports multi-model workloads, including full-text search, real-time analytics, eventing, and time-series data, accessible through a SQL-like language. Couchbase provides both self-managed and fully managed deployments via Couchbase Server and Couchbase Capella (database-as-a-service), and extends functionality to edge devices with Couchbase Mobile.

The acquisition will return Couchbase to private ownership after its 2021 initial public offering on the Nasdaq. Haveli Investments held a 9.6% stake in Couchbase prior to the acquisition offer.

Notion
Notion

Couchbase shareholders will receive $24.50 per share in cash. This represents a 67% premium over the March 27th closing stock price and a 29% premium over the June 18th closing price. The deal includes a go-shop period ending June 23rd, allowing Couchbase to solicit competing offers. The transaction is subject to customary closing conditions, including shareholder approval and regulatory clearances.

“The data layer in enterprise IT stacks is continuing to increase in importance as a critical enabler of next-gen AI applications,” commented Sumit Pande, Senior Managing Director at Haveli Investments.

The global NoSQL database market is projected to experience significant growth in the coming years. The acquisition price reflects the strategic value of Couchbase’s technology within this expanding market. The deal is expected to close subject to customary closing conditions.

Key competitors include MongoDB, offering a document database known for its scalability and flexibility, and Amazon DynamoDB, a fully managed NoSQL database service provided by Amazon Web Services, offering high performance and scalability. DataStax, another competitor, provides a distributed database built on Apache Cassandra, emphasizing high availability and scalability.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.