- Agilent Technologies: Baird maintains an Outperform rating and raises the price target from $140 to $141. JP Morgan maintains an Overweight rating but lowers the price target from $160 to $155.
- Chipotle Mexican Grill: JP Morgan maintains a Neutral rating and cuts the price target from $58 to $54.
- Dick’s Sporting Goods: TD Cowen maintains a Market Perform rating and lowers the price target from $216 to $205.
- Elf Beauty: Raymond James maintains a Strong Buy rating and raises the price target from $95 to $105.
- Equity Lifestyle Properties: Barclays initiates coverage with an Equal Weight rating and a price target of $70.
- HP: Citigroup maintains a Neutral rating and lowers the price target from $29 to $27.50. Wells Fargo keeps an Underweight rating and cuts the price target from $35 to $25.
- MongoDB: Baird maintains an Outperform rating and lowers the price target from $300 to $240.
- Nvidia: Deutsche Bank maintains a Hold rating and raises the price target from $125 to $145.
- Performance Food Group: Barclays maintains an Overweight rating and raises the price target from $100 to $112.
- Salesforce: Baird maintains an Outperform rating and cuts the price target from $400 to $365. Goldman Sachs maintains a Buy rating and raises the price target from $340 to $385.
- SentinelOne: Bernstein maintains an Outperform rating and lowers the price target from $27 to $25. Deutsche Bank maintains a Buy rating and cuts the price target from $28 to $22.
- Southwest Airlines: Deutsche Bank upgrades the stock from Hold to Buy and raises the price target from $28 to $40.
- Steris: Jefferies initiates coverage with a Hold rating and a price target of $263.
- Veeva Systems: TD Cowen maintains a Market Perform rating and raises the price target from $261 to $284.
Month: May 2025
Chief Financial Officer Berry Michael J was granted 67,736 units of Class A Common Stock …

MMS • RSS
Posted on mongodb google news. Visit mongodb google news

NEW YORK, Feb. 12, 2025 /PRNewswire/ — MongoDB, Inc. (NASDAQ:MDB) today announced it will report its fourth quarter and full year fiscal year 2025 financial results for the three months ended January 31, 2025, after the U.S. financial markets close on Wednesday, March 5, 2025.
In conjunction with this announcement, MongoDB will host a conference call on Wednesday, March 5, 2025, at 5:00 p.m. (Eastern Time) to discuss the Company’s financial results and business outlook. A live webcast of the call will be available on the “Investor Relations” page of the Company’s website at h
Article originally posted on mongodb google news. Visit mongodb google news
Gemma 3n Available for On-Device Inference Alongside RAG and Function Calling Libraries

MMS • Sergio De Simone
Article originally posted on InfoQ. Visit InfoQ

Google has announced that Gemma 3n is now available in preview on the new LiteRT Hugging Face community, alongside many previously released models. Gemma 3n is a multimodal small language model that supports text, image, video, and audio inputs. It also supports finetuning, customization through retrieval-augmented generation (RAG), and function calling using new AI Edge SDKs.
Gemma 3n is available in two parameter variants, Gemma 3n 2B and Gemma 3n 4B, both supporting text and image inputs, with audio support coming soon, according to Google. This marks a notable increase in size compared to the earlier non-multimodal Gemma 3 1B, which launched earlier this year and required just 529MB to process up to 2,585 tokens per second on a mobile GPU.
Gemma 3n is great for enterprise use cases where developers have the full resources of the device available to them, allowing for larger models on mobile. Field technicians with no service could snap a photo of a part and ask a question. Workers in a warehouse or a kitchen could update inventory via voice while their hands were full.
According to Google, Gemma 3n use selective parameter activation, a technique for efficient parameter management. This means the two models contain more parameters than the 2B or 4B that are active during inference.
Google highlights the ability for developers to fine-tune the base model and then convert and quantize it using new quantization tools available through Google AI Edge.
With the latest release of our quantization tools, we have new quantization schemes that allow for much higher quality int4 post training quantization. Compared to bf16, the default data type for many models, int4 quantization can reduce the size of language models by a factor of 2.5-4X while significantly decreasing latency and peak memory consumption.
As an alternative to fine-tuning, the models can be used for on-device Retrieval Augmented Generation (RAG), which enhances a language model with application-specific data. This capability is powered by the AI Edge RAG library, currently available only on Android and coming in future on other platforms.
The RAG library uses a simple pipeline including several steps: data import, chunking and indexing, embeddings generation, information retrieval, and response generation using an LLM. It allows full customization of the RAG pipeline, including support for custom databases, chunking strategies, and retrieval functions.
Alongside Gemma 3n, Google also announced the AI Edge On-device Function Calling SDK, also currently available only on Android, which enables models to call specific functions to execute real-world actions.
Rather than just generating text, an LLM using the FC SDK can generate a structured call to a function that executes an action, such as searching for up-to-date information, setting alarms, or making reservations.
To integrate an LLM with an external function, you describe the function by specifying its name, a description to guide the LLM on when to use it, and the parameters it requires. This metadata is placed into a Tool
object that is passed to the large language model via the GenerativeModel
constructor. The function calling SDK includes support for receiving function calls from the LLM based on the description you provided, and sending execution results back to the LLM.
If you want to take a closer look at these new tools, the best place to start is the Google AI Edge Gallery, an experimental app that showcases various models and supports text, image, and audio processing.

MMS • RSS
Posted on mongodb google news. Visit mongodb google news
MongoDB (NASDAQ:MDB – Get Free Report) had its price target lowered by equities research analysts at Guggenheim from $300.00 to $235.00 in a report released on Wednesday,Benzinga reports. The brokerage presently has a “buy” rating on the stock. Guggenheim’s price target would indicate a potential upside of 25.47% from the stock’s current price.
Several other equities analysts have also recently weighed in on the company. Rosenblatt Securities reissued a “buy” rating and set a $350.00 price objective on shares of MongoDB in a research report on Tuesday, March 4th. Canaccord Genuity Group cut their price target on MongoDB from $385.00 to $320.00 and set a “buy” rating for the company in a report on Thursday, March 6th. Robert W. Baird dropped their target price on MongoDB from $390.00 to $300.00 and set an “outperform” rating for the company in a report on Thursday, March 6th. Daiwa America raised MongoDB to a “strong-buy” rating in a report on Tuesday, April 1st. Finally, Wedbush dropped their price target on MongoDB from $360.00 to $300.00 and set an “outperform” rating for the company in a report on Thursday, March 6th. Nine analysts have rated the stock with a hold rating, twenty-three have issued a buy rating and one has issued a strong buy rating to the company. Based on data from MarketBeat.com, the company currently has an average rating of “Moderate Buy” and a consensus target price of $286.88.
MongoDB Trading Down 0.9%
NASDAQ MDB traded down $1.66 on Wednesday, reaching $187.29. The stock had a trading volume of 237,836 shares, compared to its average volume of 1,927,673. The business has a 50 day moving average price of $174.60 and a 200 day moving average price of $234.52. MongoDB has a 1 year low of $140.78 and a 1 year high of $370.00. The company has a market capitalization of $15.21 billion, a price-to-earnings ratio of -68.35 and a beta of 1.49.
MongoDB (NASDAQ:MDB – Get Free Report) last released its quarterly earnings results on Wednesday, March 5th. The company reported $0.19 earnings per share (EPS) for the quarter, missing the consensus estimate of $0.64 by ($0.45). The company had revenue of $548.40 million for the quarter, compared to the consensus estimate of $519.65 million. MongoDB had a negative return on equity of 12.22% and a negative net margin of 10.46%. During the same quarter last year, the company posted $0.86 EPS. On average, analysts anticipate that MongoDB will post -1.78 earnings per share for the current fiscal year.
Insider Activity at MongoDB
In other news, Director Dwight A. Merriman sold 3,000 shares of the firm’s stock in a transaction that occurred on Monday, March 3rd. The stock was sold at an average price of $270.63, for a total value of $811,890.00. Following the completion of the sale, the director now directly owns 1,109,006 shares of the company’s stock, valued at $300,130,293.78. This trade represents a 0.27% decrease in their ownership of the stock. The transaction was disclosed in a filing with the SEC, which is available at this link. Also, CAO Thomas Bull sold 301 shares of the business’s stock in a transaction that occurred on Wednesday, April 2nd. The stock was sold at an average price of $173.25, for a total value of $52,148.25. Following the transaction, the chief accounting officer now directly owns 14,598 shares in the company, valued at $2,529,103.50. The trade was a 2.02% decrease in their ownership of the stock. The disclosure for this sale can be found here. Over the last three months, insiders sold 25,203 shares of company stock valued at $4,660,459. 3.60% of the stock is owned by corporate insiders.
Institutional Trading of MongoDB
Hedge funds and other institutional investors have recently modified their holdings of the stock. Cloud Capital Management LLC acquired a new position in MongoDB in the 1st quarter valued at $25,000. Hollencrest Capital Management acquired a new stake in MongoDB in the 1st quarter worth about $26,000. Cullen Frost Bankers Inc. boosted its holdings in MongoDB by 315.8% in the 1st quarter. Cullen Frost Bankers Inc. now owns 158 shares of the company’s stock valued at $28,000 after purchasing an additional 120 shares during the period. Strategic Investment Solutions Inc. IL acquired a new position in shares of MongoDB during the 4th quarter worth approximately $29,000. Finally, NCP Inc. bought a new position in shares of MongoDB during the fourth quarter worth approximately $35,000. Institutional investors own 89.29% of the company’s stock.
About MongoDB
MongoDB, Inc, together with its subsidiaries, provides general purpose database platform worldwide. The company provides MongoDB Atlas, a hosted multi-cloud database-as-a-service solution; MongoDB Enterprise Advanced, a commercial database server for enterprise customers to run in the cloud, on-premises, or in a hybrid environment; and Community Server, a free-to-download version of its database, which includes the functionality that developers need to get started with MongoDB.
Featured Stories
This instant news alert was generated by narrative science technology and financial data from MarketBeat in order to provide readers with the fastest and most accurate reporting. This story was reviewed by MarketBeat’s editorial team prior to publication. Please send any questions or comments about this story to contact@marketbeat.com.
Before you consider MongoDB, you’ll want to hear this.
MarketBeat keeps track of Wall Street’s top-rated and best performing research analysts and the stocks they recommend to their clients on a daily basis. MarketBeat has identified the five stocks that top analysts are quietly whispering to their clients to buy now before the broader market catches on… and MongoDB wasn’t on the list.
While MongoDB currently has a Moderate Buy rating among analysts, top-rated analysts believe these five stocks are better buys.

Unlock the timeless value of gold with our exclusive 2025 Gold Forecasting Report. Explore why gold remains the ultimate investment for safeguarding wealth against inflation, economic shifts, and global uncertainties. Whether you’re planning for future generations or seeking a reliable asset in turbulent times, this report is your essential guide to making informed decisions.
Article originally posted on mongodb google news. Visit mongodb google news

MMS • RSS
Posted on nosqlgooglealerts. Visit nosqlgooglealerts

In the age when data is everything to a business, managers and analysts alike are looking to emerging forms of databases to paint a clear picture of how data is delivering to their businesses. The challenge is an insatiable demand from businesses for AI capabilities, which rely on access to data that has be reliably vetted and relevant to the prompts or queries at hand. Graph databases and knowledge graphs—especially when combined—fulfill this role.
This is part of a rapidly unfolding movement in which “databases have evolved from merely storage layers for transactional data to having the capacity to store the data, including its own relationships with other entities,” said Gautami Nadkarni, senior customer engineer at Google Cloud. This “takes them from knowledge to wisdom.” Attaining this wisdom is where knowledge databases and graphs step in, she added, which play a crucial role in visualizing these data into forms that normal databases don’t have the capacity for.”
“More connected, context-rich data ecosystems are essential for scalable AI,” said Scott Gnau, VP of data platforms at InterSystems. “As the pressure grows to deliver more accurate, explainable, and dynamic AI solutions, organizations are rethinking how their data infrastructure supports that goal.”
The rise of interconnected data means businesses need to look past standard relational databases. “Having the ability to store it as interconnected graphs would be more powerful,” said Nadkarni. “The rise of large language models is creating a greater need for knowledge graphs to provide context, improve accuracy, and reduce hallucinations or false results,” she added. “Businesses need to make decisions faster and with the right data. Graph databases enable real-time analysis of relationships, which is essential for dynamic environments.”
With generative AI (GenAI) and retrieval-augmented generation becoming more mainstream, “there’s growing demand for technologies that deliver better context, meaning, and reasoning,” said Ishaan Biswas, director of product management at Aerospike. “Knowledge graphs running on high-performance graph databases are proving especially valuable here. They help build smarter recommendation and retrieval systems, significantly reducing hallucinations and improving AI responses.”
DIFFERENT ROLES
While joined at the hip, graph databases and knowledge databases serve two distinct purposes. Graph databases open up views of interconnected data to help users discover and analyze relationships between datapoints, be it in customer transactions or internal operations. Knowledge graphs provide the visible structure that supports decision making and analytics.
Both graph databases and knowledge graphs “have similarities but serve different purposes,” said Shalvi Singh, senior product manager at Amazon AI. “Graph databases serve as the underlying infrastructure layer specifically designed for the efficient management of connected data. They are optimized to handle complex queries and traversals. In contrast, knowledge graphs add a semantic reasoning layer, facilitating tailored domain recommendations or precise diagnostics in fields like healthcare.”
A graph database stores and manages data as nodes, or entities, and edges, or relationships, said Philip Miller, senior product marketing manager and AI strategist at Progress. This facilitates the exploration of complex, interconnected data. A knowledge graph, meanwhile, “is a specialized application built on top of a graph database that adds semantic meaning and business context through ontologies and formal relationships,” he added. “In essence, the graph database provides the infrastructure, while the knowledge graph brings structure, reasoning, and domain expertise.”
Often, “people confuse graph databases and knowledge graphs, but they’re complementary, not interchangeable,” noted Biswas. “A graph database is like a car’s engine—it stores and processes data connections efficiently, allowing for rapid data accessibility. A knowledge graph is like that car’s navigation system, providing context, meaning, and direction on top of those connections.”
The bottom line is that “graph databases are a type of NoSQL database that stores data as a network of interconnected nodes and edges,” Nadkarni explained. “They manage complex relationships between data.”
The real value comes when graph databases and knowledge graphs are integrated, said Gnau. “In a multi-model platform, graph and semantic modeling capabilities work together to provide both structure and meaning, enabling AI and large language models to deliver more accurate, relevant, and explainable insights.”
Combining both graph databases and knowledge graphs “enables you to derive maximum benefit with GenAI and large language model technologies,” said Biswas. “A high-performance graph database with a rich knowledge graph provides structured, accurate context. This potent combination greatly diminishes errors or hallucinations and makes AI outputs more reliable and accurate.”
Effective AI systems “will leverage both the semantic layer that knowledge graphs provide as well as the power of graph databases to model complex, real-world relationships within an organization’s data,” said Parker Erickson, senior AI solutions engineer at Snorkel AI. “Both empower AI to retrieve data in a deterministic and symbolic manner—improving explainability and accuracy of the system. Knowledge graphs will help inform AI systems about the rules that govern an organization’s world—all imports of a certain type of product and location are subject to a certain tariff. Graph databases are suited to look up what exact SKUs from what suppliers are then subjected to that tariff and could be used to run simulations on the optimal way to import more product based on the rules contained in the knowledge graph layer of the graph.”
Enabling both the semantic and data layers of graphs “allow AI systems to incorporate the what—specific datapoints—but also the why—the rules contained within the semantic layer—all curated by subject matter experts,” said Erickson.
ENTER AI
The combined power of graph databases with knowledge graphs can help enterprises boost AI efforts by delivering insights within their proper context. These environments enable AI tools to rapidly traverse and analyze complex data networks of information from across various systems and platforms; adjoining knowledge graphs can integrate these data sources into a semantic framework, providing meaning and context, said Gnau.
“Knowledge graphs are especially powerful,” Miller observed, when it comes to supporting AI and large language models. “They ground AI in real-world facts and relationships, reducing hallucinations and enabling more accurate, explainable results.”

MMS • RSS
Posted on mongodb google news. Visit mongodb google news
MongoDB, Inc. (NASDAQ:MDB – Get Free Report) saw unusually large options trading activity on Wednesday. Investors acquired 36,130 call options on the stock. This is an increase of 2,077% compared to the average daily volume of 1,660 call options.
Insiders Place Their Bets
In other news, CEO Dev Ittycheria sold 18,512 shares of the firm’s stock in a transaction dated Wednesday, April 2nd. The shares were sold at an average price of $173.26, for a total value of $3,207,389.12. Following the transaction, the chief executive officer now directly owns 268,948 shares in the company, valued at approximately $46,597,930.48. This trade represents a 6.44% decrease in their position. The transaction was disclosed in a filing with the SEC, which can be accessed through the SEC website. Also, CAO Thomas Bull sold 301 shares of the firm’s stock in a transaction dated Wednesday, April 2nd. The stock was sold at an average price of $173.25, for a total transaction of $52,148.25. Following the completion of the transaction, the chief accounting officer now owns 14,598 shares in the company, valued at approximately $2,529,103.50. The trade was a 2.02% decrease in their ownership of the stock. The disclosure for this sale can be found here. Insiders sold a total of 25,203 shares of company stock valued at $4,660,459 in the last ninety days. Corporate insiders own 3.60% of the company’s stock.
Institutional Inflows and Outflows
Several institutional investors have recently made changes to their positions in the company. Vanguard Group Inc. increased its position in shares of MongoDB by 6.6% in the first quarter. Vanguard Group Inc. now owns 7,809,768 shares of the company’s stock worth $1,369,833,000 after acquiring an additional 481,023 shares in the last quarter. Franklin Resources Inc. increased its position in shares of MongoDB by 9.7% in the fourth quarter. Franklin Resources Inc. now owns 2,054,888 shares of the company’s stock worth $478,398,000 after acquiring an additional 181,962 shares in the last quarter. UBS AM A Distinct Business Unit of UBS Asset Management Americas LLC increased its position in shares of MongoDB by 11.3% in the first quarter. UBS AM A Distinct Business Unit of UBS Asset Management Americas LLC now owns 1,271,444 shares of the company’s stock worth $223,011,000 after acquiring an additional 129,451 shares in the last quarter. Geode Capital Management LLC increased its position in shares of MongoDB by 1.8% in the fourth quarter. Geode Capital Management LLC now owns 1,252,142 shares of the company’s stock worth $290,987,000 after acquiring an additional 22,106 shares in the last quarter. Finally, Amundi increased its position in shares of MongoDB by 53.0% in the first quarter. Amundi now owns 1,061,457 shares of the company’s stock worth $173,378,000 after acquiring an additional 367,717 shares in the last quarter. 89.29% of the stock is owned by hedge funds and other institutional investors.
MongoDB Price Performance
Shares of MDB opened at $188.45 on Thursday. The firm’s 50 day moving average is $174.52 and its 200-day moving average is $234.21. The stock has a market capitalization of $15.30 billion, a P/E ratio of -68.78 and a beta of 1.49. MongoDB has a 12-month low of $140.78 and a 12-month high of $370.00.
MongoDB (NASDAQ:MDB – Get Free Report) last announced its quarterly earnings data on Wednesday, March 5th. The company reported $0.19 earnings per share (EPS) for the quarter, missing the consensus estimate of $0.64 by ($0.45). The business had revenue of $548.40 million during the quarter, compared to analyst estimates of $519.65 million. MongoDB had a negative return on equity of 12.22% and a negative net margin of 10.46%. During the same period in the prior year, the firm earned $0.86 EPS. As a group, equities research analysts forecast that MongoDB will post -1.78 EPS for the current fiscal year.
Wall Street Analysts Forecast Growth
A number of research firms have recently issued reports on MDB. Citigroup reduced their price target on shares of MongoDB from $430.00 to $330.00 and set a “buy” rating on the stock in a research report on Tuesday, April 1st. Robert W. Baird reduced their price target on shares of MongoDB from $390.00 to $300.00 and set an “outperform” rating on the stock in a research report on Thursday, March 6th. Wedbush reduced their price target on shares of MongoDB from $360.00 to $300.00 and set an “outperform” rating on the stock in a research report on Thursday, March 6th. Monness Crespi & Hardt upgraded shares of MongoDB from a “sell” rating to a “neutral” rating in a research report on Monday, March 3rd. Finally, Morgan Stanley reduced their price target on shares of MongoDB from $315.00 to $235.00 and set an “overweight” rating on the stock in a research report on Wednesday, April 16th. Nine investment analysts have rated the stock with a hold rating, twenty-three have issued a buy rating and one has assigned a strong buy rating to the stock. According to MarketBeat, the stock currently has a consensus rating of “Moderate Buy” and an average price target of $286.88.
Check Out Our Latest Stock Analysis on MDB
About MongoDB
MongoDB, Inc, together with its subsidiaries, provides general purpose database platform worldwide. The company provides MongoDB Atlas, a hosted multi-cloud database-as-a-service solution; MongoDB Enterprise Advanced, a commercial database server for enterprise customers to run in the cloud, on-premises, or in a hybrid environment; and Community Server, a free-to-download version of its database, which includes the functionality that developers need to get started with MongoDB.
See Also
Receive News & Ratings for MongoDB Daily – Enter your email address below to receive a concise daily summary of the latest news and analysts’ ratings for MongoDB and related companies with MarketBeat.com’s FREE daily email newsletter.
Article originally posted on mongodb google news. Visit mongodb google news

MMS • Ye Qi
Article originally posted on InfoQ. Visit InfoQ

Transcript
Ye (Charlotte) Qi: I’m Charlotte, working on LLM inference at Meta. You’re going to hear a rapid fire of all challenges that you’re going to run into when you turn an LLM into real LLM serving infrastructure. Since 2023, everyone is in this AI gold rush, trying to find compute resources to fit the unprecedented growth from big models and longer context. This year, the demand keeps growing because test-time compute and compound LLM systems are the new hotness. Scaling LLM serving is almost becoming building like a distributed operating system. It’s so foundational to all of the innovations. We are seeing innovations happen everywhere, especially at the intersection.
At Meta’s AI infra, we are building this strong foundation to empower our researchers and ML engineers. I’ve been solving the problems of serving models for the past six years. My current focus is cost-saving and DevEx. LLM serving is probably the most interesting model serving problem that I work on because a model is a system. The best solution really requires you to think comprehensively to get some joint optimization between model, product, and system. I’m part of the zero-to-one launch that brings Meta AI online. Our team optimize inference backend for Meta AI and smart glasses. Apart from the public-facing traffic that everybody sees, we also have a lot of internal traffic, like RLHF, data curation, distillation that goes into the making of Llama family of models. These are absolutely massive. During the busy week of RLHF, our team has to process hundreds of millions of examples.
Challenge 1 – Fitting
I shared our story at the AI Infra @Scale conference. The question that I get the most is, should I run my own LLM service? How do I optimize my own inference? I think the best way to answer this question is to build everything step-by-step. Let’s imagine we have an awesome product idea. We are going to build a web agent that’s browsing InfoQ’s website, and every time Charlotte posts something, you’re going to summarize it, push it to your customer, and let them ask some follow-up. What an awesome idea. We’re going to go through a journey of four stages of challenges. Let’s start with the simplest one. You get the model. You get some hardware. What to do next? Step one, you need a Model Runner. LLM inference is notoriously expensive to run. The model is trained with next token prediction, which means your inference logic is also token by token. Look at this colorful sentence.
If an LLM has to generate it, every color switching means another model forward pass. We call the generation for the first token, prefill, and all the tokens after the first, decode. This iterative process is very challenging to accelerate. Normally, the end-to-end latency is going to be a few seconds. That’s why almost all LLM applications are using streaming interface. To tackle this special execution pattern, you want to find a runtime that at least supports continuous batching and KV cache. Why continuous batching? The response from LLM will have variable length. The shorter one will exit earlier if you are just using a static batching or dynamic batching, and the resource will be idle. What is continuous batching, though? Imagine like a bus.
At every bus stop, which is the end of each decoding step, it will try to pick up new passengers if there’s an empty spot. The new passengers carry a lot of luggage. They’re very slow to get on the bus. It’s just like prefill. The bus will keep going if nobody is waiting at the bus stops. Fortunately, most of the bus stops are indeed empty, just like what’s in the South Bay. This is how we keep GPU well-utilized. What about KV cache then? Every decoding step is conditioned on all previously generated tokens. The K and V tensors for the same token at the same position basically stay the same across the single generation request. If you don’t do this, your attention computation is going to be cubic instead of quadratic, which is beyond sustainable. Luckily, almost all of the mainstream LLM frameworks support it, and you wouldn’t go too wrong here.
Step two is to find some hardware. The high-end data center GPU today typically come with 8 GPU setup. This is an example from Meta. You will see some network connectivity with different speed, but our job is fitting. Let’s only pay attention to HBM size. You will see some 40, 80, 96, 192. These are the numbers from the most popular data center GPUs like A100, H100, MI300. Let’s take some H100 and try to fit some models in. The 8B models fit in one GPU just fine. You got runners and just run it. 70B model cannot fit in one GPU. We use tensor parallelism to partition the weights across multiple GPUs on the same host. You need at least two GPUs to not OOM.
Typically, we do use four to eight GPUs to allow bigger batch size to improve the throughput, because the KV cache does take quite amount of memory. The 405B model can’t fit in 8 H100. The weights is going to take over 800 gig under bf16, so you will need two nodes. We recommend using pipeline parallelism to further partition the weights across the nodes, because communication from multi-node tensor parallelism is just going to introduce too much overhead. You can also choose MI300, which offers you 192 gigs of HBM, and you can serve it on a single host without going through all the troubles. The takeaway is, don’t just grab your training or eval code. Find some runtime specialized in LLM inference serving. Try to have some understanding of your AI hardware that match your model. You can use tensor or pipeline parallelism to fit your models.
Challenge 2 – It’s Too Slow
We solved the fitting problem, and the product is hooked up. The next challenge is, it’s too slow. Most infra problems do get better by throwing capacity. You can also throw more GPUs, like 3 million of GPUs on the screen, or you wait for faster next generation GPUs if you can actually wait. Throwing capacity blindly only takes you this far, not that far. You’re staring at the runtime that you find, and thinking, outside of these 100 arguments, is there anything that I can tune to make it faster? Certainly, there is, but let’s understand where is the limit. Remember we mentioned that LLM inference consists of prefill and decode. Prefill generates the first token. It reads a huge model weight, and it does tons of computations across all of these tokens within your prompt to find all the relation between each pair of tokens, and it then outputs one token. Decode generates subsequent tokens.
Again, it reads a huge model weight, and it reads all the KV cache, but it only tries to attend next token, one next token to all previously generated tokens. The compute density is a lot lower. This makes prefill GPU compute heavy, decode, memory bandwidth heavy, and the KV cache requires quite amount of memory capacity. To sum up, making LLM faster is really about fitting this LLM math operation within these three types of system resources, and they scale with the model sizes, sequence lengths, batch sizes, but a little annoying that they scale with different multiplier. However, the ratio of the system resources on your hardware are fixed once it’s manufactured. This is why we have to try harder to bridge the gap.
Where do we start then? When you say things are slow, let’s be a little bit more precise. If you’re using a streaming interface, you might care about the first token latency, or TTFT, to reduce the awkward silence when your customer is waiting for something to be generated, or you care about the generation speed, or output speed, or TTIT or TTOT, they’re all the same. This is to let your user to see your app is super-fast churning out the responses, or you actually get a bunch of, it’s not user-facing, it’s a bunch of bots talking to each other. In non-streaming use cases, you might actually care about the end-to-end latency, because your client has some timeouts. These three latencies can be optimized in slightly different ways. You need to set up reasonable expectations around what’s achievable for your model sizes, input-output lengths, and your own cost budget.
For example, you cannot expect a 405B model to run faster than a 70B model, unless you can make significant modification to your hardware. With this context, we can get back to our step one, throwing capacity a little more wisely. I actually ignore the fact that even in continuous batching, you will see some hosts right there. This is because we’re running prefill in the continuous batch. All of the decoding steps in the same batch will get slowed down. This might be ok, because practically, per instance, request per second for LLM is quite low, because of resource demand. Most of the steps are actually not interrupted. Imagine you suddenly get a request with 32K input, all of the decoding will get stuck for a few seconds, and your customer will actually see it. They will feel it. We can fix this problem with disaggregation. We replicate the already partitioned weights, and we run prefill and decode in different services.
This way, we can scale their system resources separately, and get rid of this annoying 10x P99 latency for decode, because that P99 is basically the average of your prefill latency. It helps us to maintain the same latency SLOs, with much fewer machines. If you only rely on 8-way tensor parallelism, even with disagg in place, if you’re looking for processing an input that’s 128K, you are looking for at least a minute. If your app does look for this type of responsiveness, even for such long input, you do have the expensive option to use context parallelism, that further partition the entire workload at the context dimension. It helps you to store more targeted compute at your prefill time. Replication and partition are our best friends. It helps us to manipulate the system performance profile into a better shape that can fit into the hardware, that’s come with different shapes. Whenever something is too big to functionally fit, or create a substantial system bottleneck, we’ll just cut it, copy it, and we can get over it.
Enough about distributed inference. Let’s move on to step two. You should consider making your problem smaller. As long as your product quality can tolerate, let’s try to make some modification to lower the burden for your hardware. Number one, just use a more concise prompt. Number two, get a smaller model with specialized fine-tuning, distillation, or pruning. Number three, apply quantization to unlock twice or even more compute available on your hardware. Pay attention here, quantization is not a single technique.
At the post-training time, you can mix and match different components, different D-type, or different policy to use for quantization. The tremendous openness in the LLM community actually will allow you to find implementation for all of this very easily. You can play around and see which option gives you the best ROI. Step three, we can also get better at memory management using caching or virtual memory. Remember we discussed that KV cache takes quite an amount of memory. This problem can be basically tackled in the same way, in the traditional system performance.
First, we’ll try to identify what can actually get cached. For requests using roleplay or integrity chat, you typically have to throw a very long system prompt to keep the model in context. For multi-turn chatbot application, each request includes all of the chat history from previous turns. That’s a lot of re-computation. On your GPU server, there’s a decent in-memory storage system right there. You will get half terabytes of HBM, a few terabytes of DRAM, a few dozens of terabytes of flash. We can build a hierarchical caching system that actually matches the KV cache usage pattern. For example, the common system prompt might get cached in the HBM. The active user’s chat history that you need to load every minute or so, can sit in DRAM. Chat history from less engaging users can be offloaded to flash.
If you can manage the system and privacy complexity for this type of problem, this is basically a no-brainer. It’s lossless. It’s very common for us to see over 50% of reduction for both latency and capacity. Step four, if you’re looking for even more levers on performance and cost saving, you can try speculative decoding, chunked prefill, dive deeper in the attention kernels, or try some sparsity at the token level. There are tons of LLM domain-specific optimizations. This table is a quick glimpse into this space. Each technique takes different tradeoff, sometimes conflicting tradeoff around TTFT, TTIT, quality, and cost. You need to decide what’s important for your product. You also want to take my words with a grain of salt because the exact behavior really depends on your own unit workload.
Let’s recap what we’ve learned so far. Getting the basics right will give you a 10x foundation. Distributed inference, smaller input and model, and caching will give you another 2x to 4x boost. You are free to explore more advanced decoding algorithms, changes in your engine, to find your next 10x or 100x.
Challenge 3 – Production
People might come and ask, this sounds so complicated. Did you do all of this at Meta? You will find out what we might do if you hear the next two challenges. Suppose our awesome InfoQ agent is fast enough for product launch, what do we solve next? Our toy system now has to turn into a production service that is available for a user at all times. What comes to your mind when you hear the word production? To me, everything could function in a slightly unexpected way. We get a lot more moving pieces. Request distribution is changing. Like you can see in this graph, you get users with different intent of engagement, while some just spam you a lot more often. The input to output ratio has bigger variance. The effective batch size is much smaller and also changes a lot. All of these invalidate our assumptions around creating that perfect ratio of system resources to max our hardware ROI.
What’s more, the temporal patterns are also changing a lot. We get daily peaks and daily off-peaks, and spikes just like all online services. The spikes might be just some random people running evals. Very strict real-time online services like chatbots are not the only type of upstream for LLMs, human-in-the-loop annotations is also real-time, but it has to align with the evaluator’s schedule. Batch processing for summarization and feature generation can be scheduled with some time shifting, up to minutes or hours. What’s more, LLM deployment has very complex latency and throughput tradeoff. Throughput is only meaningful when it can miss your latency SLOs. You have to do this for multiple deployments. A mature product typically involves serving a handful of LLMs. You get the main big model for your chat, and you get a bunch of ancillary models for safety, planning, rewarding, function calling, to complete the end-to-end loop. Putting everything together, you’ll often find the peak FLOPS, how a vendor advertise is kind of useless.
It’s very common to lose 50% effective FLOPS, even at the earliest kernel benchmarking stage, especially for shorter input. Combining the latency bound and some operating buffer, you’re going to lose 10x. This is very common. How do we get back all of these losing FLOPS? Step one, let’s think harder. What can be traded off? Out of quality, latency, throughput, reliability, and cost, you can probably only get to choose two or three of them. I want you to ask this question to yourself, do you really need three nines or four nines if that’s going to cost you 3x? Do your customers only feel meaningful differences when you are serving at an output speed of 100 tokens per second versus 10 tokens per second? As a reference, I only speak at a speed of five tokens per second.
After you’ve thought through it, you come to me and say, “I just want to push harder on the latency. Tell me more about inference optimizations”. Let’s look at the data and decide where to optimize. You get the 70B model running at roughly 200 milliseconds on 8 GPU, which is a pretty decent performance, but your service is only hosted in California. Your New York customers get another 75 milliseconds in network roundtrip. You don’t just hardcode your host name and port. Another 75 milliseconds goes into naive host selection, health check, load balancing, and everything else. Your LLMs may be multimodal, and you have to download the images in the request, another 150 milliseconds.
Your app has to coordinate a bunch of business logic, adding safety checks before and after the main model call. It might also do a bunch of other things, like fetch information, react, doing search. This can easily add another 400 milliseconds. Look at this distribution. You can tell me where to spend time, where should we optimize. Hold on. You actually care about stuckness from this long prefill. Then, let’s do disagg. Earlier, I made disagg sound so simple. You just create two services, and everything is done. Your prefill operations and your decode operations is going to get nicely batched together. What I didn’t mention is, the link between prefill and decode is quite thin. Between prefill and decode, you have to transfer hundreds of megabytes of KV cache. You typically have an upper network bandwidth limit when you are talking TCP/IP. It’s very common to add another 50 to 100 milliseconds to your TTFT if you are doing disagg.
This means, by doing disagg, you are basically signing up for these hard optimizations, like request scheduling, and overlap data transfer and compute at every single layer. You might also want to do a little bit of additional tuning based on your own unique network environment. What’s more, how would you push a binary to it? How would you update model weights? That alone probably needs a full 40-minute session to explain. No disagg, maybe caching is easier. True, but LLM has this annoying positional encoding. The same tokens showing up in the beginning of the sentence versus the end of the sentence will have different embeddings. Imagine you are trying to get the model cache, but you are sending requests are including the last 10 messages from your customer. Then, nothing will be cached. To improve the cache hit rate, you not only need a big cache, as we discussed how you can build a hierarchy, you also need prompts that are very cacheable.
To solve this problem, we co-designed a chat history management with our product team, and we even further customized the request data flow to not only maximize cache hit rate, but also leave longer inference latency budget. To route these requests to the host with the cache, we use consistent hashing to sticky route the request from the same session to the same host, and we also have to ensure that retries on host failure are also sticky. Maybe we can try quantization then, if you are in a resource-constrained environment, and you really don’t have a choice, and quantization is basically a functional requirement. When you do have the choice, you might want to be a little bit more paranoid. What we commonly see is you quantize the weights to fp8, you run some benchmark, let’s say MMLU.
The scores are the same, sometimes even higher. When you release the quantization to production, some of your customers are just showing up and saying, something is not working. Most of the benchmarks are saturated, and may not best represent your own product objective. You should build your own product eval, and use a slow roll process to get these wins. Speaking of slow roll and testing, how do we maintain these nice scores over time? Inference bugs can manifest as subtle performance degradation, because LLMs are probabilistic models. It’s possible you have something horribly wrong, but the result comes out still decently correct. To prevent your models from getting dumber over time, our approach is no different than traditional CI/CD. Just make small benchmark runs every single one of your diffs, and run more comprehensive ones on every release for inference engine.
Another time for takeaway. We will lose theoretical FLOPS in a production environment a lot. Don’t get surprised about it. When you look at the end-to-end latency, inference a lot of time is just a small portion of it. If you want to make good use of KV cache, and the persistent KV cache, you often have to do product and infra co-optimization. Lastly, don’t forget to continuously evaluate and test all of these acceleration techniques directly using the signals from your product.
Challenge 4 – Scaling
Our InfoQ agent is a big hit. We decided to make it a little bit more fun. We want to let our users to customize which speakers they want to subscribe. What else is going to change? This is when we encounter our last stage of challenges, which is scaling, which is a never-ending topic for all infra engineers. You get so many rocket parts. They’re scattered around. How are we going to put them together to make them fly? To assess that, let’s ask ourself, what numbers will get bigger as you scale your apps? We’ll have more deployments. We’ll use more GPUs. We’ll involve more developers and serve more models. Let’s look at this one by one.
Step one, we get more deployments. Let’s use disagg as an example, because I like it. We mentioned prefill service needs more GPU compute, so you find a hardware called hardware P uniquely fits this job. Decode service needs more memory bandwidth, so you find a hardware D for this job. When you look at the host in the data center, region 1 is full of hardware P, and region 2 is full of hardware D. Interesting. Remember, prefill and decode has to transfer hundreds of megabytes of data to each other for every single request, so you probably don’t want to do that cross-region. This means if you opt in for disaggregated deployments with some hardware preferences, you actually need to maintain and optimize inference deployments, not for one, but for three, because you also need to handle the remainders.
Thinking about this, you might have different answers about what your net win is going to look like here. Remember the idea of using context parallelism for very long input. It helps you to reduce the TTFT. 90% of the time, you’re at this happy path. You get high bandwidth network from RDMA, and everything is fast and nice. However, 40 GPUs in one partition will take down your entire process group. Assuming at any time we have 3% random failures for your GPU cards, the blast radius will exponentially grow when the smallest inference deployment units is getting bigger or smaller. With faster interconnect, these hosts are required to be actually placed physically closer together, and whenever there’s some maintenance events going on in the infra, like network device upgrade, daily maintenance from the data center, all of these hosts will be gone at the same time.
At Meta, we actually have to create a dedicated job scheduler to allocate hosts for distributed inference. We also talk about heterogeneous deployment, and hosts with shared fate. What about autoscaling? In the CPU world, if you have a service with changing workloads throughout the day, you will find throughput metrics like QPS or your request queue size, then you turn on autoscaling and you forget about it. For LLM, number one, the right throughput metric is extremely tricky to find, because the bottleneck depends on your workload. QPS obviously does not work. Tokens per second works under a lot of caveats.
Number two, you cannot freely upsize, as you’re likely running to the GPU limit first. Putting allocation challenges together, you need a deployment allocator to place your job with awareness of network topologies and the maintenance events to minimize the impact from the infrastructure activities. To tackle heterogeneous hardware and autoscaling, you will need a deployment solver that actually treats autoscaling as a shard placement problem. It understands the supply and demand from your inference services, and it can make decent decisions even when the demand exceeds supply. At scale, all of these traditional operational problems are all solved with quite sophisticated software and algorithms.
We talk about allocation at scale. Step two is cost saving. We’ve covered dozens of options to make things faster and cheaper. It’s going to be quite tedious to manually apply every single one of them and tune every job individually. When you draw a graph of latency and throughput for different serving options, like different number of GPUs, different inference acceleration techniques you try, you will get a graph that looks like the left. You will see the lines eventually crossing to each other at some point. This means your best cost-effective option actually depends on your latency requirement. We have to categorize these workloads to create like an inference manual to really allow our customers to choose the best for them. This type of inference manual requires very extensive performance benchmarking automation and a lot of data science.
Suppose we solve the volume problem, cost saving also has to be looked at end-to-end just like latency. Your tool chain will get so much more effective if you grab some knowledge about inference accelerations that I talk about from this talk. I commonly hear people raise this question, look at this workload, look at the gap between the off-peaks and peaks, can we run something else to leverage this free capacity? It’s like, this green line is your provisioned throughput and you can actually raise this line 3x by applying some ideas I mentioned in this talk. This is good to hear. I have many deployments. Where do I start? A common perception I see from people is 90% people focus on one, two, three head models because we think that deployment consumes all of the GPUs.
When you actually measure, you will find so many tail deployments than you can imagine, and collectively, they might consume even more GPUs than your main model. This means observability and automation are no longer nice to have, they directly help you to claim all of these low-hanging fruits. To put them together, remember the 10x loss. Your ideal throughput is determined by your workload and your hardware. The inference optimization will help you to improve achievable throughput.
The deployment option you choose yourself is going to determine your provisioned throughput, which you might not choose the best that’s available for you. You want your deployment to be sized automatically as you are charged by GPUs, not by tokens when you’re hosting your dedicated deployments. With good data foundation and also removing some foot grounds from our customer, we are capable of really minimizing the gap across the four. Next step, running a product is beyond serving one version for each model. You want to split your traffic and evaluate a few launch candidates. This has to happen for multiple models on the platform. What does this mean? Everything we’ve discussed so far got yet another dimension to actually multiply onto all of the problem set.
Summary
You’ve already got a glance into all of the building blocks for a scalable LLM serving infrastructure. We start from model and hardware, which are really the spotlight for everything. To fit the model onto hardware efficiently, you will need a good Model Runner, a performant execution engine with a handful of inference acceleration techniques. To get the engine ready to become production service, we need to think about end-to-end latency, efficiency and correctness by investing into monitoring, routing, product integration, scheduling optimization, and continuous evaluation. A serving infrastructure needs to run multiple LLM services.
This is when allocation, deployment management, model management, experimentation all come into play. We haven’t even talked about quota and interfaces. Look at this graph, model and hardware are really just a tip of the whole LLM serving iceberg. To get the best of all worlds, we often have to do vertical optimizations across the entire stack. I think we spent a good amount of time scratching the surface of LLM serving infrastructure. I hope you find something useful from this talk, and hopefully you can get some insights for your inference strategies. Many of the pains are still very real. I hope to see more innovations to help us scale development and deployment for GenAI applications all together.
Questions and Answers
Participant 1: What are you focusing on right now? Obviously, there’s a lot of areas, but just currently, what is one or two areas that you’re focusing on at the moment?
Ye (Charlotte) Qi: Getting ready for the next generation of Llama, obviously.
Participant 1: What’s the toughest challenge in this space that you worked on in the last six months or a year?
Ye (Charlotte) Qi: I think for model serving, especially if you want to get things end-to-end, it’s not like you only care about modeling or you hire a bunch of GPU experts or you hire a bunch of distributed system experts. You actually have to have all of the three. You actually have to make these people talk to each other and understand the tradeoff in the same language. Because a lot of time, when you get things online, there’s only limited stuff that you can do by only applying infra techniques. Sometimes you do have to make some modification to your models directly. You have to work closely with the modeling people, ensuring that the modeling direction can get manifest into the correct net behavior for all of those techniques together.
As I show in here, yes, a lot of time, people will say, LLM serving, you have a lot of interesting problems to solve. Still, around 70% or even 80% of the time, have to be spent in typical system generalist type of work. We have to build this foundation very solid, make sure it’s scalable, debuggable, especially.
Participant 2: At your peak times, what was your maximum GPU utilization that you have gone through? Also, in terms of hardware performance specifically, so do you work with vendor partners like either NVIDIA or AMD to even make it more better in terms of acceleration techniques? They may be providing libraries to even give better performance of using majority of the hardware resources. How does that fit in your work here?
Ye (Charlotte) Qi: First about GPU utilization. It’s actually quite complicated to use one single metric to measure your GPU utilization, because GPU is this very massive parallel machines. Like for NVIDIA hardware, there is this device active metric that you can get, but that metric is very coarse. It tells you anything that’s happening in this GPU. It doesn’t measure at a finer grain on how are you utilizing everything there, because for a single GPU, there’s hundreds of processing units within it. We also have a metric to measure the utilization at the processing unit level.
Typically, in our work, we do cover all of those as well. The exact GPU device utilization and the streaming processors utilization really depends on your workload. Typically, we do see them go up, like go up to more than 50% very easily for our workload. Sometimes, if you’re only looking at the utilization, it can get misleading from time to time. Like we mentioned applying caching, your tokens per second will go up, but your utilization will actually go down. You need to make sure you are tracking the stuff that you care about.
About working with hardware vendors to improve performance. Meta is a quite big company. We have people working on more like an internal cloud LLM serving platform. We are also partnering very closely with PyTorch. We are basically collaborating together. Let’s say we first introduce AMD to our fleet, we work very closely with the PyTorch team. They gather signal directly from the fleet to know, are the features all covered? Are all of those operators running? Even today, I think there’s still like a tiny bit of gap. They get the signals. They get both kernel-level performance as well as the end-to-end performance. They are indeed working closely. The exact partnership is outside of my own expertise.
See more presentations with transcripts

MMS • RSS
Posted on mongodb google news. Visit mongodb google news

Every morning, find a selection of analysts’ recommendations covering North America. The list only includes companies that have been the subject of a change of opinion or price target, or of a new coverage. It may be updated during the session. This information can be used to shed light on movements in certain stocks or on market sentiment towards a company.

Article originally posted on mongodb google news. Visit mongodb google news

MMS • Ben Linders
Article originally posted on InfoQ. Visit InfoQ

Today’s software professionals navigate a maze of technical, business, and social complexity. According to Xin Yao, thriving in this environment requires more than just technical and business expertise. We need fluency in decoupling systems for maintainability, reconnecting them for business value, and working with the messiness of organizational dynamics. At the OOP Conference, Yao explored how sociotechnical design helps us deal with these challenges.
One of the biggest challenges in software is the need to decouple and connect at the same time, as Yao explained:
Developers are expected to break systems into modular, independent parts while also understanding how to reconnect them to create business value. A well-structured API, a message queue, or a data contract may work today, but as the business need evolves, so must those connections be revisited.
As software ages and business contexts evolve, the necessary changes – refactoring, re-architecting, or rethinking system boundaries – are often not held back by technical difficulty alone, but by the social complexity of teams, decision-making structures as well as role boundaries and organisational dynamics, Yao argued.
Software exists within human systems, and yet many development practices assume that the human aspects of software work, like communication, understanding, decision-making, and collaboration, can be neatly structured and controlled like code. But social complexity is emergent, unpredictable, and full of contradictions, Yao said.
Misaligned incentives, rigid team structures, power dynamics, and change fatigue often create barriers to sustainable architecture, Yao mentioned. Many teams lack the necessary conditions for open, reflective conversations, leading to superficial solutions that don’t address underlying social constraints, and thus brittle designs, she added.
Yao mentioned that beyond technical excellence, sociotechnical fitness is key. She mentioned the ability to engage in deep collaboration, reflective conversations, and participatory decision-making:
Skills like facilitation, surfacing questions, active listening, and collaborative modeling help teams navigate uncertainty and align software with human needs.
A prerequisite of sociotechnical fitness is sociotechnical awareness, to see our work with software as a system of systems. Yao argued that it’s essential to design relationships between software, business needs, and the humans behind both:
Developers and architects who recognize this aren’t just building software—they’re cultivating an environment where good software can emerge through trust, shared language, and continuous negotiation of meaning.
Software doesn’t exist alongside the social system, it is embedded within it, Yao said. Our work is a complex social system with complicated technical subsystems as parts. Being embedded, a technical subsystem isn’t an isolated machine we can optimize independently; it grows, shifts, and entangles itself with human decisions, workflows, and power structures:
Every piece of code carries traces of the conversations, misunderstandings, constraints, and compromises that shaped it.
This embeddedness makes software inherently more complex than just a collection of technical parts. The same piece of code can function differently depending on who is using it, how teams interpret requirements, or how decision-making authority is distributed, Yao said.
Software is not pure unpredictability. Practices like DDD, CI/CD, containerization, and TDD bring predictability to the technical realm, Yao said. But this predictability breaks down at the level of human interaction:
The challenge is knowing when to lean on predictability, and when to embrace emergence. That’s the shift from software design to sociotechnical design.
Instead of jumping to solutions, we need to engage with the deeper questions shaping our whole sociotechnical systems – questions of meaning, relationships, and power dynamics, Yao suggested. By doing so, we avoid premature convergence and create solutions that reflect the true complexity of the business domain and the human system.
InfoQ interviewed Xin Yao about dealing with sociotechnical complexity.
InfoQ: What are the key pitfalls to avoid when trying to align architecture with organizational structures?
Xin Yao: Many organizations aim to implement the Inverse Conway Maneuver but struggle in practice. A major pitfall is assuming that simply changing team structures will fix architectural issues. It’s more effective to iteratively adjust boundaries based on how teams naturally collaborate and evolve.
Complexity calls for collaboration. Big bang divide-and-conquer of social complexity (aka. reorganizations or transformation) is not a panacea to decoupling and connecting software.
InfoQ: What’s the role of storytelling in navigating sociotechnical complexity?
Yao: Stories help make abstract complexity tangible. They enable teams to surface assumptions, build shared understanding, foster psychological safety, and engage in productive dialogue.
AWS Launches Centralized Product Lifecycle Page: Transparency and Consolidating Deprecation Info

MMS • Steef-Jan Wiggers
Article originally posted on InfoQ. Visit InfoQ

AWS recently launched its Product Lifecycle page, a new centralized resource providing comprehensive information on service availability changes. With this initiative, the company aims to streamline how customers track service deprecations, end-of-support timelines, and restrictions on new customer access.
The AWS Product Lifecycle page consolidates information into three key categories: services closing access to new customers, services announcing end of support (including migration paths and timelines), and services already reaching their end-of-support date.
Wojtek Szczepucha, a solution architect at AWS, posted on LinkedIn:
Now, you can check if and which services will reach the end-of-support. However, it’s not only that, with the AWS Product Lifecycle page, you will know:
– WHY has this decision been made?
– HOW can you handle this situation? With the alternate solutions proposed.
Adding to the positive reception, cloud economist Corey Quinn offered his analysis, stating he was “happier” about this round of deprecations for two reasons. Firstly, he noted the benefit of a consolidated announcement of multiple deprecations, contrasting it with the potentially unsettling “drip-drip-drip” of individual deprecations seen previously. In addition, Quinn argued that this batching, coupled with clear rationales and transition plans (as highlighted by Szczepucha), builds customer confidence.
Secondly, Quinn emphasized the significant improvement in having a unified location for service and feature deprecation information, moving away from a fragmented landscape of blog posts and documentation updates.
This sentiment was echoed by Luc van Donkersgoed, an AWS Serverless Hero, in a LinkedIn post:
AWS is finally taking a mature stance on service deprecations!
Currently, the new page lists initial updates for services like Amazon Timestream for LiveAnalytics (closing to new customers), Amazon Pinpoint, and others reaching the end of support.
With the new page, the company follows up with Microsoft Azure and Google Cloud practices, offering mechanisms for communicating service lifecycles through platforms like Azure’s Lifecycle Policy and Google Cloud’s documentation and release notes.
For instance, Microsoft Azure provides its Microsoft Lifecycle Policy page, offering searchable information on support timelines, and Azure Updates for broader announcements. At the same time, Google Cloud communicates through service-specific documentation, release notes, and defined Product Launch Stages.
Lastly, the authors of the AWS News blog post on the new page recommend bookmarking the page and checking out What’s New with AWS? for upcoming AWS service availability updates.

MMS • Daniel Curtis
Article originally posted on InfoQ. Visit InfoQ

TanStack has released the first stable version of TanStack Form, a cross-framework form library with support for React, Vue, Angular, Solid, and Lit. This new addition to the TanStack ecosystem joins the existing lineup of popular form libraries, such as Formik, React Hook Forms or Final Form.
TanStack Form launches with support for five major front-end frameworks: React, Vue, Angular, Solid and Lit. This aligns with the broader TanStack philosophy of creating headless and framework-agnostic components. Looking at some of the comparisons to other libraries within the ecosystem, such as Formik or React Hook Form, TanStack Form supports a wider variety of UI frameworks from day one.
In addition to cross-framework compatibility, TanStack Form also supports multiple runtimes, including mobile and server-side environments like React Native and NativeScript, as well as server-rendered environments such as NextJS and TanStack Start. This broad compatibility ensures that developers can adopt TanStack form regardless of not only their platform, but also their runtimes.
In general, the development community seem excited about the announcement, however even Corbin Crutchley, the lead developer for the project, has mentioned on Reddit that if you are already happy with your existing setup with React Hook Form, then they wouldn’t suggest migrating right away:
“FWIW if you’re already happy with RHF I wouldn’t inherently suggest migrating away with it; it’s a well maintained library by a talented group of maintainers. It’s clearly not going anywhere – even with our entry in the space …”
You may not want to switch, but TanStack Form presents first-class type safety, server-side rendering (SSR) support and a consistent API across frameworks.
A technical detail that is subtle in the documentation, but worth highlighting is that TanStack Forms is using signals under the hood, more specifically TanStack Store, which in turn uses signals for its state management. The architectural choice of signals and their fine-grain reactivity should prevent unnecessary rerenders and ultimately lead to better performance across the board, especially when working with larger forms with complex validation rules.
For validation, the library follows the standard schema specification, which includes validation libraries such as Zod and Valibot. The benefit of following the standard schema specification means that TanStack Form is not tied to a single validation library, and instead can support a wide range of validation libraries that align with the standard schema interface.
TanStack is a collection of libraries to support common requirements in web development. Created by Tanner Linsley, it began with the popular data fetching library React Query (now known as TanStack Query). Since then the ecosystem has grown to include TanStack Table, TanStack Router, TanStack Virtual and now TanStack Form. The libraries are widely adopted across the front-end community with millions of downloads per month, TanStack Query alone currently has 9.5 million weekly downloads.
Developers can find the full documentation on the TanStack website, complete with detailed getting started guides, full API documentation and interactive examples. The project is also open source and hosted on GitHub.