Microsoft Officially Supports Rust on Azure with First SDK Beta

MMS Founder
MMS Steef-Jan Wiggers

Article originally posted on InfoQ. Visit InfoQ

Microsoft has released the first beta of its official Azure SDK for Rust, enabling Rust developers to interact with Azure services. The initial release includes libraries for essential components such as Identity, Key Vault (secrets and keys), Event Hubs, and Cosmos DB.

This move signifies Microsoft’s recognition of the growing importance and adoption of the Rust programming language, both within the company and in the broader developer ecosystem. Rust is gaining popularity due to its performance, reliability, and memory safety features, making it well-suited for systems programming and high-performance applications. Its strong type system and ownership model help prevent common programming errors, leading to more secure and stable code. At the same time, its modern syntax and tooling contribute to a positive developer experience.

The beta SDK provides Rust developers with libraries designed to integrate with Rust’s package management system (cargo) and coding conventions. The included libraries, known as “crates”  in the Rust ecosystem, can be added as dependencies to Rust projects using the cargo add command.

For example, to use the Identity and Key Vault Secrets libraries, they can run the following command:

cargo add azure_identity azure_security_keyvault_secrets tokio --features tokio/full

Next, the developer can import the necessary modules from the Azure SDK crates. The code for creating a new secret client using the DefaultAzureCredential would look like this:

#[tokio::main]
async fn main() -> Result<(), Box> {
    // Create a credential using DefaultAzureCredential
    let credential = DefaultAzureCredential::new()?;
    // Initialize the SecretClient with the Key Vault URL and credential
    let client = SecretClient::new(
        "https://your-key-vault-name.vault.azure.net/,"
        credential.clone(),
        None,
    )?;

    // Additional code will go here...

    Ok(())
}

After the Azure SDK release for Rust, Microsoft´s Cosmos DN team released the Azure Cosmos DB Rust SDK, which provides an idiomatic API for performing operations on databases, containers, and items. Theo van Kraay, a product maanger for Cosmos DB at Microsoft, wrote:

With its growing ecosystem and support for WebAssembly, Rust is increasingly becoming a go-to language for performance-critical workloads, cloud services, and distributed systems like Azure Cosmos DB.

While Microsoft is now officially entering the Rust cloud SDK space with this beta release, Amazon Web Services (AWS) already offers a mature and official AWS SDK for Rust. This SDK provides a comprehensive set of crates, each corresponding to an AWS service, allowing Rust developers to build applications that interact with the vast array of AWS offerings.

Looking ahead, Microsoft plans to expand the Azure SDK for Rust by adding support for more Azure services and refining the existing beta libraries. The goal is to stabilize these libraries and provide a robust and user-friendly experience. Future improvements are expected to include buffering entire responses in the pipeline to ensure consistent policy application (like retry policies) and deserializing arrays as empty Vec in most cases to simplify code.

Lastly, developers interested in getting started with the Azure SDK for Rust can find detailed documentation, code samples, and installation instructions on the project’s GitHub repository. They can also look for new releases from the SDK.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


MongoDB: Does Hope Remain for the Stock After Massive Post Earnings Fall?

MMS Founder
MMS RSS

Posted on mongodb google news. Visit mongodb google news

MarketBeat tracked 15 analysts who updated their price target on Mar. 6 or later. On average, they lowered their target by 23%. This indicates that, in general, Wall Street doesn’t see the move in MongoDB stock as a drastic overreaction. However, the average updated price target of $294 still shows an implied upside north of 55% versus MongoDB’s Mar. 20 closing price.

This begs the question: Is MongoDB a buy-the-dip opportunity, or is there too much moving against this stock? Additionally, do its long-term opportunities still remain?

Below are details of MongoDB’s latest earnings and a perspective on what the stock’s future holds.

For the last quarter, MongoDB posted results that actually greatly exceeded analyst estimates. Its adjusted earnings per share of $1.28 were nearly double the $0.66 per share analysts anticipated. Additionally, the company grew revenue by 20% in the quarter, way above the nearly 14% projected.

The company pulled forward significant earnings and revenue in Q4, helping it achieve its big beats. The company noted that it had to recognize over $10 million more in revenue from its Enterprise Advanced product in the quarter than it expected. This was due to accounting rules for multi-year licenses.

This helped pull forward revenue and earnings, resulting in a big beat on both. Higher-than-expected consumption revenue from its Atlas segment also contributed to this.

However, even after adjusting for this pull-forward, MongoDB still fell short of expectations on earnings growth. Wall Street expected $4.04 of adjusted EPS in total for fiscal Q4 2025 and fiscal 2026 combined. Based on MongoDB’s results and midpoint guidance, it only sees $3.81 of adjusted EPS over those five quarters. That is a miss of just under 6%.

Atlas is MongoDB’s product that it manages for customers on the cloud. Corporations manage Enterprise Advanced, also known as “Non-Atlas,” on-premises. Enterprise Advanced offers customers greater control and customization because of this.

Overall, markets have been hoping that further adoption of AI will reaccelerate the company’s annual revenue growth. This is because developers can use MongoDB’s database to build AI applications, a need that would drive demand. In fiscal 2024, revenue grew by 31%. In fiscal 2025, it grew by 19%. Now, MongoDB is forecasting growth of just 12% for fiscal 2026.

This big drop in growth has raised concerns. One factor causing this lower growth forecast is the fact that MongoDB signed many multi-year deals in fiscal 2024 and 2025. As a result, MongoDB has fewer customers with whom to renew deals in fiscal 2026.

This declining growth rate comes as MongoDB didn’t offer commentary suggesting that AI demand would have a big impact soon. The company mentioned that it expects AI-related progress to be gradual in fiscal 2026. This is because “most enterprise customers are still developing in-house skills to leverage AI effectively.”

Management added that they “expect the benefits of AI to be only modestly incremental to revenue growth in fiscal 2026.” Overall, the company sees fiscal 2026 as a “transition year” as it looks to prepare for its AI opportunity.

MongoDB Price Chart

At this point, MongoDB has a lot of negative sentiment around it. This is due to its weak guidance and management commentary that the AI opportunity may be farther away than some had hoped. Still, MongoDB remains bullish on AI, seeing it as a “once-in-a-generation” shift.

Yet, the firm did not explicitly reiterate its statement from Q3 that it believes it will capture its “fair share of successful AI applications” in the latest earnings call. Overall, MongoDB still has a solid chance to capitalize on this opportunity in the long term. However, with the negative sentiment for this stock and the general tech sector, it may be best to wait on the sidelines until more positive news regarding AI adoption arises.

Original Post

Article originally posted on mongodb google news. Visit mongodb google news

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


OpenSilver 3.2 Extends Silverlight and WPF to Mobile Devices

MMS Founder
MMS Edin Kapic

Article originally posted on InfoQ. Visit InfoQ

The latest version of OpenSilver 3.2, a remake of Microsoft’s retired Silverlight web application framework, extends the platform to mobile devices using a .NET MAUI Hybrid approach.

OpenSilver was launched in October 2021 by a French company, Userware. It is an open-source, MIT-licensed reimplementation of Silverlight. OpenSilver compiles .NET and XAML code into HTML5 and WebAssembly, reimplementing standard and selected third-party controls. It allows developers to reuse their legacy Silverlight or XAML applications instead of rewriting them. The latest update to OpenSilver was version 3.1 in December 2024, with a Visual Studio Code XAML designer and modern themes for the UI.

Until now, OpenSilver focused on WPF and Silverlight code parity on modern browsers. In version 3.2, there are new project templates featuring .NET MAUI hybrid applications. In these applications, there is a native app container for every platform that hosts the native WebView, which runs the OpenSilver application.

The benefits for the developers include maintaining a single codebase, with a consistent UI on all platforms. Native apps can use their platform APIs in the OpenSilver applications. Lastly, native apps are packaged and distributed on the app stores, making it easier for users to find them. The included ToDoCalendar sample application that illustrates how cross-platform OpenSilver works was ported from the app initially written for the defunct Windows Phone.

Right now, the supported platforms for .NET MAUI hybrid OpenSilver applications are iOS, Android, Windows, macOS, web browsers and Linux (using Photino, a native Linux framework using web technologies).

When asked how have developers received OpenSilver, Giovanni Albani, Userware founder and CEO answers:

The reception has been remarkably positive, particularly among enterprise development teams with significant WPF investments. We’ve seen adoption across finance, healthcare, and manufacturing sectors, where preserving complex business logic while expanding platform reach is essential. What resonates most with businesses is the cost-effectiveness of modernizing existing applications without rebuilding from scratch, while leveraging their teams’ existing C# and XAML expertise. Microsoft’s recommendation of WPF for new desktop application development has reinforced this interest. Interestingly, new generations of developers are discovering WPF through OpenSilver and finding its capabilities impressive. They appreciate that it’s a stable, well-documented platform with abundant samples and tutorials.

Other features in this release include a couple of updates for WPF compatibility, such as improved event bubbling (matching behaviour present in WPF) and support for RTL languages. The company intends to further improve WPF code compatibility in future releases. Additionally, the XAML editor previously available for Visual Studio Code is now available for Visual Studio 2022.

According to Giovanni Albani, (the) XAML visual editor for VS Code generated particular excitement in the developer community, with its Reddit announcement garnering over 100,000 views and more than 120 comments.

The OpenSilver source code is available on GitHub. The repository containing OpenSilver has 1067 stars and has been forked 122 times. Beyond the Userware developer team, there are other active contributions to the project, with a total of 47 contributors. According to the OpenSilver website, companies that rely on this framework include Bayer, TATA, KPMG, and others.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


.NET 10 Preview 2: Enhanced Reconnection UI, Improved Blazor Features, and Updates Across the Stack

MMS Founder
MMS Almir Vuk

Article originally posted on InfoQ. Visit InfoQ

Last week, the .NET Team announced the second preview release of .NET 10 introducing several enhancements across various components, including .NET Runtime, SDK, libraries, C#, ASP.NET Core, Blazor, .NET MAUI, and more.

One of the key updates in ASP.NET Core focuses on improving both the user experience and development flexibility. A notable addition is the ReconnectModal component for Blazor Web App projects. This new UI component gives developers more control over the reconnection process when a client loses its WebSocket connection.

As stated, the component is bundled with its own stylesheet and JavaScript files, ensuring it adheres to stricter Content Security Policy (CSP) settings. Developers can now better manage the reconnection state with new features such as the retrying state and an event to signal state changes.

Furthermore, the NavigateTo function in Blazor has been updated to prevent the browser from scrolling to the top of the page when navigating to the same page, improving the user experience by maintaining the scroll position.

Another update to the NavLink component allows it to ignore query strings and fragments when using NavLinkMatch.All, offer more precise control over link-matching behavior. Also, a new method, CloseColumnOptionsAsync, is introduced for closing column options in the QuickGrid component, simplifying UI interactions.

Continuing from Preview 1, the XML documentation comments are now included in OpenAPI documents when using ASP.NET Core, providing as stated, richer metadata for APIs. This feature requires enabling XML documentation in the project file and is processed at build-time.

A note for the developers that a new upgrade to OpenAPI.NET v2.0.0-preview7 brings several improvements, including breaking changes that affect document transformers and other OpenAPI operations.

In .NET MAUI, several new features are introduced, including the ShadowTypeConverter for improved shadow styling, which supports flexible shadow configurations using formatted strings. The SpeechOptions API now includes a Rate property to control the playback speed for text-to-speech functionality.

A platform-specific update allows modals to be displayed as popovers on iOS and Mac Catalyst, giving developers more control over modal presentations. The Switch control now supports setting an OffColor in addition to the existing OnColor. The HybridWebView component introduces a new InvokeJavascriptAsync method, simplifying the interaction with JavaScript in WebView controls.

Entity Framework Core 10 in this preview adds support for the RightJoin operator, allowing developers to perform right joins in queries, complementing the previously introduced LeftJoin operator. Small improvements have also been made to optimize database root handling and other internal processes.

WPF has seen performance improvements, particularly in UI automation, file dialogs, and pixel format conversions. Bug fixes in Fluent style components have addressed issues with default styles for Label and animation for the Expander controls. Several other bug fixes have been applied, including resolving issues with text pointer normalization and localization for various UI components.

C# 14 introduces partial instance constructors and events, expanding the range of partial members that can be used in classes. These features, along with partial methods and properties introduced in previous versions, enhance the flexibility and modularity of class implementations, especially for use with source generators.

For those interested in further details and other changes, the full release notes are available, offering an in-depth look at the second preview of .NET 10.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


FaunaDB shuts down but hints at open source future – The Register

MMS Founder
MMS RSS

Posted on nosqlgooglealerts. Visit nosqlgooglealerts

FaunaDB – the database that promised relational power with document flexibility – will shut down its service at the end of May. The biz says it plans to release an open-source version of its core tech.

The startup pocketed $27 million in VC funding in 2020 and boasted that 25,000 developers worldwide were using its serverless database.

However, last week, FaunaDB announced that it would sunset its database services.

“Driving broad based adoption of a new operational database that runs as a service globally is very capital intensive. In the current market environment, our board and investors have determined that it is not possible to raise the capital needed to achieve that goal independently,” the leadership team said.

“While we will no longer be accepting new customers, existing Fauna customers will experience no immediate change. We will gradually transition customers off Fauna and are committed to ensuring a smooth process over the next several months,” it added.

FaunaDB said it plans to release an open-source version of its core database technology. The system stores data in JSON documents but retains relational features like consistency, support for joins and foreign keys, and full schema enforcement. Fauna’s query language, FQL, will also be made available to the open-source community.

Former technical lead at Twitter’s database team, Matt Freels, co-founded FaunaDB along with fellow Twitter veteran Evan Weaver, who left in 2023.

The database biz counted development teams at Tyson Foods, Unilever, Lexmark, and software company Anaplan among its customers.

One commentator pointed out that FaunaDB might have been more successful in gaining market traction if it had employed the open-source model from the beginning.

“While there is no alternative history, I wonder what would have happened if Fauna had chosen to start as Open Source, become 100x more widely adopted but monetize a smaller portion of their customers due to ‘competing’ open source alternative,” posited Peter Zaitsev, an early MySQL engineer and founder of database consultancy Percona. ®

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


MongoDB Expands Availability of MongoDB Atlas in Southeast Asia to Support Accelerated …

MMS Founder
MMS RSS

Posted on mongodb google news. Visit mongodb google news

#inform-video-player-1 .inform-embed { margin-top: 10px; margin-bottom: 20px; }

#inform-video-player-2 .inform-embed { margin-top: 10px; margin-bottom: 20px; }

MongoDB Atlas now available on AWS Regions in Malaysia and Thailand // Local availability empowers businesses across the region—including Botnoi and EasyEat—to innovate and improve user experiences

SINGAPORE – Media OutReach Newswire – 25 March 2025 – MongoDB, the leading database for modern applications, today announced that its industry-leading, multi-cloud data platform, MongoDB Atlas, is now available on Amazon Web Services (AWS) Regions in Malaysia and Thailand. This expansion comes on the heels of significant growth in the Southeast Asia region with headcount growing over 200% in the last two years and the region becoming a key part of MongoDB’s Asia Pacific business, which has reached more than $220 million USD in annual revenue.

This page requires Javascript.

Javascript is required for you to be able to read premium content. Please enable it in your browser settings.

#inform-video-player-3 .inform-embed { margin-top: 10px; margin-bottom: 20px; }

Article originally posted on mongodb google news. Visit mongodb google news

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Which Stocks Benefit From AI Spending? Analyst Names IBM And More – Benzinga

MMS Founder
MMS RSS

Posted on mongodb google news. Visit mongodb google news

Wedbush analyst Daniel Ives discusses the benefits for software companies from the AI spending phase.

In recent weeks, the analyst has been tracking enterprise AI spending in 2025, focusing on use cases and identifying leading vendors. AI now makes up about 12% of many IT budgets, up from 10% in January, with some CIOs accelerating their AI strategies.

Nvidia Corp NVDA chips and cloud providers are central to AI deployments, and for every $1 spent on Nvidia, there is an $8-$10 impact across the broader tech ecosystem.

Around 70% of companies have increased their AI budgets, signaling rapid tech spending despite economic uncertainties.

Also Read: Tilray’s Hi*Ball Energy Returns To Whole Foods Market

The analyst has been tracking AI adoption across industries like financial services, healthcare, transportation, and manufacturing.

Many IT departments are focusing on large-scale AI deployments with Microsoft Corp MSFT, Amazon.com Inc AMZN, and Alphabet Inc‘s Google GOOGL, targeting software-driven use cases.

While customers were still in the strategy phase in late 2024, there is a shift in recent weeks, with high-priority AI use cases now being rapidly implemented across various sectors in 2025. Amid the widespread buzz in the software industry about monetizing AI, a select group of companies is beginning to stand out from the crowd.

The software sector is poised for significant growth as the AI Revolution accelerates, with the adoption of generative AI and LLM models becoming key drivers.

The analyst sees Palantir Technologies Inc PLTR and Salesforce Inc CRM as top software plays amid the AI Revolution for 2025, with other vendors like Oracle Corp ORCL, International Business Machines Corporation IBM, Snowflake Inc SNOW, Elastic N.V. ESTC, MongoDB, Inc. MDB and Pegasystems Inc. PEGA also well-positioned.

IBM’s cloud penetration has been particularly successful, presenting a major monetization opportunity. Despite investor skepticism, the analyst said the accelerating AI adoption in enterprises will benefit the software industry in 2025. The analyst also added IBM to the Wedbush Best Ideas List, reflecting higher confidence.

Read Next:

Photo via Shutterstock.

Market News and Data brought to you by Benzinga APIs

Article originally posted on mongodb google news. Visit mongodb google news

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


MongoDB Inc (MDB) Shares Up 3.11% on Mar 24 – GuruFocus

MMS Founder
MMS RSS

Posted on mongodb google news. Visit mongodb google news

Shares of MongoDB Inc (MDB, Financial) surged 3.11% in mid-day trading on Mar 24. The stock reached an intraday high of $199.48, before settling at $198.53, up from its previous close of $192.54. This places MDB 48.73% below its 52-week high of $387.19 and 14.67% above its 52-week low of $173.13. Trading volume was 533,251 shares, 23.3% of the average daily volume of 2,288,327.

Wall Street Analysts Forecast

1904218705018122240.png

Based on the one-year price targets offered by 34 analysts, the average target price for MongoDB Inc (MDB, Financial) is $307.18 with a high estimate of $520.00 and a low estimate of $180.00. The average target implies an upside of 54.73% from the current price of $198.53. More detailed estimate data can be found on the MongoDB Inc (MDB) Forecast page.

Based on the consensus recommendation from 38 brokerage firms, MongoDB Inc’s (MDB, Financial) average brokerage recommendation is currently 2.0, indicating “Outperform” status. The rating scale ranges from 1 to 5, where 1 signifies Strong Buy, and 5 denotes Sell.

Based on GuruFocus estimates, the estimated GF Value for MongoDB Inc (MDB, Financial) in one year is $406.78, suggesting a upside of 104.9% from the current price of $198.53. GF Value is GuruFocus’ estimate of the fair value that the stock should be traded at. It is calculated based on the historical multiples the stock has traded at previously, as well as past business growth and the future estimates of the business’ performance. More detailed data can be found on the MongoDB Inc (MDB) Summary page.

This article, generated by GuruFocus, is designed to provide general insights and is not tailored financial advice. Our commentary is rooted in historical data and analyst projections, utilizing an impartial methodology, and is not intended to serve as specific investment guidance. It does not formulate a recommendation to purchase or divest any stock and does not consider individual investment objectives or financial circumstances. Our objective is to deliver long-term, fundamental data-driven analysis. Be aware that our analysis might not incorporate the most recent, price-sensitive company announcements or qualitative information. GuruFocus holds no position in the stocks mentioned herein.

Article originally posted on mongodb google news. Visit mongodb google news

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Java News Roundup: JDK 24, GraalVM for JDK 24, Payara Platform, Kafka 4.0, Spring CVEs, JavaOne 2025

MMS Founder
MMS Michael Redlich

Article originally posted on InfoQ. Visit InfoQ

This week’s Java roundup for March 17th, 2025 features news highlighting: the GA releases of JDK 24 and Apache Kafka 4.0; the March 2025 edition of the Payara Platform; Spring Security CVEs; and JavaOne 2025.

JDK 24

Oracle has released version 24 of the Java programming language and virtual machine, which ships with a final feature set of 24 JEPs. More details may be found in this InfoQ news story.

JDK 25

Build 15 of the JDK 25 early-access builds was also made available this past week featuring updates from Build 14 that include fixes for various issues. Further details on this release may be found in the release notes.

For JDK 24 and JDK 25, developers are encouraged to report bugs via the Java Bug Database.

GraalVM

In conjunction with the release of JDK 24, the released of GraalVM for JDK 24 by Oracle Labs delivers new features such as: Graal Neural Network (GNN), a new static profiler that “provides a new generation of machine learning-powered profile inference” (available only on Oracle GraalVM); SkipFlow, a new experimental extension to Native Image static analysis that “tracks primitive values and evaluates branching conditions dynamically during the analysis process;” and an optimized Foreign Function and Memory API in Native Image with new specialized upcalls for direct method handles. More details on this release may be found in the release notes and the upcoming GraalVM for JDK 24 release stream on YouTube. InfoQ will follow up with a more detailed news story.

BellSoft Liberica

The release of Liberica JDK 24, BellSoft’s downstream distribution of OpenJDK 24, includes 2,597 bug fixes in the JDK and 175 bug fixes in JavaFX. Developers may download this latest version from this website.

Project Loom

Build 25-loom+1-11 of the Project Loom early-access builds was made available to the Java community this past week and is based on Build 13 of the JDK 25 early-access builds. This build improves the implementation of Java monitors (synchronized methods) for enhanced interoperability with virtual threads.

Spring Framework

It was a busy week over at Spring as the various teams have delivered milestone releases of Spring Boot, Spring Security, Spring Authorization Server, Spring for GraphQL, Spring Integration, Spring AMQP, Spring for Apache Kafka and Spring Web Services.There were also point releases of Spring Batch and Spring for Apache Pulsar. Further details may be found in this InfoQ news story.

The Spring team has disclosed two CVEs that affect Spring Security:

Josh Long, Spring Developer Advocate at Broadcom, has tweeted that March 24, 2025 marks 21 years since the formal GA release of Spring Framework 1.0.

Payara

Payara has released their March 2025 edition of the Payara Platform that includes Community Edition 6.2025.3, Enterprise Edition 6.24.0 and Enterprise Edition 5.73.0. All three releases provide bug fixes, security fixes, dependency upgrades and two improvements: the ability to specify the global context root for any deployed application using the payaramicro.globalContextRoot property; and a CORBA update to use the new implementation of Jakarta Concurrency 3.1 instead of synchronized blocks.

This edition also delivers Payara 7.2025.1.Alpha1 featuring full support for the Jakarta EE 11 Core Profile, released in December 2024, along with the same improvements in Payara Platform 6. More details on these releases may be found in the release notes for Community Edition 6.2025.3 and Enterprise Edition 6.24.0 and Enterprise Edition 5.73.0.

Apache Software Foundation

The release of Apache Kafka 4.0.0 delivers bug fixes, many improvements and new features such as: client support for subscribing with the new SubscriptionPattern class; and the ability for developers to rebootstrap clients based on timeout or error code. Further details on this release may be found in the release notes.

Hibernate

The fifth beta release of Hibernate ORM 7.0.0 features: a migration to the Jakarta Persistence 3.2 specification, the latest version targeted for Jakarta EE 11; a baseline of JDK 17; improved domain model validations; and a migration from Hibernate Commons Annotations (HCANN) to the new Hibernate Models project for low-level processing of an application domain model. More details on this release may be found in the release notes and the migration guide.

The second alpha release of Hibernate Search 8.0.0 ships with: fixes discovered in the Alpha1 release; an alignment with Hibernate ORM 7.0.0.Beta4 that implements the Jakarta Persistence 3.2 specification; and dependency upgrades to Lucene 10.1.0, OpenSearch 2.19 and Elasticsearch Client 8.17.3. Further details on this release may be found in the release notes.

Kotlin

The release of Kotlin 2.1.20 provides new features such as: a new K2 compiler kapt plugin that is enabled by default for all the projects; and an experimental DSL that replaces the Gradle Application plugin that is no longer compatible with the Kotlin Multiplatform Gradle plugin. More details on this release may be found in the release notes.

RefactorFirst

Jim Bethancourt, principal software consultant at Improving, an IT services firm offering training, consulting, recruiting, and project services, has released version 0.7.0 of RefactorFirst, a utility that prioritizes the parts of an application that should be refactored. This release delivers: improved rendering of class maps and cycles using the 3D Force-Directed Graph; and limiting the cycle analysis to 10 cycles that will be parameterized in the future. Further details on this release may be found in the release notes.

JavaOne 2025

JavaOne 2025 was held on March 18-20, 2025 at the Oracle Conference Center in Redwood Shores, California. This three-day event consisting of keynotes, presentations and hands-on labs, is organized by Oracle and the developer relations team. The session catalog provides all of the details. Details on Day One of JavaOne, in conjunction with the formal release of JDK 24, may be found in this InfoQ news story. InfoQ will follow-up with additional news stories.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: How GitHub Copilot Serves 400 Million Completion Requests a Day

MMS Founder
MMS David Cheney

Article originally posted on InfoQ. Visit InfoQ

Transcript

Cheney: GitHub Copilot is the largest LLM powered code completion service in the world. We serve hundreds of millions of requests a day with an average response time of under 200 milliseconds. The story I’m going to cover in this track is the story of how we built this service.

I’m the cat on the internet with glasses. In real life, I’m just some guy that wears glasses. I’ve been at GitHub for nearly five years. I’ve worked on a bunch of backend components, which none of you know about, but you interact with every day. I’m currently the tech lead on copilot-proxy, which is the service that connects your IDEs to LLMs hosted in Azure.

What is GitHub Copilot?

GitHub is the largest social coding site in the world. We have over 100 million users. We’re very proud to call ourselves the home of all developers. The product I’m going to talk to you about is GitHub Copilot, specifically the code completion part of the product. That’s the bit that I work on. Copilot does many other things, chat, interactive refactoring, things like that, and so on. They broadly use the same architecture and infrastructure I’m going to talk to you about, but the details vary subtly. GitHub Copilot is available as an extension. You install it in your IDE. We support most of the major IDEs, VS Code, Visual Studio, obviously, the pantheon of IntelliJ IDEs, Neovim, and we recently announced support for Xcode. Pretty much you can get it wherever you want. We serve more than 400 million completion requests. That was when I pitched this talk. I had a look at the number. It’s much higher than that these days.

We peak at about 8,000 requests a second during that peak period between the European afternoon and the U.S. work day. That’s our peak period. During that time, we see a mean response time of less than 200 milliseconds. Just in case there is one person who hasn’t seen GitHub Copilot in action, here’s a recording of me just working on some throwaway code. The goal, what you’ll see here is that we have the inbuilt IDE completions, those are the ones in the box, and Copilot’s completions, which are the gray ones, which we notionally call ghost text because it’s gray and ethereal. You can see as I go through here, every time that I stop typing or I pause, Copilot takes over. You can see you can write a function, a comment describing what you want, and Copilot will do its best to write it for you. It really likes patterns. As you see, it’s figured out the pattern of what I’m doing. We all know how to make prime numbers. You pretty much got the idea. That’s the product in action.

Building a Cloud Hosted Autocompletion Service

Let’s talk about the requirements of how we built the service that powers this on the backend, because the goal is interactive code completion in the IDE. In this respect, we’re competing with all of the other interactive autocomplete built into most IDEs. That’s your anything LSP powered, your Code Sense, your IntelliSense, all of that stuff. This is a pretty tall order because those things running locally on your machine don’t have to deal with network latency. They don’t have to deal with shared server resources. They don’t have to deal with the inevitable cloud outage. We’ve got a pretty high bar that’s been set for us. To be competitive, we need to do a bunch of things.

The first one is that we need to minimize latency before and between requests. We need to amortize any setup costs that we might make because this is a network service. To a point, we need to avoid as much network latency as we can because that’s overhead that our competitors that sit locally in IDEs don’t have to pay. The last one is that the length and thus time to generate a code completion response is very much a function of the size of the request, which is completely variable. One of the other things that we do is rather than waiting for the whole request to be completed and then send it back to the user, we work in a streaming mode. It doesn’t really matter how long the request is, we immediately start streaming it as soon as it starts. This is quite a useful property because it unlocks other optimizations.

I want to dig into this connection setup is expensive idea. Because this is a network service, we use TCP. TCP uses the so-called 3-way handshake, SYN, SYN-ACK, ACK. On top of that, because this is the internet and it’s 2024, everything needs to be secured by TLS. TLS takes between 5 to 7 additional legs to do that handshaking, negotiate keys in both directions. Some of these steps can be pipelined. A lot of work has gone into reducing these setup costs and overlaying the TLS handshake with the TCP handshake. These are great optimizations, but they’re not a panacea. There’s no way to drive this network cost down to zero. Because of that, you end up with about five or six round trips between you and the server and back again to make a new connection to a service.

The duration of each of those legs is highly correlated with distance. This graph that I shamelessly stole from the internet says about 50 milliseconds a leg, which is probably East Coast to West Coast time. Where I live, on the other side of an ocean, we see round trip times far in excess of 100 milliseconds. When you add all that up and doing five or six of them, that makes connection setup really expensive. You want to do it once and you want to keep it open for as long as possible.

Evolution of GitHub Copilot

Those are the high-level requirements. Let’s take a little bit of a trip back in time and look at the history of Copilot as it evolved. Because when we started out, we had an extension in VS Code. To use it, the alpha users would go to OpenAI, get an account, get their key added to a special group, then go and plug that key into the IDE. This worked. It was great for an alpha product. It scaled to literally dozens of users. At that point, everyone got tired of being in the business of user management. OpenAI don’t want to be in the business of knowing who our users are. Frankly, we don’t want that either. What we want is a service provider relationship. That kind of thing is what you’re used to when you consume a cloud service. You get a key to access the server resources. Anytime someone uses that key, a bill is emitted. The who is allowed to do that under what circumstances is entirely your job as the product team. We’re left with this problem of, how do we manage this service key? Let’s talk about the wrong way to do it.

The wrong way to do it would be to encode the key somehow in the extension that we give to users in a way that it can be extracted and used by the service but is invisible to casual or malicious onlookers. This is impossible. This is at best security by obscurity. It doesn’t work, just ask the rabbit r1 folks. The solution that we arrived at is to build an authenticating proxy which sits in the middle of this network transaction. The name of the product is copilot-proxy because it’s just an internal name. I’m just going to call it proxy for the rest of this talk. It was added shortly after our alpha release to move us out of this period of having user provided keys to a more scalable authentication mechanism.

What does this workflow look like now? You install the extension in your IDE just as normal, and you authenticate to GitHub just as normal. That creates a kind of OAuth relationship, like each where there’s an OAuth key which identifies that installation on that particular machine for that person who’s logged in at that time. The IDE now can use that OAuth credential to go to GitHub and exchange that for what we call a short-lived code completion token. The token is just like a train ticket. It’s just an authorization to use it for a short period of time. It’s signed. When the request lands on the proxy, all we have to do is just check that that signature is valid. If it’s good, we swap the key out for the actual API service key, forward it on, and stream back the results. We don’t need to do any other further validation. This is important because it means for every request we get, we don’t have to call out to an external authentication service. The short-lived token is the authentication.

From the client’s point of view, nothing’s really changed in their world. They still think they’re talking to a model, and they still get the response as usual. This token’s got a lifetime in the order of minutes, 10, 20, 30 minutes. This is mainly to limit the liability if, say, it was stolen, which is highly unlikely. The much more likely and sad case is that in the cases of abuse, we need the ability to shut down an account and therefore not generate new tokens. That’s generally why the token has a short lifetime. In the background, the client knows the expiration time of the token it was given, and a couple of minutes before that, it kicks off a refresh cycle, gets a new token, swaps it out, the world continues.

When Should Copilot Take Over?

We solved the access problem. That’s half the problem. Now to talk about another part of the problem. As a product design, we don’t have an autocomplete key. I remember in Eclipse, you would use command + space, things like that, to trigger the autocomplete or to trigger the refactoring tools. I didn’t use that in the example I showed. Whenever I stop typing, Copilot takes over. That creates the question of, when should it take over? When should we switch from user typing to Copilot typing? It’s not a straightforward problem. Some of the solutions we could use is just a fixed timer. We hook the key presses, and after each key press, we start a timer. If that timer elapses without another key press, then we say, we’re ready to issue the request and move into completion mode.

This is good because that provides an upper bound on how long we wait and that waiting is additional latency. It’s bad, because it provides a lower bound. We always wait at least this long before starting, even if that was the last key press the user was going to make. We could try something a bit more science-y and use a tiny prediction model to look at the stream of characters as they’re typed and predict, are they approaching the end of the word or are they in the middle of a word, and nudge that timer forward and back. We could just do things like a blind guess. Any time there’s a key press, we can just assume that’s it, no more input from the user, and always issue the completion request. In reality, we use a mixture of all these strategies.

That leads us to the next problem, which is, despite all this work and tuning that went into this, around half of the requests that we issued are what we call typed through. Don’t forget, we’re doing autocomplete. If you continue to type after we’ve made a request, you’ve now diverged from the data we had and our request is now out of date. We can’t use that result. We could try a few things to work around this. We could wait longer before a request. That might reduce the number of times that we issue a request and then have to not use the result. That additional latency, that additional waiting penalizes every user who had stopped and was waiting. Of course, if we wait too long, then users might think that Copilot is broken because it’s not saying anything to them anymore. Instead, what we’ve built is a system that allows us to cancel a request once they’ve been issued. This cancellation request using HTTP is potentially novel. I don’t claim it to be unique, but it’s certainly the first time I’ve come across this in my career.

Canceling a HTTP Request

I want to spend a little bit of time digging into what it means to cancel a HTTP request. You’re at your web browser and you’ve decided you don’t want to wait for that page, how do you say, I want to stop, I want to cancel that? You press the stop button. You could close the browser tab. You could drop off the network. You eat your laptop away. These are all very final actions. They imply that you’re canceling the request because you’re done. Either you’re done with using the site or you’re just frustrated and you’ve given up. It’s an act of finality. You don’t intend to make another request. Under the hood, they all have the same networking behavior. You reset the TCP stream, the connection that we talked about setting up on previous slides. That’s on the browser side. If we look on the server side, either at the application layer or in your web framework, this idea of cancellation is not something that is very common inside web frameworks.

If a user using your application on your site presses stop on the browser or if they Control-C their cURL command, that underlying thing translates into a TCP reset of the connection. On the other end, in your server code, when do you get to see that signal? When do you get to see that they’ve done that? The general times that you can spot that the TCP connection has been reset is either when you’re reading the request body, so early in the request when you’re reading the headers in the body, or later on when you go to start writing your response back there.

This is a really big problem for LLMs, because the cost of the request, that initial inference before you generate the first token, is the majority of the cost. That happens before you produce any output. All that work is performed. We’ve done the inference. We’re ready to start streaming back tokens. Only then do we find that the user closed the socket and they’ve gone. As you saw, in our case, that’s about 45% of the requests. Half of the time, we’d be performing that inference and then throwing the result on the floor, which is an enormous waste of money, time, and energy, which in AI terms is all the same thing.

If this situation wasn’t bad enough, it gets worse. Because cancellation in HTTP world is the result of closing that connection. In our case, in the usage of networking TCP to talk to the proxy, the reason we canceled that request is because we want to make another one straightaway. To make that request straightaway, we don’t have a connection anymore. We have to pay those five or six round trips to set up a new TCP TLS connection. In this naive idea, in this normal usage, cancellation occurs every other request on average. This would mean that users are constantly closing and reestablishing their TCP connections in this kind of signaling they want to cancel and then reestablishing connection to make a new request. The latency of that far exceeds the cost of just letting the request that we didn’t need, run to completion, and then just ignoring it.

HTTP/2, and Its Importance

Most of what I said on the previous slides applies to HTTP/1, which has this one request per connection model. As you’re reading on this slide, HTTP version numbers go above number 1, they go up to 2 and 3. I’m going to spend a little bit of time talking about HTTP/2 and how that was very important to our solution. As a side note, copilot-proxy is written in Go because it has a quite robust HTTP library. It gave us the HTTP/2 support and control over that that we needed for this part of the product. That’s why I’m here, rather than a Rustacean talking to you. This is mostly an implementation detail. HTTP/2 is more like SSH than good old HTTP/1 plus TLS. Like SSH, HTTP/2 is a tunneled connection. You have a single connection and multiple requests multiplexed on top of that. In both SSH and HTTP/2, they’re called streams. A single network connection can carry multiple streams where each stream is a request. We use HTTP/2 between the client and the proxy because that allows us to create a connection once and reuse it over and again.

Instead of resetting the TCP connection itself, you just reset the individual stream representing the request you made. The underlying connection stays open. We do the same between the proxy and our LLM model. Because the proxy is effectively concatenating requests, like fanning them in from thousands of clients down onto a small set of connections to the LLM model, we use a connection pool, a bunch of connections to talk to the model. This is just to spread the load across multiple TCP streams, avoid networking issues, avoid head-of-line blocking, things like that. We found, just like the client behavior, these connections between the proxy and our LLM model are established when we start the process and they leave assuming there’s no upstream problems.

They remain open for the lifetime of the process until we redeploy it, so minutes to hours to days, depending on when we choose to redeploy. By keeping these long-lived HTTP/2 connections open, we get additional benefits to the TCP layer. Basically, TCP has this trust thing. The longer a connection is open, the more it trusts it, the more it allows more data to be in fly before it has to be acknowledged. You get these nice, warmed-up TCP pipes that go all the way from the client through the proxy, up to the model and back.

This is not intended to be a tutorial on Go, but for those who do speak it socially, this is what basically every Go HTTP handler looks like. The key here is this req.Context object. Context is effectively a handle. It allows efficient transmission of cancellations and timeouts and those kind of request-specific type meta information. The important thing here is that the other end of this request context is effectively connected out into user land to the user’s IDE. When, by continuing to type, they need to cancel a request, that stream reset causes this context object to be canceled. That makes it immediately visible to the HTTP server without having to wait until we get to a point of actually trying to write any data to the stream.

Of course, this context can be passed up and down the call stack and used for anything that wants to know, should it stop early. We use it in the HTTP client and we make that call onwards to the model, we pass in that same context. The cancellation that happens in the IDE propagates to the proxy and into the model effectively immediately. This is all rather neatly expressed here, but it requires that all parties speak HTTP/2 natively.

It turns out this wasn’t all beer and skittles. In practice, getting this end-to-end HTTP/2 turned out to be more difficult than we expected. This is despite HTTP/2 being nearly a decade old. Just general support for this in just general life was not as good as it could be. For example, most load balancers are happy to speak HTTP/2 on the frontend but downgrade to HTTP/1 on the backend. This includes most of the major ALB and NLBs you get in your cloud providers. It, at the time, included all the CDN providers that were available to us. That fact alone was enough to spawn us doing this project. There are also other weird things we ran into.

At the time, and I don’t believe it’s been fixed yet, OpenAI was fronted with nginx. nginx just has an arbitrary limit of 100 requests per connection. After that, they just closed the connection. At the request rates that you saw, it doesn’t take long to chew through 100 requests, and then nginx will drop the connection and force you to reestablish it. That was just a buzz kill.

All of this is just a long-winded way of saying that the generic advice of, yes, just stick your app behind your cloud provider’s load balancer, it will be fine, didn’t work out for us out of the box. Something that did work out for us is GLB. GLB stands for GitHub Load Balancer. It was introduced eight years ago. It’s one of the many things that has spun out of our engineering group. GLB is based on HAProxy. It uses HAProxy under the hood. HAProxy turns out to be one of the very few open-source load balancers that offers just exquisite HTTP/2 control. I’ve never found anything like it. Not only did it speak HTTP/2 end-to-end, but offered exquisite control over the whole connection. What we have is GLB being the GitHub Load Balancer, which sits in front of everything that you interact with in GitHub, actually owns the client connection. The client connects to GLB and GLB holds that connection open. When we redeploy our proxy pods, their connections are gracefully torn down and then reestablished for new pods. GLB keeps the connection to the client open. They never see that we’ve done a redeploy. They never disconnected during that time.

GitHub Copilot’s Global Nature

With success and growth come yet more problems. We serve millions of users around the globe. We have Copilot users in all the major markets, where I live in APAC, Europe, Americas, EMEA, all over the world. There’s not a time that we’re not busy serving requests. This presents the problem that even though all this HTTP/2 stuff is really good, it still can’t change the speed of light. The round-trip time of just being able to send the bits of your request across an ocean or through a long geopolitical boundary or something like that, can easily exceed the actual mean time to process that request and send back the answer. This is another problem. The good news is that Azure, through its partnership with OpenAI, offers OpenAI models in effectively every region that Azure has. They’ve got dozens of regions around the world. This is great. We can put a model in Europe, we can put a model in Asia. We can put a model wherever we need one, wherever the users are. Now we have a few more problems to solve.

In terms of requirements, we want users, therefore, to be routed to their “closest” proxy region. If that region is unhealthy, we want them to automatically be routed somewhere else so they continue to get service. The flip side is also true, because if we have multiple regions around the world, this increases our capacity and our reliability. We no longer have all our eggs in one basket in one giant model somewhere, let’s just say in the U.S. By spreading them around the world, we’re never going to be in a situation where the service is down because it’s spread around the world. To do this, we use another product, again, that spun out of GitHub’s engineering team, called octoDNS. octoDNS, despite its name, is not actually a DNS server. It’s actually just a configuration language to describe DNS configurations that you want. It supports all the good things: arbitrary weightings, load balancing, splitting, sharing, health checks. It allows us to identify users in terms of the continent they’re in, the country.

Here in the United States, we can even work down to the state level sometimes. It gives us exquisite control to say, you over there, you should primarily be going to that instance. You over there, you should be primarily going to that instance, and do a lot of testing to say, for a user who is roughly halfway between two instances, which is the best one to send them to so they have the lowest latency? On the flip side, each proxy instance is looking at the success rate of requests that it sees and it handles. If that success rate drops, if it goes below the SLO, those proxy instances will use the standard health C endpoint pattern. They set their health C status to 500. The upstream DNS providers who have been programmed with those health checks notice that.

If a proxy instance starts seeing its success rate drop, they vote themselves out of DNS. They go take a little quiet time by themselves. When they’re feeling better, they’re above the SLO, they raise the health check status and bring themselves back into DNS. This is now mostly self-healing. It turns a regional outage when we’re like, “All of Europe can’t do completions”, into a just reduction in capacity because traffic is routed to other regions.

One thing I’ll touch on briefly is one model we experimented with and then rejected because it just didn’t work out for us was the so-called point of presence model. You might have heard it called PoP. If you’re used to working with big CDNs, they will have points of presence. Imagine every one of these dots on this map, they have a data center in, where they’re serving from. The idea is that users will connect and do that expensive connection as close to them as possible and speed up that bit.

Then those CDN providers will cache that data, and if need to, they can call back to the origin server. In our scenario, where I live in Asia, we might put a point of presence in Singapore. That’s a good equidistant place for most of Asia. A user in Japan would be attracted to that Singapore server. There’s a problem because the model is actually still hosted back here on the West Coast. We have traffic that flows westward to Singapore only to turn around and go all the way back to the West Coast. The networking colloquialism for this is traffic tromboning. This is ok for CDN providers because CDN providers, their goal is to cache as much of the information, so they rarely call back to the origin server. Any kind of like round tripping or hairpinning of traffic isn’t really a problem.

For us doing code completions, it’s always going back to a model. What we ended up after a lot of experimentation was the idea of having many regions calling back to a few models just didn’t pay for itself. The latency wasn’t as good and it carried with it a very high operational burden. Every point of presence that you deploy is now a thing you have to monitor, and upgrade, and deploy to, and fix when it breaks. It just didn’t pay for us. We went with a much simpler model which is simply, if there is a model in an Azure region, we colocate a proxy instance in that same Azure region and we say that is the location that users’ traffic is sent to.

A Unique Vantage Point

We started out with a proxy whose job was to do this authentication, to authenticate users’ requests and then mediate that towards an LLM. It turns out it’s very handy to be in the middle of all these requests. Some examples I’ll give of this are, we obviously look at latency from the point of view of the client, but that’s a very fraught thing to do. It’s something I caution you, it’s ok to track that number, just don’t put it on a dashboard because some will be very incentivized to take the average of it, or something like that. You’re essentially averaging up the experience of everybody on the internet’s requests, from somebody who lives way out on bush on a satellite link to someone who lives next to the AMS-IX data center, and you’re effectively trying to say, take the average of all their experiences. What you get when you do that is effectively the belief that all your users live on a boat in the middle of the Atlantic Ocean.

This vantage point is also good, because while our upstream provider does give us lots of statistics, they’re really targeted to how they view running the service, their metrics. They have basic request counts and error rates and things like that, but they’re not really the granularity we want. More fundamentally, the way that I think about it, to take food delivery as an example, use the app, you request some food, and about 5 minutes later you get a notification saying, “Your food’s ready. We’ve finished cooking it. We’re just waiting for the driver to pick it up”. From the restaurant’s point of view, their job is done, they did it, their SLO, 5 minutes, done. It’s another 45 minutes before there’s a knock on the door with your food. You don’t care how quickly the restaurant prepared your food. What you care about is the total end-to-end time of the request. We do that by defining in our SLOs that the point we are measuring is at our proxy. It’s ok for our service provider to have their own metrics, but we negotiate our SLOs as the data plane, the split where the proxy is.

Dealing with a Heterogeneous Client Population

You saw that we support a variety of IDEs, and within each IDE, there is a flotilla of different client versions out there. Dealing with the long tail of client versions is the bane of my life. There is always a long tail. When we do a new client release, we’ll get to about 80% population within 24 to 36 hours. That last 20% will take until the heat death of the universe. I cannot understand why clients can use such old software. The auto-update mechanisms are so pervasive and pernicious about getting you to update, I don’t quite understand how they can do this, but they do. What this means is that if we have a bug or we need to make a fix, we can’t do it in the client. It just takes too long, and we never get to the population that would be successful with rolling out that fix. This is good because we have a service that sits in the middle, the proxy that sits in the middle, that we can do a fix-up on the fly, hopefully.

Over time, that fix will make it into the client versions and they’ll sufficiently roll out, there’ll be a sufficient enough population. An example of this, one day out of the blue, we got a call from a model provider that said, you can’t send this particular parameter. It was something to do with log probabilities. You can’t send that because it’ll cause the model to crash, which is pretty bad because this is a poison pill. If a particular form of request will cause a model instance to crash, it will blow that one out of the water and that request will be retried and it’ll blow the next one out of the water and keep working its way down the line. We couldn’t fix it in the client because it wouldn’t be fast enough. Because we have a proxy in the middle, we can just mutate the request quietly on the way through, and that takes the pressure off our upstream provider to get a real fix so we can restore that functionality.

The last thing that we do is, when we have clients that are very old and we need to deprecate some API endpoint or something like that, rather than just letting them get weird 404 errors, we actually have a special status code which triggers logic in the client that puts up a giant modal dialog box. It asks them very politely, would they please just push the upgrade button?

There’s even more that we can do with this, because logically the proxy is transparent. Through all of the shenanigans, the clients still believe that they’re making a request to a model and they get a response. The rest is transparent. From the point of view of us in the middle who are routing requests, we can now split traffic across multiple models. Quite often, the capacity we receive in one region won’t all be in one unit. It might be spread across multiple units, especially if it arrives at different times. Being able to do traffic splits to combine all that together into one logical model is very handy. We can do the opposite. We can mirror traffic. We can take a read-only tap of requests and send that to a new version of the model that we might be either performance testing, or validating, or something like that.

Then we can take these two ideas and mix and match them and stack them on top of each other and make A/B tests, experiments, all those kinds of things, all without involving the client. From the client’s point of view, it just thinks it’s talking to the same model it has yesterday and today.

Was It Worth the Engineering Effort?

This is the basic gist of how you build a low latency code completion system with the aim of competing with IDEs. I want to step back and just ask like, as an engineering effort, was this worth it? Did the engineering effort we put into this proxy system pay for itself? One way to look at this is, for low latency, you want to minimize hops. You certainly want to minimize the number of middlemen that are in the middle, the middleware, anything that’s kind of in that request path adding value but also adding latency. What if we just went straight to Azure instead, we had clients connect straight to Azure? This would have left authentication as the big problem, as well as observability. They would have really been open questions. It would have been possible to teach Azure to understand GitHub’s OAuth token. The token that the IDE natively has from GitHub could be presented to Azure as an authentication method. I’m sure that would be possible. It would probably result in Azure building effectively what I just demonstrated on this.

Certainly, if our roles were reversed and I was the Azure engineer, I would build this with an authentication layer in front of my service. Some customer is coming to me with a strange authentication mechanism, I’m going to build a layer which converts that into my real authentication mechanism. We would have probably ended up with exactly the same number of moving parts just with more of them behind the curtain in the Azure side. Instead, by colocating proxy instances and model instances in the same Azure region, we, to most extent, ameliorated the cost of that extra hop. The inter-region traffic is not free, it’s not zero, but it’s pretty close to zero. It’s fairly constant in terms of the latency you see there. You can characterize it and effectively ignore it.

War Stories

I’m going to tell you a few more war stories of what’s happened over the life of this product just to emphasize that the value of having this intermediary really paid for itself over and over. One day we upgraded to a new version of the model which seemed to be very attracted to a particular token. It really liked emitting this token. It was some end of file marker, and it was something to do with a mistake in how it was trained that it just really liked to emit this token. We can work around this essentially in the request by saying, in your response, this very particular token, weight it down negative affinity, never want to see it. If we didn’t have an intermediary like the proxy to do that, we would have had to do that in the client. We would have had to do a client rollout which would have been slow and ultimately would not have got all the users.

Then the model would have been fixed and we’d have to do another client change to reverse what we just did. Instead, it was super easy to add this parameter to the request on the fly as it was on its way to the model. That solved the problem immediately and it gave us breathing room to figure out what had gone wrong with the model training and fix that without the Sword of Damocles hanging over our head.

Another story is that, one day I was looking at the distribution of cancellation. For a request that was cancelled, how long did it live until it was cancelled? There was this bizarre spike effectively at 1 millisecond, effectively immediately. It was saying, a lot of requests come from the clients and are immediately cancelled. As in, you read the request and then instantly afterwards the client is like, I’m sorry, I didn’t mean to send that to you, let me take it back. The problem is by that time we’ve already started the process of forwarding that to Azure and they’re mulling on it. We immediately send the request to Azure and then say to them, sorry, I didn’t mean to send that to you. May I please have it back? Cancellation frees up model resources quicker but it’s not as cheap as just not sending a request that we know we’re going to cancel is.

It took us some time to figure out what was exactly happening in the client to cause this fast cancellation behavior, but because we had the proxy in the middle, we could add a little check that just before we made the request to the model, we would check, has it actually been cancelled? There were mechanisms in the HTTP library to ask that question. We saved ourselves making and then retracting that request. Another point talking to metrics, from the metrics that our upstream model provider provides us, we don’t get histograms, we don’t get distributions, we barely get averages. There would be no way we would have been able to spot this without our own observability at that client proxy layer. If we didn’t have the proxy as an intermediary, we still could have had multiple models around the world.

As you saw, you can have OpenAI models in any Azure region you want. We would just not have a proxy in front of them. We probably would have used something like octoDNS to still do the geographic routing, but it would have left open the question of what do we do about health checks. When models are unhealthy or overloaded, how do we take them out of DNS? What we probably would have had to do is build some kind of thing that’s issuing synthetic requests or pinging the model or something like that, and then making calls to upstream DNS providers to manually thumbs up and thumbs down regions. HTTP/2 is critical to the Copilot latency story. Without cancellation, we’d make twice as many requests and waste half of them. It was surprisingly difficult to do with off-the-shelf tools.

At the time, CDNs didn’t support HTTP/2 on the backend. That was an absolute non-starter. Most cloud providers didn’t support HTTP/2 on the backend. If you want to do that you have to terminate TLS yourself. For the first year of our existence of our product, TLS, like the actual network connection, was terminated directly on the Kubernetes pod. You can imagine our security team were absolutely overjoyed with this situation. It also meant that every time we did a deploy, we were literally disconnecting everybody and they would reconnect, but that goes against the theory that we want to have these connections and keep them open for as long as possible.

GitHub’s Paved Path

This is very GitHub specific, but a lot of you work for medium to large-scale companies, you probably have a tech stack that is, in GitHub paths, we call it the paved path. It is the blessed way, the way that you’re supposed to deploy applications inside the company. Everything behind GLB, everything managed by octoDNS made our compliance story. You can imagine, we’re selling this to large enterprise companies. You need to have your certifications. You need to have your SOC 2 tick in the box. Using these shared components really made that compliance story much easier. The auditors say, this is another GLB hosted service, using all the regular stuff, not exactly a tick in the box but got a long way towards solving our compliance story. The flip side is because these are shared components rather than every individual team knowing every detail up and down of terminating TLS connections on pods hosted in Kubernetes clusters that they run themselves, we delegate that work to shared teams who are much better at it than that.

Key Takeaways

This is a story of what made Copilot a success. It is possible that not all of you are building your own LLM as a service-service. Are there broader takeaways for the rest of you? The first one is, definitely use HTTP/2. It’s dope. I saw a presentation by the CTO of Fastly, and he viewed HTTP/2 as an intermediary. He really says HTTP/3 is the real standard, the one that they really wanted to make. From his position as a content delivery partner whose job is just to ship bits as fast as possible, I agree completely with that. Perhaps the advice is not use HTTP/2, the advice would probably be something like, use something better than HTTP/1. If you’re interested in learning more, if you look that up on YouTube, that’s a presentation by Geoff Huston talking about HTTP/2 from the point of view of application writers and clients, and how it totally disintermediates most of the SSL and the middle VPN nonsense that we live with day to day in current web stuff.

The second one is a Bezos quote, if you’re gluing your product together from parts from off-the-shelf suppliers and your role in that is only supplying the silly putty and the glue, what are your customers paying you for? Where’s your moat? As an engineer, I understand very deeply the desire not to reinvent the wheel, so the challenge to you is, find the place where investing your limited engineering budget, in a bespoke solution, is going to give you a marketable return. In our case, it was writing a HTTP/2 proxy that accelerated one API call. We’re very lucky that copilot-proxy as a product is more or less done, and has been done for quite a long time, which is great because it gives our small team essentially 90% of our time to get dedicated to the operational issues of running this service.

The last one is, if you care about latency, if your cloud provider is trying to sell you this siren song that they can solve your latency issues with their super-fast network backbone. That can be true to a point, but remember the words of Montgomery Scott, you cannot change the laws of physics despite what your engineering title is. If you want low latency, you have to bring your application closer to your users. In our case that was fairly straightforward because code completion at least in the request path is essentially stateless. Your situation may not be as easy. By having multiple models around the globe, that turns SEV1 incidents into just SEV2 alerts. If a region is down or overloaded, traffic just flows somewhere else. Those users instead of getting busy signal, still get a service albeit at a marginally higher latency. I’ve talked to a bunch of people, and I said that we would fight holy wars over 20 milliseconds. The kind of latencies we’re talking about here are in the range of 50 to 100 milliseconds, so really not noticeable for the average user.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.