Presentation: Comprehensive Approaches to Software Supply Chain Security

MMS Founder
MMS Mykhailo Brodskyi

Transcript

Brodskyi: My name is Mykhailo Brodskyi. As Principal Software Architect, I focus on platform security and cloud migration. I’m going to walk you through top four security risk categories in software supply chain and show you how to mitigate them effectively. I will share real case studies from our projects, highlighting strategies that protect systems from vulnerabilities, and ensure security and resilience of your platform.

Here is how I’m going to do it. First, we’ll start with the challenges that we have in FinTech industries. Then, we will deep dive in the risk categories. I will show practical examples of how to mitigate them. I prepared some case studies from our real projects. Then, I will show a real hands-on demo.

Looking Into the Future and Reflecting on the Past

Do any of you know what significant event related to security takes place here in Munich every year? It’s not Oktoberfest. Any ideas, every year in winter? Munich Security Conference. It has been a global stage for discussing international security issues. Here, we are not talking about geopolitical issues. We are talking about something equally similar, software security. As Munich Security Conference, that shapes global security policies, our goal is how we shape our software security chain.

Uncover FinTech Landscape Challenges

Let’s dive into the FinTech landscape. The FinTech ecosystem is driven by key serious business domains, such as customer onboarding, payment processing. Each of these domains ensure smooth operations of financial services. Each domain houses numerous business applications inside. For example, for instance, customer onboarding. There is an application for know your customers. For payment processing, we have an application that is responsible for APMs processing, alternative payment processing, credit card transaction processing, and more. Each of these domains work under some framework that operates with standards, laws, and principles.

For example, payment processing is subject to PSD2, while fraud detection is mitigated and operated with AMLD. Why do we have all these principles, laws, and standards, all these regulations that we have in FinTech, and in other areas as well? The answer is simple, that’s risks. All these regulations are designed to mitigate some risk: financial risk, reputation risk. This law exists right now, that helps organizations to mitigate such risk. As you can see that the FinTech landscape is very complex due to these regulations that we have, and also integrations with other applications. It’s clear that we need to have a really robust approach, how we can secure all our applications inside our landscape.

Explore Software Supply Chain

Let’s dive into software supply chain. I would like to begin with something that we are all familiar with, that’s our physical goods supply chain. The journey begins with upstream supplier, that delivers raw materials to manufacturer, and then customer is going to receive the final product. Similar to software development, we rely on suppliers such as third-party libraries, dependencies, and tools. Everything goes into development flow. In case any of this component is compromised, our final product is at risk. Development organization in software supply chain security, it’s similar to manufacturer. Inside we have the different stages of the process.

The process starts with development, goes to integration, and then end with deployment. Each of this stage relies again on third-party libraries, tools, and dependencies. It’s clear that we need to have an approach to secure all these dependencies. That’s why compliance and security, it’s not the static flow, it’s a static layer in our software supply chain. It’s dynamic and it’s going to be integrated in each step of the process. Based on this overview and understanding of software supply chain, we can create different categories. The main category that’s related to our third-party libraries and tools. The second category that’s related to our internal development. Then we have process and all this risk that’s related to our delivery and deployment, and governance, and security testing.

Address Mitigation Strategies for Third-Party Risks

Now I’m going to talk in detail about all these categories that we have. Let’s start with the first one. Let’s start with our software development chain, when we have all these components. The first approach and first what we need to understand and ask when we work with third-party libraries, they need to be certified. In this case, we can make sure that our final product that is going to be developed based on these libraries is also going to be protected. Then we can integrate software composition analysis. This approach will help us to mitigate these issues and risk that’s related to third-party libraries and tools. Software composition analysis, there are multiple steps there.

First component, that’s dependency analysis, they analyze all our dependencies. Then vulnerability detection, because this tool already has integration with internal database, which is possible to monitor and understand if there are any issues in our pipeline. Then, also module that’s responsible for license compliance. In our organization, usually we have private artifact repository. Then we have version control system. Our journey starts with fetching these dependencies and trying to build our project. This tool, software composition analysis, will help us to analyze all these dependencies that we have there. The next step is going to be build our pipeline. We can integrate some job in this pipeline that is going to monitor all these dependencies. Then, also to provide some notification to us in case we have any issues.

Now we can go even further and try to mitigate and build even more layers of security while we are talking about third-party libraries. Let’s imagine that we would like to start working on some new feature, and we need to use some new third-party libraries that are not available in our repository. First, we have the development team, then we have cyber team, and we have our supplier. In this case, that supplier who is going to deliver third-party libraries and tools. A developer is going to select this component that is needed to be integrated in our private artifact repository, and select in public artifact repository. Then it’s going to be added first in intermediate repository, where we’re going to trigger this vulnerability scanning, what I mentioned earlier, and license scanning.

Only after we perform vulnerability scan, license compliance check, and we will be ready for the further check, we can include this library into the next repository to secure the repository. This repository is going to be integrated and continuously execute some monitoring tool. We will try to identify new vulnerabilities there. Try to also check licenses, what we have. Once we receive a good sign from this validation, we can include this library to our development repository. This zero-trust dependency management really will help us to minimize all these risks that’s related to dependencies and tool. Finally, at the end, we can execute verification. We can execute all security verification, SAST and DAST. Then perform penetration test.

Let’s try to summarize what we need to do for mitigating third-party dependencies. We need to ask about licenses. They need to be compliant with that. Then we need to integrate. Of course, use only private artifact repository. To build several layers of repository in case it’s needed, depends on your business domain. Then, to integrate continuous verification in your pipeline.

Secure Internal Development

Let’s go to the internal development. Here there is a best practice in case you would like to improve your security development. Try to integrate some existing principles and standards. For example, in our FinTech industry, there is a common set of rules and standards, PCI DSS. All payment processing domains should follow these standards. Let’s talk about these standards. The definition. First of all, it’s a set of standards that explain in which way we need to implement, and how we need to build our network. Also, there are other standards as well. This standard is super important for FinTech. There are six groups of requirements. One group is focused on network segmentation. Then it’s related to how you build access control to your system and your environment. Also related to how you’re going to monitor your environments and your applications. There are stages of process inside this flow, in case you would like to apply these standards for your organization.

First process that we need to discover, we need to scope and we need to analyze your infrastructure and your landscape, what you have. What does it mean for this? You identify all components in this chain. You also analyze which type of data you store there. Based on this information, then you can apply different segmentation strategies. That’s number one, scoping, organization analysis. Then, categorization. PCI DSS explains different categories for systems that you need to apply. It depends on which data you store there. First, that’s CD system, cardholder data environment. That’s the environment, where do you process transactional data or you store transactional data? Everything that’s related to simple transactions, everything that’s related to cardholder data.

Then, connected-to: you have a separated system that doesn’t store any cardholder related data or customer data. This system is just only connected to cardholder data environment that process or store related data. Then you have security impacts in your system. A good example, some configuration management, when you store configuration for a particular microservice or particular customer. Out-of-the scope system. Out-of-the scope means that the system is not going to be under PCI DSS. Its system doesn’t contain any credit cardholder data. It can be completely isolated from our main environment. The next step, we need to implement all these segmentation and controls. Then, we need to implement validation. We need to maintain this segmentation. It means, for example, in our industry, two times per year, we need to complete PCI DSS. Every time we need to update this documentation, we need to show that we have a monitoring system in place. That’s why it’s very important.

Examine Real-World Case Study

I would like to show a real example. It’s a very interesting story of what we already started. Our company, the main goal is process transactions. All our systems that we have currently, they were hosted in a private data center. We initiated a really complex project to migrate all our 100-plus application modules from a physical data center to the cloud. During this migration journey, we had to review all our current segmentation approaches that we have, all our communication strategies. I’m trying to show some small set of architecture where we try to apply all these principles. Then, somehow, to bring architectural improvements during this cloud migration. Holistic architecture. In payment processing, there are different layers of architecture. Here we have, first, input channel, where we need to obtain this transaction. Then, to send to our payment processing gateway. There are different input channels. We can use mobile devices. We can integrate with external websites. Or it can be integration with external systems, with airlines, for example.

In this case, we have environment, when we need to consume these transactions. Right now, there is a component input channel. We are going to receive this transaction from physical terminal. Then, if you use different currency, and you would like, for example, to travel somewhere from Europe to U.S., or in other countries, you can ask which currency you would like to pay. For this currency conversion exchange, there is a separate component, or even a separate service is responsible and integrated to payment industry. That’s currency conversion service. This component is responsible to decide, which option is better and how we are going to exchange it. Then, we are going to process this transaction.

In this case, payment processing service is going to be connected to one of these card schemes that we have. Once we started to analyze the current architecture, what we had previously in our data center, the landscape is super complex. Sometimes there is shared database approach, and 10 applications connected to one database. Of course, in cloud, it’s difficult to somehow troubleshoot this issue, and try to implement some new features. That’s why we started thinking, let’s try to separate these components. Let’s remember which categories we have. That’s CD system that’s responsible for cardholder data environment, connected-to system, security-impacting systems, and out-of-the scope system.

Obviously, the transaction is going to be received, first of all, by input channel. Then, processed by payment service. Then, sent to card schemes. It means that these two services, it goes to CD bucket. Then we can separate and we can move currency conversion service independently to another zone. In this case, let’s assume that we can include the service in non-card data environment. What else do we need to follow in order to build this separation, and, first of all, to move this service to the out-of-the scope category? We need to implement access control. We need to have authentication and encryption. It means that it’s not just possible payment service is going to talk directly to currency conversion service, we need to authenticate this service. We need to implement some mechanism of authorization, how we are going to do it. Also, we need to put it in a separate zone.

I understand that there are so many people from different industries. I try to think, how can you use this information and apply this information and deploy already, let’s say, next week? Try to think from this categorization point of view, and these separate categories that we have in FinTech industry, in PCI DSS, and build the same categorization and segmentation level on your system. Let’s say that we are talking about healthcare. We can create and build a separate environment where we are going to put applications that are related to storing and processing some personal information. Then, you can store this information that’s related to health insurance, health state, for example, of this person, in this separate environment and even in separate application. Then, construction. I remember back in the past, in my experience, construction domain, we had microservice architecture. All these services were just deployed in one single zone.

Of course, from communication point of view and then security point of view, it’s a really bad approach. In case we are talking about construction domain, it’s better also, again, all related customer information put in one service, in one database, and then to separate in a completely different environment. Then, real estate. The same goes to this domain. Customer related information, we put in one database, even in another environment. All information about objects and real estate properties, you can put in separate environment, because also you need to protect this information. Somehow, for their competitors, it’s going to be super important. Energy sector, all information, for example, telemetry information, information about some plants manufacturers, you can also separate in completely different environment and zone. That’s cross-industry applications, and how we can build this inspiration and apply for other industries.

Approach number two, that also goes to secure development practices. This approach is successfully applied in the current company and also in the previous one that was related to network security protection, so threat modeling. What does it mean? There are three questions that we need to answer. First, we need to understand what we are going to build, which potential issues we can have, and how we are going to mitigate them. Idea that, in case you have any design process in your organization or you have architectural process, you can integrate threat modeling on an earlier stage of your development. That’s exactly what we did in the current company. It means that on this earlier stage, you can, together with your development team, think with all these vulnerabilities, potential vulnerabilities that we have right now, and try to mitigate already on the earlier stage of your design, architectural draft version.

It helped us multiple times, because it’s reputation risk, and it’s development risk, and even some additional costs then which we need to fix later on. Key components, so, first of all, we identify the same, there is some similarity that’s related to PCI DSS, that we have scoping, here we have asset identification. We are going to analyze all our components, what we have in our system. Then we are trying to also review current threats that we have, and try to build mitigation for this threat. There are some benefits. First of all, we can increase time to market. We are not going to spend some additional time for testing or verification, and then fixing these issues. We can improve our application security. Then, it’s also to use some best practices, some frameworks that we have already in this industry. There are so many approaches of what we have. We applied several times a straight approach for threat modeling.

I’m going to show you right now a DFD diagram. That’s a diagram that is going to be compiled and created during this threat modeling process. With this diagram, you can identify external boundaries of what you have, internal systems that we have right now, and then the processes and storage. Then we will try to map all these issues that we can have, and identify what is the communication flow from one service to another service, and then try to build some additional security layers. For example, what is going to be authentication and authorization? Do we have any encryption there? Which type of information do we store in this database? Then it’s going to be everything documented. It’s going to be reviewed together with our cyber experts, with our architects. Then, to make sure that we are not going to introduce any additional risk there. This approach is possible to automate with different tools.

Even from Microsoft, there is automation. It’s even possible to use some AI approach to analyze and build some list of the risks potentially that you can have. Once we applied this approach, we were able to identify some potential vulnerabilities which were not identified during penetration test, and that was really a red sign for us, and we spent immediately to resolve these issues that were related to service-to-service communication, and which data do we send there. These two big issues were identified, especially because of this process that we applied.

Let’s summarize how we can mitigate internal development. First of all, that’s one more time, apply existing security standards, what we have right now. In case you’re in healthcare, you can try to apply these industry standards, what I just explained recently. Also, security review, really good code review, and threat modeling.

Mitigate Software Delivery and Governance Risks

Let’s move to software delivery deployment, and governance and security testing categories. I would like to show you how we are going to mitigate these delivery risks, what we have during our deployment. Let’s, one more time, go back to our process that we have, our development organization, with different stages during this flow. First issue, what we can have, that’s version control system stage. We can, by accident, expose some credential secret. There were so many examples in GitLab, in GitHub, that were found in public repositories, all secrets. It can be a really big issue to all systems that we have.

This issue, we can mitigate with secret management. Let’s say that we are together right now, building some software, building a new feature. Of course, there are so many available secrets management tools for our platform. There are platform agnostic, that we don’t care which cloud provider we are going to use. There are some cloud providers that are container native. Then we have some tools that’s DevOps focused. Also, in a separate category, I added identity management system. That’s not related to all of them, but it’s somehow in the first layer, how we are going to protect our access. Let’s say, because during our cloud migration, we are going to deploy everything that we have in a data center to Azure.

In this case, let’s select Key Vault Secret Manager in Azure. Then we can go and we can move to the build stage. Here there is a risk that our build infrastructure can be compromised. In this case, we can use additional security controls. That’s what we have in all version control systems, in Git or GitLab. Then we can also include and implement SAST and DAST, static and dynamic security test and analysis. For static testing, we have SonarQube. For dynamic, we have Acunetix and Qualys. Let’s say that for security controls, we will select SonarQube and Acunetix. That’s what we use in our current company. Then, package stage, insecure artifact. Insecure artifact, I explained previously, that’s really zero-trust approach and CCI approach as well. It can also be integrated. Another approach is source code signing. There are different tools for this: Cosign, Notary, pipeline code signing in Azure. We are going to select Cosign.

Then, let’s move to the testing stage and deployment stage. Insufficient security testing. I have seen multiple times that we do not pay really big attention. There are no multiple security test cases available to mitigate and complete final verification of your software. That’s why it’s a really good approach first to integrate SAST, DAST. Depending on your domain industry, integrate also penetration test for your organization. This approach even was applied earlier in previous companies, related to construction or network and security verification testing. All these issues we can mitigate with security controls and secret management. Also, there is a point here. Have you already integrated a secret management tool in your pipeline? Also, there is verification. It’s very important that these keys that you have in this tool, that they have expiration date. Otherwise, it’s not going to be compliant, in case you use any tool that’s integrated with your environment, and then can monitor it.

Hands-On Demo

Now, I would like to go to the demo that I prepared. Specifically, I’m going to focus on the third-party libraries’ mitigation, and show you how this artifact, we can generate a software bill of material. We can use in our verification and analysis. Here, I have a simple project in GitHub. There is a microservice with some dependencies. It’s a really simple microservice. In the pipeline, we have two different stages. It’s build and generate software bill of material. Then there is stage when we build integration with Snyk. There are two stages. First, we generate this software bill of material. Then we use this artifact for further scanning and verification. That’s related to software composition analysis. Also, there is a dashboard of this tool. Right now, I don’t have any critical or high critical vulnerabilities. Also, I integrated this Nexus Repository. Right now, it’s running on my EC2 instance. Here, we have different types of repositories that I created. First is Maven Central Repository.

There is GitHub repository here, integrated pipeline there. There are multiple stages. First, we have Snyk scan integration. I’m going to trigger right now the tool of this build. Then, I have integration with this dashboard. Also, there are no high critical vulnerabilities. There are multiple repositories. This repository, it’s related to my dependencies, what I have in the current project. Then, I have a separate repository when I’m going to publish my artifact, which I’m going to build. Here, you can see the separation of these two repositories. That’s EC2 configuration of security groups. Then, I’m going to change this form configuration. Right now, everything is green. Here, I’m going to introduce non-dependency, Log4j dependency, and see what is going to be the behavior of this tool and how it’s going to be integrated in this dashboard. I’m going to comment out this dependency, and trigger a build. Build started. It was completed. Now, you can see that new issues were introduced.

Based on this artifact that was created, this tool is integrated, and continuously analyzing my software bill of material. Then, I’m going to remove this dependency, and generate this file one more time. At the end, it’s a big artifact. It’s a big XML or JSON file, with all these dependencies that you have in the application. Then, this file, now you can see that’s integrated already in the pipeline. Here, you can even build some business logic on the current pipeline on top. You can establish continuous monitoring. Then, you can use this file in order to share, and then trigger a compliance check. Then, you can use this outcome for your regulation and compliance process. I remember back in the past, in one of the projects, the compliance team asked the development team, can you please create an Excel file and put all the dependencies in this file? We were really surprised. It’s really manual work. It’s better to implement and integrate software bill of materials. Then, to have some stage in the pipeline that the security and risk team can analyze and can approve. At the end, this issue is mitigated, resolved, and dashboard is green.

Questions and Answers

Nikolai: You have a step, a Snyk scan, but what if a dependency was found after the build finished and it already was deployed? Do you continuously rescan all your dependencies, and then notify and rebuild all the services which depend on this dependency?

Brodskyi: A question about the integration, about how we automate, and how we’re going to notify and protect our next deployment step.

Nikolai: Not next, but if it’s already in production, and next day we found some zero-day vulnerability in the dependency which we already deployed.

Brodskyi: In this case, you need to establish patch management, and make sure that your organization is able to provide this process where you can mitigate and deliver this simple fix as soon as possible. That’s only related to, what is your patch process. In case it’s happened, then in our organization there is an SLA. We need to react in this period of time. In case it’s not happened, then it’s going to be a problem, a reputation risk for our company. That’s patch processing what we have. We have SLA, how fast we’re going to react, and what is going to be the mitigation.

Nikolai: To know that you have this vulnerability, how you go about it.

Brodskyi: To know it, because of the PCI DSS, we need to have a really strong monitoring system. We have a monitoring system that is going to notify each team immediately, all development teams. This monitoring tool is integrated with all other notification channels that we have: Teams, for example, emails, and so on. First of all, the operational support team is going to receive this alarm. Then, development team is going to receive all this notification.

Nikolai: More practical, like, for example, I have a container inside my private registry. I know that, for example, AWS Inspector continuously does this scanning of the containers if you keep it in their registry. As soon as they found that in your container you have some vulnerability, you can configure this notification pipeline that will send you a message. Then you can rebuild your artifact with a fresh dependency, and then deploy it again. How do you do that? What tools do you use?

Brodskyi: For container scanning, we use Azure tool. We integrated this tool there. Then, for this type of application that’s not in the cloud right now, we use Acunetix, Qualys, SonarQube, and, of course, penetration test in case we are going to release a very business critical update.

Shashi: These DORA regulations are coming from next year, which have to be adapted by, I think, all the European companies regardless of the industry. These pipelines which you showed us, will these have to be adapted and become more faster because the SLAs might be much smaller because of this regulation? If yes, then, is there already something going on on these architectures which you have just shown in this talk?

For example, in our company, we use Black Duck for software composition analysis. Because in our company we have C++ based libraries and some of them take really long to build, we build them locally on our infrastructure. Let’s say we have a CVE found, like the guy asked, zero-day CVE found, how would we use this thing which you showed us just now to be compliant with the DORA regulations that we have immediately a new batch created and delivered to the end customer?

Brodskyi: There is a new regulation coming to the FinTech industry, DORA. Also, regarding the pipeline, how is it going to be adapted?

Regarding DORA regulation, that’s particularly related to resilience and how your system platform is going to be resilient. How do you deploy? Also, it’s about platform security. Regarding the deployment, for sure, right now we are working to improve our deployment. Because of the cloud migration, we integrated all these DevOps principles in order to speed up. Latest example is that in order to complete some verification of our big APM processing, alternative payment method processing application, we spent several hours in this cloud migration, optimizing the processing strategy and optimizing the feedback loop in this pipeline. In GitLab plus Argo CD, we are able to speed up our deployment. This DORA regulation in our company is running in parallel, because we are doing these improvements not because of this regulation, because of our big cloud migration journey and to improve speed to market.

Regarding the vulnerabilities, in case we have them in production, we have a very complex monitoring tool. We have our support team that is looking every time on this monitoring tool. Also, we have notifications. Once they receive, we immediately react. All scrum teams, depending on the application or microservice, focus on this particular vulnerability. Then it’s going to be delivered. We will use all these tools in our pipeline in order to verify. Then it’s going to be deployment using this patch processing and hotfix deployment.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Optimize AI Workloads: Google Cloud’s Tips and Tricks

MMS Founder
MMS Claudio Masolo

Google Cloud has announced a suite of new tools and features designed to help organizations reduce costs and improve efficiency of AI workloads across their cloud infrastructure. The announcement comes as enterprises increasingly seek ways to optimize spending on AI initiatives while maintaining performance and scalability.

The new features focus on three key areas: compute resource optimization, specialized hardware acceleration, and intelligent workload scheduling. These improvements aim to address one of the primary challenges enterprises face when deploying AI at scale—balancing innovation with cost management.

In the announcement, Google Cloud’s VP of AI Products said:

Organizations are increasingly looking for ways to optimize their AI costs without sacrificing performance or capability, these new features directly address that need by providing more efficient ways to run machine learning training and inference.

Google Cloud’s approach begins with strategic platform selection. Organizations now have multiple options ranging from fully-managed services to highly customizable solutions. Vertex AI offers a unified, fully managed AI development platform that eliminates infrastructure management concerns, while Cloud Run with GPU support provides a scalable inference service option. For long-running tasks, Cloud Batch combined with Spot Instances can significantly reduce costs. Organizations with existing Kubernetes expertise may benefit from Google Kubernetes Engine (GKE), while those requiring maximum control can utilize Google Compute Engine.

A key recommendation focuses on optimizing container performance. When working with inference containers in environments like GKE or Cloud Run, Google advises keeping containers lightweight by externally storing models using Cloud Storage with FUSE, Filestore, or shared read-only persistent disks. This approach dramatically reduces container startup times and improves scaling efficiency—critical factors in managing both performance and costs.

Storage selection emerges as another critical factor in optimization. Google Cloud recommends Filestore for smaller AI workloads, Cloud Storage for object storage at any scale, and Cloud Storage FUSE for mounting storage buckets as a file system. For workloads requiring lower latency, Parallelstore provides sub-millisecond access times, while Hyperdisk ML delivers high-performance storage specifically engineered for serving tasks.

To prevent costly delays in resource acquisition, Google Cloud emphasizes the importance of Dynamic Workload Scheduler and Future Reservations. These tools secure needed cloud resources in advance, guaranteeing availability when required while optimizing the procurement process for popular hardware components.

The final strategy addresses deployment efficiency through custom disk images. Rather than repeatedly configuring operating systems, GPU drivers, and AI frameworks from scratch, organizations can create and maintain custom disk images that allow new, fully-configured workers to deploy in seconds rather than hours.

AI cost management has become increasingly critical across industries, in response to the growing demand for more efficient and cost-effective AI infrastructure, both AWS and Microsoft Azure have also ramped up their efforts to support enterprise AI workloads. AWS has introduced new cost-aware tools within its SageMaker platform, including Managed Spot Training and model monitoring capabilities to help users optimize both performance and budget. Similarly, Azure is enhancing its AI offering through Azure Machine Learning with features like intelligent autoscaling, reserved capacity pricing, and seamless integration with Azure Kubernetes Service (AKS) for better workload orchestration.

Like Google Cloud, both AWS and Azure are emphasizing hybrid flexibility, storage optimization, and GPU acceleration to give enterprises more control over how they scale and spend. This convergence signals a competitive push across cloud providers to address the pressing challenge of AI cost management while still empowering innovation at scale.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Alliancebernstein L.P. Sells 5,503 Shares of MongoDB, Inc. (NASDAQ:MDB) – MarketBeat

MMS Founder
MMS RSS

Alliancebernstein L.P. lessened its position in shares of MongoDB, Inc. (NASDAQ:MDBFree Report) by 5.9% in the 4th quarter, according to its most recent filing with the Securities & Exchange Commission. The fund owned 87,001 shares of the company’s stock after selling 5,503 shares during the quarter. Alliancebernstein L.P. owned approximately 0.12% of MongoDB worth $20,255,000 at the end of the most recent quarter.

Several other large investors have also modified their holdings of the business. Norges Bank acquired a new stake in shares of MongoDB in the fourth quarter worth $189,584,000. Raymond James Financial Inc. acquired a new position in shares of MongoDB in the 4th quarter worth approximately $90,478,000. Amundi raised its holdings in shares of MongoDB by 86.2% during the fourth quarter. Amundi now owns 693,740 shares of the company’s stock worth $172,519,000 after purchasing an additional 321,186 shares during the period. Assenagon Asset Management S.A. grew its position in shares of MongoDB by 11,057.0% during the 4th quarter. Assenagon Asset Management S.A. now owns 296,889 shares of the company’s stock valued at $69,119,000 after buying an additional 294,228 shares during the last quarter. Finally, Pictet Asset Management Holding SA boosted its position in MongoDB by 69.1% during the 4th quarter. Pictet Asset Management Holding SA now owns 356,964 shares of the company’s stock valued at $83,105,000 after purchasing an additional 145,854 shares during the period. 89.29% of the stock is currently owned by institutional investors and hedge funds.

MongoDB Stock Performance

Shares of MDB stock traded up $25.49 during trading hours on Wednesday, reaching $171.34. The company had a trading volume of 3,611,865 shares, compared to its average volume of 1,790,746. The firm has a market cap of $13.91 billion, a PE ratio of -62.53 and a beta of 1.49. MongoDB, Inc. has a one year low of $140.78 and a one year high of $387.19. The business has a fifty day moving average price of $228.92 and a 200-day moving average price of $259.09.

MongoDB (NASDAQ:MDBGet Free Report) last announced its quarterly earnings data on Wednesday, March 5th. The company reported $0.19 EPS for the quarter, missing the consensus estimate of $0.64 by ($0.45). MongoDB had a negative net margin of 10.46% and a negative return on equity of 12.22%. The business had revenue of $548.40 million during the quarter, compared to analysts’ expectations of $519.65 million. During the same quarter in the previous year, the firm earned $0.86 earnings per share. Equities research analysts forecast that MongoDB, Inc. will post -1.78 EPS for the current year.

Insider Activity at MongoDB

In other news, CAO Thomas Bull sold 301 shares of the company’s stock in a transaction dated Wednesday, April 2nd. The shares were sold at an average price of $173.25, for a total transaction of $52,148.25. Following the transaction, the chief accounting officer now owns 14,598 shares in the company, valued at $2,529,103.50. This represents a 2.02 % decrease in their ownership of the stock. The transaction was disclosed in a legal filing with the SEC, which is available through this hyperlink. Also, insider Cedric Pech sold 1,690 shares of the company’s stock in a transaction that occurred on Wednesday, April 2nd. The shares were sold at an average price of $173.26, for a total transaction of $292,809.40. Following the transaction, the insider now owns 57,634 shares in the company, valued at $9,985,666.84. The trade was a 2.85 % decrease in their position. The disclosure for this sale can be found here. Insiders sold 58,060 shares of company stock worth $13,461,875 in the last 90 days. 3.60% of the stock is owned by corporate insiders.

Wall Street Analysts Forecast Growth

MDB has been the topic of a number of research analyst reports. Citigroup lowered their price objective on shares of MongoDB from $430.00 to $330.00 and set a “buy” rating on the stock in a research note on Tuesday, April 1st. Daiwa America upgraded MongoDB to a “strong-buy” rating in a research note on Tuesday, April 1st. JMP Securities reissued a “market outperform” rating and issued a $380.00 target price on shares of MongoDB in a research note on Wednesday, December 11th. Oppenheimer reduced their price target on shares of MongoDB from $400.00 to $330.00 and set an “outperform” rating on the stock in a research report on Thursday, March 6th. Finally, Needham & Company LLC reduced their price objective on MongoDB from $415.00 to $270.00 and set a “buy” rating on the stock in a report on Thursday, March 6th. Seven equities research analysts have rated the stock with a hold rating, twenty-four have given a buy rating and one has assigned a strong buy rating to the stock. According to MarketBeat.com, MongoDB has a consensus rating of “Moderate Buy” and an average target price of $312.84.

Read Our Latest Analysis on MDB

MongoDB Company Profile

(Free Report)

MongoDB, Inc, together with its subsidiaries, provides general purpose database platform worldwide. The company provides MongoDB Atlas, a hosted multi-cloud database-as-a-service solution; MongoDB Enterprise Advanced, a commercial database server for enterprise customers to run in the cloud, on-premises, or in a hybrid environment; and Community Server, a free-to-download version of its database, which includes the functionality that developers need to get started with MongoDB.

Further Reading

Institutional Ownership by Quarter for MongoDB (NASDAQ:MDB)

Before you consider MongoDB, you’ll want to hear this.

MarketBeat keeps track of Wall Street’s top-rated and best performing research analysts and the stocks they recommend to their clients on a daily basis. MarketBeat has identified the five stocks that top analysts are quietly whispering to their clients to buy now before the broader market catches on… and MongoDB wasn’t on the list.

While MongoDB currently has a Moderate Buy rating among analysts, top-rated analysts believe these five stocks are better buys.

View The Five Stocks Here

The 10 Best AI Stocks to Own in 2025 Cover

Wondering where to start (or end) with AI stocks? These 10 simple stocks can help investors build long-term wealth as artificial intelligence continues to grow into the future.

Get This Free Report

Like this article? Share it with a colleague.

Link copied to clipboard.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Redis Launches Vector Sets and a New Tool for Semantic Caching of LLM Responses

MMS Founder
MMS RSS

Redis, the company behind the eponymous in-memory key-value database, mostly made news in recent months because of its license change, which resulted in the launch of the Valkey project. Now, Redis is hoping to change the conversation a bit with the launch of two new AI-centric products ahead of the launch of Redis 8 on May 1. The first of these is a new caching tool, LangCache, which allows developers to bring large language model (LLM) response caching to its applications. The second is the launch of a new data type, vector sets, for…

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Why Use a NoSQL Database for AI? There Are Many Great Reasons – The New Stack

MMS Founder
MMS RSS

<meta name="x-tns-categories" content="AI / AI Agents / Databases“><meta name="x-tns-authors" content="“>


Why Use a NoSQL Database for AI? There Are Many Great Reasons – The New Stack


<!– –>

As a JavaScript developer, what non-React tools do you use most often?

Angular

0%

Astro

0%

Svelte

0%

Vue.js

0%

Other

0%

I only use React

0%

I don’t use JavaScript

0%

2025-04-08 08:00:28

Why Use a NoSQL Database for AI? There Are Many Great Reasons

sponsor-couchbase,sponsored-post-contributed,

NoSQL databases play a key role in facilitating AI adoption. A flexible platform with memory, persistence and traceability is needed to power AI agents.


Apr 8th, 2025 8:00am by


Featued image for: Why Use a NoSQL Database for AI? There Are Many Great Reasons

Image from whiteMocca on Shutterstock.

With AI increasingly becoming table stakes for organizations, let’s dig into the role NoSQL databases play in facilitating AI adoption, and why a flexible developer data platform with memory, persistence and traceability is needed to power AI agents.

Starting With the Basics on NoSQL

NoSQL databases, short for “Not only SQL,” were developed to address modern data storage and scalability needs that traditional relational databases struggle with.

Unlike relational databases, which were designed to minimize data duplication and scale vertically, NoSQL databases use flexible data models such as key-value, document, column, time series and graph formats to accommodate web, mobile and IoT applications. These databases operate as primary content stores, allowing flexible data access and high availability through horizontal scaling across distributed systems.

Organizations choose NoSQL for its ability to support dynamic, real-time and personalized user experiences, adapting quickly to changing application requirements. NoSQL databases, particularly document-oriented ones, use the JSON format, enabling agile development without rigid schemas.

Additionally, modern NoSQL systems incorporate relational database features, including ACID (atomicity, consistency, isolation and durability) transactions and SQL-like querying, while maintaining scalability, high availability and efficiency. This convergence of relational and NoSQL capabilities simplifies database management, making NoSQL the preferred choice for modern, flexible cloud computing and distributed data applications.

AI Agents Are Operational Applications

AI agents, which automate traditional software and human workflows, require real-time data access for task execution and to support reasoning.

Unlike traditional analytical databases, which are often relational, highly structured and process data in delayed batches, operational databases enable low-latency, high-frequency read and write operations, which are essential for AI-driven applications. In the retail industry, for instance, AI agents can use diverse operational data such as user profiles, inventory, promotions, product vector embeddings and more for powerful semantic search.

To function effectively, agents must integrate multiple data formats, engage with models, cache conversations and maintain those interaction histories. The database needs to support high-velocity workloads, ensuring AI agents remain responsive and scalable.

AI Needs Access to a Variety of Data in a Flexible Way

AI agents require fast data access and a diverse range of data to operate effectively, especially in real-time decision-making scenarios. They need both structured data (such as databases and spreadsheets) and unstructured data (such as text, images and audio) to generate powerful insights and responses. The ability to quickly pull relevant data enables AI to produce responses that are the most contextually relevant to the user and make predictions with minimal latency.

Additionally, real-time data sharing through APIs and functions allows AI systems to integrate seamlessly with other platforms, ensuring up-to-date information flow and facilitating dynamic, automated decision-making. Without rapid access to varied data sources, AI agents risk providing outdated, incomplete or inaccurate responses, limiting their effectiveness, whether supporting internal or customer-facing applications.

Multiagent AI Systems Need To Work Together

​In enterprise environments, multiagent AI systems can efficiently handle dynamic workloads and deliver prompt responses but will need real-time performance and scalability. By collaborating through distributed shared memory, these agents can swiftly access and update shared data, enhancing coordination and reducing communication overhead. Implementing low-latency, event-driven synchronization mechanisms ensures that agents remain aligned and can react promptly to changes, thereby maintaining system coherence and responsiveness.

Techniques such as array-based queuing locks can be employed to manage access to shared resources, minimizing contention and ensuring fairness among agents. Additionally, communication protocols like the message passing interface facilitate efficient data exchange and synchronization across distributed systems. Collectively, these strategies enable multiagent AI systems to operate effectively in complex, large-scale enterprise settings.

Memory and Persistence Together

Maintaining short-term, long-term, procedural and shared memory is critical for AI agents to ensure contextual awareness, continuity and efficiency in decision-making. Short-term memory (caching) allows AI to rapidly retrieve recent interactions and computations, reducing redundant processing and improving responsiveness. Long-term memory (persistence) ensures AI agents retain historical context, enabling them to learn from past interactions and refine their outputs over time.

Having both in a unified platform streamlines performance, as agents can seamlessly transition between fast temporary access and deep retained knowledge. Additionally, AI agents need structured storage for critical information such as API definitions, function calls and prompts, allowing them to interact efficiently with data, execute the correct actions and ensure consistency across different sessions. By integrating these memory types, AI systems can provide more intelligent, context-aware and adaptive interactions while optimizing computational efficiency.

Governance and Traceability

Governance and traceability are essential for AI agents, particularly in enterprise environments where compliance, accountability and safe AI behavior are critical. Organizations must ensure that AI-driven decisions are transparent, auditable and explainable to meet regulatory requirements, mitigate risks and build trust in AI systems. Traceability allows enterprises to monitor how AI models reach conclusions, making it possible to detect biases, errors or security vulnerabilities.

By implementing robust governance frameworks, businesses can enforce ethical AI use, prevent unauthorized access or misuse, and maintain consistency in decision-making. Additionally, enterprises need auditable logs of AI interactions, ensuring that every decision can be reviewed, verified and improved over time. Without proper governance and traceability, AI systems may pose compliance risks, erode trust and fail to align with business objectives and legal standards.

The Challenge of Point Solutions

Reliable and unified data architectures are key to successful AI projects. Using multiple database and data cache systems for AI agents create significant challenges by complicating data access, hindering collaboration, disrupting memory integration, limiting flexibility, increasing operational expenses and undermining governance. Organizations that deploy multiple single-purpose database solutions also introduce data sprawl, risk and complexity, making it difficult to effectively use AI, minimize AI confusion, trace the source of AI hallucinations and debug incorrect variables.

Data complexity is AI’s enemy because AI is imprecise to begin with. Using AI within a complex, multi-database architecture produces unreliable results because the risk of feeding AI models inconsistent or incorrect data is too high.

AI agents require fast, seamless access to diverse data for real-time decisions, but drawing data from disparate systems introduces inefficiencies, backtracing issues and delays. Collaboration falters as multiagent systems face compatibility issues, slowing communication and coordination. Memory management suffers from fragmentation, breaking the continuity needed for contextual awareness and performance. Flexibility is curtailed, delaying adaptation to new needs or features, while governance and compliance become harder to enforce due to inconsistent monitoring and traceability.

By simplifying the data management activities that surround AI, a unified, multipurpose database resolves these issues, enabling reliable, scalable and compliant AI operations.

A NoSQL Data Platform To Support Agentic AI 

Tens of thousands of organizations have adopted NoSQL, making it their choice for modern applications. AI agents are the next logical step on that path to be supported by fast and flexible NoSQL data.

To run critical applications, many enterprises choose Couchbase to improve resiliency, performance and stability while reducing risk, data sprawl and total cost of ownership. Couchbase is the developer data platform that powers critical applications in our AI world. Find out more about how Couchbase Capella and AI services help organizations accelerate the development of agentic AI applications. Start using Capella today for free and sign up for the private preview of Capella AI Services.

YOUTUBE.COM/THENEWSTACK

Tech moves fast, don’t miss an episode. Subscribe to our YouTube
channel to stream all our podcasts, interviews, demos, and more.

Group
Created with Sketch.







Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Kafka 4.0: KRaft Simplifies Architecture

MMS Founder
MMS Steef-Jan Wiggers

Apache Kafka has reached a significant milestone with the release of version 4.0, a major update that introduces a host of new features and improvements, most notably the default operation in KRaft mode, which, according to Confluent’s documentation, eliminates the dependency on Apache ZooKeeper..

For over a decade, ZooKeeper has served as the backbone of Kafka, and the community has expressed gratitude for its contributions. However, the move to KRaft by default in Kafka 4.0 streamlines deployment and management by removing the need to maintain a separate ZooKeeper ensemble.

(Source: Confluent documentation)

Lalit Moharana, an AWS Community Builder, posted on LinkedIn:

ZooKeeper is stepping aside as Apache Kafka adopts KRaft with the upcoming Kafka 4.0 release, marking the end of a 14-year partnership. This shift simplifies Kafka’s architecture by ditching the separate ZooKeeper system, boosting scalability, and paving the way for a self-sufficient future – all thanks to KRaft’s Raft protocol magic.

In addition:

Why the Change? ZooKeeper’s overhead and limits (think 100,000+ partitions) couldn’t keep up with Kafka’s growth. And:

KRaft Benefits: One system, millions of partitions, faster recovery – Kafka’s ready to soar!

Beyond the architectural shift, Kafka 4.0 brings the general availability of KIP-848, which introduces a next-generation consumer group protocol. This new protocol is designed to dramatically improve rebalance performance, reducing downtime and latency for consumer groups, especially in large-scale environments. By minimizing “stop-the-world” rebalances, Kafka aims to provide a more stable and responsive data streaming experience. The new protocol is enabled by default on the server side, with consumers needing to opt in by setting group.protocol=consumer.

In a Hacker News thread, a respondent commented:

One thing I immediately noticed after switching from SNS/SQS to Kafka was its speed. Messages seem to get sent/received almost immediately.

Furthermore, Kafka 4.0 offers early access to Queues for Kafka (KIP-932). This feature introduces the concept of “share groups” to enable cooperative consumption using regular Kafka topics, effectively allowing Kafka to support traditional queue semantics. While not a direct addition of a “queue” data structure, this enhancement expands Kafka’s versatility, making it suitable for a broader range of messaging use cases, particularly those requiring point-to-point messaging patterns akin to durable shared subscriptions.

In a LinkedIn post, Govindan Gopalan, an AI & Data Engineering Leader at IBM, concluded:

Early queue support (KIP-932) introduces point-to-point messaging, expanding Kafka’s use cases beyond traditional publish-subscribe workflows.

This major release marks a significant step forward in platform modernization. As part of its evolution, Kafka 4.0 has removed APIs deprecated for at least 12 months. Furthermore, it updates the minimum Java requirements, with Kafka Clients and Kafka Streams now requiring Java 11, and Kafka Brokers, Connect, and Tools requiring Java 17. This move encourages the adoption of newer Java features and aligns Kafka with more current technology stacks. The release also updates the minimum supported client and broker versions (KIP-896) and defines new baseline requirements for supported upgrade paths, as detailed in KIP-1124.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Presentation: Thriving Through Change: Leading Through Uncertainty

MMS Founder
MMS Jennifer Davis

Transcript

Davis: I’m going to be focusing on the positive elements of this. I do not want to talk in depth about change. There has been a lot of change that has occurred over the last few years: whether it’s from COVID, whether it’s from industry changes that have led to things that cause fear and anxiety in the workplaces. There is also positive change, promotions, and all of those kinds of things, or reorgs that are negative, or can be sometimes negative, and you don’t really know what’s happening. I’m here to talk to you about thriving through change, being prepared for change. I’m Jennifer. I am an engineering manager at Google. I’m also an author. I’m a builder of communities. I value connection and making opportunities for people to connect, like we have here with all of our unconferences and our talks to inspire and inform.

Context

Just like any team, my team, we want to make more value faster. That’s not surprising. The key thing is, it’s not for one individual or one function to move faster. Within DevRel, we have to think about all the cars or whatever other thing that we’re building. We have to be part of that process. We have to think for our team, but also for the larger org, and be aware of everything that’s in progress of being built. We have to think about, so how do we maximize the value we bring? I’m in DevRel engineering. What does that mean? We build code. We write code. Our value is in code that our users find valuable. It’s not just writing code. That’s not helpful. Just like any other company, we’re not just writing code for the fun, unless we are. There’s a core value to that code that we’re writing.

We think about the things that we want to minimize, and that’s building the wrong things, spending a lot of effort writing code for products that are never going to ship. Or, having a lot of things in progress and never delivering them. Changing up our context all the time. Working on technical debt. How many people have big missions that work on technical debt? I don’t like those ones, because the existence of a software artifact means it inherently has technical debt, and unless it is valuable to my users, I don’t really care. Unless it’s security, then I have to fix it. I don’t know if you have this experience, but for our team, if we write code, people are going to copy it and paste it straight into their production, without maybe examining it or understanding the context. Because it’s Google code, of course, it’s perfect.

If it’s a security issue, I want to fix it. We also want to not repeat the same problems over and again, and not learn from those mistakes. We want to not have instances of samples where they don’t follow established practices, so it ends up causing more technical debt.

One of the things I really love about my job is that we get to hold the banner of the zeroth customer. What does that mean? We sit right there and think about, what are teams really building? How are they building? What’s the context that they’re building them in? It’s just so thrilling to be thinking about reliability, operability, sustainability, all of the abilities, and then bringing that into the world and giving it to people so that they can copy and paste it, and try it out and learn. One of the examples I’m sharing here is the Avocano solutions. Avocano is a dynamic website, like, how would you build a dynamic website using Google Cloud Services? We talk through all the different constraints that we have and what choices we were making. Is it the one way? No. There are multiple ways.

Actually, we provide different solutions on how you build a dynamic website. It’s a core solution that we think is important for people to understand. Just like solutions, and there’s more than one way to do it, everything I’m going to share, there is more than one way to team. I’ve tried to focus on some of the things that are important from my context and my team, focused on DevRel engineering. There are different components, and you may not find that my paths are your paths. I’m hoping that you can take some things from it.

I’m going to talk a little bit about change. Last year, after years of looking at our organization, we started a transformation. It’s just started. We recognized the ever-increasing challenge of what our org as DevRel was trying to do, and specifically DevRel engineering. How many different APIs, how many products launching? How can we actually deliver samples in a meaningful way? The previous org, you have dedicated product teams. This is a combination of product and engineering that would map over to a DevRel team. DevRel engineering teams had multiple products that they were responsible for. My product areas were serverless, DevOps, orchestration. Then from that, we have to figure out, what’s the highest priority? Every single one of those product teams are not talking to each other. We’re the interface. When you think about DevOps teams and all the things, that’s what we were. There were multiple DevRel engineering teams responsible for different sets of samples.

Collectively, we had virtual teams where we would share the burden of the platform management. How do we write samples in these four main languages? What do we test? What’s our infrastructure? This is a process and set of processes and tools that have evolved over 10 years. There’s a lot of cruft in it and a lot of friction. Ownership of samples all sat in the DevRel engineering team, a team that went from 100 people to 50 people, to 23 people, to 14 people. Last year, they announced a reorg, and my team became 11 people.

The mission is to build a lightweight platform that enables more stewardship of samples so that the product teams, external contributors, and all of DevRel, including tech writing, can own and drive sample contribution. Not to apply some standard SLO across all samples, because there’s different types of samples. There’s how-tos. There’s concepts. There’s API SDKs where people really need it to be perfect and it does exactly the right things, because there’s nothing like going and looking up something and it does/doesn’t work. We need a bigger set of samples, but we cannot manage it the way we have been managing it. That’s the change, or where we want to go.

As I was planning out this talk, I started with one context. As the reorg hit, I was like, I’ve got to change up how I talk about this because everything’s changing. Yet, my team has been prepared to navigate this change because of the things that I’m going to talk about. They were ready and empowered, and they have that autonomy. Other folks have talked about DORA. DORA is this research that’s been happening for over nine years now. Effectively, there’s a set of metrics and a set of capabilities. It’s a choose your adventure. What capabilities are you trying to drive up to increase the metrics that matter to your organization so that you can have a high-performing team? It’s a way to categorize and evaluate how well your team is performing. It’s not meant to say, we’re better than this team and that team. High-performing has a lot of feels to it, but the goal is to create context that help people deliver.

Embrace Functional Leadership

There are four areas I want to talk about, and the things that I have used and leveraged to make change be something that, yes, sometimes it sucks, but we have the power to enable teams to navigate change. We don’t get to control all the changes. As much as we like to have cabs and change boards limiting what happens in production, we don’t control all change. Let’s talk about functional leadership. I want to talk about leadership, actually, because really words are hard, and we use the same word to mean different things at different times. When I say leadership, you might not be thinking about it the same way, and that’s ok. What is leadership? Often, it’s defined by someone’s following you, so then you’re the leader. That’s not very valuable. Sometimes the manager is the leader, and sometimes they’re not. Some people lead by example, and some people lead by coaching, and both of these are completely valid approaches.

In complex environments, leadership is about enabling people to come together to figure out the problems and share all the context, figure out what’s ambiguous, what’s possible. We need lots of different perspectives to enable strong choices. Not right choices, but strong choices, because generally there is no right answer. There’s lots of wrong ones, but there’s not one right answer. We need to help people to understand. That’s what a leader is. Mary Parker Follett was an American management consultant who did a lot, she pioneered a lot in terms of organizational theory and organizational management. She has some really amazing writing out there. I highly encourage. It’s very under-read, but she’s quoted a lot. Her concepts on functional leadership are amazing to me. I felt like, am I inventing some new way of doing things at one point when I first became a manager?

Then I started reading her, and I’m like, no, we discovered this hundreds of years ago, or 100 years ago, and we’re just not talking about it. She said, “Leadership is not defined by the exercise of power, but by the capacity to increase the sense of power among those led”. What does that even mean? It means we need to focus on the individuals first, and we need to empower everyone to be a leader. Everyone can be a leader, but then who’s following? That’s not the point of leadership. Some people talk about situational leadership, and people stepping up into leadership as needed, and that’s part of it too. Every one of us can be a leader. I approach a new team. What do I do? The first thing is, I do nothing to change anything, because there’s already enough change happening. Figure out the roles and responsibilities that everyone already thinks they have. Understand, I’m not going to make assumptions, no matter what anyone says to me, any other manager, any other person, that person can’t do this, or that person can’t do that, because I believe in my heart, everyone can be a leader.

I find out, I’m building these relationships, what are the motivations, goals, and worries that every individual has? What is their context? What are they hoping and dreaming of? What do they want out of this? It doesn’t matter if they’re only here for the money, or if they’re altruistic and they want to improve the world.

All that matters is I understand what their context is, what their capabilities are. Maybe there’s some context that is impeding them that I can help with, and maybe it’s just something I need to be able to recognize, and not change the conversation to be something they can’t do right now. I keep building up this knowledge about them. I see them in action. I understand more, what do they bring to the workplace? What are they bringing to the team? What are the different strengths and weaknesses everybody has? Where can I have opportunities to empower them to grow, to be more? How do I see them in these different lights? I can watch over time as these things change, and I can help them connect them with the people and the opportunities that best suit them.

I need to build trust. A lot of people know, we’re going to talk about values. None of that matters. Why are we talking about values again? Everybody wants to talk about values. It’s important. Even if you’re talking to a team that hates talking about values, it’s important. It’s important to talk about what the company’s culture and those values are, but it’s important to talk about as individuals. As a manager, I am the first one to step to the plate to be vulnerable and share. The three core values I have inform how I want to be present in the world and present for my teams. Authenticity, I want to commit who I am, who I’m presenting to you, this is me. This is what I believe, and this is what I value. If I say I’m going to do something, I’m going to do it. I’m going to be kind. I am going to be generous and thoughtful. I am going to treat every interaction to the best of my ability with kindness. I expect and want that out of the people I work with and the teams that I work with.

Sometimes kindness is about giving people feedback that they don’t really know that they need it. It’s like crucial feedback. Kindness is not the same thing as niceness and not telling someone, “I know that you want to do this thing, but you know how you did it this way”, and having those conversations, because feedback is a gift. I value trust. When people talk to me, when I talk to them, I do not assume that I have your trust. I will give you trust, but I am not going to assume you’re giving it to me until you’re ready. I’m your manager, I’m not going to assume anything. When you’re ready, you can tell me, and it’s good. I value your trust. I’m going to work hard to not break that trust.

All of these things are going to build together. The conversations that you can have by talking about your values are tremendous, because everybody has different things that they hold true to themselves. Helping people be authentic to themselves means they bring their best selves and their best perspectives to build the best things.

Functional leadership is about delegation. I care that everyone can achieve this goal of being a leader if they want to. I want to foster that capability, so I’m going to identify ways that are going to make it possible. By doing this, I can respond to change. Energies ebb and flow. People need time off. I don’t want any single points of failure within my organization, where now we’re going to have a fire drill, because nobody knows this, or we’re going to go and call this person who’s on leave. No. I want everyone to be enabled and empowered. I want to be able to take time off. I don’t want to have to check my phone. I don’t want to have to check my email. I want to make that commitment to the people who report to me. I delegate leadership. I also clearly define the roles and responsibilities, because I want people to understand what decisions do they get to make, what kind of autonomy do they have. If they need help, I’m going to coach them.

Enable Healthy Conflict

A key challenge we have is in conflict. Healthy conflict is an important part of team cohesion and finding great outcomes. What is healthy conflict? It’s having an argument that fosters creativity and personal development and builds stronger bonds. If we don’t address unhealthy conflict, we can cause more pain to be borne by the people who already have so much stacked up against them. In an intrateam conflict, building healthy conflict. You can recognize when you have it, when you have open communication, people are giving constructive feedback in a timely manner, and folks are open to other people’s ideas. They’re not immediately dismissing them. It’s so crucial. I’m emphasizing this so much because I’ve dealt with these challenges, and it’s really hard to preserve trust and foster psychological safety if we don’t address when someone’s being contemptuous or when somebody is being disrespectful.

Now, interteam conflict, that may not be something you have to worry so much on. In your context, it might be something where it helps us provide team cohesion, because we can be like, “They’re over there, they’re whatever. It’s all their fault”. It helps build bonds internally, but within DevRel, it’s not effective. We cannot do that. We have to work with so many product teams, with so many other core functionalities, tech writing, engineering teams across the org, security, OSPO. We cannot be an us and them. When you’re trying to achieve larger, big-scale impacts, you are literally forcing your virtual team to not be effective because you have caused a problem by turning into an us versus them. I don’t recommend that.

The core piece of this, and I tell folks about this, is that you have to keep a lot of things in your mind that may not feel congruent, but at the core, you don’t have to like how someone leads. We might say, in short, I don’t like them. I don’t care. Let’s not be sloppy. Really, you don’t like their decisions or the way they’re approaching this. You have to respect them. As a leader, if you find that you have people on a team, and that doesn’t matter if it’s a small team, a larger team, and you have people that are showing contempt and not respecting each other, that is when you have to make corrections if they don’t solve themselves. Because you’re not going to have an effective team otherwise. Too often, we’re too afraid to say something because then aren’t you like, are you not respecting people’s opinions or decisions? No. It’s ok to disagree, but you have to commit. You have to move forward.

The first step is maybe I talk to them and say, I get it, but what can you agree on? What are the things that you can agree on? It’s ok if it’s just that you both showed up. You both care passionately because that’s what usually causes the biggest conflict is when someone is like, I care passionately about it, and it needs to be this way, and I am right, and you’re wrong, and your idea sucks. In some cultures, you have a culture where it’s ok to speak in that language, and it’s completely acceptable. You know your context. In some cultures, you have to call it out and address the concerns that are coming up from this. It’s ok. It’s ok for people to have different opinions. The way to navigate that is to have clear roles and responsibilities. I said it before, I’ll say it again, who is responsible? Yes, everyone can be a leader, but not everyone all the time. There is a clear, articulated set of roles and responsibilities that line up and set the context so people know when they need to disagree but commit.

One of the other ways I navigate this, and there might be another term for this. I just want to share this. It’s establishing a common work item vocabulary. What does this mean? We have OKRs. Then we have projects that map out to those OKRs, and have impacts and business decisions. If we’re all doing work our own way, yes, everyone gets to choose how we do the work, but how we document how we do the work needs to be consistent. That way, we can improve the way that we deliver results, because if we look at the tree of work that maps out to the OKRs and the epics or the projects, and we map this down, down to the tasks, I know at any point in time as an IC, I can go look at my tree of work and know what my work maps to. I know how my work is changing and impacting the org. I know, before I even start to do the work, what is the value of that work? Am I achieving something that matters? That’s a core part of being happy. There’s five metrics of happiness, and autonomy is one, but doing work that matters.

The org could say, we’re changing priorities, but you have the record of what you did and what you accomplished. Connect the day-to-day work to the strategic, larger-level projects. As a manager and a leader, this also does something cool for me, and that means I can look across what the team is doing and make sure that all the projects are level-appropriate, and the scope of work is appropriate. I can search our internal tools and say, what’s the tree of work? Who’s getting what opportunities? Am I being fair and equitable and empowering people to take ownership? Are people having leadership opportunities, because everything has a scope and a set objective. It also provides transparency and accountability so people can see what everybody is doing, because we’re all talking in the same language. You can see, that person hasn’t done da-da-da. They’re not doing this or that. Are they on a big project? Are they the only one working on something? You have that visibility. It decreases the opportunities for people to have conflict about things that really don’t matter. They matter, but there’s ways to navigate them.

Other questions that I think about in helping people understand where they fit in to the bigger picture are these seven questions. The opportunities to measure and the tools that you have to engage and create these visualizations will vary across the org. Every individual in your team and across your org needs to know what’s going on, what’s the state of our work? What needs attention right now? What’s urgent and important? Where is my place in this? Am I actually contributing to any objectives that matter? What is meaningful to me? How am I having that meaning at work? How do I know what good looks like? How do I know when I’m done? What’s the state of my team? How healthy is my team?

Establish Metrics that Matter

We’re talking about stuff that’s measurements. Let’s talk about metrics that matter. Back to DORA. Again, nine years of research, lots of context, lots of community, lots of people have input these things. It’s distilled down into these four key metrics of deployment frequency, and lead time, which is throughput, and stability, change failure rate, and time to restore service. Back to me, because this is about my team, that’s data. I can have that context in my mind, but I need to actually think about, what am I doing now? I got a new team. What am I supposed to measure? The first thing is, figure out what the problem is. I’m sharing some metrics. This is all open source. All of our samples are open source. There’s no secrets here. We have 12,963 samples. We have 7,288 distinct use cases. What does that mean? There’s an intent for a sample. What are you trying to show? That’s what those use cases are.

Our goal is to at least have the four core languages have the sample, and that’s Python, Node, Go, and Java. There are additional languages that we try to support. Not every engineer knows every language. There are 11,320 files. There are 118 repositories that this covers. That was just our first assessment. Then we discovered, but some samples are just in documentation. They’re not in GitHub. We’re not even looking at the full picture. What do I do with this? I started talking to the team, and you’ll see that these are similar. I literally copied and pasted this from my notes in the doc I share with the team. These are very similar to some of the DORA metrics, but this is the start. This is not the finish. Every team is going to have their own special context of how they apply metrics, and what they gather, and what they measure.

For us, thinking about system green, it’s, we tested, and it’s the whole system and set of samples, and what that state is. If we know that early, when we’re building and validating and testing, that’s less expensive than, at the end, and discovering that we’ve launched that, and now the customers are finding the problem. Because, ideally, customers don’t find your problems. Again, we’re DevRel, so we are the zeroth customer, so if we are finding the problem, that’s a symptom of a larger issue. Another measurement, time to ship. That sounds obvious. It’s like in GitHub, you shipped it. Actually, no. Again, code is only as valuable to us, not if we finish writing it, but if people are using it. It has to be in docs. The measurement is from the point of the PR, to the point that it’s merged into docs. It’s available for people to go ahead and click, copy it direct from documentation. Rollbacks are, someone submitted a sample, we validated it, it looks good.

Then we discover, no, this is terrible, we got to roll back. You wouldn’t think, how could that happen with samples? It happens more often than you think, because the context is lost with you having different people doing different things and modifying changes that you wouldn’t imagine could happen. Because if you think about the context of all the samples, we’re validating platforms, there’s a lot of little axes of constraints, whether it’s version of language, version of runtimes, different third-party packages, it gets messy. Then, release cadence, thinking about, how often are we adding to our samples catalog? All of these felt like, this is a starter point. These are metrics I can measure right now, and we can see how we’re doing. Then we can make incremental change and see how we’re adding value to the organization, while also making change and improving processes, reducing how hard it is to add samples.

I don’t get to just decide, here’s the samples, this is the samples, and here’s the metrics we’re going to measure. You have to get buy-in. You have to get your leadership to agree, yes, this is valuable. Sometimes your leadership will say, so here is the metric we’re going to measure you against. Result, 50% of technical debt. Then you say, here is actually what we’re going to measure, because I don’t care about technical debt. I do, for certain values of technical debt, but what’s the value of me updating a node version of a language, of a sample? How do I know my customers value that sample at all? How do I know that we even have the right set of samples? Those are all things I really care about. Except I’ll resolve all the security issues, I can do that.

Craft Supportive Environments

Let’s craft some supportive environments. First, take care of yourself first, because leadership is hard. We don’t talk enough about how hard it is to be an effective leader. When we’re doing it in an environment that encourages a generative culture, where we encourage people to talk and connect, and we don’t have hierarchies. We’re often dealing with a lot of emotional baggage that people are carrying, and their trauma, and it can be hard. It’s not a right or wrong situation often. We have to be ok with things being 80% good. Everyone is different and unique and special. What helps you, you need to know. I have gone through periods of really stressful work environments and had burnouts, and been too afraid to know how to speak about it or to deal with it. The things that people talk about, it’s like do this, do that. No. You have to find out what works for you.

For me, I know that I am having a problem if I’m forgetting to do a daily walk, because walking is where I process a lot. Walking is how I feel connected to my emotions, where my body is feeling. I also have gotten into fiber arts and crochet because it’s all math and creating three-dimensional objects, which is fantastic. The things that help you are going to change, depending on where you are at. Take care of you first.

Let’s talk about teams, because that’s too emotional and frou-frou, but no, it’s really important. Think about the boundaries that you’re setting. The first step is being explicit and repeating over and again. This is an example of one of the agenda descriptions of my team meeting. It tells people explicitly every time, the goal of this meeting is that we are building together shared information, open communication. We’re going to discuss things. It’s not just a status meeting. We have async standups for that. We’re not doing just status. We’re together establishing team norms. We’re going to align on a shared goal and a mission. We’re going to share feedback. We are intentionally going to share feedback. When someone shares what they’re working on, we’re going to talk about it. You’re going to have opportunities to help, and we’re going to connect together. We’re going to have team rituals. Every team should have a set of rituals. I just introduced these to my new team in the last few months. We start our meetings with music. It gives people an opportunity.

Since we’re a distributed team, we don’t have water cooler moments. We don’t have time to share little bits and pieces all the time. This gives us an opportunity for people to share, here’s a set of music I like. We kick off meetings with a team temperature check. That gives people an opportunity to express support if they need it. It’s not all rainbows and unicorns in DevRel. There are challenging times. People get to talk about how they’re feeling, and be heard, and that matters. We end the meeting with kudos. That gives everybody the opportunity to practice giving feedback and to receive feedback. Instead of waiting to just do it quarterly or once a year at the end of the year, it means that we are thinking about and expressing what was the impact of how you did something, and please do more of it like that, because I appreciated it. That’s just the job. We get to practice accepting the feedback. It’s hard sometimes.

Another ritual that we don’t talk about a lot, but that we do, is goodbyes. I don’t own my people. I am sad when they leave. The opportunity to give them appreciation, to reflect, to share this in a team setting, is the most amazing gift. We do not do this enough to say goodbye, and we wish you well. Because the industry is so small, we are going to connect with these people again and again. The opportunities are massive. Showing the people who are on the team that you valued these people who are leaving, sets the context and increases that value for themselves as well.

Also, play. Play is awesome. Games, especially role-playing games, give us the opportunity to practice where our work can very much be connected to our identity. I encourage people to not have that sense of identity tied to their work, because what you work, what you do, is different than who you are. I recognize that often, one of the key problems with conflict is that you are questioning my identity when you question my work, but we’re not actually. Role-playing helps us to build up team connectiveness. It’s ok to practice failure at communication. You can do bad things in your game and it’s ok. Everything is cool. Also, playing helps you do really awesome things with your work.

This is part of my team that is in Vegas right now at the Next Conference. This is one of the cool apps that we built. It’s a meta-inception of building a train and building an app. It’s an app that builds and connects and validates, is your app going to work based on the services that you select? The train will move or not. You can grab the code and play with it yourself if you want to take a look. Play can lead to creativity and fun and team cohesion in a way that you wouldn’t expect you can inspire people.

Minimize the human toil. Let’s maximize what the machines do. Don’t maximize what the machines do and give all the toil to the humans. That’s my explicit boundary about certain things. How we could think about ways to automate some stuff. Automate context. We use GitHub Actions. This is a great utility, a feature of GitHub, if you are on GitHub. There’s also GitLab Actions that you can leverage. We automate what PR is their context, so we can identify who would be the best person to actually take on this context. We also automate linting, because not everybody remembers to lint their code and then this gives them fast feedback, you’ve got a problem. They’re not waiting for a review to find out that there’s actually a problem before they can even get their code merged. We’re also looking at, how do we take our standards of samples and how we write samples? Because the goal of our samples is not just working code. We want to teach something. To be effective, we have to think about the concepts and how we’re applying them. We’re working towards, how could we potentially automate this process too, and check to make sure that samples follow our guidelines?

We want to encourage cross-team projects, because cross-team projects is another way to facilitate leadership opportunities and grow people. Also, when you get teams that are made up of different specialists, you end up achieving really amazing things. One of the key solutions that I shared, the Avocano solution, my team built all the code, and we wrote the Neyer’s tutorial. We influenced the writing for the script for the demo that was made for the video. We worked with all these different teams that alone, if we tried to do it, it wouldn’t have been as nice.

Now we have this nice, polished solution that can meet people where they’re at, depending where they are in their learning journey. That’s enabled and empowered by cross-team projects. Include training and educations intentionally in planning. I always slice off the top 20%, and I say, this is going to be some kind of training, education. I factor it in in different ways. One of those ways is friction logging. Friction logging is a way to evaluate and be the customer zero, and determine, is this a good experience? If I didn’t know anything, or if I had this context, what would I experience? We provide that feedback to the product teams. We share and talk through it together, so it’s not that in isolation, that I’m just seeing what I know. Now I have a shared context. Here’s what Cloud Build does. Here’s what Cloud Run does. I have that together with my team. We write down what decisions we made and why. Sometimes we work on really cool things, and we then get prioritized over somewhere else.

These decision records actually help us set down context, so we don’t have to keep working on something. We can come back to it. For example, Emblem is this multi-product app that shows how to do something, that now there’s a new feature in Cloud Run with Cloud Deploy that we could refactor. We have our decision records that would allow us to do this, and tell us why we chose what we did at the time.

Have a show and tell. This is also a learning experience. It does not need to be something that’s like, here’s my polished demo, and now I’m going to present to everybody. It does give people the opportunity to practice presenting, but, more importantly, it gives people the opportunity to practice sharing what they’re doing. What did I learn today? Your team may have policies around open-source contributions, so before you encourage your team to do open source, make sure you check your OSPO policies and make sure it’s ok. One of the things about working in open source, for example, the Kubernetes projects are amazing at this, is it sets context for how to work with other teams. It empowers people and provides opportunities to do leadership, where there’s dynamics and there’s feelings, there’s a lot of feelings and personalities involved, but it’s outside of your job. It’s a separate context. It helps and supports people. I encourage people to contribute to open source. All of the continuous learning, it matters to you as well.

Conferences like QCon give us the opportunity to connect with one another and talk about all these different problems we’re facing. I’ve had so many amazing conversations that I just want to write about because I’m so inspired. One thing I want to share that I have gotten so much immense value is Ruth Malan’s technical leadership trainings. These are not, here, let me tell you how to do your job. Instead, it is a conversation. “That sounds terrible. I already talked to enough people. Why would I want to do that?” It’s leaders across the industry and different industries at different levels, CEOs, CTOs. You get this opportunity to talk to people and connect with people that are doing different problems, but sometimes the same problems, and see things from a different perspective. Ruth provides context and a common language, but it’s not a, let me tell you how to do this. It’s a conversation. I also recommend Lara Hogan’s management and leadership training.

A lot of her things you can get, is like, a play on your own time. She provides lots of examples of how to have some hard conversations. I would like to encourage folks to fill out the DORA survey, because your experience and how you solve problems matters. By filling out the survey, you help us validate and continue to evolve, what are the ways that we increase software performance? How can we measure? How can we improve?

Recap

I’ve talked about quite a few things. Embracing functional leadership. Everyone can be empowered to be a leader. Enable healthy conflict, and watch for patterns of contempt or dismissiveness that can impact how your team performs. Establish metrics that matter to you and to your team and your org. Craft those supportive environments that are going to build and nurture and create sustainable paths where humans are doing valuable, impactful work that’s not burning them out.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Top 9 AI News and Stock Ratings Today – Insider Monkey

MMS Founder
MMS RSS

Artificial intelligence is the greatest investment opportunity of our lifetime. The time to invest in groundbreaking AI is now, and this stock is a steal!

My #1 AI stock pick delivered solid gains since the beginning of 2025 while popular AI stocks like NVDA and AVGO lost around 25%.

The numbers speak for themselves: while giants of the AI world bleed, our AI pick delivers, showcasing the power of our research and the immense opportunity waiting to be seized.

The whispers are turning into roars.

Artificial intelligence isn’t science fiction anymore.

It’s the revolution reshaping every industry on the planet.

From driverless cars to medical breakthroughs, AI is on the cusp of a global explosion, and savvy investors stand to reap the rewards.

Here’s why this is the prime moment to jump on the AI bandwagon:

Exponential Growth on the Horizon: Forget linear growth – AI is poised for a hockey stick trajectory.

Imagine every sector, from healthcare to finance, infused with superhuman intelligence.

We’re talking disease prediction, hyper-personalized marketing, and automated logistics that streamline everything.

This isn’t a maybe – it’s an inevitability.

Early investors will be the ones positioned to ride the wave of this technological tsunami.

Ground Floor Opportunity: Remember the early days of the internet?

Those who saw the potential of tech giants back then are sitting pretty today.

AI is at a similar inflection point.

We’re not talking about established players – we’re talking about nimble startups with groundbreaking ideas and the potential to become the next Google or Amazon.

This is your chance to get in before the rockets take off!

Disruption is the New Name of the Game: Let’s face it, complacency breeds stagnation.

AI is the ultimate disruptor, and it’s shaking the foundations of traditional industries.

The companies that embrace AI will thrive, while the dinosaurs clinging to outdated methods will be left in the dust.

As an investor, you want to be on the side of the winners, and AI is the winning ticket.

The Talent Pool is Overflowing: The world’s brightest minds are flocking to AI.

From computer scientists to mathematicians, the next generation of innovators is pouring its energy into this field.

This influx of talent guarantees a constant stream of groundbreaking ideas and rapid advancements.

By investing in AI, you’re essentially backing the future.

The future is powered by artificial intelligence, and the time to invest is NOW.

Don’t be a spectator in this technological revolution.

Dive into the AI gold rush and watch your portfolio soar alongside the brightest minds of our generation.

This isn’t just about making money – it’s about being part of the future.

So, buckle up and get ready for the ride of your investment life!

Act Now and Unlock a Potential 10,000% Return: This AI Stock is a Diamond in the Rough (But Our Help is Key!)

The AI revolution is upon us, and savvy investors stand to make a fortune.

But with so many choices, how do you find the hidden gem – the company poised for explosive growth?

That’s where our expertise comes in.

We’ve got the answer, but there’s a twist…

Imagine an AI company so groundbreaking, so far ahead of the curve, that even if its stock price quadrupled today, it would still be considered ridiculously cheap.

That’s the potential you’re looking at. This isn’t just about a decent return – we’re talking about a 10,000% gain over the next decade!

Our research team has identified a hidden gem – an AI company with cutting-edge technology, massive potential, and a current stock price that screams opportunity.

This company boasts the most advanced technology in the AI sector, putting them leagues ahead of competitors.

It’s like having a race car on a go-kart track.

They have a strong possibility of cornering entire markets, becoming the undisputed leader in their field.

Here’s the catch (it’s a good one): To uncover this sleeping giant, you’ll need our exclusive intel.

We want to make sure none of our valued readers miss out on this groundbreaking opportunity!

That’s why we’re slashing the price of our Premium Readership Newsletter by a whopping 70%.

For a ridiculously low price of just $29.99, you can unlock a year’s worth of in-depth investment research and exclusive insights – that’s less than a single restaurant meal!

Here’s why this is a deal you can’t afford to pass up:

• Access to our Detailed Report on this Game-Changing AI Stock: Our in-depth report dives deep into our #1 AI stock’s groundbreaking technology and massive growth potential.

• 11 New Issues of Our Premium Readership Newsletter: You will also receive 11 new issues and at least one new stock pick per month from our monthly newsletter’s portfolio over the next 12 months. These stocks are handpicked by our research director, Dr. Inan Dogan.

• One free upcoming issue of our 70+ page Quarterly Newsletter: A value of $149

• Bonus Reports: Premium access to members-only fund manager video interviews

• Ad-Free Browsing: Enjoy a year of investment research free from distracting banner and pop-up ads, allowing you to focus on uncovering the next big opportunity.

• 30-Day Money-Back Guarantee:  If you’re not absolutely satisfied with our service, we’ll provide a full refund within 30 days, no questions asked.

Space is Limited! Only 1000 spots are available for this exclusive offer. Don’t let this chance slip away – subscribe to our Premium Readership Newsletter today and unlock the potential for a life-changing investment.

Here’s what to do next:

1. Head over to our website and subscribe to our Premium Readership Newsletter for just $29.99.

2. Enjoy a year of ad-free browsing, exclusive access to our in-depth report on the revolutionary AI company, and the upcoming issues of our Premium Readership Newsletter over the next 12 months.

3. Sit back, relax, and know that you’re backed by our ironclad 30-day money-back guarantee.

Don’t miss out on this incredible opportunity! Subscribe now and take control of your AI investment future!

No worries about auto-renewals! Our 30-Day Money-Back Guarantee applies whether you’re joining us for the first time or renewing your subscription a year later!

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

How Meta is Using a New Metric for Developers: Diff Authoring Time

MMS Founder
MMS Craig Risi

Tracking developer productivity metrics is essential for understanding and improving the efficiency of software development workflows. In fast-paced engineering environments, small inefficiencies can accumulate, impacting overall delivery timelines and code quality. By leveraging precise metrics, organizations can identify bottlenecks, assess the impact of new tools, and make data-driven decisions to enhance developer experience. 

Now we can add another new metric to help track the development process better: Diff Authoring Time (DAT). DAT is a new metric developed by engineers at Meta to measure the duration required for developers to submit changes, known as “diffs,” to the codebase, which they shared in a recent Meta Tech Podcast. By tracking the time from the initiation of a code change to its submission, DAT offers insights into the efficiency of the development process and helps identify areas for improvement.

Implementing DAT involves integrating a privacy-aware telemetry system with version control systems, integrated development environments (IDEs), and operating systems. This setup allows for the precise measurement of the time developers spend authoring code changes without compromising privacy. The data collected through DAT enables Meta to conduct rigorous experiments aimed at enhancing developer productivity. ​

For instance, DAT has been instrumental in evaluating the impact of introducing a type-safe mocking framework in Hack, leading to a 14% improvement in authoring time. Additionally, the development of automatic memoization in the React compiler resulted in a 33% improvement, and efforts to promote code sharing have saved thousands of DAT hours annually, achieving over a 50% improvement. ​

The significance of DAT lies in its ability to provide a precise yet comprehensive measure of development productivity, facilitating data-driven decisions to enhance engineering efficiency. By aligning internal development workflows with an experiment-driven culture, DAT supports continuous improvement in software engineering practices. ​

As highlighted in the Meta Tech Podcast, engineers Sarita and Moritz discuss the challenges of measuring productivity, the implementation of DAT, and the new capabilities it unlocks for developers. Their insights underscore the importance of accurate productivity metrics in fostering an environment of continuous improvement within Meta’s engineering teams. ​

In summary, Diff Authoring Time serves as a tool for Meta to assess and enhance developer productivity, enabling the company to make informed decisions that streamline workflows and improve the overall efficiency of its engineering processes.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Presentation: Unleashing Llama’s Potential: CPU-based Fine-tuning

MMS Founder
MMS Anil Rajput Rema Hariharan

Transcript

Rajput: I come from the hardware background, and we want to optimize. We run benchmarks, and you know the benchmarks are limited use. I wanted to always understand what customers are doing, what are their environments running. One of the things at QCon, I found at the time that the hottest topic in 2018 and 2019 frame, number one is Java. I think 70% to 80% of attendees were Java enterprise and they were solving related problems. The hottest topics were underneath CPUs, most of the deployment. That’s my experience.

Then, suddenly COVID happened. At the time I used to be Intel, and from Intel I changed to AMD. I’m joining back QCon, and that is what has happened since then. CPU has gone tiny and everything is GPU. The whole conference here is no longer Java, on the products, but it’s all about LLMs. I’m also trying to change with that. The interesting topic we bring you is not exactly the same, but the LLMs in particular, Llama running on the CPU. Hold your thing for the GPU part, but we plan to talk about CPUs.

How many folks are aware of the CPU architecture? The reason I wanted to check because many of the optimizations and discussions we want to do here is software-hardware synchronization. Because when we talk about the performance and we hear from many customers that, we were on-prem or we are going into the cloud, and their goals are, I want to save 10%, 20% of TCO, or I want to actually reduce the latency, or I want to do other things. Sure, there are a lot in the architecture of the application, but you can have actually significant performance improvements just from understanding underneath hardware and leveraging it to the best. We want to show you a particular example, how you can actually be aware of the underneath hardware that would be deployed, and design your thing, or architect. It would have both roles, you would have the deployments because some of the decisions are at the deployment time, and others even when you’re writing the code or application.

Hardware Focused Platform Features

We’re not talking about the GPU, and CPU plus GPU interactions, because those analyses become quite different, and how the data is flowing between them. It’s mostly a Llama model or that area being deployed on a CPU platform. Let me share a couple of components, and I’ll talk about each one of them when we take CPUs, cores, simultaneous multi-threading, or in Intel’s case we call Hyper-Threading, the different kind of caches. In caches particularly, I’ll show later a chiplet architecture which is unified. That’s a big change happening in the last three, four years of the deployment.

Of course, the memory when we talk memory, then there are two things, memory capacity and memory bandwidth. Memory latency too, but you don’t have to worry that much about that part. It’s usually memory capacity and bandwidth. Let me start with the CPU side. I wanted to show you one CPU and the two CPUs. The reason it’s good to know about those parts, is because many platforms have the two CPUs, we call them dual socket, and it becomes like two NUMA nodes. Unless you have to cross the application on both the CPUs, that means a cross-socket communication, you want to avoid that. The only time you need that large database, and it needs to have the memory capacity which is needed from both sockets, then it is good.

Otherwise, you want to keep each process and memory local. If it is a 1P platform, you don’t need to worry, but if it is 2P platform, you want to be a little bit aware. I/O side also becomes more interesting that is the I/O device sitting on which CPU, that’s usually much better to keep your process and thing on that socket than the other one. We are not going into the I/O area. If also you work on the I/O disk or network cards, they’re usually on one of the sockets, and how that processing comes are separate talks, actually. Just be aware, is the platform you’re using a 1P or 2P?

Let me go into a little more detail that when you talk about the CPU, typically, you would see a lot of cores, and within a core, you would have L1, L2 caches, and then L3 cache would be underneath. Then of course, you have the memory and I/O or NIC cards. That is a typical design. When we talk about the core, the core you would have SMT thread, L1 cache, L2 cache. This is the only place on the top, the SMT thread which is something you need to think. Do you need to worry about the SMT thread part?

The only part you need to think about it, it just gives you twice the number of threads. When you are designing from the application side, the thread pool size, if it is going to be, let’s say N core system and SMT on, then you want to make sure your thread pool or setting are not hardcoded, they can check how many vCPUs are available. Then you set accordingly. Because I have seen that kind of mistake where people hardcode, we found even in MongoDB or those kinds of things where they hardcode the core at 32 or 16. Suddenly you see when you try to deploy them, a scale-up model, they don’t scale and then you find that the programmer actually hardcoded it.

Now, let me show you, because I talked about the chiplet architecture, what is the difference between a unified L3 and a chiplet architecture. Most of the Intel systems, Intel Xeon, when their GNR is coming next, GNR will be the first chiplet type. Anything before has been unified, you see all the cores sitting within that socket, same L3. On the chiplet side, we try to create a group of cores associated to L3 in each one. That is the chiplet architecture. It has many benefits at the hardware level on the yield, but it has a benefit on the software side. Let me show you what benefit it gives you. I like to show it this way, a little more clarity on the chiplet. One of the benefits you would see that L3 associated to each chiplet, it cannot consume the whole memory bandwidth. On a unified L3 given few cores, if you run a noisy application, it could actually consume the whole memory bandwidth and the rest of the program may not get much bandwidth left, and poor latency suddenly when the noisy neighbor comes in.

One of the benefit of the chiplet architecture is that each chiplet cannot consume the full memory bandwidth. As an example, one platform, if it has 400 Gbps memory bandwidth, one chiplet can’t do more than 40 to 60. Number one, it protects you from the noisy neighbor scenario. Number two, we have the setting in the bus where you have NPS, NUMA Per Socket partitioning. You can set up to four. In this case, it is actually dividing your memory channel into four categories. Most of the cloud are running with NPS1 because they don’t want to manage the schedule of the memory. When you are running in your on-prem, you could actually create four clear NUMA nodes, and it will give you the memory bandwidth.

One of the benefits you would see in this scenario is that, let’s say you wanted to deploy an application on two chiplets, so two L3s, and then you have another application in another NUMA node. What you would see is that they are not colliding on the same memory bandwidth from the channels, and each of the new applications get their own L3. You are able to deploy an application where you have a consistent performance and they’re not consuming each other’s memory or memory bandwidth or the L3 clashing. These are the benefits. The unified L3 does give the benefit where if you need to exchange the data among the L3 or application, that part is faster. Other than that, most of the chiplet architecture benefits outweigh that kind of bandwidth. That’s the reasoning you will see the GNR also going a very similar path of the chiplet. As these caches are increasing, the number of cores are increasing, you cannot have just everything on one unified. It’s just a limitation in the architecture, by the time you have a huge number of cores.

Focus: Software and Synchronization (AI Landscape and LLMs)

Rema will talk with regards to LLMs or the Llamas, what is the role of the SMT, simultaneous multi-threading, what do you need to think about to leverage it best. Same thing on the core that when you have a lot of cores increasing, how do you want to leverage it or what you need to be aware of. Caches play a very important role, and I can tell you from the EDA tools or other areas that you could have 20%, 30% improvement. Because when you’re dividing your problem, just like in the LLM and other space too, if you fit it into the cache versus not fitting, the performance improvements and differences are not 4% or 5%, they are like 20%, 30% as soon as you start fitting in versus moving out. That part for a particular architecture, the more you are high frequency trading or tools or many response time sensitive, throughput time sensitive, you have profiling tools where you can see, is it fitting in or how much is missing? What do I need to adjust? Those kind of specific optimizations from different tools. That part is very important.

Then the memory capacity and the bandwidth part. Because LLMs, as Rema will show you later, when it is compute bound, when it is memory bandwidth bound, and what kind of decision can you do, and actually memory capacity on that extent. I just wanted to give you these details first because Rema will extensively use, this is my analysis and I’m memory bandwidth bound here, and I’m trying to fit in the cache or other area of the chiplet, in her talk. Just wanted to give you a quick idea that even though LLMs, if you look at the bigger picture, it is actually a pretty tiny piece and very complex piece just for the reference. I’m sure you are all aware where the LLM and ChatGPT fits, and another part also that with regard to the timelines, this part is exponentially increasing and changing. We are in exciting times from this change.

Llama

Hariharan: Let’s Llama. Excited to be talking about this Llama who is right now sitting on top of the Andes and everything. Let’s talk about the real thing that we are all interested in. Why Llama in the first place? Why are we not talking about other models? There are so many GPT models that we are all aware of. One of the main reasons to talk about Llama and why we use it for benchmarking and workload analysis is it is a small model. Particularly when we are talking about running it on a CPU, this is one of the smaller models. It comes in multiple sizes, but we still have the small model available.

Relative to other GPT models, this is the smaller one. Not only that, I think we are all aware of this, that Llama was actually trained on publicly available data and it’s fully open source. That’s something that protects us in whatever usage we have. A lot of our customers like to train it for their own select areas. That keeps them more protected because it was trained on publicly available data. There’s nothing to worry about in terms of lawsuits and stuff. This is why I think it’s a small model, open source, and it can be trained.

Let’s look at what our Llama does in action. How does it actually function here? There are two phases of Llama: the prefill phase and the decoding phase. What happens in the prefill phase? Prefill phase, you type something, and whatever you type, basically the model is loaded, and your input data, which is in this case ‘Computer science is’ is what you’re typing, and it is going to predict what’s the next word or next token. The whole model is loaded and the whole model actually works on everything that you have typed. In this case it’s just three words, but you could be sending a whole book there. You could be typing a whole book and putting it as the input data. It processes the entire data you have submitted and then it produces the very first token. Now it need not be one token, it could be a probability distribution over a bunch of tokens. Without loss of generality, let’s just say it’s one token that comes out of it. This portion of the work is extremely compute intensive. We call this the prefill phase. That’s the first phase of it.

The second phase is the decoding phase. What happens is it takes the previous token, and the KV cache that was built and whatever else was created there, and creates the next token, and then the next token, and the next token, and so on, until it reaches the end of sentence. This is the decoding phase. Because you are actually loading the model over and again every time. I said the model is small, but it’s still not small enough to fit into our caches. You really have to pull the model from the memory and load it multiple times, portions of it every time and so on. When you’re doing that there’s a lot of memory bandwidth involved in the second phase. This decoding phase is highly memory bandwidth intensive.

Basically, what happens in the prefill phase is tokenization, embedding, and encoding, and everything. The decode phase, you are actually going to iterate through the whole thing over and again, either deterministically or probabilistically. Like I said, each token produced is actually a probabilistic distribution over a bunch of tokens. You can set configuration parameters where you are actually just going to be greedy and pick the most probable one and move forward. That’s the fast way to do it, but there are ways to do that. Without loss of generality, let’s just say that these are the two stages of the model and we’re taking one token at a time and moving forward.

Now let’s look at the Llama internals. When I say internals, you’re driving a car. You want to look at the inside of the car. You have the engine, you have the transmission, all these things. If you have to make sure that your car is functioning well, you have to make sure you have a good engine, make sure you have a good transmission, and all those things. Let’s look at the internals here. What we see is when Llama is running, there’s going to be a lot of matrix multiplication that happens. Matrix multiplication is one of the key operations that happens here. Dot product is another one. One is cross product, dot product. Scaling and softmax computation. Weighted sum. Last but not the least, passing through multiple layers and aggregation of everything. These are the primitives that go into the Llama internals.

In order to get the best performance, what we really need to do is to optimize these primitives. Not just optimize these primitives, that happens through all the BLAS and MKL libraries that have been written. Also, optimize it for a given hardware platform. Now, the optimization, some of it is common optimization. Other part is specific to a particular hardware. That’s about something we have to be aware of, what is the latest thing that optimizes these? What are the latest libraries that optimize these for a particular hardware that you are running it on?

Next, let’s talk about metrics. Metrics is clearly something that comes from the user. A user decides what the metrics are, what is important to them. Let’s look at it. I’m showing you pretty much the same diagram that we showed before, but slightly differently placed. What happens in any LLM, not just Llama, you give the input. You are giving the input, and then if you have typed in ChatGPT, which all of us do, we’re waiting for some time, especially if the input is long. It goes blink, blink, blink. I don’t know whether my words are lost in the ether space or what happened. That is the time that the initial prefill phase is working on it. Then at the end of it comes the first token. Something is happening there, and I’m happy.

Then after that, all the tokens follow. Sometimes you can see when the output is large, as you are trying to read, there’s more that’s coming. It doesn’t all appear in one shot. All these tokens are coming slowly. Token is not exactly a word, but tokens are usually converted to words. I won’t go into the details of that at this point. What are the metrics here? The time it took from the time I gave my input to when I started seeing something appearing on my screen. That’s the time to first token. I’m iterating it just for completeness of the talk. Then all these tokens appear all the way to the end. The total latency is something that’s very important. I care whether I got my entire response or not. Inverse of that is the throughput. One divided by latency is the throughput, basically. Throughput is something that’s super important for all of us. We are ready to wait a little bit longer if we are really asking it to write an essay. If I’m just asking a yes or no question, you better be quick. I’m going to pretty much say the same things.

Basically, throughput is a real performance. Throughput actually marks how the system is being used. TTFT is something that’s super important. It is a result of that compute intensive phase. The TTFT is something that can be manipulated quite a bit by giving specialized hardware many times. For example, AMX is used as well. That definitely helps reducing the TTFT. GPUs also reduce TTFT. Throughput, on the other hand, is mostly controlled by the memory bandwidth, because the model is loaded over and again. The larger your output gets as it is pumping through more tokens, the throughput is controlled by the memory bandwidth.

Deployment Models – How Are Llamas Deployed?

Now let’s talk about the deployment models. I just thought it is important to show that our CPUs fit in whether you’re using GPUs or just CPUs. Smaller models particularly, can be run very well on street CPUs. Even when you’re running on GPUs, a CPU is involved. A CPU is connected to the GPUs. That typically suits larger models and allows for mixed precision and better parallelism and all that. Talking about GPUs, you can see deployments like these as well. They basically have a network of GPUs. What is important in this case is how these GPUs are connected. GPUs have to be connected through an NVLink or an InfiniBand. They need fast connections between all these GPUs and the CPU and GPU as well.

Typically, CPU connects to a GPU through the PCIe. It can get even more complex. You can have not just a network of GPUs, but even the inputs can be fed in a different way. You can have audio input, video input, and they’re processed differently, fed into the model, and then there are layers that are being handled by different sets of GPUs. You can make this as complex as you want to. We’re not going to get into all these complexities. Life is difficult even if you take a simple case. Let’s stick to it.

Llama Parameters

Let’s get into the details. First, let’s get familiar with some of the jargon that we use. They’re not exactly jargon, we’re all familiar with it, but let’s get them on the board here. The main three parameters that we’ll talk about are input tokens, output tokens, batch size. Those are three things that you will hear whenever you look at any benchmark publication or workload details corresponding to not just Llama, but in general, all LLMs. Input tokens, clearly that’s what you’re typing on the screen. That’s the input you’re providing. A paragraph, a quick set of prompts, whatever. Output tokens is what it produces. Tokens, again, are not the words that you see, but tokens are related to the words that you see. What is a batch size? Batch size can go all the way from one to whatever that you want. What does it look like here? Basically, if you give just one thing at a time, you give one prompt, wait for the response, then give the next prompt. You can say, tell me a story. Llama tells you a story.

Then, tell me a scary story. It tells you another one, and so on. You can give multiple prompts at the same time, like him, he’s showing an example of batch size equal to 4 there. All of them will be processed together. What can also happen is that some prompts are done earlier than the others. The parallelism doesn’t stay the same throughout. They are also working on things like dynamic batch sizes and so on. We’ll not get into all those details right now here. We’ll keep it simple. We’ll assume that a batch size of 4 is given, and then batch output of 4 is produced, and then next 4, and so on, if I say batch size equal to 4. What is a Llama instance? That’s the first thing I put there. Llama instance is your Llama program that you’re running. You can run multiple instances of these on the same system. You don’t have to run just one. You can run multiple instances. We will talk about how things scale and so on. Each instance is an instantiation of the Llama program.

Selecting the Right Software Frameworks

Let’s talk about selecting the right software frameworks. First, everything started with PyTorch. That’s the base framework that we started with. It has good support. It’s got a good community support, but except it doesn’t have anything special for any particular hardware or anything like that. It’s not optimized. It’s a baseline you can consider. Then came TPP that was created by Intel. TPP was done by Intel. A lot of what you see in TPP is optimized more for Intel. Given that it is optimized for Intel, it actually works pretty well on AMD as well. We also get a good gain going from the baseline to going to using TPP.

Then came IPEX. IPEX actually incorporates TPP right within. IPEX was built on top of TPP. That, again, was done by Intel. Also, benefits Intel a little bit more than it benefits AMD. Last but not the least is our favorite thing, which is ZenDNN. How many of you have used ZenDNN? It was recently released. The thing about ZenDNN is it builds on top of what is already there, obviously. It was recently released. It gives a good boost to the performance when you run it on AMD hardware, particularly. Let’s look at some numbers since I said this is better than this and so on.

As you can see, if I mark the baseline as equal to 1, so what I’m plotting here is various software optimizations: baseline, TPP, IPEX, and then finally Zen. You can see going from baseline to TPP is more than a factor of 2. These are all performance based on our hardware only. There are no competitive benchmarking numbers or anything presented in this talk. Going from baseline to IPEX, you get nearly a 3x. It’s a lot more than that. Then with Zen, you get even more of a boost above where IPEX is. One thing I have to say that in Zen, the advantage that you get will keep increasing as the batch size increases. It is actually optimized to benefit from higher core counts that we have, number one.

Secondly, the large L3 caches that we have as well. There’s a lot of code refactoring that went into it, and there’s a lot of optimization that has gone into it, but the benefits actually increase. As you can see with two batch sizes that I’ve shown, 1 and 16, it shows a 10% advantage over IPEX when you start, and then it goes to a little higher, and then I know I’ve not added those graphs, but the benefit actually does increase as the batch size increases.

Hardware Features and How They Affect Performance Metrics

Let’s come to the core of this talk. Here you have various hardware features, and the question is, how are they going to affect my performance metrics? How do we optimize things in order to use all these things optimally? Let’s first talk about cores. In this graph, what I’m showing here is how Llama scales. I’m using only one instance of Llama, size 16, size 32, 64, and 128. You can see I’ve gone eight times in size from the leftmost to the rightmost. That’s a factor of 8 involved, but the amount of gain that I’ve got is less than 50%. The software does not scale. There are multiple reasons to it. We can get into that.

Basically, the performance that you get with size 16 seems to be mostly good enough. Maybe I should run multiple of 16s rather than run a large instantiation of the same thing. What I plotted earlier, what I showed in the previous graph was basically throughput. The throughput doesn’t scale a whole lot, and that’s the main thing that as somebody who’s trying to use the system and getting the most out of it, I’m interested in throughput. The user is also interested in the TTFT. The TTFT does benefit, not a whole lot, but there is some benefit when you make the instance larger. Not a whole lot. As you can see, going from 16 to 128, it dropped by about 20% or something. Not a whole lot. That parallelism and that CPU capacity that you’re throwing in does benefit in terms of performance.

Moral of the story, additional cores do offer only incremental value. TTFT also benefits. The reason these two have to be taken together is there could be possibly a requirement on the TTFT when you’re working with a customer to say, I want my first token to appear within so many milliseconds or seconds, whatever it may be. You have to bear that in mind when you’re trying to say, can I make the instance really small and have a whole lot of them? There is another thing to it as well, which is, what happens here is when you have too many instances, each instance is going to consume memory. I’ll come to that later. You may not have that much memory to deal with it.

Next, let’s talk about SMT, the symmetric multi-threading. You have a core and then you have a sibling core. On each core, there are two of them that are operating. Are you going to get benefit from using the SMT core? Let’s take a look. The blue lines here show you what’s a performance improvement. I’m only plotting the improvement. I’m not giving you raw numbers, nothing. When you’re running a single instance, so this is a single instance of size 16 that I’m running. What have I done there? There’s nothing else running on the system. Remember my CPU has 128 cores. These are all run on our Turing. We have 128 cores, but I’m using only 16 of them. That means the background is very quiet. Nobody else is using the memory bandwidth. Memory bandwidth wise, we are not constrained at all.

The only constraint that’s coming here is from the core itself. It’s CPU bound. What happens there, you are getting a good boost by using the SMT thread. With and without SMT thread, if you actually run it twice, you can see that you get a good boost. The orange line on the other hand is the kind of boost that you will see when you’re actually running everything. You’re running all the 16, 16, 16, you’re running all of them together. What happens then, you’re actually going to be constrained by the memory bandwidth. Your memory bandwidth becomes a constraint. Really, there is no advantage or disadvantage. As you can see, the percentage is in single digits. The statistical variation is what you’re seeing there, nothing else.

Moral of the story again, SMT does not hurt, even in the fully loaded case, but it is going to give you a lot of benefit if the background is going to be quiet. Particularly that’s important because, let’s say you’re running on a cloud, on AWS or one of these things, you don’t assume everybody is running a Llama. You take an instance of size 16 and you’re running there. Maybe most likely that everybody else is quiet or doing very little thing. You will get that benefit, so use your SMT there.

This is the most important thing. You will see a big difference here, memory bandwidth. What is the role of memory bandwidth? What I did was, on a Turing system, we have a bandwidth of 6,000 megatransfers per second. I clocked it down to 4,800, 20% reduction in the bandwidth. When I did that, the question was, how much are we going to affect the overall performance? Remember I told you that the prefill phase is affected mostly by the CPU and the decoding phase is affected by the memory bandwidth. When you do that, that’s a substantial difference. Here is the opposite role. There are two things that I plotted here.

The first thing is a single instance, the dark brown one. When I just run a single instance, that means it doesn’t matter my memory bandwidth, whether it’s 6,000 or 4,800, I have plenty, for a single instance that’s running. When I run all the instances, that is when you can see the memory bandwidth really hitting you hard and it gets affected very badly. Basically, moral of the story here is that, use as much bandwidth as you can get. If the cloud is going to constrain you for the amount of bandwidth that you’re going to use, it’s worth paying for that if you have to pay for extra bandwidth. I know it can be manipulated how much bandwidth each instance gets.

Next let’s talk about role of caches here. I don’t have a graph here, but I can talk through this. Caching is important. Remember we are constrained on memory bandwidth. If we can get the data from caches, it’s better. However, you cannot fit the whole model into cache. What really happens is your model gets loaded over and again. Just by nature of this particular workload, it’s a use and throw model. That means you’re going to load the weights, use it to compute something, and that’s it. You’re not going to reuse it. The only way you can reuse it is if you use a higher batch size. If you’re going to process 64 of them together, so all 64 will be using the same type of weights in order to do the computation.

Otherwise, if you’re just using a batch size equal to 1, it is a use and throw model. You can actually see a very large L3 miss rate. Using a higher batch size is crucial in order to increase the caches. Earlier I talked about what happens as you scale a particular instance. I’m going to talk about what happens when you change the number of instances, how you use more instances, how does your performance scale. As you can see, now the two bars that you see there, the blue one, that’s what I’m using as a basis, is running just a single 16-core instance. The orange ones are running 16 of the 16-core instances. I’m running 16 of them in parallel.

If everything was ideal and everything you would have had, the height of the orange bar would have been 16. It would have been 16 times performant, but no, there are other constraints that come into play. Your memory bandwidth is a big constraint. You don’t get 16x, but you get nearly 10x, 12x performance overall compared to the single instance where there is no memory bandwidth constraint and you’re running all of them together. This is for different situations. Chat is where you have short input, short output. Essay is where short input, long output. Summary is very long and short. Translation is both long. In all the cases, you’re getting at least a 10x improvement against the baseline that we’re looking at. Basically, running parallel instances is the way to go, pretty much.

I talked about batches a lot. I said use higher batch sizes. What is our return? Throughput-wise, look at the return that we are getting. As the batch size increases, going from 1 to 128, I got more than 128x of performance. The reason is I’m getting much higher L3 hit rates. I’m getting things from the cache instead of going to the memory. I’ve reduced my latency. I’ve made my CPU much more performant here. I’m using my cores a lot more. I’m getting more than 128x return when I’m using 128 batch size. You don’t get anything for free. The place where it hurts is the TTFT. Your TTFT actually goes up also. That’s not a good thing. That makes sense.

If I’m working on 20 projects at the same time, everybody is going to be complaining. All my customers are going to be complaining that I’m not giving them the solution, but yes, I’m working day and night. That’s what matters to us as users of computers. We want to use our machines day and night and get the maximum throughput. After one month, I think everybody will get the answer when I’m working on 20 projects, but next day, probably not. That’s what we are seeing. That comes at a cost of TTFT. The TTFT does grow as well as you are running more in parallel.

These are various things that I already said. Do not use larger instances, if you can afford it. Use more instances. In order to harvest performance, use larger batch sizes. Also, the whole thing is going to be a balancing act for the most part between TTFT and overall memory needs. I said memory needs now, and I want to get into it immediately now after this. TTFT is a requirement that’s going to be placed on you by a customer. Memory needs is something that is going to grow as the number of parallel instances increases and as the batch sizes increases as well. Some formulae out there: fundamentally, I want to say this, the memory need comes from three different factors. Number one, the model itself. The larger the model, the more you’re going to load. If you have an 8 billion model, so each one is going to take 2 bytes. The next thing that is going to add to the memory is activations.

Last but not the least, the KV cache as well. The KV cache keeps growing as you’re building it as well, as you’re processing more. The total requirement comes from all three of them together. Basically, if you’re having multiple instances, in this case, let’s say I computed my need for memory as 41 gigabytes, so 41 times 32, if I start 32 instances in parallel. I said 32 instances, because typically on AMD systems, we like to keep an instance on one thing called the CCD or the Core Complex Die, which has 8 cores. We have 32 of them, so 8 times 32 is our whole system. That comes to 1.3 terabytes, which is really close to the total amount of memory I have on the system, 1.5 terabytes. This is when, when you have too many instances, you start seeing swapping. This calculation is something that we urge you to do in order to get a good idea of how to get the maximum out of the system. You don’t want swapping. Swapping is not a good thing.

I know I did a back-of-the-envelope calculation there, and did that. That calculation, most of the times is slightly under. It depends on which framework you are using. With ZenDNN, it’s pretty close by, as you can see, in most cases. In the case of IPEX, it was using even more memory. I have seen this go the other way as well, not for Llama, for some of the other use cases that we have run, where ZenDNN will take more memory and so on. I’m not making a statement on this here at all. The point is you have to be aware of that this is only a back-of-the-envelope calculation, but you have to look at what your framework is using.

If you’re using ZenDNN or IPEX, whichever it is, just see how much total memory that your instance is going to use. This is again one more thing that I want to say, free floating versus dedicated. Please pin your instances. This is probably the worst case I have shown, happened at least in one case. Doesn’t always happen, but when you pin it, each one is going to run on a different set of cores, and you’re going to get the returns proportionately. When you don’t pin, there is no telling. Pretty much all of them will run on the same bunch of cores, or they will be context switching back and forth. Either way, you pay the penalty. This is the worst case that I’ve shown. It’s not going to be as bad as this, but it can be as bad as this.

Summary

Recommendations for optimization: you know that the initial part of the run is core bound, and the second part is memory bound, so use more memory bandwidth if you can get it. Parallelism helps. Use the best software that you can use for your hardware. For Zen, definitely we recommend zentorch. These things will evolve with time as well, but this is where you have to do your due diligence and homework and identify what is the best software that fits your case. Pin instances as much as possible.

Questions and Answers

Participant 1: How are you capturing some of these metrics? What specific metrics and what tools are you using for the metric calculation of observability?

Hariharan: We know how many tokens we are sending. Typically, when we run Llama, we know our input, output tokens. The output tokens is the total number of tokens that are produced by the model, and we know how long it took to run. That’s what we use to compute the throughput. Again, for TTFT, what we do is, for any particular input token size, we set the output token equal to 1, and run it and estimate what the TTFT is going to be. Typically, we know that that is something that the user is actually simply waiting for.

Participant 1: How about CPU and other physical metrics, especially on cloud providers? Are you using hardware counters? How are you measuring swapping?

Hariharan: I have not run it on the cloud yet. Running it on bare metal, we have our regular tools to measure the utilization and also the counters and everything. We have our own software, and general-purpose software as well.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.