MMS • Mandy Gu Namee Oberst Srini Penchikala Roland Meertens Antho
Article originally posted on InfoQ. Visit InfoQ
Subscribe on:
Transcript
Srini Penchikala: Hello everyone. Welcome to the 2024 AI and ML Trends Report podcast. Greetings from the InfoQ, AI, ML, and Data Engineering team. We also have today two special guests for this year’s Trends report. This podcast is part of our annual report to share with our listeners what’s happening in the AI and ML technologies. I am Srini Penchikala. I serve as the lead editor for the AI, ML and Data Engineering community on InfoQ. I will be facilitating our conversation today. We have an excellent panel with subject matter experts and practitioners from different specializations in AI and ML space. I will go around our virtual room and ask the panelists to introduce themselves. We will start with our special guests first, Namee Oberst and Mandy Gu. Hi Namee. Thank you for joining us and participating in this podcast. Would you like to introduce yourself and tell our listeners what you’ve been working on?
Introductions [01:24]
Namee Oberst: Yes. Hi. Thank you so much for having me. It’s such a pleasure to be here. My name is Namee Oberst and I’m the founder of an open source library called LLMware. At LLMware, we have a unified framework for building LLM based applications for RAG and for using AI agents, and we specialize in providing that with small specialized language models. And we have over 50 fine-tuned models also in Hugging Face as well.
Srini Penchikala: Thank you. Mandy, thank you for joining us. Can you please introduce yourself?
Mandy Gu: Hi, thanks so much for having me. I’m super excited. So my name is Mandy Gu. I am lead machine learning engineering and data engineering at Wealthsimple. Wealthsimple is a Canadian FinTech company helping over 3 million Canadians achieve their version of financial independence through our unified app.
Srini Penchikala: Next up, Roland.
Roland Meertens: Hey, I’m Roland, leading the datasets team at Wayve. We make self-driving cars.
Srini Penchikala: Anthony, how about you?
Anthony Alford: Hi, I’m Anthony Alford. I’m a director of software development at Genesis Cloud Services.
Srini Penchikala: And Daniel?
Daniel Dominguez: Hi, I’m Daniel Dominguez. I’m the managing partner of an offshore company that works with cloud computing with the AWS Department Network. I’m also an AWS Community Builder and the machine learning too.
Srini Penchikala: Thank you everyone. We can get started. I am looking forward to speaking with you about what’s happening in the AI and ML space, where we currently are and more importantly where we are going, especially with the dizzying pace of AI technology innovations happening since we discussed the trends report last year. Before we start the podcast topics, a quick housekeeping information for our audience. There are two major components for these reports. The first part is this podcast, which is an opportunity to listen to the panel of expert practitioners on how the innovative AI technologies are disrupting the industry. The second part of the trend report is a written article that will be available on InfoQ website. It’ll contain the trends graph that shows different phases of technology adoption and provides more details on individual technologies that have been added or updated since last year’s trend report.
I recommend everyone to definitely check out the article as well when it’s published later this month. Now back to the podcast discussion. It all starts with ChatGPT, right? ChatGPT was rolled out about a year and a half ago early last year. Since then, generative AI and LLM Technologies feels like they have been moving at the maximum speed in terms of innovation and they don’t seem to be slowing down anytime soon. So all the major players in the technology space have been very busy releasing their AI products. Earlier this year, we know at Google I/O Conference, Google announced several new developments including Google Gemini updates and generative AI in Search, which is going to significantly change the way the search works as we know it, right? So around the same time, open AI released GPT-4o. 4o is the omni model that can work with audio vision and text in real time. So like a multi-model solution.
And then Meta released around the same time Llama 3, but with the recent release of Llama version 3.1. So we have a new Llama release which is based on 405 billion parameters. So those are billions. They keep going up. Open-source solutions like Ollama are getting a lot of attention. It seems like this space is accelerating faster and faster all the time. The foundation of Gen AI technology are the large language models that are trained on a lot of data, making them capable of understanding and generating natural language and other types of content to perform a wide variety of tasks. So LLMs are a good topic to kick off this year’s trend report discussion. Anthony, you’ve been closely following the LLM models and all the developments happening in this space. Can you talk about what’s the current state of Gen AI and LLM models and highlight some of the recent developments and what our listeners should be watching out for?
The future of AI is Open and Accessible [05:32]
Anthony Alford: Sure. So I would say if I wanted to sum up LLMs in one word, it would be “more” or maybe “scale”. We’re clearly in the age of the LLM and foundation models. OpenAI is probably the clear leader, but of course there are big players like you mentioned, Google, also Anthropic has their Claude model. Those are closed, even OpenAI, their flagship model is only available through their API. Now Meta is a very significant dissenter to that trend. In fact, I think they’re trying to shift the trend toward more open source. I think it was recently that Mark Zuckerberg said, “The future of AI is open.” So Meta and Mistral, their models are open weight anyway, you can get the weight. So I mentioned one thing about OpenAI, even if they didn’t make the model weights available, they would publish some of the technical details of their models. For example, we know that GPT-3, the first GPT-3 had 175 billion parameters, but with GPT-4, they didn’t say, but the trend indicates that it’s almost certainly bigger, more parameters. The dataset’s bigger, the compute budget is bigger.
Another trend that I think we are going to continue to see is, so the ‘P’ in GPT stands for pre-trained. So these models, as you said, they’re pre-trained on a huge dataset, basically the internet. But then they’re fine-tuned, so that was one of the key innovations in ChatGPT was it was fine-tuned to follow instructions. So this instruct tuning is now extremely common and I think we’re going to continue to see that as well. Why don’t we segue now into context length? Because that’s another trend. The context length, the amount of data that you can put into the model for it to give you an answer from, that’s increasing. We could talk about that versus these new SSMs like Mamba, which in theory don’t have a context length limitation. I don’t know, Mandy, did you have any thoughts on this?
Mandy Gu: Yes, I mean I think that’s definitely a trend that we’re seeing with longer context windows. And originally when ChatGPT, when LLMs first got popularized, this was a shortcoming that a lot of people brought up. It’s harder to use LLM at scale or as more as you called it when we had restrictions around how much information we can pass through it. Earlier this year, Gemini, the Google Foundation, this GCP foundational models, they introduced the one plus million context window length and this was a game changer because in the past we’ve never had anything close to it. I think this has sparked the trend where other providers are trying to create similarly long or longer context windows. And one of the second order effects that we’re seeing from this is around accessibility. It’s made complex tasks such as information retrieval a lot simpler. Whereas in the past we would need a multi-stage retrieval system like RAG, now it’s easier, although not necessarily better, we could just pass all those contexts into this one plus million context window length. So that’s been an interesting development over the past few months.
Anthony Alford: Namee, did you have anything to add there?
Namee Oberst: Well, we specialize in using small language models. I understand the value of the longer context-length windows, but we’ve actually performed internal studies and there have been various experiments too by popular folks on YouTube where you take even a 2000 token passage and you pass it to a lot of the larger language models and they’re really not so good at finding the lost in the middle problem for doing passages. So if you really want to do targeted information search, it’s still the longer context windows are a little misleading I feel like sometimes to the users because it makes you feel like you can dump in everything and find information with precision and accuracy. But I don’t think that that’s the case at this point. So I think really well-crafted RAG workflow is still the answer.
And then basically for all intents and purposes, even if it’s like a million token context lines or whatever, it could be 10 million. But if you look at the scale of the number of documents that an enterprise has in an enterprise use case, it probably still doesn’t move the needle. But for a consumer use case, yes, definitely a longer context window for a very quick and easy information retrieval is probably very helpful.
Anthony Alford: Okay, so it sounds like maybe there’s a diminishing return, would you say? Or-
Namee Oberst: There is. It really, really depends on the use case. If you have what we deal with, if you think about it like a thousand documents, somebody wants to look through 10,000 documents, then that context window doesn’t really help. And there are a lot of studies just around how an LLM is really not a search engine. It’s really not good at finding the pinpointed information. So I don’t really personally like to recommend longer context LLMs instead of RAG. There are other strategies to look for information. Having said that, where is it very, very helpful in my opinion, the longer context window? If you can pass, for instance, a really long paper that wouldn’t have fit through a narrow context window and ask it to rewrite it or to absorb it and to almost… What I love to use LLMs for is to transform one document into another, take a long Medium article and transform that to a white paper, let’s just say as an example, that would’ve previously been outside the boundaries of a normal context window. I think this is fantastic. Just as an example of a really great use case.
Anthony Alford: So you brought up RAG and retrieval augmented generation. Why don’t we look at that for a little bit? It seems like number one, it lets you avoid the context length problem possibly. It also seems like a very common case, and maybe you could comment on this, the smaller open models. Now people can run those locally or run in their own hardware or their own cloud, use RAG with that and possibly solve problems and they don’t need the larger closed models. Namee, would you say anything about that?
Namee Oberst: Oh yes, no, absolutely. I’m a huge proponent of that and if you look at the types of models that we have available in Hugging Face to start and you look at some of the benchmark testing around their performance, I think it’s spectacular. And then the rate and pace of innovation around these open source models are also spectacular. Having said that, when you look at GPT-4o and the inference speed, the capability, the fact that it can do a million things for a billion people, I think that’s amazing.
But if you’re looking at an enterprise use case where you have very specific workflows and you’re looking to solve a very targeted problem, let’s say, to automate a specific workflow, maybe automate report generation as an example, or to do RAG for rich information retrieval within these predefined 10,000 documents, I think that you can pretty much solve all of these problems using open source models or take an existing smaller language model, fine tune them, invest in that, and then you can basically run it with privacy and security in your enterprise private cloud and then also deploy them on your edge devices increasingly. So I’m really, really bullish on using smaller models for targeted tasks.
Srini Penchikala: Yes, I tried Ollama for a use case a couple of months ago and I definitely see open source solutions like Ollama that you can self-host. You don’t have to send all your data to the cloud and you don’t know where it’s going. So use these self-hosted models with RAG techniques. RAG is mainly for the proprietary information knowledge base. So definitely I think that combination is getting a lot of attention in the corporations. Companies don’t want to send the data outside but still be able to use the power.
Roland Meertens: I do still think that at the moment most corporations are starting with OpenAI as a start, prove their business value and then they can start thinking about, “Oh, how can we really integrate it into our app?” So I think it’s fantastic that you can so easily get started with this and then you can build your own infrastructure to support the app later on.
Srini Penchikala: Yes. For scaling up, right Roland? And you can see what’s the best scale-up model for you, right?
Roland Meertens: Yes.
Srini Penchikala: Yes. Let’s continue the LLM discussions, right? Another area is the multi-model LLMs, the GPT-4o model, the omni model. So where I think it definitely takes the LLMs to the next level. It’s not about text anymore. We can use audio or video or any of the other formats. So anyone have any comments on the GPT-4o or just the multi-model LLMs?
Namee Oberst: In preparation for today’s podcast, I actually did an experiment. I have a subscription to GPT-4o, so I actually just put in a couple of prompts this morning, just out of curiosity because we’re very text-based, so I don’t actually use that feature that much. So I asked it to generate a new logo for LLMware, like for LLMware using the word, and it failed three times, so it botched the word LLMware like every single time. So having said that, I know it’s really incredible and I think they’re making fast advances, but I was trying to see where are they today, and it wasn’t great for me this morning, but I know that of course they’re still better than probably anything else that’s out there having said that, before anybody comes for me.
Roland Meertens: In terms of generating images, I must say I was super intrigued last year with how good Midjourney was and how fast they were improving, especially the small size of the company. That a small company can just beat out the bigger players by having better models is fantastic to see.
Mandy Gu: I think that goes back to the theme, Namee was touching on it, where big companies like OpenAI, they’re very good at generalization and they’re very good at getting especially new people into the space, but as you get deeper, you find that, as we always say in AI machine learning, there’s no free lunch. You explore, you test, you learn, and then you find what works for you, which isn’t always one of these big players. For us, where we benefited the most internally from the multi-modal models is not from image generation, but more so from the OCR capabilities. So one very common use case is just passing in images or files and then being able to converse with the LLM against, in particular, the images. That has been the biggest value proposition for us and it’s really popular with our developers because a lot of the times when we’re helping our end users, where our internal teams debug, they’ll send us a screenshot of the stack trace or a screenshot of the problem and being able to just throw that into the LLM as opposed to deciphering the message has been a really valuable time saver.
So not so much image generation, but from the OCR capabilities, we’ve been able to get a lot of value.
Srini Penchikala: That makes sense. When you take these technologies, OpenAI or anyone else, it’s not a one-size-fits-all when you introduce the company use cases. So everybody has unique use cases.
Daniel Dominguez: I think it’s interesting that I think now we mentioned all the Hugging Face libraries and models that are right now, for example, I’m thinking and looking right now in Hugging Face, there are more than 800,000 models. So definitely it’ll be interesting next year how many new models are going to be out there. Right now the trendings are, as we mentioned, Llama, Google Gemma, Mistral models, Stability models. So definitely in one year, how many new models are going to be out there, not only on text, but also on images, also on video? So definitely there’s something that, I mean it will be interesting to know how many models were last year actually, but now it could be an interesting number to see how many new models are going to be next year on this space.
RAG for Applicable Uses of LLMs at Scale [17:42]
Srini Penchikala: Yes, good point. Daniel. I think just like the application servers, probably like 20 years ago, right? There was one coming out every week. I think a lot of these are going to be consolidated and just a few of them will stand out to last for a longer time. So let’s quickly talk about the RAG, you mentioned about it. So this is where I think the real sweet spot-for companies, to input their own company information, whether on-site or out in the cloud and run through LLM models and get the insights out. Do you see any real-world use cases for RAG that may be of interest to our listeners?
Mandy Gu: I think RAG is one of the most applicable uses of LLMs at scale, and I think they can be shaped, and depending on how you design the retrieval system, it can be shaped into many use cases. So for us, we use a lot of RAG internally and we have a tool, this internal tool that we’ve developed which integrates our self-hosted LLMs against all of our company’s knowledge sources. So we have our documentation in Notion, we have our code in GitHub, and then we also have public artifacts from our help center website and other integrations.
And we essentially just built a retrieval augmented generation system on top of these knowledge bases. And how we’ve designed this is that every night we would have these background jobs which would extract this information from our knowledge sources, put in our vector database, and then through this web app that we’ve exposed to our employees, they’d be able to ask questions or give instructions against all of this information. And internally when we did our benchmarking as well, we’ve also found this to be a lot better from a relevancy and accuracy perspective than just feeding all of this context window into something like the Gemini 1.5 series. But going back to the question primarily as a way of boosting employee productivity, we’ve been able to have a lot of really great use cases from RAG.
Namee Oberst: Well Mandy, that is such a classic textbook, but really well-executed project for your enterprise and that’s such a sweet spot for what the capabilities of the LLMs are. And then you said something that’s really interesting. So you said you’re self-hosting the LLMs, so did you take an open source LLM or do you mind sharing some of the information? You don’t have to go into details, but that is a textbook example of a great application of Gen AI.
Mandy Gu: Yes, of course. So yes, they’re all open source and a lot of the models we did grab from Hugging Face as well. When we first started building our LLM platform, we wanted to provide our employees with this way to securely and accessibly explore this technology. And like a lot of other companies, we started with OpenAI, but then we put a PII redaction system in front of it to protect our sensitive data. And then the feedback we got from our employees, our internal users was that this PII redaction model actually prevented the most effective use cases of generative AI because if you think about people’s day-to-day works, there’s a large degree of not just PII but sensitive information they need to work with. And that was our natural segue of, okay, instead of going from how do we prevent people from sharing sensitive information with external providers, to how do we make it safe for people to share this information with LLMs? So that was our segue from OpenAI to the self-hosted large language models.
Namee Oberst: I’m just floored Mandy. I think that’s exactly what we do at LLMware. Actually, that’s exactly the type of solution that we look to provide with using small language models chained at the back-end for inferencing. You mentioned Ollama a couple of times, but we basically have Llama.cpp integrated into our platform so that you can bring in a quantized model and inference it very, very easily and securely. And then I’m a really strong believer that this amazing workflow that you’ve designed for your enterprise, that’s an amazing workflow. But then we’re also going to see other workflow automation type of use cases that will be miniaturized to be used on laptops. So I really almost see a future very, very soon where everything becomes miniaturized, these LLMs become smaller and almost take the footprint of software and we’re all going to start to be able to deploy this very, very easily and accurately and securely on laptops just as an example, and of course private cloud. So I love it. Mandy, you’re probably very far ahead in the execution and it sounds like you just did the perfect thing. It’s awesome.
Mandy Gu: That’s awesome to hear that you’re finding similar things and that’s amazing the work you’re doing as well. You mentioned Llama.cpp, and I thought that’s super interesting because I don’t know if everyone realizes this, but there’s so much edge that quantized models, smaller models can give and right now when we’re in this phase, when we’re still in the phase of this rapid experimentation, speed is the name of the game. Sure, we may lose a few precision points by going with a more quantized models, but what we get back from latency, what we get back from being able to move faster, it’s incredible. And I think Llama.cpp is a huge success story in its own, how this framework created by an individual, a relative small group of people, how well it is able to be executed at scale.
AI Powered Hardware [23:03]
Namee Oberst: Yes, I love that discussion because like Llama.cpp though, Georgi Gerganov, amazing, amazing in open source, but it’s optimized for actually Mac Metal and works really well also in NVIDIA CUDA. So the work that we’re doing is actually to allow data scientists and machine learning groups in enterprises also on top of everything else to not only be able to deliver the solution on Mac Metal, but across all AI PCs. So using Intel OpenVINO using Microsoft ONNX so that, data scientists like to work on Macs, but then also be able to deploy that very seamlessly and easily on other AI PCs because MacOS is only like 15% of all the operating systems out there, the rest of the 85 are actually non-MacOS. So just imagine the next phase of all this when we can deploy this across multiple operating systems and access to GPU capabilities of all these AI PCs. So it’s going to be really exciting in terms of a trend in the future to come, I think.
Small Language Models and Edge Computing [24:02]
Srini Penchikala: Yes, a lot of good stuff happening there. You both mentioned about small language models and also edge computing. Maybe we can segue into that. I know LLMs, we can talk about them for a long time, but I want to hear your perspective on other topics. So regarding the small language models, Namee, you’ve been looking into SLMs at your company, LLMWare, and also a RAG framework you mentioned specifically designed for SLMs. Can you talk about this space a little bit more? I know this is a recent development. I know Microsoft is doing some research on what they call a Phi-3 model. Can you talk about this? How are they different? What our listeners can do to get up to speed with SLMs?
Namee Oberst: So we’re actually a pioneer in working with small language models. So we’ve been working and focused actually on small language models for well over a year, so almost like too early, but that’s because RAG as a concept, it didn’t just come out last year. You know that probably RAG was being used in data science and machine learning for probably the past, I’d say, three or four years. So basically when we were doing experimentation with RAG and we changed one of our small parameter models very, very early on in our company, we realized that we can make them do very powerful things and we’re getting the performance benefits out of them, but with the data safety and security and exactly for all the reasons that Mandy mentioned and all these things were top of mind for me because my background is I started as a corporate attorney at a big law firm and I was general counsel of a public insurance brokerage company.
So those types of data safety, security concerns were really top of mind. So for those types of highly regulated industries, it’s almost a no-brainer to use small language models or smaller models for so many different reasons, a lot of which Mandy articulated, but there’s also the cost reason too. A cost is huge also. So there’s no reason to really deploy these large behemoth models when you can really shrink the footprint of these and really bring down the cost significantly. What’s really amazing is other people have really started to realize this and on the small language model front they’re getting better and better and better. So the latest iteration by Microsoft Phi-3, we have RAG fine-tuned models and Hugging Face that are really specifically designed to do RAG. When we fine-tuned it using our proprietary datasets across which we’ve fine-tuned 20 other models the exact same way, same datasets so we have a true apples to apples comparison. The Phi-3 or Phi-3 model really broke our test. It was like the best performing model out of every model we’ve ever tested, including 8 billion parameter models.
So our range is from one to 8 billion and really just performed the highest in terms of accuracy just blew my mind. The small language models that they’re really making accessible to everyone in the world for free on Hugging Face are getting better and better and better and at a very rapid clip. So I love that. I think that’s such an exciting world and this is why I made the assertion earlier, with this rate and pace of innovation, they are going to become small, so small that they’re going to take on the footprint of software in a not-so-distant future and we are going to look to deploy a lot of this on our edge devices. Super exciting.
Srini Penchikala: Yes, definitely a lot of the use cases include a combination of offline large language model processing versus online on the device closer to the edge real time analysis. So that’s where small language models can help. Roland or Daniel or Anthony, do you have any comments on the small language models? What are you guys seeing in this space?
Anthony Alford: Yes, exactly. Microsoft’s Phi or Phi, I think first we need to figure out which one that is, but definitely that’s been making headlines. The other thing, we have this on our agenda and Namee, you mentioned that they’re getting better. The question is how do we know how good they are? How good is good enough? There’s a lot of benchmarks. There’s things like MMLU, there’s HELM, there’s the Chatbot Arena, there’s lots of leader boards, there’s a lot of metrics. I don’t want to say people are gaming the metrics, but it’s like p-hacking, right? You publish a paper that says you’ve beat some other baseline on this metric, but that doesn’t always translate into say, business value. So I think that’s a problem that still is to be solved.
Namee Oberst: Yes, no, I fully agree. Anthony, your skepticism around the public..
Anthony Alford: I’m not skeptical.
Namee Oberst: No, I actually, I’m not crazy about them. So we’ve actually developed our own internal benchmarking tests that are asking some common sense business type questions and legal questions, just fact-based questions because our platform is really for the enterprise. So in an enterprise you really care less so about creativity in this instance, but just about how well are these models able to answer fact-based questions and basic logic, basic math, like yes or no questions. So we actually created our own benchmark testing and so the Phi-3 result is off of that because I’m skeptical of some of the published results because, I mean, have you actually looked through some of those questions like on HellaSwag or whatever? I can’t answer some of that. I am not an uneducated person. I don’t know what the right or wrong answer is sometimes either. So we decided to create our own testing and the Phi-3 results that we’ve been talking about are based on what we developed and I’m not sponsored by Microsoft. I wish they would, but I’m not.
Srini Penchikala: Definitely, I want to get into LLM evaluation shortly, but before we go there, any language model thoughts?
Roland Meertens: One thing which I think is cool about Phi is that they trained it using higher quality data and also by generating their own data. For example, for the coding, they asked it to write instructions for a student and then trained on that data. So I really like seeing that if you have higher quality data and you select the data you have better, you also get better models.
Anthony Alford: “Textbooks Are All You Need”, right? Was that the paper?
Roland Meertens: “Textbooks Are All You Need” is indeed name of the paper, but there’s multiple papers coming out also from people working at Hugging Face around “SantaCoder: don’t reach for the stars!”. There’s so much interest into what data do you want to feed into these models, which is still an underrepresented part of machine learning, I feel.
Srini Penchikala: Other than Phi, I guess that’s probably the right way to pronounce. I know Daniel, you mentioned TinyLlama. Do you have any comments on these tiny language models, small language models?
Daniel Dominguez: Yes. I think like Namee said, many of those language models running now on Hugging Face are also a lot of things to discover. Also, one thing that is interesting in Hugging Face as well is all this new poor GPU or rich GPU, I don’t know if you have seen the target that they’re doing on the leaderboard. That according to your machine you are like a rich GPU or poor GPU, but you’re able to run all these language models as well. And thanks for all the chips that are right now also in the industry, for example, those from NVIDIA that are able to run all these small language models on the chips that are also running on the cloud that are also able to run on the poor GPUs and systems that people have on their machines.
So those small language models are able to run on these thanks to all these GPUs from companies like NVIDIA. And I think in Hugging Face, when you see these simulations that you’re able to run all of these on your machines without the need of having a huge machine capacity. So that’s also something interesting and that’s something that you can run on small language models based on your arduino machine as well.
Srini Penchikala: Yes. Let’s get into, I know there’s a lot of other AI innovation happening, so quickly before we leave the language model discussion, I know you guys mentioned about the evaluation. Any thoughts on, other than the benchmarks, which can be take-it-with-a-grain-of-salt type of metrics, but what about real world best practices? Like you mentioned, Daniel, there are so many language models, so how can someone new to this space compare these LLMs and eliminate the ones that may not work for them and choose something that works for them? Right? So have you seen any industry practices or any standards in this space?
Mandy Gu: So I think Anthony mentioned something interesting, which is the business value and I think that’s something important we should think about for evaluation. I’m also quite skeptical of these general benchmarking tests, but I think what we should really do is we need to evaluate the LLMs, not just the foundational models, but the techniques and how we orchestrate the system at the task at hand. So if for instance, the problem I’m trying to solve is I’m trying to summarize a research paper where if I’m trying to distill the language, I should be evaluating the LLMs capabilities for this very specific task because going back to the free lunch theorem, there’s not going to be one set of models or techniques that’s going to be the best for every task. And through this experimentation process, it’ll give me more confidence to find the right set or the best set. And at the end of the day, how we quantify it better should be based on evaluating the task at hand and the end results, our success criteria that we want to see.
AI Agents [33:27]
Srini Penchikala: Yes, we can definitely add the links to these benchmarks and public leader boards in our transcript. So in the interest of time, let’s jump to the other topics. We have the next one, AI agents. I know there have been a lot of development in this area, AI powered coding assistants. Roland, what are you seeing in this? I know you spent some time with Copilot and other tools.
Roland Meertens: I mean last year you asked what do you think the trend is going to be next year? And I said AI agents and I don’t think I was completely right. So we see some things happening with agents. At some point OpenAI announced that they now have this GPT Store, so you can create your own agents. But to be honest, I’ve never heard of anyone telling me like, “Oh man, you should use this agent. It’s so good.” So in that sense, I think there’s not a lot of progress so far, but we see some things like for example, Devin, this AI software engineer where you have this agent which has a terminal, a code editor, a browser, and you can basically assign as a ticket and say, “Hey, try to solve this.” And it tries to do everything on its own. I think at the moment Devin had a success rate of maybe 20%, but that’s pretty okay for a free engineer, software engineer.
The other thing is that you’ve got some places like AgentGPT. I tried it out, I asked it to create an outline for the Trends in AI podcast and it was like, “Oh, we can talk about trends like CNNs and RNNs.” I don’t think those are the trends anymore, but it’s good that it’s excited about it. But yes, overall I think that there’s still massive potential for you want to do something, do it completely automatically instead of me trying to find out which email I should send using ChatGPT and then sending it and then the other person summarizing it and then writing a reply using ChatGPT, why not take that out and have the emails fly automatically?
Anthony Alford: My question is, what makes something an agent?
Roland Meertens: Yes, that’s a good question. So I think what I saw so far in terms of agents is something which can combine multiple tasks.
Anthony Alford: When I was in grad school, I was studying intelligent agents and essentially we talk about someone having agency. It’s essentially autonomy. So I think that’s probably the thing that the AI safety people are worried about is giving these things autonomy. And regardless of where you stand on AI doom, it’s a very valid point. Probably ChatGPT is not ready for autonomy.
Roland Meertens: It depends.
Anthony Alford: Yes, but very limited scope of autonomy.
Roland Meertens: It depends on what you want to do and where you are willing to give your autonomy away. I am not yet willing to put in autonomous Roland agent in my workspace. I think I wouldn’t come across very smart. It would be an improvement over my normal self, but I see that people are doing this for dating apps, for example, where they are automating that part. Apparently they’re willing to risk that.
Daniel Dominguez: As Roland said, they’re not yet on the big wave, but definitely there is something that is going to happen with them. For example, I saw recently also Meta and Zuckerberg said that the new Meta AI agents for small businesses, they’re going to be something that are going to help small business owners to automate a lot of things on their own spaces. Hugging Chat also has a lot of AI agents for daily workflows, for example, to do a lot of things. I know that also, for example, Slack now has a lot of AI agents to help summarize conversation or tasks or daily the workflows of whatever.
So I think AI agents on their daily workspace or whatever on small businesses are going to start seeing more naturally as we continue developing all of this landscape because it’s going to help a lot of the things that we have to do on a daily basis, and this is just going to start working more and more and the different companies are going to start offering the different agents on their own platforms. So for example, I know that Google, for example, going to start offering AI agents for as Roland says on their Gmail tasks or whatever. So that’s something that probably is going to start moving faster over the next year or so.
Roland Meertens: Yes, and especially with Langchain you can just say, “Oh, I’ve got these API functions you can call, I want this workflow. If you manage to reach this, then do this. If you don’t manage that, use this other API.” Just combining all the tools in the toolbox and do it automatically is insanely powerful.
Mandy Gu: I think that’s a great point, which is something we take for granted from agents is that they’re integrated in the places that we do work. So Roland, your example with Gmail having this assistant be embedded within Google workspaces so they could actually manage your emails as opposed to going to ChatGPT and asking, how do I augment this email? Or whatever it is that you want to do. From a behavioral perspective, this movement of information between platforms is just such a huge source of toil, and if we can reduce for our end users one less tab or one less place they have to go to do their work, from a behavior perspective, that’s going to be a huge lift. And ultimately what really drives the adoption of these agents.
Srini Penchikala: Mandy, it would be nice to actually for these agents to help us when to send an email and when not to send an email, make a phone call instead. So I mean, that could be even more productive, right?
Roland Meertens: I am wondering in terms of trends, I think last year was really the year where every company said, “We are now an AI company. We are going to have our own Chatbots.” And I don’t know, I’ve even seen some co-workers who said, “Oh, I’m trying to make this argument. I let ChatGPT generate something for me, which is three pages of argument. I think it looks pretty good.” And I don’t care about your argument anymore. I don’t want to chat with your Chatbot, I just want to see your website. So I also wonder what is going to settle at in the middle? Is every company, is every website going to be a Chatbot now or can you also just look up what the price of a book is instead of having to ask some agents to order it for you?
Srini Penchikala: We don’t want to over-agentize our applications, right?
Roland Meertens: Don’t over-agentize your life is a tip.
AI Safety and Security [40:14]
Srini Penchikala: Yes, let’s keep moving. Anthony, you mentioned about AI safety. So let’s get into the security. Namee and Mandy, you both are working on a lot of different real world projects. How do you see security versus innovation? How can we make these revolutionary technologies valuable, at the same time safe to use in terms of privacy and consumer data?
Mandy Gu: There’s definitely been a lot of second order effects in the security space from generative AI, like fourth party data sharing, data privacy concerns, like those are on the rise. A lot of the SaaS vendors that we work with that a lot of companies work with, they’ll have an AI integration, and they don’t always make it clear, but a lot of the times they’re actually sending your data to OpenAI. And depending on the sensitivity of your data, that’s something you want to avoid. So I think there’s two things to keep in mind here. One is we need to have a comprehensive lineage and mapping of where our data is going. And this is a lot harder with the rise of AI integration. So that’s something we definitely have to keep in mind. And then the second part is if we want our employees to have proper data privacy security practices, then we have to make the secure path the path of least resistance for them.
So going back to the example I shared earlier, if we added the super strict PII redaction system on top of all the conversations with OpenAI and other providers, then people are going to be discouraged and they’re going to just go to ChatGPT directly. But if we offer them alternatives and if we give them carrots to make this more accessible or to add other features that they need and we make the path of least resistance, then that’s how we get into our internal users and how we build up that culture of good data privacy practices.
Namee Oberst: Yes, Mandy, I think the workflow that you described actually underscores What I like to emphasize when we’re talking about data safety and security, the way that you design the generative AI workflow in your enterprise actually has such an impact on the safety and security of all your sensitive data. So if you take into considerations like Mandy did, when will PII come into effect? Do we have vendors, for instance, who might inadvertently send out our sensitive data to a provider that I don’t feel comfortable with, like OpenAI, just as an example? You need to look at that. You need to look at data lineage. You need to make sure that your workflow also has auditability in place so that you can trace back all the interactions that took place between all the inferences. And how AI explainability comes into play. Are there attack surfaces potentially in the workflow that I’ve designed? What about prompt injection?
By the way, fun fact, small language models are less prone to prompt injection because they’re so used to just taking such small tasks that they can almost generalize enough to be prone to that, but just worrying about prompt injection, rat poisoning, things like that. So I think there are a lot of considerations that an enterprise needs to take into account when they deploy AI, but I think Mandy, a lot of the points that you brought out are spot on.
Mandy Gu: I like what you mentioned about the attack surfaces, because that’s something that can quickly get out of control. And one analogy I’ve heard about generative AI and the AI integrations, it’s like this cable versus streaming methodology because so many different companies are coming up with their own AI integrations to buy them all is paying for Netflix, Hulu and all of these streaming services at once. Not only is it not economical, but it really increases the attack surfaces as well. And I think this is something we need to build into our build versus buy philosophy and be really cognizant and deliberate about what we pay for and where we send our data to.
One trend that we have noticed is that the general awareness of these issues are getting better. I think the vendors, the SaaS providers, they are responding to these concerns and I’ve seen more and more offerings of, “Hey, maybe we can host this as a part of your VPC. If you’re on AWS or if you’re on GCP, I’ll run Gemini for you. So this data still stays within your cloud tenants.” I think that’s one positive trend that I am seeing when it comes to security awareness in this space.
Namee Oberst: Absolutely.
LangOps or LLMOps [44:27]
Srini Penchikala: Along with security, the other important aspect is how do we manage these LLMs and AI technologies in production? So quickly, can we talk about LangOps or LLMOps? There are a few different terms for this. Maybe Mandy, you can lead us off on this. How do you see the production support of LLMs going on and what are some lessons learned there?
Mandy Gu: Yes, absolutely. So the way, at WealthSimple, that we divide our LLM efforts, we have three very distinct streams. So the first is boosting employee productivity. The second is optimizing operations for our clients. And then the third is this foundational LLMOps where we like to call it LLM platform work, which enables the two efforts. We’ve had a lot of lessons learned and what has worked for us has been our enablement philosophy. We’ve really centered this around security, accessibility and optionality. And at the end of the day, we just really want to provide optionality so everyone can choose the best techniques, foundational models for the tasks at hand. And this has really helped prevent one of the common problems we see in this space where people use LLMs as a hammer looking for nails. But by providing these reusable platform components, the organic extensions and adoptions of Gener AI has been a lot more prevalent.
This was a lesson we learned over time. So for example, when we first started our LLM journey, we built this LLM gateway with an audit trail with a PII redaction system for people to just safely converse with OpenAI and other providers. We got feedback that the PII redaction restricted a lot of real-world use cases. So then we started enabling self-hosted models where we can easily add an open source model, we fine tune, we can add it to our platform, make it available for inferencing for both our systems and for our end users through the LLM gateway. And then from there we looked at building retrieval as a reusable API, building up the scaffolding and accessibility around our vector database. And then slowly as we started platformatizing more and more of these components, our end users who are like the scientists, the developers, various folks within the business, they started playing around with it and identifying like, “Hey, here’s a workflow that would actually really benefit from LLMs.” And then this is when we step in and help them productionize that and deploy it and deploy the products at scale.
AI Predictions [46:49]
Srini Penchikala: Thanks, Mandy. Let’s wrap up this discussion. A lot of great discussion. I know we can talk about all of these topics for a long time and hopefully we can have some follow up, one-on-one podcast discussions on these. So before we wrap up, I want to ask one question to each one of you. What is your one prediction in AI space that may happen in the next 12 months? So when we come back to discussion next year, what can we talk about in terms of predictions? Mandy, if you want to go first.
Mandy Gu: I think a lot of the hype around LLMs is going to sober up, so to speak. I think we’ve seen just this rapid explosion in growth over the past year and a half, and for a lot of companies and industries, LLMs are still a bet, a bet that they’re willing to continuously finance. But I think that will change over the next coming 12 months where we start building more realistic expectations for this technology and also how much we’re willing to explore before we expect a tangible result. So I’m expecting this to be less of a hype 12 months from now, and also for the companies that still uses technology to have tangible ways where it’s integrated with their workflows or within their products.
Srini Penchikala: Daniel, how about you?
Daniel Dominguez: I think with all the data that is being generated with artificial intelligence, there will be some kind of integration with, for example, blockchain. I have seen that, for example, a lot of projects in blockchain includes the data integration with artificial intelligence. So probably blockchain and artificial intelligence are still on the early days, but definitely something will be integrated between artificial intelligence and blockchain. Mainly on data, mainly on the space integration meaning in databases or something like that. So that’s something that probably we’re still on the early days, but for me, artificial intelligence and blockchain, that still going to be a huge migration.
Srini Penchikala: What about you Roland?
Roland Meertens: I’m still hoping for more robotics, but nowadays, we are calling it embodied AI. That’s the name change, which started somewhere in the last year. I don’t know exactly when, but if you take the agents, right, they can perform the computer tasks for you. But if you can put that into a robot and say, “Get me this thing, pick that thing up for me,” just general behavior, embodied AI is the next big thing, I think. That’s what I’m hoping for.
Srini Penchikala: So those robots will be your paid programmer, right?
Roland Meertens: Well, no. So there will be the agents who will be your pair programmer, but the robots will help you in your life. The other thing which I’m really wondering is now companies have all this data. So are companies going to fine tune their own models with their data and sell these models? Or is everybody going to stay on the RAG train? Imagine that you’re, I don’t know, like a gardener and you have years worth of taking photos of gardens and then writing advice on how to improve your garden. There must be so many tiny companies which have this data, how are they going to extract value out of it? So I’m super excited to see what smaller companies can do with their data and how they are going to create their own agents or their own Chatbots or their own automations for using AI.
Srini Penchikala: Anthony, how about you?
Anthony Alford: AI winter. Well, so Mandy already said it right? She said, “Maybe we’ll see the hype taper off,” which I can say that’s the mild form of AI winter, the strong form of AI winter, maybe you saw this headline, I think it was a paper in nature that says, “If you train generative AI on content generated by generative AI, it gets worse.” And I think people are already starting to wonder is the internet being polluted with generated content? So we shall see. I hope I’m wrong. This is one where I hope I’m wrong, so I’ll be glad to take the L on that one.
Srini Penchikala: No, it is very possible, right? And how about you Namee? What do you see as a prediction in the next 12 months?
Namee Oberst: So I foresee a little bit of what Anthony and Mandy described, but actually then moving on very, very quickly to the much more valuable, realistic and tangible use cases, probably involving more automated workflows and the agent work processes and then moving into more edge devices like laptops and even phones. So that’s what I’m foreseeing. So we shall see. It’ll be interesting.
Srini Penchikala: Yes, it’ll be interesting. Yes, that’s what I’m seeing as well. So it’ll be more unified, end-to-end, holistic, AI-powered solutions with these small language models, RAG, the AI-powered hardware. So I think a lot of good things are happening. I think hopefully, Anthony, the AI winter won’t last for too long. That’s the podcast for today. Anybody have any concluding remarks?
Namee Oberst: It was so fun to be on this podcast. Thank you so much for having me. I really enjoyed the experience here.
Anthony Alford: Ditto. Loved it.
Mandy Gu: Yes.
Roland Meertens: I especially like seeing our podcast over the years. If you go back to, I think we started this in 2021, maybe. It’s always fun to see how our predictions change over the years and how our topics change over the years.
Srini Penchikala: I want to thank you all the panelists for joining, participating in this discussion for 2024 AI and ML Trends Report and what to look forward to in this space for the remainder of this year and next year. To our audience, we hope you all enjoyed this podcast. Hope this discussion has offered a good roundup update on the emerging trends and technologies in the AI and ML space. Please visit the infoq.com website and download the trends report with an updated version of the adoption graph that will be available along with this podcast recording that will show what trends and technologies and topics are becoming more mature in terms of adoption and what are still in the emerging phase.
I hope you join us again soon for another episode of the InfoQ Podcast. Check out the previous podcast on various topics of interest like architecture and design, cloud, DevOps, and of course, AI, ML and Data Engineering topics in the podcast section of infoq.com website. Thank you everyone, and have a great one. Until next time.
Mentioned
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.