Podcast: Sam Partee on Retrieval Augmented Generation (RAG)

Uncategorized

Podcast: Sam Partee on Retrieval Augmented Generation (RAG)

MMS • Sam Partee

Subscribe on:

Introduction [00:52]

Roland Meertens: Welcome everybody to the InfoQ podcast. My name is Roland Meertens and I’m your host for today. Today, I’m interviewing Sam Partee, who is a Principal Applied AI engineer at Redis. We are talking to each other in person at the QCon San Francisco conference just after he gave the presentation called Generative Search, Practical Advice for Retrieval Augmented Generation. Keep an eye on InfoQ.com for his presentation as it contains many insights into how one can enhance large language models by adding a search components to retrieval augmented generation. During today’s interview, we will dive deeper into how you can do this and I hope you enjoy it and I hope you can learn from the conversation.

Welcome Sam to the InfoQ podcast.

Sam Partee: Thank you.

Roland Meertens: We are recording this live at QCon in San Francisco. How do you like the conference so far?

Sam Partee: It’s awesome. I’m really glad Ian invited me to this and we’ve had a really good time. I’ve met some really interesting people. I was talking to the source graph guys earlier and really loved their demo and there’s a lot of tech right now in the scene that even someone like me who every single day I wake up… I went to a meetup for Weaviate last night. I still see new things and it’s one of the coolest things about living here and being in this space and QCon’s a great example of that.

The Redis vector offering [02:09]

Roland Meertens: Yes, it’s so cool that here everybody is working on state-of-the-art things. I think your presentation was also very much towards state-of-the-art and one of the first things people should look at if they want to set up a system with embeddings and want to set up a system with large language models, I think. Can you maybe give a summary of your talk?

Sam Partee: Yes, absolutely. So about two years ago, Redis introduced its vector offering, essentially vector database offering. It turns Redis into a vector database. So I started at Redis around that time and my job was not necessarily to embed the HNSW into the database itself. There was an awesome set of engineers, DeVere Duncan and those guys who are exceptional engineers. There was a gap that other vector databases had that you couldn’t use Redis and Lang Chain or Llama Index or what have you. So my job is to actually do those integrations and on top of that to work with customers. And so over the past two or so years I’ve been working with those integration frameworks with those customers, with those users in open source and one of the things that you kind of learn through doing all of that are a lot of the best practices. And that’s really what the talk is just a lot of stuff that I’ve learned by building and that’s essentially the content.

Roland Meertens: Okay, so if you say that you are working on vector databases, why can’t you just simply store a vector in any database?

Sam Partee: So it kind of matters what your use case is. So, for Redis for instance, let’s take that. It’s an incredible real-time platform, but if your vectors never change, if you have a static dataset of a billion embeddings, you’re way better off using something like Faiss and storing it in an S3 bucket, loading it into a Lambda function and calling it every once in a while. Just like programming languages, there’s not a one size fits all programming language. You might think Python is, but that’s just because it’s awesome. But it’s the truth that there’s no tool that fits every use case and there’s certainly no vendor that fits every use case. Redis is really good at what it does and so are a lot of other vendors and so you really just got to be able to evaluate and know your use case and evaluate based on that.

Roland Meertens: So what use cases have you carved out for what people would you recommend to maybe re-watch your talk?

Sam Partee: Yes, so specifically use cases, one thing that’s been really big that we’ve done a lot of our chat conversations. So long-term memory for large language models is this concept where the context window of even in the largest case, what is it, 16K? No, 32K, something like that for I think that’s GPT-4. Even then you could have a chat history that is over that 32K token limit and in that case you need other data structures than just a vector index and you need the ability to sort Z sets and Redis. And so there are other data structures that come into play that play as memory buffers or things like that that for those kind of chat conversations end up really mattering and they’re actually integrated into LangChain. Another one’s semantic caching is this concept of it’s what Redis has been best at for its decadal career, something like that ever since Salvatore wrote it.

Semantic caching is simply the addition. It’s almost the next evolution of caching where instead of just a perfect one-to-one match like you would expect a hash would be, it’s more of a one-to-many in the sense that you can have a threshold for how similar a cached item should be and what that cached item should return is based on that specific percentage of threshold. And so it allows these types of things like, say the chat conversation, where if you want a return, “Oh, what’s the last time that I said something like X?” You can now do that and have the same thing that is not only returning that conversational memory but also have that cached. And with Redis you get that all at a really high speed. And so for those use cases it ends up being really great and so there’s obviously a lot of others, but I’ll talk about some that it’s not.

So some that it’s not, that we’ve seen that we are not necessarily the best for is internet memory database. And we have tiering and auto tiering which allows you to go to NVME and you can have an NVME drive, whatever, and go from memory to NVME and it can actually do that automatically now, which is quite fascinating to me. But even then, and even if you have those kinds of things enabled, there are cases like I mentioned where you have a product catalog that changes once every six months and it’s not a demanding QPS use case. You don’t need the latencies of Redis. You call this thing once every month to set user-based recommendations that are relatively static or something like that.

Those use cases, it’s kind of an impractical expense. And it’s not like I’m trying to down talk the place I work right now. It’s really just so that people understand why it is so good for those use cases and why it justifies, and even in that case of something like a recommendation system that is live or online, it even justifies itself in terms of return on investment. And so those types of use cases it’s really good for, but the types that are static that don’t change, it really isn’t one of the tools that you’re going to want to have in your stack unless you’re going to be doing something more traditional like caching or using it for one of its other data structures, which is also a nice side benefit that it has so many other things it’s used for. I mean, it’s its own streaming platform.

Use cases for vector embeddings [08:09]

Roland Meertens: So maybe let’s break down some of this use case. You mentioned extracting the vectors from documents and you also mentioned if a vector is close enough, then you use it for caching. Maybe let’s first dive into the second part because that’s what we just talked about. So you say if a vector is close enough, does Redis then internally build up a tree to do this fast nearest neighbor search?

Sam Partee: Sure. Yes. So we have two algorithms. We have KNN, K-nearest neighbors, brute force, and you can think of this like an exhaustive search. It’s obviously a little bit better than that, but imagine just going down a list and doing that comparison. That’s a simplified view of it, but that’s called our flat index. And then we have HNSW, which is our approximate nearest neighbors index. And so both of those are integrated that we’ve vendored HNSW lib, and that’s what’s included inside of Redis. It’s modified for making it work with things like CRUD operations inside of Redis. But that’s what happens when you have those vectors and their indexed inside of Redis, you choose, if you’re using something like RedisVL, you can pass in a dictionary configuration or a YAML file or what have you, and that chooses what index you end up using for search.

And so for the people out there that are wondering, Which index do I use?” Because that’s always a follow-up question, if you have under a million embeddings, the KNN search is often better because if you think about it in the list example, appending to a list is very fast and recreating that list is very fast. Doing so for a complex tree or graph-based structure that is more computationally complex. And so if you don’t need the latencies of something like HNSW, if you don’t have that many documents, if you’re not at that scale, then you should use the KNN index. In the other case, if you’re above that threshold and you do need those latencies, then HNSW provides those benefits, which is why we have both and we have tons of customers using either one.

Roland Meertens: So basically then we’re just down to if you have anything stored in your database, it’s basically like a dictionary with nearest neighbor. Do you then get multiple neighbors back or do you just return what you stored in your database at this dictionary location?

Sam Partee: There are two specific data structures, Hashes and JSON documents. Hashes in Redis are like add a key, you store a value. JSON documents, you have add a key, you store a JSON document. When you are doing a vector similarity search within Redis, whatever client you use, whether it’s Go or Java or what have you, or Python, what you get back in terms of a vector search is defined by the syntax of that query. And there are two major ones to know about. First are vector searches, just plain vector searches, which are, “I want a specific number of results that are semantically similar to this query embedding.” And then you have range queries which are, “You can return as many results as you want, but they have to be this specific range in this range of vector distance away from this specific vector.” And whether I said semantically earlier, it could be visual embeddings, it can be semantic embeddings, it doesn’t matter.

And so vector searches, range searches, queries, et cetera, those are the two major methodologies. It’s important to note that also Redis just supports straight up just text search and other types of search features which you can use combinonatorally. So all of those are available when you run that and it’s really defined by how you use it. So if you are particular about, let’s say it’s a recommendation system or a product catalog, again to use that example, you might say, “I only want to recommend things to the user,” there’s probably a case for this, “If they’re this similar. If they’re this similar to what is in the user’s card or this basket.” You might want to use something like a range query, right?

Roland Meertens: Yes, makes sense.

If you’re searching for, I don’t know, your cookbooks on Amazon, you don’t want to get the nearest instruction manual for cars, whatever.

Sam Partee: Yes.

Roland Meertens: Even though it’s near-

Sam Partee: Yes, sure.

Roland Meertens: … at some point there’s a cutoff.

Sam Partee: That might be a semantic similarity or let’s say a score rather than a vector distance, one minus the distance. That might be a score of let’s say point six, right? But that’s not relevant enough to be a recommendation that’s worthwhile. And so, if there’s 700 of them that are worthwhile, you might want 700 of them, but if there’s only two, you might only want two. That’s what range queries are really good for, is because you might not know ahead of time how many results you want back, but you might want to say they can only be this far away and that’s a concept that’s been around in vector search libraries for quite some time. But it is now, you can get it back in milliseconds when you’re using Redis, which is pretty cool.

Hybrid search: vector queries combined with range queries [13:02]

Roland Meertens: Yes, nice. Sounds pretty interesting. You also mentioned that you can combine this with other queries?

Sam Partee: So we often call this hybrid search. Really hybrid search is weighted search. So I’m going to start saying filtered search for the purposes of this podcast. If you have what I’m going to call a recall set, which is what you get back after you do a vector search, you can have a pre or post filter. This is particular to Redis, but there are tons of other vector databases that support this and you can do a pre or post filter. The pre-filter in a lot of cases is more important. Think about this example. Let’s say I’m using as a conversational memory buffer, and this could be in LangChain, it’s implemented there too, and I only want the conversation with this user. Well, then I would use a tag filter where the tag, it’s basically exact text search, you can think about that or categorical search, where some piece of that metadata in my Hash or JSON document in Redis is going to be a user’s username.

And then I can use that tag to filter all of the records that I have that are specific to that user and then I can do a vector search. So it allows you to almost have, it’s like a schema in a way of, think about it’s like a SQL database. It allows you to define kind of how you’re going to use it, but the benefits here are that if you don’t do something in the beginning, you can then add it later and still alter the schema of the index, adjust and grow your platform, which is a really cool thing. So the hybrid searches are really interesting. In Redis you can do it with text, full text search like BM 25. You can do it with tags, geographic by… You can do polygon search now, which is really interesting. Literally just draw a polygon of coordinates and if they’re within that polygon of coordinates, then that is where you do your vector search.

Roland Meertens: Pretty good for any mapping application, I assume.

Sam Partee: Or, like say food delivery.

Roland Meertens: Yes.

Sam Partee: I actually think I gave that example in the talk. I gave that example because Ian was in the front. He’s obviously a DoorDash guy. They’re power users of open source and it’s always fun to see how people use it.

Roland Meertens: But so in terms of performance, your embeddings are represented in a certain way to make it fast to search through them?

Sam Partee: Yes.

Roland Meertens: But filters are a completely different game, right?

Sam Partee: Totally.

Roland Meertens: So is there any performance benefits to pre-filtering over post-filtering or the other way around?

Sam Partee: I always hate when I hear this answer, but it depends. If you have a pre-filter that filters it down to a really small set, then yes. But if you have a pre-filter, you can combine them with boolean operators. If you have a pre-filter that’s really complicated and does a lot of operations on each record to see whether it belongs to that set, then you can shoot yourself in the foot trying to achieve that performance benefit. And so it really depends on your query structure and your schema structure. And so, that’s not always obvious. I’ve seen in, we’ll just say an e-commerce company, that had about a hundred and something combined filters in their pre-filter. Actually no, it was post-filter for them because they wanted to do vector switch over all the records and then do a post filter, but it was like 140 different filters, right?

Roland Meertens: That’s a very dedicated, they want something very specific.

Sam Partee: Well, it made sense for the platform, which I obviously can’t talk about, but we found a much better way to do it and I can talk about that. Which is that ahead of time, you can just combine a lot of those fields. And so you have extra fields in your schema. You’re storing more, your memory consumption goes up, but your runtime complexity, the latency of the system goes down because it’s almost like you’re pre-computing, which is like an age-old computer science technique. So increase space complexity, decrease runtime complexity. And that really helped.

How to represent your documents [16:53]

Roland Meertens: Yes, perfect trade-off. Going back to the other thing you mentioned about documents, I think you mentioned two different ways that you can represent your documents in this amending space.

Sam Partee: Yes.

Roland Meertens: Can you maybe elaborate on what the two different ways are and when you would choose one over the other?

Sam Partee: Yes, so what I was talking about here was a lot of people… It’s funny, I was talking to a great guy, I ate lunch with him and he was talking about RAG and how people just take LangChain or LlamaIndex or one of these frameworks and they use a recursive character text splitter or something and split their documents up, not caring about overlap, not caring about how many tokens they have and chunk it up. And use those randomly as the text, raw text basically, for the embeddings and then they run their RAG system and wonder why it’s bad. And it’s because you have filler text, you have texts that isn’t relevant, you possibly have the wrong size and your embeddings possibly aren’t even relevant. So what I’m suggesting in this talk is a couple ways, and actually a quick shout out to Jerry Lou for the diagram there. He runs LlamaIndex, great guy.

What I’m suggesting is there’s two approaches I talk about. First, is you take that raw text and ask an LLM to summarize it. This approach allows you to have a whole document summary and then the chunks of that document associated with that summary. So first, you go and do a vector search over the summaries of the documents, which are often semantically more like rich in terms of context, which helps that vector search out. And then you can return all of the document chunks and even then sometimes on the client side do either a database, local vector search on the chunks that you return after that first vector search.

And with Redis, you can also combine those two operations. Triggers functions are awesome. People should check that out. 7.2 release is awesome. But then the second approach is also really interesting and it involves cases where you would like the surrounding context to be included, but your user query is often something that is found in maybe one or two sentences and includes things like, maybe names or specific numbers or phrases. To use this finance example we worked on, it’s like, “the name of this mutual bond in this paragraph” or whatever it was.

What we did there was instead we split it sentence by sentence and so that when the user entered a query, it found that particular sentence through vector search, semantic search. But the context, the text that was retrieved, was a larger window around that sentence and so it had more information when you retrieved that context. And so, the first thing that people should know about this approach is that it absolutely blows up the size of your database. It makes it-

Roland Meertens: Even if I’m embedding per sentence?

Sam Partee: Yes. And you spend way more on your vector database because think about it, you’re not only storing more text, you’re storing more vectors. And it works well for those use cases, but you have to make sure that that’s worth it and that’s why I’m advocating for people, and this is why I made it my first slide in that section is, just go try a bunch. I talk about using traditional machine learning techniques. So weird that we call it traditional now, but do like a K-fold. Try five different things and then have an eval set. Try it against an eval set. Just like we would’ve with XGBoost when it was five years ago. It feels like everything has changed. But Yes, that’s what I was talking about.

Roland Meertens: So if you are doing this sentence by sentence due to embeddings and you have the larger context around it, is there still enough uniqueness for every sentence or do these large language models then just kind of make the same vector of everything?

Sam Partee: If you have a situation where the query, or whatever’s being used as the query vector, is a lot of text, is a lot of semantic information, this is not the approach to use. But if it’s something like a one or two liner question, or one or two sentence question, it does work well. What you’re, I think getting at to, is that imagine the sentences that people write, especially in some PDFs that just don’t matter. They don’t need to be there and you’re paying for not only that embedding but the storage space. And so, this approach has drawbacks, but who’s going to go through all, I forget how many PDFs there were in that use case, but like 40,000 PDFs which ended up creating, it was like 180 million embeddings or something.

Roland Meertens: Yes, I can imagine if you use this approach on the entire archive database of scientific papers, then-

Sam Partee: Docsearch.redisventures.com, you can look at a semantics search app that does only abstracts, which is essentially the first approach, right? But it just doesn’t have the second layer, right? It doesn’t have that, mostly because we haven’t hosted that. It would be more expensive to host. But it does it on the summaries which the thing about the paper summary… It’s actually a great example, thank you for bringing that up, is that the paper summary, think about how more information is packed into that than random sections of a paper. And so that’s why sometimes using an LLM to essentially create what seems like a paper abstract is actually a really good way of handling this and cheaper usually.

Hypothetical Document Embeddings (HyDE) [22:19]

Roland Meertens: I think the other thing you mentioned during your talk, which I thought was a really interesting trick is if you are having a question and answer retrieval system, that you let the large language model create a possible answer and then search for that answer in your database. Yes. What do you call this? How does this work again? Maybe you can explain this better than I just did.

Sam Partee: Oh no, actually it’s great. I wish I remembered the author’s name of that paper right now because he or she or whoever it is deserves an award and essentially the HyDE approach, it’s called Hypothetical Document Embedding, so HyDE, HyDE, like Jekyll and Hyde. People use the term hallucinations with LLMs when they make stuff up. So I’m going to use that term here even though I don’t really like it. I mentioned that in the talk. It’s just wrong information, but I’ll get off that high horse.

When you use a hallucinated answer to a question to look up the right answer, or at least I should say the right context, and so why does this work? Well, you have a question and that question, let’s say it’s something like in the talk, what did I say? I said, what is Redis? Think about how different that question is than the actual answer, which is like, “an internet memory database, yada, yada, yada.” But a fake answer, even if it’s something like it’s a tool for doing yada, yada, yada, it’s still semantically more similar in both sentence structure and most often it’s actual semantics that it returns a greater amount of relevant information because of the way that the semantic representation of an answer is different from the semantic representation of a query.

Roland Meertens: Kind of like, you dress for a job you want instead of for a job you have.

Sam Partee: That’s pretty funny.

Roland Meertens: You search for the job you want. You search for the data you want, not for the data you have.

Sam Partee: Couldn’t agree more, and that’s also what’s interesting about it. I gave that hotel example. That was me messing around. I just created that app for fun, but I realized how good of an example of a HyDE example it is because it’s showing you that searching for a review with a fake generated review is so much more likely to return reviews that you want to see than saying, this is what I want in a hotel. Because that structurally and semantically is far different from a review than… Some English professors probably crying right now with the way I’m describing the English language, I guess not just English, but you get the point. It’s so much more similar to the actual reviews that you want that the query often doesn’t really represent the context you want.

Roland Meertens: I really liked it as an example also with hotels because on any hotel website, you can’t search for reviews, but you-

Sam Partee: Oh, of course not.

Roland Meertens: Yes, but it kind of makes sense to start searching for the holiday you want or others have instead of searching for the normal things you normally search for like locations, et cetera, et cetera.

Sam Partee: It was funny. I think I said that I started doing it because I actually did get mad that day at this travel website because I just couldn’t find the things I was looking for and I was like, “Why can’t I do this?” And I realize I’m a little bit further ahead in the field, I guess, than some enterprise companies are in thinking about these things because I work on it all the time I guess. But I just imagine the next few years it’s going to completely change user experience of so many things.

I’ve seen so many demos lately and obviously just hanging around SF, you talk to so many people that are creating their own company or something, and I’ve seen so many demos where they’re using me for essentially validation of ideas or something, where my mind’s just blown at how good it is, and I really do think it’s going to completely change user experience going forward.

Applications where vector search would be beneficial [26:10]

Roland Meertens: Do you have more applications where you think it should be used for this? This should exist?

Sam Partee: Interesting. Review data is certainly good. So look, right now we’re really good at text representations, at semantics, and the reason for that is we have a lot of that data. The next frontier is definitely multimodal. OpenAI I think has already started on this in some of their models, but one thing I was thinking about and honestly it was in creating this talk, was why can’t I talk to a slide and change the way it looks? And I can basically do that with stable diffusion. It’s on my newsletter head. The top of my newsletter is this cool scene where I said the prompt is something like the evolution of tech through time because that’s what I’m curious about.

Roland Meertens: But you still can’t interact with… Also with stable diffusion, you can give a prompt, but you can’t say, “Oh, I want this, but that make it a bit brighter or replace it.”

Sam Partee: You can refine it and you can optimize it and make it look a little better, but you’re right. It’s not an interaction. The difference with RAG and a lot of these systems like the chat experience, I’ve seen a chatbot pretty recently made by an enterprise company using Redis that is absolutely fantastic and the reason is because it’s interactive. It’s an experience that is different. And I’d imagine that in a few years you’re literally never going to call an agent on a cell phone again.

You’re actually never going to pick up the phone and call a customer service line because there will be a time and place, and maybe it’s 10 years, two years, I don’t know, I’m not Nostradamus. But it will be to the point where it’s so good, it knows you personally. It knows your information, and it’s not because it’s been trained on it. It’s because it’s injected at runtime and it knows the last thing you ordered. It knows what the previous complaints you’ve had are.

It can solve them for you by looking up company documentation and it can address them internally by saying, “Hey, product team, we should think about doing this.” That is where we’re headed to the point where they’re so helpful and it’s not because they actually know all this stuff. It’s because that combined with really careful prompt engineering and injection of accurate, relevant data makes systems that are seemingly incredibly intelligent. And I say seemingly because I’m not yet completely convinced that it’s anything more than a tool. So anybody that personified, that’s why I don’t like the word hallucinations, but it is just a tool. But this tool happens to be really, really good.

Roland Meertens: The future is bright if it can finally solve the issues that you have whenever you have to call your phone the company.

Sam Partee: God, I hope I never have to call another agent again.

Deploying your solution [28:57]

Roland Meertens: In any case, for the last question, the thing you discussed with another participant here at the QCon conference was, if you want to run these large language models, is there any way to do it or do you have any recommendations for doing this on prem, rather than having to send everything to an external partner?

Sam Partee: That’s a good question. There’s a cool company, I think it’s out of Italy, called Prem, literally, that has a lot of these. So shout out to them, they’re great. But in general, the best way that I’ve seen companies do it is Nvidia Triton is a really great tool. The pipe-lining and being able to feed a Python model’s result to a C++ quantized PyTorch model and whatnot. If you’re really going to go down the route of doing it custom and whatnot, going and talking to Nvidia is never a bad idea. They’re probably going to love that.

But one of the biggest things I’ve seen is that people that are doing it custom, that are actually making their own models, aren’t talking about it a whole lot. And I think that’s because it’s a big source of IP in a lot of these platforms, and that’s why people so commonly have questions about on-prem, and I do think it’s a huge open market, but personally, if you’re training models, you can use things like Determined. Shout out Evan Sparks and HPE, but there’s a lot of ways to train models. There’s really not a lot right now of ways to use those models in the same way that you would use OpenAI’s API. There’s not a lot of ways to say, even Triton has an HPS API, but the way that you form the thing that you send to Triton versus what you do for OpenAI, the barrier to entry of those two things.

Roland Meertens: Yes, GRPC uses DP for this-

Sam Partee: Oh, they’re just so far apart. So the barrier to adoption for the API level tools is so low, and the barrier to adoption for on-prem is unbelievably high. And let alone, you can probably not even get a data center GPU right now. I actually saw a company recently that’s actually doing this on some AMD chips. I love AMD, but CUDA runs the world in AI right now. And if you want to run a model on prem, you got to have a CUDA enabled GPU, and they’re tough to get. So it’s a hard game right now on premise, I got to say.

Roland Meertens: They’re all sold out everywhere. Also on the Google Cloud platform, they’re sold out.

Sam Partee: Really?

Roland Meertens: Even on Hugging Face, it’s sometimes hard to get one.

Sam Partee: Lambda is another good place. I really liked their Cloud UI. Robert Brooks and Co. Over there at Lambda are awesome. So that’s another good one.

Roland Meertens: All right. Thanks for your tips, Sam.

Sam Partee: That was fun.

Roland Meertens: And thank you very much for joining the InfoQ Podcast.

Sam Partee: Of course.

Roland Meertens: Thank you very much for listening to this podcast. I hope you enjoyed the conversation. As I mentioned, we will upload the talk on Sam Partee on InfoQ.com sometime in the future. So keep an eye on that. Thank you again for listening, and thanks again to Sam for being a guest.

About the Author

Sam Partee

Show moreShow less

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Uncategorized