Article originally posted on InfoQ. Visit InfoQ
Low: Putting all of the machine learning algorithms, models, and everything else into the database didn’t always sound like a good idea to me. A lot of this is based on things that I learned working at Instacart over the last decade, trying to scale our machine learning infrastructure, our data infrastructure, and our real-time inference systems. When I got to Instacart, one of the first things I did actually was helped pull all of our product catalog data out of our monolithic Postgres database that was already hitting scalability constraints 8, 9 years ago. I moved all of that into Elasticsearch, so that we would have this beautiful, horizontally scalable, amazing natural language processing framework. That actually carried the company for about the next 5 years. It became the real heart of our data infrastructure. It allowed us to grow the business to a multibillion-dollar revenue generating enterprise. Our Elasticsearch cluster grew to several hundred nodes. It was powering several thousand stores worth of data with hundreds of thousands of personal shoppers and millions of customers using it on a regular basis. We were doing everything with that cluster. We were taking our machine learning embeddings and putting it in there. We were putting our feature store data in there. Some of the JSON blobs for our Elastic documents reached the size of megabytes.
Over time, this became slightly less tenable. Once you have the god object in your data architecture that everybody wants to put something in and get something out of, there’s all kinds of organizational problems. There’s also all kinds of scalability problems, regardless of the technology. You really need tight control over everything going on. The crucible moment, I think, for me came during the COVID pandemic, when most of the world moved from an offline grocery experience to an online grocery experience. We were already a multibillion-dollar company. Then we started doubling on a weekly basis. At that point, everything you know about scalability goes out the window, everything you know about engineering best practices goes out the window, because the company is going to die if you can’t actually scale double in a week. Then you have to do it again the next week. You find all of these microservices that had their individual feature stores based on Redis, or Cassandra, or Druid, or we’re talking directly to Snowflake, all of the concurrency issues that you thought you would slowly deal with, all come to a head.
What we did was, we had a Postgres cluster, we had learned a lot about scaling Postgres in the intervening years, between the time that I get there, you can get pretty far with read replicas. Especially in a machine learning context, where a few seconds of data latency is actually pretty state of the art when you start thinking about Kafka and Flink streaming, then you’re used to tolerating and not having ACID compliant transactions. We can actually scale Postgres in a very similar manner horizontally. We get really good control of the sharding capabilities of our database, so that we can very specifically target exactly what criteria we need. We abstracted all of that logic that we built internally in Instacart into a different project called PgCat that sits in front of a massive cluster of Postgres databases to help scale it horizontally. Just know that that’s always the other half of the project that I rarely talk about, because scaling Postgres is so important when you’re talking about machine learning.
Talking about machine learning, when we think about machine learning, a lot of people in the world think that it’s this mystical, dark art. I think that’s really unfortunate. I think that we should try to take an engineering-first approach, because I think engineering is actually where many of the hard problems in deploying machine learning systems are. I think engineers are very used to dealing with systems as black boxes: you have inputs, you expect outputs. It’s just a function in the middle. A machine learning model is just a function. It just takes inputs. It just produces outputs. You don’t need to know how it works to use it. We use hundreds of APIs, hundreds of functions, hundreds of SDKs without really knowing how they work internally, and we do this with very effectively. We have unit tests. We have integration tests. We have end-to-end QA. We can deal with machine learning systems very similarly and get very far. This is an example. It’s very contrived. It’s just a function that takes a birthdate and it returns a float. This is implemented in Python. I hope that I don’t have any bugs in my code. It gives you a rough idea of what we’re talking about.
We can actually pretend that we’re a machine, we don’t understand birthdates. We don’t understand age as a concept. If we’re trying to teach a machine about how to calculate a birthdate, we’ll take a data-driven approach. We’ll collect some samples from the audience. Somebody is born in 1960. Somebody is born in 1980. Somebody is born in the year 2000. We can ask them, what is your age? We don’t know how they’re calculating their age. It’s some internal mystical black box that they know how old they are, and they know when they were born. They’ll actually do some computation very similar to that Python function. We don’t need to know that. We can just put their data in a lookup table, and now our function implementation doesn’t need to know anything about dates or ages. You parse in a birthdate. It will look up that birthdate in our data table and will return the answer. This is actually pretty terrible, because most people wouldn’t have an exact birthdate, and so we’ll get a not found error, rather than actually returning a useful age. This is where machine learning comes into play. What we can do with this very meager sample of data is we want to generalize. We want to be able to tell people their age in years, given any birthdate, not just some birthdate that somebody’s already told us the answer for.
One of the simplest machine learning algorithms is linear regression. All linear regression is, is you take your three data points that you have, you put them on a plot graph. You draw a line through them. If you remember, the slope intercept form of a line from early algebra in high school, is y equals mx plus b. In this case, we know that every year you get 1 year older, and it happens to be the year 2023 today, so m is negative one, b is 2023. We’ve now solved the equation for the linear regression that passes through these points. You don’t need to know how linear regression actually works, is actually implemented. That’s just another function call. It’ll give you this data. There’s libraries that do this, so it’s all implemented. We can now rewrite our Python function. We have a couple constants, we have m and b. You can now parse in your datetime. The year of the datetime is the only feature we care about in our very simple model. We can now multiply that by m and add b. Now we’ve actually generalized our three data points into predicting the correct age from any birthdate without really knowing anything about ages or birthdates. If you start to think about all the hidden functions in our applications, from a user perspective, we don’t know why users behave the way that they do or what they want. We can start to gather data about their behavior. They can start explaining to us these hidden functions, and we can model them. Then we can actually generalize those models across populations. Machine learning is a very powerful technique when you’re dealing in a murky environment.
Neural Networks, Deep Learning, and LLMs
To move this forward beyond the simplest linear regression, neural networks, deep learning, and LLMs are a much more advanced topic. This is a diagram of a very simple neural network. It has three layers. The three layers are the inputs, the hidden layer, and the output. This is just a function. It takes three inputs, it produces one output. What happens in the middle is a black box, nobody needs to know. I’ll walk through an example. Just for reference, all machine learning models only operate on math, they only take numbers. What we’ve seen lately is that LLMs, they take text. How does that work? You start with your words, and you assign all of your words an ID. A is the first word in our dictionary, it gets the ID number 1. Upon is the 74th word in our dictionary, so it gets the ID number 74, and so on and so forth, until you have all of the words in your dictionary assigned numbers. Then you multiply those numbers, you add those numbers.
Every single line in this graph represents a function very similar to linear regression. There’s hundreds of different functions and hundreds of different ways that you can implement those lines. You don’t need to know that right now. You don’t need to be a machine learning expert to know that this is just math. It’s a lot of math. There are a lot of lines, and that’s why GPUs can actually execute all of those lines in parallel, they’re all independent. What you’ll get is just some more magic numbers in an array. In the middle in the hidden layer, you just repeat that process with more lines, more functions, more math, and you’ll get an output. The output is just some numbers, a magic number like 42. To actually understand what 42 means, we look it up in the dictionary, and we get the word timeout. Now we have this model that given the three inputs, once upon a time, is predicting the next word in the sentence is time. This is how LLMs work. This is the magic that they do.
For this talk, we’re going to focus on embeddings and vectors. What are embeddings and vectors? An embedding is just that hidden layer. This is some mystical intermediate representation that the model has. State of the art research doesn’t really understand why these numbers are the way that they are. They are the way that they are because that’s what gets the right answer, is basically what it boils down to. There are lots of clever ways to figure out how to generate those numbers to make sure that you are getting the right answer. Again, we don’t need to know any of that: it’s just a black box, it’s just a function.
The very cool thing is that we’ll have lots of various ways to start a story, like, once upon a, that may all be similar to us. They will also be similar in this hidden layer, in this embedding, even though they may use completely different words. It may be a completely different phrase. It may be a completely different language. As long as the model has been trained well, then the embedding, the intermediate representation of that language will be very similar. When I say similar, I just mean like, by Euclidean geometry, like it’ll be some number close to negative 3, it will be some number close to 23 in the second box. We have lots of ways to measure similarity of large arrays, you can do the Manhattan distance, since we’re in New York, or you might choose the dot product, or you might choose cosine similarity. These are all just for loops to implement these things. Again, modern processors can do a lot of those computations very quickly, and tell you how similar all of these things are.
The reason you need an embedding database, or you might want an embedding database, just think about the Elasticsearch case that I mentioned before. You have hundreds of thousands, or millions, or even billions of documents, and they’re all text. In traditional search, you’re going to do keyword matching against an inverted index using English language. If somebody uses the wrong word, if they say, vanilla ice cream versus old fashioned vanilla, you may not actually get the right keyword match, and you may return the wrong product. Synonym matching is something that embeddings are very good at. You can actually take all of your natural language documents, you will run them all through a neural network like this. You generate the embedding for that document. Then you save just that embedding, just that array in your database. Pretty much every database I know of has an array datatype for storage. It’s a pretty primitive data type. Pretty much every programming language supports arrays, even garbage collected runtime languages like Python, can implement operations on arrays very quickly, especially with optimized libraries like NumPy, and pandas. This is not new stuff. Even though vector databases seem very new today, people have been doing these things for decades. A lot of the functionality you need has actually been baked into hardware by Intel, and by NVIDIA, because this is such a generally useful thing to be able to say, add up the numbers into a race. Everybody needs to do it for all kinds of things.
If you have a database, and it’s full of these arrays, with millions of them, and then you have a user query comment, you can also generate the embedding of that user query. Now you can actually calculate the distance between that user embedding array and each and every one of the arrays in your database, and you’ll get some distance function for all of those. You can sort that list, find the one with the smallest distance, that will give you the array. You translate that array is mapped back to a text document, and now you have the English language document that is most similar to the English language query coming in. Again, this is just a function. You take a single input, which is English language, you produce a single output, which is English language, even though it’s all math under the hood.
I’ve just given you a pretty high-level description of how machine learning works. You can basically forget all of that now, because if you want to generate an embedding, you don’t really need to know how any of it actually happens. We’ve created a very simple function in PostgresML. If you install PostgresML on your Postgres database, you can select the pgml.embed function, and you parse it two arguments. The first argument is the model_name. This can be any model published on Hugging Face. There are hundreds of open source models that will generate embeddings. This can be a model that you’ve trained yourself if you are a natural language processing expert. Then it, of course, takes the text that you want to create an embedding for. You can use this function both to index all of your documents and create embeddings. This is the output of that function for a single call. It’s always a vector. Again, a vector is just a massive blob of numbers. Keep in mind, though, that when you’re talking about these embedding vectors, typically they’re on the scale of 1000 floats each. If you’re talking about a million documents that you’re creating embeddings for, that’s a billion floats that you need to store. That’s a gigabyte. These things can quickly grow large. You do need to be thoughtful about how you’re storing them, how you’re indexing them.
Postgres makes this, again, very simple. If you have a documents table, and that table would normally have a text column that represents the body of the document, we can add a separate column or a second column to that table, and this table will hold the embedding. In this case, the embedding is a vector. It has 768 elements in it. These need to be sized. It’s really nice to have a typed schema in Postgres, where you can check the correctness of everything. This particular vector column, Postgres has this really nice feature where you can say that a column is generated. In this case, the generation of this embedding column is our pgml.embed function. Anybody who inserts some text into this document table is automatically going to also create an embedding alongside that document. They don’t even need to know that they’re creating an embedding. It eliminates a lot of issues with data staleness. If they update the document, the embedding is also regenerated and updated for them.
This is an example of a slightly newer model. It’s the Hong Kong University’s natural language processing instructor-xl model. At the time that I made these slides, it was the leading model, it’s no longer the leading model for embeddings. These things change on a weekly basis. Being able to just swap out your model_name, regenerate all your embeddings is actually really nice. This model is interesting because it takes a prompt for how to actually generate the embedding, similar to all the prompt engineering that people are doing with other large language models. It’s worth calling out that everything is moving very quickly now. You need a lot of flexibility, and you need to constantly be updating dependencies. Python dependencies are not fun to maintain. There’s a very large operational burden there. If you’re doing machine learning on a bunch of laptops for your data scientists, it’s very hard to make sure that every data scientist in a large organization has the latest updates and that their laptops are working. On the other hand, if there’s one large central database cluster, that thing can be managed very effectively by an operational team who has complete control. Then the data scientists can still get access to the latest models. They can still get access to do all of the things that they need to do.
PostgresML isn’t just about large language models. It has all of the classical machine learning algorithms that you expect. We have native bindings for XGBoost that are super-duper fast, we’re very proud of. The implementation is highly optimized so that when you have, what is an array, a C array in memory in Postgres in a buffer from a table, we take a pointer to that, we don’t even copy it. We parse that pointer to XGBoost. XGBoost says, “Fine. I know what an array is.” Again, these things are very primitive. This is the goal of Apache Arrow. This is the goal of protobuf from Google. They have Spanner. Spanner is very similar concept that when you have a protobuf, will store the data in the table in the row in the exact protobuf format. There is no serialization. They can just copy it straight out over the wire. Serialization and going over the wire kill so many machine learning applications. They really limit the amount of data that you can bring to bear in an interactive online context.
The Storage Keyword, and 768
The Hong Kong University natural language processing instructor-xl model, it has a hidden layer in the middle of it that is 768 nodes wide. It will always have 768 floats as its intermediate hidden layer representation. Regardless of the string size that you parse in, it’s going to have that hidden state once it’s read the entire string. That is the size of the vector that we will always be storing in this table, regardless if its one-word input, or 1000-word input. All of these models have what’s called a context window, which means they can only consider 512, or 2000 tokens at a time. If you parse fewer than that, it will get padded out with zeros or the empty word token. If you parse more than that, it will probably just get truncated. There’s a whole bunch of techniques that you might want to do called chunking, where if your documents are larger than the context window of your model, you will want to split those documents up into chunks, create an embedding for each chunk. There’s lots of cool tricks you can do with embeddings.
You can do all kinds of math with arrays. Let’s say you have a document and you break it up into 10 chunks, and you generate 10 different embeddings for those 10 chunks. You can actually add up all the vectors, and the vector that you get by adding up those 10 vectors will actually be the average vector of the entire document. Even though your model may be limited to a context window of 512, the embedding practically that it can consider is unlimited. If you think about what happens when you start adding up a bunch of embeddings together, some of them will have positive numbers, some of them will have negative numbers. A lot of them will just cancel out and trend towards zero. If you add up the embeddings for all of Wikipedia, you might just get zeros across the board, meaning that this is not special in any way. It’s equally relevant to everything or equally close to everything. It will actually be closer to things that are also very generic, or general. That covers I think the 768.
GENERATED ALWAYS AS is just the Postgres syntax that tells Postgres, anytime someone inserts or updates a row in this table, it needs to run this function. This body here is actually a variable that references this body column up here. You only have to parse the body text in once. Postgres has an in memory once. It will reuse it, and it will re-parse it to the model here. STORED is another option. You can have this be stored in the table physically where it will take up space. This is good if you read things more than you write them. You don’t have to store it. It can be generated on the fly at read time. Perhaps you’re storing a lot of documents that you never need the embedding for. Generating embeddings is expensive. In that case, you only want to run the generation function if somebody is reading the embedding column. That might be a savings there.
This is an example of the cosine_distance operator that’s also provided by pgvector. There are three of these operators parsed. This will do the cosine similarity function. There’s one for the Manhattan distance. There’s one for the dot product. These three functions have different tradeoffs. Manhattan doesn’t involve any square roots. You just add up the east-west plus the north-south, and that’s your final distance. Whereas dot product and cosine similarity have a little bit more math involved, they’re a little bit more expensive to compute. The distance is truer. If we actually take the hypotenuse of the triangle, that’s better than if we look at just the East-West, North-South Manhattan distance, although we’ve just had to calculate a square root. My preference is to always start with cosine_distance, which is the most flexible. If your vectors are normalized, then the dot product will give you the same accuracy for free. Not all vector spaces are fully normalized. This is the safest, highest quality answer. If you need more performance later, you can test empirically, and you really have to test empirically. There’s all kinds of metrics you can get for these models, that will give you a quantitative answer of like, the perplexity is blah-blah-blah, and like, what does that mean? I don’t know, it means it’s better than the other one, maybe? What you really have to do is you have to actually run some queries against your database. You have to look at the things that you’re getting back.
If you have 1000 vectors, a modern CPU core or GPU core with several gigahertz of speed, can probably compute 100,000 cosine similarities or dot products per second. If you have 100,000 documents in your database, and you have a single core CPU, and one second is fast enough for your application, then you’re done. You don’t need anything special. If you have fewer documents, you’re done, you don’t need anything special. A lot of document collections that I see are on the order of thousands, they’re not on the order of millions. You can just brute force the computation on a single core. If you have a GPU that has 5000 cores, then you can do 5000 times 100,000 in a second, which will give you 5 billion documents. Five billion is an interesting number for scalability reasons.
In this case, though, let’s say you have a million documents, and you don’t want to wait 10 seconds for a user query to come in, you need a way to actually find your closest vectors in your dataset much more quickly. You want to avoid doing a lot of the direct comparisons. We use indexes to create shortcuts all of the time. Most people are probably familiar with keyword indexes, when they’re doing text recall. We just build the list of all of the documents that contain that keyword, and we sort that list ahead of time, by how many times that keyword appears in that document. Then when somebody searches for that keyword, you just go get your list, and you take the head of it. That gives you your documents that match. It’s very quick. It’s much more quick than actually having to scan every single document and see if the keyword actually exists in it. That would be terribly slow.
Similarly, with vectors, there are indexing operations that we do. pgvector supports the IVFFlat indexing type. This index type, what it does, is, let’s say you’ve got a million vectors that you want to build an index over. In this case, we’ve said we want 2000 lists in this index. It’s going to create 2000 lists of vectors. It’s going to automatically cluster all of your million vectors into 2000 different clusters that are most compact, so that every vector in each list is most similar to all of the other vectors in that list and less similar to vectors in other lists. Then it’s going to compute the centroid of all of those lists. This is just vector math. Each list can then be looked at as the centroid, and everything will be closest to that centroid in each list. Now when I do a query, and I want to look up and find the nearest vectors to my query vector, I only have to brute force the 2000 centroids. I can find the list from those 2000 that has the most likely candidates that are closest to my input. Then, I can brute force all of the ones in that list. I don’t want to do the math of 1 million divided by 2000, I think it’s 500. That would be another 500 vector comparisons. In total, we would do 2000 to find the list. We would do 500 across that list. That’s 2500 comparisons or vector distance calculations, rather than the full million, so 2500 is much faster than a million.
The tradeoff here is around the edge cases. If your vector is on the boundary between two lists and only slightly falls closer to one list than the other, then you might actually miss some other vectors from that other list that are also near the boundary. That’s why this is an approximate nearest neighbor search. What we can do to handle those cases if we find that our recall is bad, that we’re frequently missing vectors near the edge in our results, is we can turn up the number of probes. The number of probes is basically how many lists do we want to actually look at or consider. We can go from one probe of just considering the 500 vectors in the very nearest list, to 10 probes. Then we’ll actually consider 5000 additional vectors across 10 different lists. That will make sure that if anything is on any of the edges, we’ll get that. Again, this is 10 times more expensive given the fixed cost of 2000 up front. You can do the math. The runtimes are actually very predictable. If you’re doing this on a CPU core, it’s always the same number of floating-point operations, and so you can measure. You can say, how much latency budget do I have? How important is it to actually get every single record back or not?
You actually need to build these with the matching operator that you will use. In this case, we are doing a cosine operator against our index, so we’ll generate the index to match against the cosine operator. If you want to use the Euclidean distance or the dot product, you would substitute that here as well. You can build as many of these indexes as you want. You could have three indexes, one for each operator, if you wanted to cover it that way. There’s a lot you can do with these indexes. You can build partial indexes in Postgres that only cover certain portions of the table. Let’s say you have some other criteria in the table where like, maybe some document is part of a collection, and you only want to search for documents in collection 32, you can build a partial index that only includes documents from collection 32. Actually, what we found at Instacart was we didn’t need vector indexes, because all of our queries were scoped to very tight constraints. Where given the user id, given the store that they were shopping at, given all of these extra criteria, we could winnow down the number of possible vectors to some subset far smaller than 10,000. Then we could just brute force the answer, and it would be an exact answer. It’d be very fast.
Calculating Distance from Pairs
What are the pairs that we’re calculating distance from?
It’s whatever you want it to be. In the clustering case, the pair is going to be two documents. In the query case, you’ll be comparing the incoming vector, the query vector to a document vector, into actually all of the document vectors. Clustering is a very expensive operation, because you actually have to compare every vector to every other vector to find out what vectors are closest to each other. This is like an n^2 operation on your vector size. Creating these indexes is a time-consuming process, and so you might want to consider. If you’ve already created your index, one of the very cool things about Postgres and pgvector is that anytime you insert a new record in the table, it goes into the index. The index is persisted to disk. If your database catches on fire or whatever, you can be assured that that data is safe and sound, as long as you’ve gotten the response back from Postgres, that, in fact, your transaction has completed.
Search Query Matching to Repository Items
You have to create the index before you can use the index. If you don’t have an index on the table, and you run that query like this, Postgres will accept this query with or without an index. Postgres has a very advanced query planner. It will look at all of the indexes on the table, given a particular query. It will use statistics about the data, the number of rows, the different predicates in your query. In this case, the query has no predicate. If the table has no index, this will be a full table scan. It will actually do this comparison against every record. If you have an index, on the other hand, it will use the index to take the shortcut that I’ve described. That’s the magic that pgvector gives you.
Clustering (n^2 pairs in a join table)
This is a very generic example. I probably could have picked something more explicit. You could say that the left-hand side is query vector, and the right-hand side is documents.embedding. In that case, that’s what you would write. Then, you would be selecting from documents. That’s how you would query that table. Again, you would have to generate that user query at runtime, because what they’re going to give you is a piece of text. Then you would call pgml.embed on that piece of text.
Vector Search (Summary)
There’s more work being done on pgvector right now to improve. There’s a lot of research going on around vector indexes. There’s a lot of tradeoffs. There’s at least a dozen different vector indexing algorithms. They are all some tradeoff between build cost, runtime cost, accuracy. IVFFlat is a pretty good first stab that makes pretty balanced tradeoffs amongst all of those things.
Common Table Expressions
This is a better example of what you might do. Let’s say we have a set of the Amazon movie reviews from all customer reviews for all of the Amazon DVDs for sale. There are millions of these customer reviews. We can create embeddings out of all those customer reviews. We can store them in a table. That’ll be our documents table for this example, where we have millions of user reviews. Now when somebody comes in with a query, and they want to find a movie similar to that query, we can use PGML to generate the embedding for that. In this case, we’re looking for the best 1980s sci-fi movie. Some people may describe, in their review, they may use the keyword ‘best 1980s sci-fi movie,’ but they might also say, ‘best 1970s sci-fi movie,’ or worse, ‘1980s sci-fi movie’. It’s these very subtle modifiers and nuances that embeddings are really magical about capturing the entire sentiment, and not just over-indexing on particular keywords. The quality you can get with a recall like this is pretty cool. When you have a WITH request AS, this creates a virtual table. In Postgres, this is called a common table expression. You can begin any query with this. The content is any other query. Whatever comes out of that query, in this case, it’ll be an embedding. For the rest of this query, and we can chain multiple ones of these, that’s what the SELECT … is going to do in the next slide. We’ll have a virtual table in-memory, it will have one row in it. That row will be a single vector named embedding, and it will be generated on the fly from the best 1980s sci-fi movie using our model.
Then, we can actually do the cosine operator because we’ve actually built an index on this table in the previous slide. This will be very fast. This will be very efficient. We get the full power of SQL. This doesn’t have a limit. You should put a limit, otherwise you’re going to be pulling back 5 million documents, and Postgres will know that. It’ll know that there’s 5 million documents and you didn’t put a limit, so it’ll just ignore your index because it’s not going to do any good anyway, because you’re going to need to calculate the 5 million dot products anyway. Because that’s what you’re actually selecting, you’re selecting the cosine_distance. The Postgres query planner is smart. It will try to help you out. You also need to be a little bit diligent when you’re thinking about, what do I really want? What do I really need?
PostgresML isn’t only limited to embeddings. There’s lots of things that you can do with these large language models. Hugging Face has this concept of tasks that we’ve also adopted. Some of these tasks can be text generation. It can be text classification. It can be translation from one language to another. There are several dozens of these. This is an example of using pgml.transform, which is the second function. It will also download a model from Hugging Face. In this case, we haven’t even specified the model. Hugging Face has some default model. This is a bit of a footgun, because they may change that default model, and your application may be using a new model that you haven’t tested. Be wary. You can also specify the model to this function. They do take keyword arguments. Anything you can specify to any of these API models, there’s a way to parse the keyword argument down all the way through. The output of this example here will be whatever the output of the Hugging Face model would be. In this case, Hugging Face returns some JSON. It’s going to have a sentiment colon floating-point number. Postgres has all of the JSON operators you need, so then you can dig into that JSON object coming back.
This is where I think the true power of SQL as a platform becomes really interesting, because if you think about all of the thousands of chatbots being built with LangChain today. It’s sort of like the iPhone moment when everybody was building a Flashlight app back in 2008. The very standard model is, I have my user prompt that I want to chat with. You take that, you parse it to OpenAI. That’s a remote data center call, it takes several hundred milliseconds. They have to actually then run their model on it. They give you back, no longer one of the best quality embeddings, but they’ll charge you for it anyway. Then you take that embedding, you parse it to your vector database. You look up a bunch of context that you want to use for prompt engineering with your chatbot. Maybe those are help documents that your support center has. Maybe that’s a movie catalog because this is a movie guru chatbot. You get that context that you’ve written all these documents about movies, or about help for your support center. You pull back the English language text out of your database, and you pull back as much English language text as OpenAI’s models will take in their context windows. You put your prompt on that. You say, “Given this information, you are a helpful chatbot. Please respond to this prompt.” You paste all that together. You send off 30 kilobytes worth of data to OpenAI again. They run it through their text generation model, they send you a response back. Then you go ahead and send that response back to the user.
With PostgresML, you can do all of that in a single query inside the database. There’s no network transit. There’s no memory copies. You create your embedding in the first CTE. You retrieve all of the relevant documents in the second CTE. Then in the third query, you parse those as concatenated inputs with your prompt, to another large language model. You can actually start stacking and chaining models with these CTEs as much as you want. One of the things that we found very useful is that cosine similarity is ok. It’s good and fast, relative to an XGBoost model that re-ranks. An XGBoost model that re-ranks is going to be much more accurate if rankings really matter.
Model Hosting on The Database
The transform is going to happen in the database. All of that text document for the prompt engineering and context will stay inside of the database. There will be a much lower network data transit cost. It’s going to be pointers in memory, or a memcpy if necessary. You’re talking about many kilobytes, potentially megabytes worth of data that doesn’t need to get sent over the wire. Compared to OpenAI and Python implementations doing the same thing, it’s usually a 10x speedup.
If you wanted to do it with PostgresML, you could have a foreign data wrapper to a different database instance. You can set up that and you can query that different database instance. If you wanted to retrieve all of your documents from Postgres into your Python application, and then send those off to OpenAI and do a hybrid approach, you can do that, absolutely, but you’re then now paying the network transit cost.
We covered machine learning, in theory. We covered SQL and how to do all this in depth. We’ve gone a step further, and we’ve created a Python SDK that implements all of these SQL queries for you. It gives you the functions with inputs and outputs that you actually want and need to know. You connect your Python application to a Postgres database. You create a collection of documents. You give it some arbitrary name for your collection. You can then upsert documents into that collection. If you want to go ahead and generate the indexes, you can do that with a single function call. We’ll enforce all of the best practices and the most efficient query patterns with this SDK. Finally, you can just do your vector search. You can get those documents back that you might then want to pass on to OpenAI. We’re extending this, though, to create chainable API SDK function calls, so that you can do a vector search. Then, instead of materializing the results immediately, you can call dot text generation. You can actually do it all in a single execution to get the efficiency back at the application layer.
It’s worth mentioning that if you try to do a bunch of LLMs inside of a single database, you’re probably going to knock that database over pretty quickly. PgCat is our other project that acts as a router. It handles sharding. It handles replication. It handles failover and load balancing across many different Postgres replicas. You might want to look into that project as well.
See more presentations with transcripts