Ahoy there 🚢,
Matt Squire here, CTO and co-founder of Fuzzy Labs, and this is the 4th edition of MLOps.WTF, a fortnightly newsletter where I discuss topics in Machine Learning, AI, and MLOps.
You could have invented RAG
ChatGPT was only released 19 months ago, a fact which I still find hard to believe. There’s a saying by Vladimir Lenin that goes “there are decades where nothing happens; and there are weeks where decades happen” — I feel like that applies to computer history too. (Of all the people I expected to quote in an MLOps newsletter, Lenin wasn’t high on the list; although according to Snopes, he never said this anyway!).
The early hype around generative AI is abating, mercifully. Now more and more people are thinking seriously about what it takes to build robust, scalable, and secure systems around large language models and their various cousins. This year has already seen publications from LinkedIn, Pinterest, and Uber that describe how these tech companies have incorporated LLMs into their platforms and products.
The common thread is Retrieval Augmented Generation (RAG), a design pattern for building LLM-powered applications that’s becoming the standard onto which everybody has converged. I say converged deliberately, because it feels like many teams are discovering the same challenges and inventing similar solutions somewhat independently.
RAG is very simple and, if it didn’t exist already, you’d probably invent it yourself. Start by thinking of a task that you’d like to perform using some domain-specific data. Perhaps you want to summarise last month’s sales data, or maybe you’d like a way for your tech support team to ask questions which can be answered using your product documentation. Whatever the task, if you want an LLM to perform it, you’ll need:
A way to retrieve data that’s related to the task,
An LLM that is capable of performing the task, and
A way to prompt the LLM so that it understands both the task, and the data.
In essence, that’s all RAG is. The retrieval step is usually (but not always) fulfilled by a vector database; augmentation is simply the process of inserting retrieved data into a prompt template; and generation refers to the LLM itself generating a response.
Productionising RAG
But as always, production brings with it pesky but important details. Let’s call the above basic RAG. Here, ‘basic’ isn’t meant as a value judgement, just a reflection of where it sits on a maturity curve. It’s a completely reasonable starting point for many projects, but it’s still just that.
Production RAG is a different beast, and this realisation is reflected in the industry reports I mentioned earlier — from LinkedIn, Pinterest, and Uber.
To begin with, let’s think about the retrieval step. Getting this right is key, because no matter what model we use, its output will only ever be as good as the data we give it. We need data that’s highly relevant to a user’s query, with nothing irrelevant mixed in. The role of the vector database is finding content that is semantically similar to the query, and while this works to a point, it usually isn’t enough in practice.
Suppose we’ve got a dataset made up of answers to common programming problems, and we ask a question like this: “How do I diagnose a null pointer exception in c++?”. The vector (semantic) search will do a good job finding content relating to diagnosis and null pointer exceptions, along with synonyms: debugging, fixing, null reference, etc. What it’s really doing, though, is matching full sentences, e.g.“A good way to find a null-pointer exception is to step through the code in a debugger”.
The ‘c++’ part of the query is going to be tricky. First because it may not appear in the answer text (the answers don’t mention c++, because it’s implied by the context of the question), and second because we only want c++ answers: though ‘Java’ is semantically close to ‘c++’, it’s not relevant to the question.
Often people end up combining multiple search methods. In this case, we could run a semantic search over the answers, and if our answers are tagged by programming language, we can run a full text search to match c++.
Guardrails also play a big role in production RAG systems, in order to manage risk to users and the companies that build and run the system. Guardrails apply to inputs, where we may want to prevent a user from asking off-topic questions, running jailbreaks, or leaking sensitive information to the model, and to the outputs, where we will want to filter toxic language, prevent the model talking about a competitor, or validate response quality.
Additionally, because querying an LLM is expensive, various optimisations can be applied to RAG systems. Queries, prompts, and responses can be cached, for example, and we can route queries away from larger and more expensive models where they may not be needed.
Chip Huyen has written a wonderful article that summarises all of the architectural common themes in production RAG in a lot more detail than I can do justice here. I can’t recommend it enough if you’re building RAG systems — here’s the link.
What’s on the horizon
As an industry we’re still learning about the best ways to build RAG systems. We can expect to see a steady stream of improvements, like better ways to approach data ingestion, improved retrieval, more ways to do guardrails, along with various efficiency and scalability gains.
But things are also changing very quickly in generative AI, so it’s reasonable to ask whether RAG is really here to stay. My own answer is yes, and there are a few reasons for this.
To begin with, RAG as a pattern has a great deal of generality. As long as there’s a need to get an LLM to perform a task on data that is directly related to that task, there’s a need for RAG. The data doesn’t have to come from a vector database, either, and there are applications where a time series or a graph database either supplements it or takes its place.
However, a common objection is: if future LLMs have considerably longer context windows, then conceivably you could give the LLM all of the data inside a single prompt, obviating the need for a retrieval step. In practice, giving an LLM too much information all at once in this way tends not to work very well, compared with a narrow, targeted context, so I’m not entirely convinced.
A second point to think about is non-textual data, like images, tables, audio, etc. The good news is that multi-modal RAG is definitely a thing, albeit at an earlier stage of maturity compared with text-mode RAG. For instance, we can use a multi-modal embedding model to embed both text and images. KX Systems wrote a blog demonstrating this here.
So I’m optimistic about the future of RAG. As applications evolve, our understanding over how to build robust, scalable, secure RAG systems will improve. Mind you, making predictions about the future is a dangerous game to play. What do you think — will we still be talking about RAG in a year? Let us know in the comments
And finally
Did you know that cybernetics comes from the ancient greek kubernḗtēs meaning steersman or pilot? Perhaps more surprisingly, the word governor has the same origin, coming via old French as gouvreneur, and latin as gubernator (where we get the word gubernatorial). Returning to the world of MLOps, Kubernetes is also taken directly from the ancient Greek source.
What can we conclude from these three facts? For one thing: given his previous roles both as the Governor of California and as a cybernetic organism, Arnold Schwarzenegger has been sorely overlooked as a spokesman/advocate for Kubernetes
On the subject of etymology, here are some words that I think should exist, but don’t:
Ento-etymology: the study of insect names
Steno-steganography: the practice of quickly hiding secret messages in text
Tele-teleology: the study of design in terms of functionality–but from a distance
Thanks for reading!
Matt
About Matt
Matt Squire is a human being, programmer, and tech nerd who likes AI and MLOps. Matt enjoys unusual programming languages, dabbling with hardware, and computer science esoterica. He’s the CTO and co-founder of Fuzzy Labs, an MLOps company based in the UK. Fuzzy Labs are currently hiring so if you like what you read here and want to get involved in this kind of thing then checkout the available roles here.
Each edition of the MLOps.WTF newsletter is a deep dive into a certain topic relating to productionising machine learning. If you’d like to suggest a topic, drop us an email!