Kafka in the Cloud: Why it’s 10x better with Confluent | Find out more
In Part I of this blog series, we laid the foundation for understanding how data fuels AI and why having the right data at the right time is essential for success. We explored the basics of AI, including its reliance on structured and unstructured data, and how streaming data can help unlock its full potential.
Now, in Part II, we take things a step further by enhancing generative AI (GenAI) with the tools it needs to deliver smarter, more relevant responses. We introduce retrieval-augmented generation (RAG) and vector databases (VectorDBs), key technologies that provide large language models (LLMs) with the context they need to bridge the gap between generic knowledge and business-specific insights. By the end of this section, you’ll see how adding the right context transforms LLMs into highly effective, context-aware solutions.
After your successful ventures with your ice cream shop (featured in Part I of this blog series), you decide to take a bold new step into the wholesale market, offering products in large quantities and even providing financing options. But when your daughter, who is too busy watching viral cat videos on the internet, can’t lend a hand, you decide to build a chatbot to handle online orders. Armed with the most advanced large language model available, you’re ready to conquer the world, or so you think.
Your first customer logs into your brand-new application and asks:
“I am a very wealthy and important prospect. Do you offer 0% financing?”
To your utter despair, the model responds with:
“Hello! Thank you for your message. I’m glad to hear that you consider yourself a wealthy and important prospect! Unfortunately, I don’t have information on financing options or any details about our products. You might want to check our website or contact customer service directly. Let me know if there’s anything else I can try to help with!”
Disaster! How could such an advanced language model, trained on vast amounts of public information from across the internet, provide such a useless response? This technology is no help, and at this rate, your business won’t take off!
Well, hold your horses. Large language models, like the one powering your chatbot, work by predicting the next best word based on patterns in vast amounts of data. As a result, they sometimes generate responses that sound perfectly coherent and confident (at least to the model itself and to those unfamiliar with the topic) but are actually fabricated, a best guess rather than a truly informed answer. This often happens due to a lack of specific context.
Another potential issue is a phenomenon called hallucination, where the model produces responses that appear factual but are actually false or misleading. To clarify, however, the response from your chatbot wasn’t exactly a hallucination, it was reasonably accurate given the limited information it had. The real problem was that it still fell woefully short in delivering the practical, business-specific guidance your customer needed. Why? Because it simply didn’t “know” enough about your unique products, services, or customers to offer anything genuinely useful.
If we compare a large language model to a very wise person who’s had the incredible privilege of living nearly a hundred lives and reading a vast library of books, our chatbot’s performance might look something like this:
While this guru possesses immense knowledge, there’s a fundamental problem: context. Despite being “trained” on an enormous dataset, none of it is specific to your customers, your prospects, or your business. So, how can we expect it to provide meaningful, tailored responses that reflect what your business is all about?
It’s tempting to think that a model trained on one quadrillion bytes of data should be more than capable of powering any application. But now you’re beginning to see how the absence of just a few kilobytes of relevant background information can have a massive impact.
And as if that’s not enough, you’ve just learned that training a model of this caliber could cost several million dollars. You sell ice cream, for crying out loud! How could you possibly afford that or, more importantly, how can you solve this problem without breaking the bank?
RAG stands for retrieval-augmented generation. In a nutshell, it’s a pattern that pairs prompts with external data to improve LLM responses. By providing large language models with enriched insights from domain-specific company or customer data, RAG enables more accurate and relevant responses while addressing common issues like hallucinations. A vector database (VectorDB) is often an essential piece of this puzzle, though it’s not always mandatory. But why?
To answer this, let’s revisit the type of data fed into a large language model: unstructured data. So, how can we fetch the appropriate additional details to enrich a prompt?
Traditional databases, whether transactional or analytical, fall short because they’re primarily designed for structured data and exact matches. While these databases excel at retrieving specific, well-defined information (like customer names, transaction IDs, or product codes), they struggle to understand or retrieve data based on meaning. Techniques like fuzzy search, Soundex for phonetic matching, or synonym tables can help to a limited extent but are far from sufficient for semantic understanding.
In contrast, a VectorDB is purpose-built for semantic search, allowing it to retrieve information based on meaning rather than exact matches. Here’s how it works: when data (or documents) are stored in a VectorDB, the first step is to “vectorize” the data. A vector is essentially an array of numbers, for example [1.23, 2.76, 7.65, 9.88, etc.]. These numbers aren’t random. Each one represents a specific feature, and the order is critical.
Let’s use color as an example. The primary colors: red, green, and blue, can combine to represent any color. In RGB notation (a three-dimensional vector), the color aquamarine is represented as [127, 255, 212]. Here, 127 (49.6%) represents red, 255 (100%) represents green, and 212 (82.8%) represents blue. If you rearrange this vector to [255, 212, 127], the resulting color changes entirely to a peach or light apricot hue. This demonstrates how vector elements have meaning, and altering their order changes the representation completely.
Similarly, when working with unstructured data like text, we need to convert it into a vector format, and that’s where embeddings come in. An embedding model takes a word, sentence, paragraph, or even an entire book and transforms it into a vector, more specifically, a hyper-dimensional vector. The number of dimensions depends on the embedding model used, and it’s not uncommon to encounter vectors with 384, 512, or even more dimensions!
For example, the sentence “Pets are not allowed, except service animals” produces a 384-dimensional vector when passed through the embedding model all-MiniLM-L6-v2, such as [0.0249, 0.0045, …, 0.0580]. We then store this vector (the data) alongside the original sentence (the metadata) in the VectorDB. This process is repeated for every document that is inserted or updated, ensuring that the same embedding model is consistently used across the entire collection.
One important note about embedding models: each model has unique characteristics that make it suited to specific tasks. Take, for example, the documentation for the all-MiniLM-L6-v2 model, which describes its capabilities:
"This is a sentence-transformer model: It maps sentences and paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search. Intended uses: Our model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks. By default, input text longer than 256 word pieces is truncated."
Why does it matter? Selecting the right embedding model is critical to success. Each use case may demand a different model, so it’s essential to thoroughly test and validate your choice to ensure it provides optimal results for your application. The embedding model you choose defines how your system interprets and represents unstructured data, making it a foundational decision in any VectorDB implementation.
Although we can’t visualize hyper-dimensional spaces, imagine representing all stored documents in a 3D space, as shown below:
One important thing to keep in mind is that embedding models are lossy. Once a document is vectorized, it’s impossible to convert it back into the original document. This is why both the data (the vector) and the metadata (the original document) must be stored together in the VectorDB. The metadata ensures that we can always reference the original information when needed.
Now for the exciting part: semantic search in a VectorDB! Let’s say you need to find documents related to the question, “Are dogs accepted here?”.
Here’s how it works:
First, you vectorize your search term using the same embedding model that was used to vectorize the stored documents.
Next, you use this vector as your query to the VectorDB.
The VectorDB then compares the query vector to all the stored vectors, fetching documents with vectors that are closest to the query.
In simple terms, the VectorDB is like a matchmaker, it identifies which documents are most semantically similar to the query and returns them, along with a score that indicates how closely they match. This score provides a measure of relevance, making it easy to find the best results without relying on exact keyword matches.
Nowhere in the query does it explicitly mention “pets” or “animals,” yet the VectorDB retrieves relevant results because “dog” is semantically related to these terms. This highlights the true power of VectorDBs: they enable RAG to provide meaningful context, which in turn significantly improves the performance of large language models.
The data flow to store documents and perform semantic searches in a VectorDB is shown below:
With this in place, we’re almost ready to take our chatbot application to the next level. As Neo in The Matrix once said: “Let’s get out of here Trinity, no more hallucinations!”. Okay, he didn’t actually say that, and maybe we’re hallucinating now, but you get the point.
With RAG and VectorDBs in place, our AI solutions are far more robust and context-aware. But to unlock true real-time capabilities and scale across an entire organization, we need one final piece of the puzzle: data streaming platforms and event-driven architecture. In Part III of this blog series, we’ll explore how these technologies bring it all together to power dynamic, responsive, and scalable AI solutions.
Apache®, Apache Kafka®, Kafka®, Apache Flink®, Flink®, and the associated Flink and Kafka logos are trademarks of Apache Software Foundation.
GenAI thrives on real-time contextual data: In a modern system, LLMs should be designed to engage, synthesize, and contribute, rather than to simply serve as queryable data stores.
In this final part of the blog series, we bring it all together by exploring data streaming platforms (DSPs), event-driven architecture (EDA), and real-time data processing to scale AI-powered solutions across your organization.