[Webinar] Bringing Flink to On-Prem and Private Clouds. Register Now
Welcome to this three-part blog series where we explore how to transform generative AI (GenAI) into a powerful, real-time solution that drives business success.
In Part I, we'll lay the groundwork by examining how data fuels AI, why streaming data matters, and the core building blocks of AI technology.
In Part II, we dive into enhancing large language models (LLMs) with tools like retrieval-augmented generation (RAG) and vector databases (VectorDBs) to provide the right context for smarter responses.
Finally, in Part III, we bring it all together by exploring data streaming platforms (DSPs), event-driven architecture (EDA), and real-time data processing to scale AI-powered solutions across your organization. Let’s understand why data is the engine behind every successful AI implementation.
The whole world is buzzing about generative AI (GenAI). Is it a bandwagon your business needs to hop on, or just another passing trend you can safely ignore? And what, if anything, do data streaming platforms (DSPs) have to do with it? At first glance, these technologies seem to serve completely different purposes.
The good news is, if you’re still reading, it means you’re at least curious enough to explore the possibilities. Whether your business truly needs generative AI is something only you can decide, but by the end of this article, you’ll have a much better understanding of these technologies and, more importantly, the synergies between them.
The ultimate goal? To provide you with the foundational knowledge you need to make informed decisions, should you ever choose to embark on this journey into the world of generative AI and data streaming.
Artificial intelligence has evolved from a niche concept into a transformative force that is reshaping how businesses operate. The integration of AI with data streaming platforms has emerged as a crucial driver in this digital transformation era, focusing on the dynamic and real-time flow of data to generate contextually relevant predictions. Across industries, businesses are increasingly adopting AI technology to optimize operations, stay competitive, and enhance user experiences. However, the true potential of AI only unfolds when it is applied to the right datasets, at the right moment, and within the appropriate context.
To achieve this, AI and data streaming must work hand in hand to deliver the freshest and most relevant data, whether it’s customer insights, operational metrics, or market trends. Yet, many organizations struggle to realize this vision due to significant data management challenges.
A recent McKinsey survey highlighted these challenges, revealing that managing data remains one of the main barriers to extracting value from AI. Over 70% of companies surveyed reported difficulties integrating data into AI models, citing issues like data quality and the lack of clear governance processes. Fragmented and inconsistent data further complicates the problem, making it hard to meet the high standards required for AI performance.
Data governance is a critical hurdle. Without robust processes to ensure data consistency, accessibility, and security, integrating data into AI systems becomes a complex and error-prone endeavor. Governance is essential for managing data at scale, yet many organizations lag behind in developing these processes, causing delays and limiting the potential value AI can deliver.
Here’s the spoiler: a data streaming platform can be a game-changer for AI. By enabling instantaneous data integration, a DSP allows companies to process and analyze data as it’s generated, rather than relying on slower, outdated batch processing methods that handle data in bulk at scheduled times.
This continuous, unbounded flow of data ensures that information is always up to date, a critical requirement for producing high-quality AI outputs. But that’s not all. A streaming platform also supports consistent data pipelines, reducing fragmentation and making data more accessible and reusable across different systems.
What really sets a DSP apart, though, are its built-in governance features. With tools for monitoring, security, and data lineage tracking, a DSP provides a structured approach to managing data at scale. This makes it significantly easier to maintain the quality and compliance standards that AI models demand for success.
If we peel AI like an onion, we find these layers:
Machine learning (ML): This involves statistical and probabilistic models primarily used for pattern recognition. By exposing a computer to large datasets, we train it to identify patterns and make predictions or decisions based on past data. Think of it as teaching a machine to recognize trends and act on them, whether it’s spotting fraud, recommending movies, or predicting the weather.
Deep learning (DL): This is a specialized branch of machine learning. These models excel at identifying patterns, such as recognizing faces in photos, interpreting spoken language, or analyzing the sentiment of text, and making accurate predictions based on this complex information. Deep learning is what powers everything from image recognition in your smartphone to real-time language translation apps.
Large language models (LLM): How many books does an average person read in their lifetime? Let’s assume 20 books per year over 50 years, that’s around 1,000 books. LLMs, however, are trained on text data equivalent to tens of thousands, if not millions, of books. Early LLMs, for instance, were exposed to the equivalent of over 75,000 books. To match that breadth, a person would need to live more than 75 lifetimes without rereading a single title!
“Ah, The Karamazov Brothers, my favorite book! I remember reading it… what was it, 35 lives ago? I wish I could read it again!”
In essence, LLMs are predictive in nature. They generate text by anticipating the next word in a sequence, producing well-structured phrases, sentences, paragraphs, and even entire books. And they do this with near-human fluency and coherence, making them invaluable for tasks ranging from answering customer queries to writing entire essays.
And if we could oversimplify it and classify AI into two main functional approaches, we would have predictive AI and generative AI. Let’s explore them one by one.
1. Predictive AI: This uses statistical models to forecast outcomes based on patterns found in historical, structured data. What does that mean in practical terms? Let’s say you own an ice cream shop in the sunny city of Natal, nestled in the northeastern corner of Brazil. Every day, before closing, you jot down in your notebook how many scoops you sold, the weekday, the month, the year, whether it was a public holiday, and the average temperature and humidity. After two years, you’d have about 730 entries, each a snapshot of the factors that influenced your daily sales.
Then, on a cold, rainy day with no customers in sight, you pick up a magazine to pass the time. While flipping through the pages, you stumble across an article about how machine learning can help businesses optimize stock levels. “Voilà! I can use this to boost my business,” you think.
This is where a predictive AI model comes into play. Using the data you’ve recorded (with the number of scoops sold as the “label” and factors like weekday, month, public holiday, humidity, and temperature as the “features”), a supervised machine learning model could create a mathematical function, such as:
f(weekday, month, year , public_holiday, avg_humidity, avg_temperature) = y^
To predict tomorrow’s sales, you’d input the weather forecast along with the other relevant factors into the model. The result would be an estimate y^
(y-hat) of how many scoops you’re likely to sell, helping you plan your stock levels for the next day.
However, it’s important to remember that this prediction is essentially an educated guess, relying on patterns from the past, and ideally seasonal, data. The accuracy of this forecast can be evaluated by calculating the error:
error = |y - y^|
This absolute difference between the actual value (y
) and the expected value (y^
) serves as a metric to assess the model’s accuracy. Over time, this error can guide you in retraining the predictive model to improve its performance.
The workflow for a predictive model looks like this:
This highlights an important limitation of predictive models: they are custom-made for specific input data. For instance, applying this exact ice cream sales model to a shoe store would fail miserably, factors like humidity and temperature likely have little to no impact on shoe sales. Additionally, the model’s “coupling” with the application (or users) is high. This means the application must know precisely what information to provide as input for the model to generate useful predictions.
2. Generative AI (GenAI): This takes a completely different approach. Using deep learning models, it rapidly creates tailored content based on broad, unstructured data. Unlike predictive AI, which works with structured data to forecast outcomes, a generative AI model can process and understand human language (unstructured data) as input and respond in a human-like way. Because it’s trained on vast amounts of text, it learns how people communicate and can predict, one word at a time, the most appropriate response to a given prompt. For example, if you prompt ChatGPT with, “Write a poem about the future” it might respond with something like: “Future’s canvas, unwritten tale, in innovation’s breeze, dreams set sail.” Now, whether or not that qualifies as a poem is up for debate, but what’s important is this: the model understood the request and generated a coherent response. And it did so in plain English, no programming or specialized commands required. Compared to the predictive model, this one has an additional building block:
Prompt engineering is the process of carefully crafting and adjusting prompts at runtime, possibly injecting additional details into the original prompt, to guide a language model’s responses for optimal relevance, clarity, and accuracy. Think of it as giving the model just the right instructions to unlock its full potential. From a cost perspective, foundational models like LLMs are significantly more expensive to train than many other deep learners. To put this into perspective, training GPT-3 (the early model behind ChatGPT) is estimated to have cost over USD 40 million in computing power alone. So, unless you have a tech budget to rival Silicon Valley, don’t try this at home! Last, one of the standout advantages of generative AI is its low coupling between the application (or users) and the model. Unlike predictive AI, which requires custom-built models for specific datasets, generative AI models are reusable and trained on broad, generic data. For instance, GPT-3 was trained on a diverse dataset that included Common Crawl, Wikipedia, books, and other open sources, with a cutoff date around 2020. This versatility means you can ask the model almost anything, from cooking tips and children’s stories to aviation concepts and quantum mechanics. It’s like having an encyclopedia, a storyteller, and a curious researcher rolled into one, ready to assist at your command.
With a solid understanding of the fundamentals of AI and the role of data, we’re ready to move from theory to action. In Part II, we’ll see how we can enhance large language models (LLMs) with the right tools and context to unlock their full potential. It’s time to dive in and put these principles to work!
Apache®, Apache Kafka®, Kafka®, Apache Flink®, Flink®, and the associated Flink and Kafka logos are trademarks of Apache Software Foundation.
GenAI thrives on real-time contextual data: In a modern system, LLMs should be designed to engage, synthesize, and contribute, rather than to simply serve as queryable data stores.
In this final part of the blog series, we bring it all together by exploring data streaming platforms (DSPs), event-driven architecture (EDA), and real-time data processing to scale AI-powered solutions across your organization.