[Webinar] Kafka + Disaster Recovery: Are You Ready? | Register Now
In building the next generation of web agents, we need the simplest, fastest way to extract web data at scale for production use cases. And for every new generative AI (GenAI) application, developers and businesses need reliable data to power the underlying models. But getting that data in a usable, trustworthy format? That’s where things get complicated.
As the co-founder and CTO of Reworkd, I spend a lot of time thinking about how to meet the growing demands of AI-powered applications, not just with reliable data but also in real time and at scale. That’s the mission of Reworkd—we’re making real-time data extraction as seamless and efficient as possible.
I’ll walk you through how we leverage agentic AI and GenAI to automate and scale real-time web scraping and processing and share how we use the Confluent Data Streaming Platform to deliver faster, more reliable data extraction services.
Scraping data from the web has always been a tedious process with a ton of manual effort involved. It requires crawling websites, parsing dynamic pages, handling terabytes of data per day, and ensuring data accuracy, all while trying to avoid getting blocked by the websites you're scraping.
When you’re also working with GenAI models, the problem compounds. For example, if you’re building an interactive recommendation engine, you need unstructured and structured product data, reviews, prices, and other relevant fields from thousands or even millions of web pages.
So the objective is not just to retrieve the data but also to process it in a way that makes the data consumable and understandable to a large language model (LLM) behind the recommendation chatbot experience you’re building.
Imagine having to manually scrape data from 10,000 to 100,000 websites, each with tens, hundreds, or even millions of pages for this use case. Doing it just once per site wouldn’t be enough. You would need to regularly redo the entire process to ensure the app you’re building continues to produce up-to-date, accurate, and useful outputs for your end user. And it’s not feasible to directly feed each page into an LLM without incurring high token costs.
The amount of effort involved wouldn’t just be a waste of your time—it would be impossible to achieve quickly, frequently, and consistently enough to benefit you, the end user, or the business. That’s what makes web data extraction a space that’s overdue for an AI intervention. It involves a staggering amount of work that simply shouldn’t be done manually anymore. Not when there’s a better way forward and there’s better work that you could be doing with that time.
Discover more resources on generative AI.
At Reworkd, using AI to automate web scraping significantly accelerates data extraction with custom code for validating extracted data while avoiding overloading target sites. We built AI Agents to iteratively create code, crawl, and scrape data from webpages as well as to test the code – self-reviewing to verify that the code works and is ready to go to the customer. We created AgentGPT, a GenAI application that allows you to assemble, configure, and deploy autonomous AI Agents in your browser.
The benefits become even more powerful for our customers when they’re available in real time and ready to scale on demand.
That’s why we made the Confluent data streaming platform the core of Reworkd’s data extraction pipelines.
Powered by its cloud-native Apache Kafka® engine, Confluent delivers the real-time, fault-tolerant solution we need to handle high-throughput data streams with ease. Using Confluent allows us to stream web data in real time, validate it, and make it available to customers in near-real time.
Because it’s not just about streaming data—it’s about ensuring that data is processed, validated, and transformed before it reaches the end user. Here’s how the data streaming platform fits into our data extraction pipelines:
Streaming Data into Kafka Topics: After scraping data from a webpage, we send the output to Kafka topics. This is where we do most of the heavy lifting, from data validation to transformation.
Built-In Data Governance: We use Schema Registry to enforce consistency in data extracted at various times and across different sources. This ensures that all data flowing through the system adheres to a defined data contract, which is especially important when you're dealing with large volumes of data from a wide variety of sources.
Shift-Left Stream Processing: We’ve set up consumer groups to manage essential processing—whether it’s deduplication, data validation, or transforming raw web data into structured formats—as early as possible to prevent data errors and duplicative processing. And we’re able to easily add new consumers or adjust processing logic as a customer’s needs change.
Downstream Consumers: From Kafka, we send the transformed data to downstream MySQL databases. We use ClickHouse as our data warehouse, connecting Kafka to ClickHouse via Kafka Connect and Debezium for change data capture (CDC). We also use Grafana for monitoring and visualizing key metrics.
API Consumption: Customers don’t pull data directly from Kafka. Instead, we expose the data via a REST API, which is more flexible for integrating with different types of customer workflows.
Web scraping often requires writing a lot of custom code, particularly when you’re working with pages that change frequently or have complex structures. Traditionally, building a custom scraper for each site is a time-consuming, manual process. Using an Agentic AI workflow that includes OpenAI’s GPT-4, we’ve developed tools to automate many of the steps involved.
Most of our Agentic AI work revolves around writing, testing, and iterating on code for web scraping. For example, scraping product listings from an e-commerce page often requires dealing with highly dynamic pages where millions of products are listed.
Feeding every single page into an LLM would be prohibitively expensive due to token costs. Instead, we use retrieval-augmented generation (RAG) to create embeddings and understand the context of a webpage in order to write efficient scraping code, which both cuts down costs and increases efficiency.
Our ability to stream and process the resulting data on Confluent allows us to efficiently manage token usage, deduplicate incoming data, and quickly identify the most important changes on target web pages. This is essential for businesses that need to track products, prices, and other dynamic content across large-scale websites.
Customers provide a list of websites and schemas, and our AI Agents automatically generate web scrapers in minutes, significantly accelerating a process that would normally take weeks. Here’s how it works in practice:
Webpage Rendering & Parsing: We start by scraping data from a newly targeted web page. First, the page is rendered in a browser, and then we use computer vision to convert it into a 2D text version. This enables us to extract relevant data, like product names, prices, and descriptions.
Agentic AI Code Generation: Next, we feed this text version into GPT-4 via Azure’s OpenAI service. AI Agents then write code to extract the relevant data. Understanding the context of the page is critical at this stage, otherwise, the code generated will be unlikely to extract the right data or make it usable for downstream consumers accurately.
Testing and Validation via AI Agents: After generating the scraping code, AI Agents test and validate the generated code by running it against the webpage. If the extraction fails, the agents will revise the code and try again until it works.
Real-Time Data Processing: Once the code is validated, the scraped data is sent to a Kafka topic for further processing and deduplication. We also track token usage and billing for our customers by connecting Kafka with ClickHouse via connectors, and use Grafana for visualizing the results.
Downstream Data Consumption: Our customers consume the data via a REST API, allowing them to integrate it seamlessly into their workflows. If the data already exists in their system, they can easily deduplicate and ensure that they’re not re-ingesting unchanged data.
Using Confluent as the backbone behind Reworkd has made it significantly easier for our team to work toward accelerating and streamlining data extraction. Instead of having to dedicate weeks to simple additions or updates to our data systems, we can make changes in a few days and focus on the work that matters most to our customers.
For example, we had an engineer join our team and—within just a week—build a new dashboard by adding a new consumer to track network usage for customers, leverage a connector to send usage data to ClickHouse, and build streaming data pipelines to power a new billing process.
Without Confluent, those changes would have involved weeks or even months of building bespoke integrations. That’s what standardizing your data integration, processing, and governance with streaming means: you don’t have to think about your architecture, and you can reuse transformed data wherever it’s needed. Confluent makes it easy to experiment with a fast feedback loop, iterate, and build new features quickly.
Additionally, Confluent’s cloud-native design ensures that when a service goes down, it picks up right where it left off when it comes back online. This makes our system incredibly fault-tolerant and ensures data isn’t lost during outages.
Looking ahead, my role is to keep evaluating where the technology is headed and ensuring our platform stays ahead of the curve. Models are getting better and cheaper, but there are still gaps in terms of accuracy. Agentic AI and GenAI will continue to evolve, and for Reworkd, that means automating even more aspects of the web data extraction process and expanding the role of agentic AI in writing evals, for example, to enhance performance and make it even easier to handle dynamic web data.
Using data streaming with our GenAI and Agentic AI tools allows us to pull relevant data and deliver high-quality, hallucination-free responses to our customers faster without inordinately increasing storage-related costs. Over time, we envision this will allow us to introduce a fully automated process for scraping and extracting web data, making people’s and GenAI’s job easier while improving accuracy. This will not only lower costs but also open up new opportunities for businesses looking to access and analyze web data in real time.
Whether you're developing GenAI applications or managing data-intensive systems, the key to success lies in continuous experimentation, automation of repetitive manual processes, and access to trustworthy data. By leveraging tools like Confluent, you'll be able to build data pipelines that can scale as your needs grow and deliver the data you need, even as your AI apps evolve with new demands and emerging tech.
Stay informed on GenAI best practices and check out the discussions from Confluent AI Day 2024, including a panel that I participated in—“Is Your Data Ready for Trustworthy GenAI?”
You can also learn more about Reworkd or explore additional resources at Confluent’s GenAI hub.
Real-time AI is the future, and AI/ML models have demonstrated incredible potential for predicting and generating media in various business domains. For the best results, these models must be informed by relevant data.
Sarvant Singh, vice president of Data and Emerging Solutions at Penske Transportation Solutions, will tell you that Penske is more than just the yellow trucks you see on the roads. And rightly so...