Level Up Your Kafka Skills in Just 5 Days | Join Season of Streaming On-Demand
In 2023, business analytics and business intelligence have both been around for long enough to have been adopted by most companies as a requirement for effectively operating their businesses. With this has come whole ecosystems of tools and technologies to capture, process, and glean insights from data. But as companies fight to maintain an edge against the competition, the need for more sophisticated and complex analysis has grown.
Enter one of the newer fields of study and technologies—AI and ML. With the promise of “teaching” computers to draw inferences, discover patterns, and make predictions accurately, AI and ML are being widely adopted. Companies are seeking the ability to predict the behaviors of people and things, discover new trends in a variety of metrics, and discover hidden relationships between products, people, and things.
In this blog, you’ll get to know some of the problems, challenges, and solutions facing the field of AI/ML-driven business analytics focused on what is at its core—data. You’ll walk away with a better understanding of what you’re up against when it comes to doing real-time business analytics with AI/ML, optimizing your data infrastructure, and preparing analyst teams and technologies to use the highest quality data for the job.
The first key challenge that businesses are likely to face when incorporating AI/ML is data quality and integrity, which breaks down into a handful of more specific issues or topics discussed below. This challenge can stem from a variety of different sources, but in essence, it boils down to something that is often difficult to fix: human and organizational behavior.
1a. Data Completeness. Any reasonably established business probably has some degree of siloing with their information systems and data. Since the business or human behavior that you’re trying to model with AI/ML can easily be more complex than one domain, you’ll need to bring those domains together to build and train the strongest models. However, that’s typically much easier said than done.
1b. Data Context. Bringing all necessary data sources together in one place doesn’t handle stitching it all together in an intelligible way. When preparing data for use with AI and ML, there is frequently a need to join or enrich data from multiple sources. Performing this task can be cumbersome if done incorrectly, and the systems that are given this responsibility need to be carefully architected to optimize performance and scalability.
1c. Data Consistency. Whether your data is coming from different external sources or parties or another team or teams within your organization, there’s no guarantee that everyone will agree on the naming or formatting of data (humans can’t even settle on a standardized format for dates, let alone for a large and complex schema). So, some kind of transformation or very strong organizational discipline (likely both) will be necessary in order to make sure the data you feed to your AI or ML model is consistent and quality enough for the model to be effective.
While you can take the approach of unwavering organizational discipline and execution at the human level to solve the above, it’s likely more advisable to adopt tools and technologies that will enforce it for you. (In a later section of this post, you’ll find examples and solutions that can be applied to these challenges to do just that.)
The second key challenge is performance and scalability—like the previous challenge, it also breaks down further into important topics. Ultimately, quality data doesn’t necessarily cut it. You will need the right supporting tools and infrastructure in order to scale things with the business.
2a. Data Volume and Velocity. Having good data is great, but is the supporting infrastructure that transforms, unifies, and cleans that data able to handle massive volumes of data all at once? What about the supporting infrastructure that does the modeling and inference of the data? Just because something works at a small scale doesn’t guarantee that it will continue to do so as the workload size grows, so it’s important to consider this in the initial design.
2b. Infrastructure Scaling. Having a working model with quality data doesn’t ensure that it’s ready for production workloads, where things can break and unexpected spikes in data ingestion are possible. For this reason, it’s important to place a strong emphasis on an implementation that is redundant and resilient without limiting horizontal scalability. Depending on the specifics of how you’re looking to apply AI and ML, this implementation can vary, which makes this one of the most challenging things to get just right.
2c. Continuous Model Training. In many circumstances, you likely desire your model to learn over time. That is, you don’t cyclically train, use, train, and use the model with some amount of lag between phases without factoring new, available data into the model. Training an ML model with batches of data is the easier approach, but is less reactive to the current state of the data (and therefore, the current state of your business) than its real-time counterpart. Designing for real time is the best option, but comes at the cost of additional complexity and obstacles.
Many of the performance and scalability challenges can be addressed by a well-architected data infrastructure. Make sure that you design data infrastructure that can handle increases in scale as you grow, lest you find yourself spending time refactoring the systems and applications, resulting in time taken away from doing analysis that provides the business with new insights.
When it comes to the challenges that have been laid out thus far, Confluent provides the most complete real-time data platform for addressing them. There are other technologies, such as traditional messaging queues, that integrate business systems similarly but ultimately don’t provide a complete solution to the problem. That is, they might enable two applications to queue messages to each other, but they lack governance, quality control, stream processing, and an external technology integration framework that is necessary to be a complete platform. As you’ve seen so far through the discussion of the key challenges, the right platform to enable real-time business analytics with AI/ML needs to be complete in order to address as many of the key issues as possible.
Now that you understand some key challenges, here’s a quick introduction to Confluent Cloud before discussing how cloud-native data streaming in Confluent Cloud can address these challenges. Confluent Cloud provides a complete, cloud-native, and everywhere platform for Apache Kafka and runs on Kora, Confluent’s very own Apache Kafka engine built for the cloud. Confluent Cloud has somewhere in the ballpark of 30k Kafka clusters in operation, 3 trillion messages written per day, and processes more than an exabyte of data per year.
Here’s what Confluent Cloud offers as solutions to the challenges we’ve discussed, and why it’s often the real-time data infrastructure of choice for companies of every size, in every industry, and line of business.
When it comes to data integrity and quality, Confluent provides capabilities to help with the main challenges of maintaining high-quality data. Each of these solutions below addresses an aspect of managing the integrity and quality of data in Confluent Cloud.
When addressing data completeness, Confluent provides solutions like fully managed Kafka Connectors to bring the data together from multiple systems. As previously discussed, it’s likely that the data you’re looking to use will come from a variety of sources and you’ll need to bring them all into one place. With hundreds of supported technologies within our connector ecosystem, there’s likely to be a connector you can use to capture the data in real time from these sources. Confluent provides a portfolio of these connectors as fully managed and self-service. They can be provisioned in minutes and require no management or administration.
Taking the example of a retailer, this could be capturing clickstream data from a time-series database, relational data from one or many relational databases, and/or other context from a SaaS platform or tool. Once you’ve identified your source systems, you can capture events from each with fully managed connectors and bring them together in one platform. With everything in one place, you can conduct analysis that spans multiple domains and contexts.
In the event that you need to connect to and consolidate from systems where we don’t provide pre-built connectors, Confluent provides other options as well. When integrating proprietary applications or services, Kafka Clients serve as supported application libraries that can be integrated into almost any codebase, examples of which can be found here. In circumstances where all that is accessible is HTTP, Confluent offers the Confluent REST Proxy, a simple-to-deploy and manage tool that can intercept HTTP requests and produce or consume data to or from Confluent Cloud.
When addressing data context, Confluent provides support for stream processing tools like fully managed ksqlDB, fully managed Apache Flink, and Kafka Streams in order to transform, unify, and clean data (among many others). Once you’ve brought all the data together, you’ll need to transform the data into uniform formats, enrich some datasets with additional context from others, and make sure that all the data in the platform is high-quality and ready for use. In the case of Flink specifically, Confluent’s fully managed Flink SQL service will also provide functions and tools to integrate with OpenAI’s APIs to do things such as request embeddings, moderate text, get text from GPT3/4 models, and more in-flight.
Using the example of the retailer again, imagine you’ve captured clickstream data from a time-series database and master data from a relational one. After performing simple transformations to clean up the data, you can enrich the clickstream data with customer master data by adding things like demographics. With the enriched clickstream events, you can perform deeper analysis.
When addressing data consistency, Confluent provides tools like fully managed Schema Registry and Stream Governance. When bringing together data from multiple sources, things like validation, compatibility, and evolution need to be factored in. Fully managed Schema Registry and Stream Governance provide you just that—a set of tools that can perform data validation before events are produced or consumed, that ensure compatibility of data between different applications, and that can evolve the data within these boundaries over time.
Once again using the example of the retailer, each data source captured is its own data set, and thus an area in which a schema can be enforced. Things like validation and compatibility provide value in scenarios where the upstream or downstream schemas change, ultimately breaking another application that uses that data. When bringing multiple sources together (perhaps even from different teams or business units) this helps preserve independence without the risk of introducing changes that cause cascading failures. Using the example of the clickstream data and the customer master data, if one team modified or removed a necessary field (such as the field `user_id`), it could impact another team in the instance where the other team consumes that data set and requires that field. With fully managed Schema Registry, this change could be invalidated before it happens.
When it comes to performance and scalability, Confluent Cloud handles the main challenges right off the shelf. Each Kafka cluster deployed in Confluent Cloud provides solutions to help you scale your AI and ML workloads over time.
When addressing data volume and velocity, Confluent Cloud is one of the best solutions to handle massive data volumes with high velocity and low latency. When designing a scalable and performant data infrastructure for your AI and ML pipelines, consolidating data transfer and processing on Confluent Cloud will allow you to grow the volume of data and the complexity of your AI and ML usage over time without having to worry about large architecture changes. As previously introduced, Confluent Cloud runs on the proprietary Kora Engine, which boasts 10x performance over Apache Kafka. Put simply, Confluent Cloud was designed to address this exact concern.
When addressing infrastructure scaling, Confluent Cloud provides a few different options. Each of these options provides high enough ceilings with respect to capacity that in most cases the infrastructure can be considered “infinitely scalable,” since it can scale far beyond what most customers need. If you are interested in seeing the latest capacities as well as a comparison between the different available cluster types, view the latest documentation. This scalability in conjunction with the performance provided makes Confluent Cloud a strong choice for your data infrastructure when enabling real-time AI and ML data pipelines.
When addressing continuous training, by using a combination of most things already discussed (Connectors, Stream Processing, Kafka) and new things like Kafka Clients (language-specific APIs), AI and ML ops can be simplified and consolidated. Things like normalization, feature extraction, embedding, clustering, etc. can be done by independent services that can each be horizontally scaled, which prepare and process the data in real-time and in order. Depending on the goals and outcomes, you can use these building blocks to design efficient AI and ML models that are trained continuously while still being available for use.
Learn how Confluent’s customers and partners are using our data streaming platform to transform IT and unlock new business outcomes.
At the end of the day, your ability to capture insight and analysis from data for business analytics or business intelligence using AI/ML is limited by the quality of the data and the access to it. Confluent helps its customers achieve scale, integrity, and resilience of their data infrastructure enabling them to focus on using the data rather than preparing or procuring it. As a quick summary of the solutions discussed, the following tools will help you build high-quality, highly scalable, and highly performant data infrastructure for your AI and ML pipelines:
Data governance with fully managed Confluent Schema Registry: Enabling ubiquitous data validation, compatibility, and evolution across all your data streams.
Stream processing with fully managed ksqlDB, Kafka Streams, and Flink: Enabling dynamic transformation, enrichment, and manipulation when creating or preparing data streams.
Data aggregation with fully managed connectors: Enabling integration with external systems to create data streams from your internal or external data sources.
Enterprise performance: Enough speed and capacity to handle any workload, at any size, anywhere.
Enterprise scalability: Enough scalability to grow with your business, dynamically or manually.
Throughout this blog, the example of a retailer has been used in order to describe Confluent Cloud technology in what is hopefully a familiar context. To leave you with parting thoughts and ideas for relating this blog to your own real-time business analytics use cases with AI/ML, check out these additional examples of how you can bring these concepts to your industry.
Objectives: Improve the quality of patient care and decrease hospital readmission through predictive analytics.
Data Sources: Electronic health records (EHR) stored in healthcare systems like Epic Systems, patient data and demographics from customer relationship management (CRM) systems like Salesforce, and historical data from data warehouses or data lakes.
Data Preparation/Processing: Prepare the data by combining EHRs with customer data from CRM and normalizing patient records. Then, process the data by conducting feature extraction and engineering, cleaning, and standardization so it’s ready for analysis.
Data Analysis: Build and train ML models like recurrent neural networks or gradient boosting continuously and in real-time using the cleaned/processed historical data.
Outcomes: Identify patterns and circumstances that lead to hospital readmission and implement mitigation strategies to reduce costs and increase patient satisfaction.
Objectives: Detect and prevent fraud in real time to minimize financial loss and improve customer satisfaction and trust.
Data Sources: Change data capture of OLTP databases like Postgres or MySQL, credit card data and transactions from core banking systems, external fraud data from systems like Experian, customer data, and demographics from CRM systems like Salesforce.
Data Preparation/Processing: Prepare the data by combining and enriching things like credit card data with customer data from CRM systems and external fraud data. Then, process the data by standardizing, aggregating, and cleaning it before doing feature engineering for fraud modeling.
Data Analysis: Build and train deep neural networks to model things like credit transaction amount, location, and timing as they relate to user demographics, patterns, and more.
Outcomes: Detect and prevent fraud in real time before settling transactions, improving customer satisfaction and decreasing financial loss to both customers and the business.
Objectives: Optimize inventory and supply chain management to reduce costs and increase revenues.
Data Sources: Point of sale (POS) data, historical sales data, e-commerce platform data like Shopify, organizational master data, supply chain, and supplier data.
Data Preparation/Processing: Prepare the data by combining and enriching POS transaction data with things like store locations, demographics, and metadata. Then, process the data by standardizing, normalizing, and formulating a good timeline of events with good context for analysis.
Data Analysis: Using the timeline of contextualized events, apply real-time forecasting models like state space models to continuously update forecasts, allowing for real-time prediction using the latest data available.
Outcomes: Make the best purchasing, inventory, and supply decisions based on predictions that factor in all available data, reducing wasted inventory costs and the likelihood of out-of-stock events.
Objectives: Decrease customer churn and implement customer retention strategies.
Data Sources: Call detail records (CDRs), customer information and demographics in CRM systems like Salesforce, support interaction with online chat services, billing, and service usage data from internal platforms.
Data Preparation/Processing: Prepare the data by joining and enriching data where context and detail can be added, like CDRs and chats with customer details and loyalty. Then, process the data by cleaning and standardizing it before doing feature engineering to prepare it for analysis.
Data Analysis: Build neural networks using clean and contextualized data and train them continuously with stochastic gradient descent as new updates and data arrive.
Outcomes: Identify at-risk customers earlier, allowing intervention that can apply retention strategies and decrease churn.
Objectives: Reduce costs of maintenance, increase equipment reliability, and improve operational efficiency.
Data Sources: Sensor data from devices/machinery, part and product data from ERPs like SAP, and maintenance histories.
Data Preparation/Processing: Prepare the data by combining and enriching datasets that provide additional context and detail to each other like device/machinery details from an ERP to sensor data. Then, process the data by creating a timeline of well-contextualized events that are ready for time-series analysis.
Data Analysis: Train and apply the online variants of something like random forests or gradient boosting, allowing you to capture sensor data continuously and in real-time all while predicting likely maintenance events using the latest data available.
Outcomes: Increase the likelihood of predicting necessary maintenance, which both increases the quality of service of devices/machinery and decreases the rate of failures that can accumulate costs in a variety of ways.
Learn how Confluent’s customers and partners are using our data streaming platform to transform IT and unlock new business outcomes.
Walmart’s global presence, with its vast number of retail stores plus its robust and rapidly growing e-commerce business, make it one of the most challenging retail companies on the planet […]
This a summary of Podium’s technological journey and an example of how our engineering team is tooling ourselves to scale well into the years to come. Here at Podium, we’re […]