[Webinar] Kafka + Disaster Recovery: Are You Ready? | Register Now
Change data capture (CDC) refers to the process of tracking all changes in data sources, such as databases and data warehouses, so they can be captured in destination systems. CDC methods like transaction logs and database triggers allow organizations to achieve data integrity and consistency across multiple systems and deployment environments. Additionally, they allow organizations to use the right tool for the right job by moving data from legacy databases to purpose-built data platforms, such as document or search databases, or data warehouses.
Initially, CDC became popular as an alternative solution to batch data replication for populating data warehouses for Extract Transform Load (ETL) jobs. In recent years, CDC has become the de facto method for migrating to the cloud.
Here are some of the main benefits of CDC:
Data is constantly changing, which can cause databases, data lakes, and data warehouses to be out of sync. Additionally, organizations are increasingly migrating to the cloud, which increases the chances of ending up with data silos. CDC has become a popular solution to bridge on-premises and cloud environments, allowing enterprises to migrate to the cloud at their own pace, or continue to operate in a hybrid environment.
See Confluent’s Oracle CDC Premium Connector in action here.
When data is changed through INSERT, UPDATE, or DELETE in a source database—which is usually a relational database such as MySQL, Microsoft SQL, Oracle, or PostgreSQL—it needs to be propagated to downstream systems such as caches, search indexes, data warehouses, and data lakes. Change data capture (CDC) tracks changes in a source dataset and transfers those changes to a target dataset.
Traditionally, teams used batch processing to sync data, which meant data was not synced right away, production databases were slowed down to allocate resources, and data replication only occurred during specified batch windows. CDC ensures that changes are synced in real-time, which eliminates the delays associated with traditional batch processing. CDC continuously tracks changes and updates target databases immediately, ensuring data is always up to date.
CDC is typically implemented using two main approaches: push and pull: Either the source database pushes the updates to downstream services and applications, or downstream services and applications poll the source database at fixed intervals to pull the updated data. Each approach has its own advantages and disadvantages. It’s important to consider all of these aspects in the context of your own use case.
Push: In this approach, the source database does the heavy lifting. It captures changes in the database and sends those updates to target systems so they can take appropriate action.
The advantage of this method is that target systems are updated with the latest data in near real-time. However, if the target systems are unreachable or offline, the changed data can be lost.
To mitigate these risks, a messaging system is typically implemented between the source and target systems to buffer changes until they can be committed to their final destinations.
Pull: In this method, the source database’s job is lighter than the push method. Instead of actively sending updates, the source database logs data changes in a specific column on each table. It is the target systems’ responsibility to continuously poll the source database to retrieve the changes and take the right actions on them.
Like in the push method, a messaging system is required between the source and target systems, in order to ensure that the changed data isn’t lost when the target systems are unavailable.
The disadvantage of the pull approach is that target systems won’t be immediately notified if the data changes. Since the changes are batched between pull requests, there is a lag before the target systems will learn about these changes.
If your application needs real-time data, you should use the push method: it ensures the immediate propagation of changes. However, it requires robust messaging systems to handle the potential downtime of target systems.
If you want to continuously transfer big batches of data, and your applications don’t need real-time data, you should opt for the pull method. However, there will be a lag due to batch processing.
CDC detects changes in data using different methods. Following are the most commonly used methods:
Timestamp-based: e most recent change. This column can be called LAST_MODIFIED, LAST_UPDATED, etc. Downstream applications or systems can query this field and get the records that have been updated since the last execution time.
Pros:
Cons:
An example is shown of a snapshot of the table when a new record is created and a snapshot of the table when the record with ID=101 is updated.
Trigger-based: Most databases support trigger functions. These are stored procedures that are automatically executed once a specific event—such as to INSERT, UPDATE, or DELETE a record—occurs on a table. One trigger for each operation per table is needed to capture any data changes. These data changes are stored in a separate table (commonly referred to as a “shadow table” or “event table”) in the same database. Additionally, developers can include messaging systems so that these data changes can be published to queues, where relevant target systems subscribe to them.
Pros:
Cons:
Log-based: Transactional databases log all changes—INSERT, UPDATE, and DELETE— committed against the database and their corresponding timestamps into files called transaction logs. These logs are primarily used for backup and disaster-recovery purposes, but they can also be used to propagate changes to target systems. Data changes are captured in real time. Since target systems can read from transaction logs, this method doesn’t impose computational overhead on source databases.
Pros:
Cons:
Change Data Capture (CDC) is used across modern data integration systems to ensure compliance and enable real-time analytics. By continuously capturing and replicating data changes, CDC supports seamless data migration, integration with microservices, and cloud adoption.
There are many use cases for CDC. Let’s review a few examples below.
When copying an entire source database to a destination system in batch mode, the source database cannot accept new writes, including schema changes, until the process is complete. The longer the copying process, the greater the risk of delaying important changes to the source. There is also a greater potential for further delays in passing the changes to the target.
Neither of these scenarios is acceptable, because consumers demand real-time experiences from modern applications. CDC addresses these challenges by continuously replicating changed data (a subset of the entire database) to downstream consumers.
As organizations continue to break down their monolithic architectures and adopt microservices, they need to transfer data from source databases, and possibly direct it to more than one destination system. Since these transitions take time, CDC can be used to keep both source and target data stores synchronized during this process.
Organizations are increasingly migrating to the cloud to reduce TCO and improve agility and elasticity. By leveraging cloud-native services, companies can focus on building new digital experiences rather than spending time and resources configuring, maintaining, and managing their databases and infrastructure.
CDC supports this migration and ensures that data remains consistent and up-to-date across on-premises and cloud environments. This seamless data integration helps businesses fully leverage cloud capabilities without disruption.
CDC allows you to capture data changes in source databases, but you still need to communicate and propagate these changes to downstream systems. This is where Apache Kafka and Kafka Connect shine. Apache Kafka is an open-source event streaming platform that can durably write and store streams of events, and process them in real-time or retrospectively. Kafka is a distributed system of servers and clients that provides reliable and scalable performance.
The Kafka Connect API is a core component of Apache Kafka, introduced in version 0.9. Kafka Connect provides scalable and resilient integration for Kafka with other systems, both sending to and receiving data from them. Kafka Connect is configuration-driven, meaning that you don’t need to write any code to use it. It is driven purely by configuration files and provides an easy integration point for developers.
One of the popular use cases for Kafka Connect is database change data capture. You can leverage Confluent’s JDBC or Debezium CDC connectors to integrate Kafka with your database, and easily stream your data into Confluent.
The JDBC connector polls source databases for new or changed data based on update timestamp columns. Confluent Cloud offers fully-managed source connectors for Microsoft SQL Server, PostgreSQL, MySQL, and Oracle.
Additionally, Confluent offers log-based CDC source connectors for Microsoft SQL Server, PostgreSQL, MySQL, and Oracle. These connectors obtain a snapshot of existing data at startup time, and then monitor and record all subsequent row-level changes.
Learn how to build streaming data pipelines with Apache Kafka and Confluent with this free on-demand course.
Confluent’s fully-managed CDC source connectors for MySQL, Oracle, Microsoft SQL Server, and PostgreSQL simplify streaming data pipelines. These connectors help you easily migrate data across different environments, meeting you everywhere your data systems live. Our rich ecosystem of 120+ pre-built source and sink connectors makes it easy to connect multiple data systems, and keep them in sync throughout the migration. Furthermore, you can reduce your total cost of ownership using our fully managed, cloud-native services with no operational overhead or prohibitive licensing costs.