Yesterday I had the opportunity to present Debezium and the idea of change data capture (CDC) to the Darmstadt Java User Group. It was a great evening with lots of interesting discussions and questions. One of the questions being the following: what is the advantage of using a log-based change data capturing tool such as Debezium over simply polling for updated records?
So first of all, what’s the difference between the two approaches? With polling-based (or query-based) CDC you repeatedly run queries (e.g. via JDBC) for retrieving any newly inserted or updated rows from the tables to be captured. Log-based CDC in contrast works by reacting to any changes to the database’s log files (e.g. MySQL’s binlog or MongoDB’s op log).
As this wasn’t the first time this question came up, I thought I could provide a more extensive answer also here on the blog. That way I’ll be able to refer to this post in the future, should the question come up again :)
So without further ado, here’s my list of five advantages of log-based CDC over polling-based approaches.
- All Data Changes Are Captured
By reading the database’s log, you get the complete list of all data changes in their exact order of application. This is vital for many use cases where you are interested in the complete history of record changes. In contrast, with a polling-based approach you might miss intermediary data changes that happen between two runs of the poll loop. For instance it could happen that a record is inserted and deleted between two polls, in which case this record would never be captured by poll-based CDC.
Related to this is the aspect of downtimes, e.g. when updating the CDC tool. With poll-based CDC, only the latest state of a given record would be captured once the CDC tool is back online, missing any earlier changes to the record that occurred during the downtime. A log-based CDC tool will be able to resume reading the database log from the point where it left off before it was shut down, causing the complete history of data changes to be captured.
- Low Delays of Events While Avoiding Increased CPU Load
With polling, you might be tempted to increase the frequency of polling attempts in order to reduce the chances of missing intermediary updates. While this works to some degree, polling too frequently may cause performance issues (as the queries used for polling cause load on the source database). On the other hand, expanding the polling interval will reduce the CPU load but may not only result in missed change events but also in a longer delay for propagating data changes. Log-based CDC allows you to react to data changes in near real-time without paying the price of spending CPU time on running polling queries repeatedly.
- No Impact on Data Model
Polling requires some indicator to identify those records that have been changed since the last poll. So all the captured tables need to have some column like
LAST_UPDATE_TIMESTAMPwhich can be used to find changed rows. This can be fine in some cases, but in others such requirement might not be desirable. Specifically, you’ll need to make sure that the update timestamps are maintained correctly on all tables to be captured by the writing applications or e.g. through triggers.
- Can Capture Deletes
Naturally, polling will not allow you to identify any records that have been deleted since the last poll. Often times that’s a problem for replication-like use cases where you’d like to have an identical data set on the source database and the replication targets, meaning you’d also like to delete records on the sink side if they have been removed in the source database.
- Can Capture Old Record State And Further Meta Data
Depending on the source database’s capabilities, log-based CDC can provide the old record state for update and delete events. Whereas with polling, you’ll only get the current row state. Having the old row state handy in a single change event can be interesting for many use cases, e.g. if you’d like to display the complete data change with old and new column values to an application user for auditing purposes.
In addition, log-based approaches often can provide streams of schema changes (e.g. in form of applied DDL statements) and expose additional metadata such as transaction ids or the user applying a certain change. These things may generally be doable with query-based approaches, too (depending on the capabilities of the database), I haven’t really seen it being done in practice, though.
And that’s it, five advantages of log-based change data capture. Note that this is not to say that polling-based CDC doesn’t have its applications. If for instance your use case can be satisfied by propagating changes once per hour and it’s not a problem to miss intermediary versions of records that were valid in between, it can be perfectly fine.
But if you’re interested in capturing data changes in near real-time, making sure you don’t miss any change events (including deletions), then I’d recommend very much to explore the possibilities of log-based CDC as enabled by Debezium. The Debezium connectors do all the heavy-lifting for you, i.e. you don’t have to deal with all the low-level specifics of the individual databases and the means of getting changes from their logs. Instead, you can consume the generic and largely unified change data events produced by Debezium.
Gunnar is a software engineer at Decodable and an open-source enthusiast by heart. He has been the project lead of Debezium over many years. Gunnar has created open-source projects like kcctl, JfrUnit, and MapStruct, and is the spec lead for Bean Validation 2.0 (JSR 380). He’s based in Hamburg, Germany.
Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.
We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Zulip, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.