Subscribe


Debezium 0.9 Alpha1 and 0.8.1 Released

Just two weeks after the Debezium 0.8 release, I’m very happy to announce the release of Debezium 0.9.0.Alpha1!

The main feature of the new version is a first work-in-progress version of the long-awaited Debezium connector for MS SQL Server. Based on the CDC functionality available in the Enterprise and Standard editions, the new connector lets you stream data changes out of Microsoft’s popular RDBMS.

Besides that we’ve continued the work on the Debezium Oracle connector. Most notably, it supports initial snapshots of captured tables now. We’ve also upgraded Apache Kafka in our Docker images to 1.1.1 (DBZ-829).

Please take a look at the change log for the complete list of changes in 0.9.0.Alpha1 and general upgrade notes.

Note: At the time of writing (2018-07-26), the release artifacts (connector archives) are available on Maven Central. We’ll upload the Docker images for 0.9.0.Alpha1 to Docker Hub as soon as possible. The Docker images are already uplodaded and ready for use under tags 0.9.0.Alpha1 and rolling 0.9.

SQL Server Connector

Support for SQL Server had been on the wish list of Debezium users for a long time (the original issue was DBZ-40). Thanks to lots of basic infrastructure created while working on the Oracle connector, we were finally able to come up with a first preview of this new connector in comparatively short time of development.

Just as the Oracle connector, the one for SQL Server is under active development and should be considered an incubating feature at this point. So for instance the structure of emitted change messages may change in upcoming releases. In terms of features, it supports initial snapshotting and capturing changes via SQL Server’s CDC functionality. There’s support for the most common column types, table whitelisting/blacklisting and more. The most significant feature missing is support for structural changes of tables while the connector is running. This is the next feature we’ll work on and it’s planned to be delivered as part of the next 0.9 release (see DBZ-812).

We’d be very happy to learn about any feedback you may have on this newest connector of the Debezium family. If you spot any bugs or have feature requests for it, please create a report in our JIRA tracker.

Oracle Connector

The Debezium connector for Oracle is able to take initial snapshots now. By means of the new connector option snapshot.mode you can control whether read events for all the records of all the captured tables should be emitted.

In addition the support for numeric data types has been honed (DBZ-804); any integer columns (i.e. NUMBER with a scale <\= 0) will be emitted using the corresponding int8/int16/int32/int64 field type, if the columns precision allows for that.

We’ve also spent some time on expanding the Oracle connector documentation, which covers the structure of emitted change events and all the data type mappings in detail now.

Debezium 0.8.1.Final

Together with Debezium 0.9.0.Alpha1 we also did another release of the current stable Debezium version 0.8.

While 0.9 at this point is more interesting to those eager to try out the latest developments in the Oracle and SQL Server connectors, 0.8.1.Final is a recommended upgrade especially to the users of the Postgres connector. This release fixes an issue where it could happen that WAL segments on the server were retained longer than necessary, in case only records of non-whitelisted tables changed for a while. This has been addressed by means of supporting heartbeat messages (as already known from the MySQL connector) also for Postgres (DBZ-800). This lets the connector regularly commit offsets to Kafka Connect which also serves as the hook to acknowledge processed LSNs with the Postgres server.

You can find the list of all changes done in Debezium 0.8.1.Final in the change log.

What’s next?

As discussed above, we’ll work on supporting structural changes to captured tables while the SQL Server connector is running. The same applies to the Oracle connector. This will require some work on our DDL parsers, but thanks to the foundations provided by our recent migration of the MySQL DDL parser to Antlr, this should be manageable.

The other big focus of work with be to provide an alternative implementation for getting changes from Oracle which isn’t based on the XStream API. We’ve done some experiments with LogMiner and are also actively exploring further alternatives. While some details are still unclear, we are optimistic to have something to release in this area soon.

If you’d like to learn more about some middle and long term ideas, please check out our roadmap. Also please get in touch with us if you got any ideas or suggestions for future development.

About Debezium

Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.

Get involved

We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Gitter, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.


Five Advantages of Log-Based Change Data Capture

Yesterday I had the opportunity to present Debezium and the idea of change data capture (CDC) to the Darmstadt Java User Group. It was a great evening with lots of interesting discussions and questions. One of the questions being the following: what is the advantage of using a log-based change data capturing tool such as Debezium over simply polling for updated records?

So first of all, what’s the difference between the two approaches? With polling-based (or query-based) CDC you repeatedly run queries (e.g. via JDBC) for retrieving any newly inserted or updated rows from the tables to be captured. Log-based CDC in contrast works by reacting to any changes to the database’s log files (e.g. MySQL’s binlog or MongoDB’s op log).

As this wasn’t the first time this question came up, I thought I could provide a more extensive answer also here on the blog. That way I’ll be able to refer to this post in the future, should the question come up again :)

So without further ado, here’s my list of five advantages of log-based CDC over polling-based approaches.

All Data Changes Are Captured

By reading the database’s log, you get the complete list of all data changes in their exact order of application. This is vital for many use cases where you are interested in the complete history of record changes. In contrast, with a polling-based approach you might miss intermediary data changes that happen between two runs of the poll loop. For instance it could happen that a record is inserted and deleted between two polls, in which case this record would never be captured by poll-based CDC.

Related to this is the aspect of downtimes, e.g. when updating the CDC tool. With poll-based CDC, only the latest state of a given record would be captured once the CDC tool is back online, missing any earlier changes to the record that occurred during the downtime. A log-based CDC tool will be able to resume reading the database log from the point where it left off before it was shut down, causing the complete history of data changes to be captured.

Low Delays of Events While Avoiding Increased CPU Load

With polling, you might be tempted to increase the frequency of polling attempts in order to reduce the chances of missing intermediary updates. While this works to some degree, polling too frequently may cause performance issues (as the queries used for polling cause load on the source database). On the other hand, expanding the polling interval will reduce the CPU load but may not only result in missed change events but also in a longer delay for propagating data changes. Log-based CDC allows you to react to data changes in near real-time without paying the price of spending CPU time on running polling queries repeatedly.

No Impact on Data Model

Polling requires some indicator to identify those records that have been changed since the last poll. So all the captured tables need to have some column like LAST_UPDATE_TIMESTAMP which can be used to find changed rows. This can be fine in some cases, but in others such requirement might not be desirable. Specifically, you’ll need to make sure that the update timestamps are maintained correctly on all tables to be captured by the writing applications or e.g. through triggers.

Can Capture Deletes

Naturally, polling will not allow you to identify any records that have been deleted since the last poll. Often times that’s a problem for replication-like use cases where you’d like to have an identical data set on the source database and the replication targets, meaning you’d also like to delete records on the sink side if they have been removed in the source database.

Can Capture Old Record State And Further Meta Data

Depending on the source database’s capabilities, log-based CDC can provide the old record state for update and delete events. Whereas with polling, you’ll only get the current row state. Having the old row state handy in a single change event can be interesting for many use cases, e.g. if you’d like to display the complete data change with old and new column values to an application user for auditing purposes.

In addition, log-based approaches often can provide streams of schema changes (e.g. in form of applied DDL statements) and expose additional metadata such as transaction ids or the user applying a certain change. These things may generally be doable with query-based approaches, too (depending on the capabilities of the database), I haven’t really seen it being done in practice, though.

Summary

And that’s it, five advantages of log-based change data capture. Note that this is not to say that polling-based CDC doesn’t have its applications. If for instance your use case can be satisfied by propagating changes once per hour and it’s not a problem to miss intermediary versions of records that were valid in between, it can be perfectly fine.

But if you’re interested in capturing data changes in near real-time, making sure you don’t miss any change events (including deletions), then I’d recommend very much to explore the possibilities of log-based CDC as enabled by Debezium. The Debezium connectors do all the heavy-lifting for you, i.e. you don’t have to deal with all the low-level specifics of the individual databases and the means of getting changes from their logs. Instead, you can consume the generic and largely unified change data events produced by Debezium.

About Debezium

Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.

Get involved

We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Gitter, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.


Debezium 0.8 Final Is Released

I’m very happy to announce the release of Debezium 0.8.0.Final!

The key features of Debezium 0.8 are the first work-in-progress version of our Oracle connector (based on the XStream API) and a brand-new parser for MySQL DDL statements. Besides that, there are plenty of smaller new features (e.g. propagation of default values to corresponding Connect schemas, optional propagation of source queries in CDC messages and a largely improved SMT for sinking changes from MongoDB into RDBMS) as well as lots of bug fixes (e.g. around temporal and numeric column types, large transactions with Postgres).

Please see the previous announcements (Beta 1, CR 1) to learn about all the changes in more depth. The Final release largely resembles CR1; apart from further improvements to the Oracle connector (DBZ-792) there’s one nice addition to the MySQL connector contributed by Peter Goransson: when doing a snapshot, it will now expose information about the processed rows via JMX (DBZ-789), which is very handy when snapshotting larger tables.

Please take a look at the change log for the complete list of changes in 0.8.0.Final and general upgrade notes.

What’s next?

We’re continuing our work on the Oracle connector. The work on initial snapshotting is well progressing and it should be part of the next release. Other improvements will be support for structural changes to captured tables after the initial snapshot has been made, more extensive source info metadata and more. Please track DBZ-716 for this work; the improvements are planned to be released incrementally in the upcoming versions of Debezium.

We’ve also started to explore ingesting changes via LogMiner. This is more involved in terms of engineering efforts than using XStream, but it comes with the huge advantage of not requiring a separate license (LogMiner comes with the Oracle database itself). It’s not quite clear yet when we can release something on this front, and we’re also actively exploring further alternatives. But we are quite optimistic and hope to have something some time soon.

The other focus of work is a connector for SQL Server (see DBZ-40). Work on this has started as well, and there should be an Alpha1 release of Debezium 0.9 with a first drop of that connector within the next few weeks.

To find out about some more long term ideas, please check out our roadmap and get in touch with us, if you got any ideas or suggestions for future development.

About Debezium

Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.

Get involved

We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Gitter, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.


Debezium 0.8.0.CR1 Is Released

A fantastic Independence Day to all the Debezium users in the U.S.! But that’s not the only reason to celebrate: it’s also with great happiness that I’m announcing the release of Debezium 0.8.0.CR1!

Following our new release scheme, the focus for this candidate release of Debezium 0.8 has been to fix bug reported for last week’s Beta release, accompanied by a small number of newly implemented features.

Thanks a lot to everyone testing the new Antlr-based DDL parser for the MySQL connector; based on the issues you reported, we were able to fix a few bugs in it. As announced recently, for 0.8 the legacy parser will remain the default implementation, but you are strongly encouraged to test out the new one (by setting the connector option ddl.parser.mode to antlr) and report any findings you may have. We’ve planned to switch to the new implementation by default in Debezium 0.9.

In terms of new features, the CR1 release brings support for CITEXT columns in the Postgres connector (DBZ-762). All the relational connectors support it now to convey the original name and length of captured columns using schema parameters in the emitted change messages (DBZ-644). This can come in handy to properly size columns in a sink database for types such as VARCHAR.

Thanks a lot to the following community members who contributed to this release: Andreas Bergmeier, Olavi Mustanoja and Orr Ganani.

Please take a look at the change log for the complete list of changes in 0.8.0.CR1 and general upgrade notes.

What’s next?

Barring any unforeseen issues and critical bug reports, we’ll release Debezium 0.8.0.Final next week.

Once that’s out, we’ll continue work on the Oracle connector (e.g. exploring alternatives to using XStream for ingesting changes from the database as well as initial snapshotting), which remains a "tech preview" component as of 0.8.

We’ll also work towards a connector for SQL Server (see DBZ-40), for which the first steps just have been made today by preparing a Docker-based setup with a CDC-enabled SQL Server instance, allowing to implement and test the connector in the following.

To find out about some more long term ideas, please check out our roadmap and get in touch with us, if you got any ideas or suggestions for future development.

About Debezium

Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.

Get involved

We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Gitter, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.


Debezium 0.8.0.Beta1 Is Released

It’s with great excitement that I’m announcing the release of Debezium 0.8.0.Beta1!

This release brings many exciting new features as well as bug fixes, e.g. the first drop of our new Oracle connector, a brand new DDL parser for the MySQL connector, support for MySQL default values and the update to Apache Kafka 1.1.

Due to the big number of changes (the release contains exactly 42 issues overall), we decided to alter our versioning schema a little bit: going forward we may do one or more Beta and CR ("candidate release") releases before doing a final one. This will allow us to get feedback from the community early on, while still completing and polishing specific features. Final (stable) releases will be named like 0.8.0.Final etc.

This release would not have been possible without our outstanding community; a huge "thank you" goes out to the following open source enthusiasts who all contributed to the new version: Echo Xu, Ivan Vucina, Listman Gamboa, Omar Al-Safi, Peter Goransson, Roman Kuchar (who did a tremendous job with the new DDL parser implementation!), Sagar Rao, Saulius Valatka, Sairam Polavarapu, Stephen Powis and WenZe Hu.

Thank you all very much for your help!

Now let’s take a closer look at some of the features new in Debezium 0.8.0.Beta1; as always, you can find the complete list of changes of this release in the change log. Plese take a special look at the breaking changes and the upgrade notes.

XStream-based Oracle Connector (Tech Preview)

Support for a Debezium Oracle connector has been one of the most asked for features for a long time (its original issue number is DBZ-20!). So we are very happy that we eventually can release a first work-in-progress version of that connector. At this point this code is still very much evolving, so it should be considered as a first tech preview. This means it’s not feature complete (most notably, there’s no support for initial snapshots yet), the emitted message format may still change etc. So while we don’t recommend using it in production quite yet, you should definitely give it a try and report back about your experiences.

One challenge for the Oracle connector is how to get the actual change events out of the database. Unlike with MySQL and Postgres, there’s unfortunately no free-to-use and easy-to-work-with API which would allow to do the same for Oracle. After some exploration we decided to base this first version of the connector on the Oracle XStream API. While this (kinda) checks the box for "easy-to-work-with", it doesn’t do so for "free-to-use": using this API requires you to have a license for Oracle’s separate GoldenGate product. We’re fully aware of this being not ideal, but we decided to still go this route as a first step, allowing us to get some experiences with Oracle and also get a connector into the hands of those with the required license handy. Going forward, we are going to explore alternative approaches. We already have some ideas and discussions around this, so please stay tuned (the issue to track is DBZ-137).

The Oracle connector is going to evolve within the next 0.8.x releases. To learn more about it, please check its connector documentation page.

Antlr-based MySQL DDL Parser

In order to build up an internal meta-model of the captured database’s structure, the Debezium MySQL connector needs to parse all issued DDL statements (CREATE TABLE etc.). This used to be done with a hand-written DDL parser which worked reasonably well, but over time it also revealed some shortcomings; as the DDL language is quite extensive, we saw repeatedly bug reports caused by some specific DDL constructs not being parseable.

So we decided to go back to the drawing board and came up with a brand new parser design. Thanks to the great work of Roman Kuchar, we now have a completely new DDL parser which is based on the proven and very mature Antlr parser generator (luckily, the Antlr project provides a complete MySQL grammar). So we should see much less issue reports related to DDL parsing going forward.

For the time being, the old parser still is in place and remains to be the default parser for Debezium 0.8.x. You are very encouraged though to test the new implementation by setting the connector option ddl.parser.mode to antlr and report back if you run into any issues doing so. We plan to improve and polish the Antlr parser during the 0.8.x release line (specifically we’re going to measure its performance and optimize as needed) and switch to it by default as of Debezium 0.9. Eventually, the old parser will be removed in a future release after that.

Further MySQL Connector Changes

The MySQL Connector propagates column default values to corresponding Kafka Connect schemas now (DBZ-191). That’s beneficial when using Avro as serialization format and the schema registry with compatibility checking enabled.

By setting the include.query connector option to true, you can add the original query that caused a data change to the corresponding CDC events (DBZ-706). While disabled by default, this feature can be a useful tool for analyzing and interpreting data changes captured with Debezium.

Some other changes in the MySQL connector include configurability of the heartbeat topic name (DBZ-668), fixes around timezone handling for TIMESTAMP (DBZ-578) and DATETIME columns (DBZ-741) and correct handling of NUMERIC column without an explicit scale value (DBZ-727).

Postgres Connector

The Debezium Connector for Postgres has seen quite a number of bugfixes, including the following ones:

  • wal2json can handle transactions now that are bigger than 1Gb (DBZ-638)

  • the transaction ID is consistently handled as long now (DBZ-673)

  • multiple fixes related to temporal column types (DBZ-681, DBZ-696)

  • OIDs are handled correctly as unsigned int now (DBZ-697, DBZ-701)

MongoDB Connector

Also for the MongoDB Connector a number of small feature implementations and bugfixes has been done:

  • Tested against MongoDB 3.6 (DBZ-529)

  • Nested documents can be flattened using a provided SMT now (DBZ-561), which is useful when sinking changes from MongoDB into a relational database

  • The unwrapping SMT can be used together with Avro now (DBZ-650)

  • The unwrapping SMT can handle arrays with mixed element types (DBZ-649)

  • When interrupted during snapshotting before completion, the connector will redo the snapshot after restarting (DBZ-712)

What’s next?

As per the new Beta/CR/Final release scheme, we hope to get some feedback by the community (i.e. you :) on this Beta release. Depending on the number of issues reported, we’ll either release another Beta or go to CR1 with the next version. The 0.8.0.Final version will be released within a few weeks. Note that the Oracle connector will remain a "tech preview" component also in the final version.

After that, we’ve planned to do a few 0.8.x releases with bug fixes mostly, while work on Debezium 0.9 will commence in parallel. For that we’ve planned to work on a connector for SQL Server (see DBZ-40). We’d also like to explore means of creating consistent materializations of joins from multiple tables' CDC streams, based on the ids of originating transactions. Also there’s the idea and a first prototype of exposing Debezium change events as a reactive event stream (DBZ-566), which might be shipped eventually.

Please take a look at the roadmap for some more long term ideas and get in touch with us, if you got thoughts around that.

About Debezium

Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.

Get involved

We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Gitter, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.


back to top