Debezium 2.2.0.Alpha3 Released

Breaking Changes

We typically try to avoid any breaking changes, even during minor releases such as this; however, sometimes breaking changes are inevitable given the circumstances. Debezium 2.2.0.Alpha3 includes one breaking change:

PostgreSQL zoned date-time data types truncated

PostgreSQL zoned date-time data types truncated

It was identified (DBZ-6163) that PostgreSQL timezone based column values that had a value of zero (0) for milli and micro second parts of a timezone based column were being serialized incorrectly where the string did not include neither the millisecond nor the microsecond portions of the time using zeroes.

This does not create any data loss!

What’s important to note is that prior to this release, when evaluating the values of such columns, consumers must be prepared to parse these string-based time values without the presence of a milli or microsecond value. In effect, this means events have an inconsistent pattern where some will have the milli and microsecond portions and others may not if their source value had 0 milliseconds or 0 microseconds.

These string-based time values will be be emitted consistently, padded with zeroes (0) for the milli and microsecond parts of the string-based time, even when the source value has neither milli nor microseconds.

Optional parallel snapshots

Debezium’s relational database initial snapshot process has always been single-threaded. This limitation primarily stems from the complexities of ensuring data consistency across multiple transactions.

Starting in Debezium 2.2, we’re adding a new and initially optional way to utilize multiple threads to perform consistent database snapshot for a connector. This implementation uses these multiple threads to execute table-level snapshots in parallel.

In order to take advantage of this new feature, specify snapshot.max.threads in your connector’s configuration and when this property has a value greater than 1, parallel snapshots will be used.

Example configuration using parallel snapshots

snapshot.max.threads=4

In the example above, if the connector needs to snapshot more than 4 tables, there will be at most 4 tables being snapshot in parallel. When one thread finishes processing a table, it will get a new table to snapshot from the queue and the process continues until all tables have been snapshot.

This feature is considered incubating, but we strongly suggest that new connector deployments give this feature a try. We would welcome any and all feedback on how to improve this going forward.

MongoDB server-side change stream filtering

Debezium presently subscribes to the MongoDB change stream and evaluates whether an event is of relevance or not on the connector side. On the surface, there is nothing technically wrong with this approach, it has worked well; however, a recent contributor explained how this decision impacts them.

Overall, the current process effectively serializes across the network all changes from MongoDB to the connector. If you have a lower volume of changes, you likely don’t see any issue with this approach; however, in a high volume scenario, especially when you’re only interested in a subset of the data generated by change streams, you quickly begin to see how this approach is inefficient. Furthermore, if you’re running the connector in a cloud environment like AWS, you’ll likely see in a high volume scenario where utilization costs could be impacted.

By moving where the include/exclude list filters are evaluated from the connector to the MongoDB server’s change stream subscription, this adds a number of advantages for all MongoDB connector users.

By reducing the number of events seen by connector, this impacts both network and CPU utilization. When events are sent that the connector simply discards due to include/exclude filters, this leads to network usage that could be avoided. When the connector is configured with full document or pre-image settings, this adds even more utilization to the network that is entirely unnecessary. Furthermore, by receiving more events than the connector configuration is interested in, this leads to the connector doing more processing, raising CPU utilization.

While network and CPU utilization are critical regardless of one’s environment, these are often more scrutinized when operating a cloud-based environments as these two metrics directly impact the operating budget. Users should see an overall lower network and CPU utilization with Debezium MongoDB 2.2 connectors.

We hope to share more details the benefits of this change in a future blog post, so stay tuned!

Incremental snapshot surrogate key support

Debezium’s incremental snapshot feature has been a tremendous success. It provides an efficient way to perform a consist snapshot of data that can be resumed, which is critical when the snapshot consists of large volumes of data.

However, incremental snapshots do have specific requirements that must be met before the feature can be used. One of those requirements is all tables being snapshot must use a primary key. You may ask, why does a table have no primary key, and we aren’t going to debate that here today; however, suffice to say this occurs more often than you may think.

With Debezium 2.2, incremental snapshots can be performed on key-less tables as long as there is one column that is unique and can be considered a "surrogate key" for incremental snapshot purposes.

The surrogate key feature is not supported by MongoDB; only relational connectors.

To provide the surrogate key column data in an incremental snapshot signal, the signal’s payload must include the new surrogate key attribute, surrogate-key.

An example incremental snapshot signal payload specifying a surrogate key

{
  "data-collections": [ "public.mytab" ],
  "surrogate-key": "customer_ref"
}

In the above example, an incremental snapshot will be started for table public.mytab and the incremental snapshot will use the customer_ref column as the primary key for generating the snapshot windows.

A surrogate key cannot be defined using multiple columns, only a single column.

However, the surrogate key feature isn’t just applicable for tables with no primary keys. There are a series of advantages when using this feature with tables that have primary keys:

One clear advantage is when the table’s primary key consists of multiple columns. The query generates a disjunction predicate for each column in the primary key, and it’s performance is highly dependent on the environment. Reducing the number of columns down to a single column often performs universally.
Another advantage is when the surrogate key is based on a numeric data type while the primary key column is based on a character-based data type. Relational databases generally perform predicate evaluation more efficiently with numeric comparisons rather than character comparisons. By adjusting the query to use a numeric data type in this case, query performance could be better.

Other fixes

There were quite a number of other improvements, bug fixes, and stability changes in this release, some noteworthy are:

When using snapshot.collection.include.list, relational schema isn’t populated correctly DBZ-3594
Debezium UI should use fast-jar again with Quarkus 2.x DBZ-4621
Create a Datastax connector based on Cassandra connector DBZ-5951
Add support for honouring MongoDB read preference in change stream after promotion DBZ-5953
Add support for header to all Debezium Server sinks DBZ-6017
GCP Spanner connector start failing when there are multiple indexes on a single column DBZ-6101
Negative remaining attempts on MongoDB reconnect case DBZ-6113
Support String type for key in Mongo incremental snapshot DBZ-6116
Tables with spaces or non-ASCII characters in their name are not captured by Oracle because they must be quoted. DBZ-6120
Offsets are not advanced in a CDB deployment with low frequency of changes to PDB DBZ-6125
Allow TestContainers test framework to expose ConnectorConfiguration as JSON DBZ-6136
Oracle TIMESTAMP WITH TIME ZONE is emitted as GMT during snapshot rather than the specified TZ DBZ-6143
Upgrade impsort-maven-plugin from 1.7.0 to 1.8.0 DBZ-6144
Debezium UI E2E Frontend build failing randomly with corrupted Node 16 tar file DBZ-6146
Debezium UI SQL Server tests randomly fail due to slow agent start-up DBZ-6149
Upgrade Quarkus dependencies to 2.16.3.Final DBZ-6150
Remove hardcoded list of system database exclusions that are not required for change streaming DBZ-6152
RelationalSnapshotChangeEventSource swallows exception generated during snapshot DBZ-6179
Create SSL scenarios for integration tests for MySQL connector DBZ-6184

Altogether, 33 issues were fixed for this release. A big thank you to all the contributors from the community who worked on this release: Gabor Andras, Anisha Mohanty, Bob Roldan, Bobby Tiernay, Chris Cranford, Eugene Abramchuk, Gabor Andras, Gunnar Morling, Harvey Yue, Jakub Cechacek, Jeremy Ford, Jiri Pechanec, Mehmet Firat Komurcu, Plugaru Tudor, Stefan Miklosovic, Subodh Kant Chaturvedi, Vojtech Juranek, and Xinbin Huang!

Outlook & What’s Next?

In addition, we are nearing the end of the Debezium 2.2 development cycle. Assuming no unexpected problems, we do intend to release Beta1 next week, followed by a release candidate two weeks thereafter. Our goal is to finalize the Debezium 2.2 release in late March or early April at the latest.

We would love to hear your feedback or suggestions about our roadmap, changes in this release, or any that are outstanding or that we may haven’t mentioned. Be sure to get in touch with us on the mailing list or our chat if there is.

Also, the DevNexus 2023 conference is coming up in early April in Atlanta, and I have the privilege to be a guest speaker discussing Debezium and CDC patterns. Be sure to check out that talk in person if you have an opportunity!

And finally, be on the lookout for our first installment of our 2023 Newsletter later this month. I also will be wrapping up the blog series, "Debezium for Oracle" where I cover performance, debugging, and frequently asked questions about the Oracle connector.

Until next time…

Chris Cranford

Chris is a software engineer at Red Hat. He previously was a member of the Hibernate ORM team and now works on Debezium. He lives in North Carolina just a few hours from Red Hat towers.

About Debezium

Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.

Get involved

We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Zulip, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.