Debezium 2.0.0.Beta2 Released

New connector property namespaces

One of the largest overhauls going into Debezium 2.0 is the introduction of new connector property namespaces. Starting in Debezium 2.0 Beta2 and onward, many connector properties have been relocated with new names. This is a breaking change and affects most, if not all, connector deployments during the upgrade process.

Debezium previously used the prefix "database." with a plethora of varied connector properties. Some of these properties were meant to be passed directly to the JDBC driver and in other cases to the database history implementations, and so on. Unfortunately, we identified situations where some properties were being passed to underlying implementations that weren’t intended. While this wasn’t creating any type of regression or problem, it could potentially introduce a future issue if there were property name collisions, for example, a JDBC driver property that matched with a "database." prefixed Debezium connector property.

The following describes the changes to the connector properties

All configurations previously prefixed as database.history. are now to be prefixed using schema.history.internal. instead.
All JDBC pass-thru options previously specified using database. prefix should now be prefixed using driver. instead.
The database.server.name connector property renamed to topic.prefix.
The MongoDB mongodb.name connector property aligned to use topic.prefix instead.

Again, please review your connector configurations prior to deployment and adjust accordingly.

All Debezium event schemas are named and versioned

Debezium change events are emitted with a schema definition, which contains metadata about the fields such as the type, whether it’s required, and so on. In previous iterations of Debezium, some schema definitions did not have explicit names nor were they being explicitly versioned. In this release, we’ve moved to making sure that all schema definitions have an explicit name and version associated with them. The goal of this change is to help with future event structure compatibility, particularly for those who are using schema registries. However, if you are currently using a schema registry, be aware that this change may lead to schema compatibility issues during the upgrade process.

Skipped operations default to truncate events

Debezium supports skipping specific event types by including the skipped.operations connector property in the connector’s configuration. This feature can be useful if you’re only interested in a subset of operations, such as only inserts and updates but not deletions.

One specific event type, truncates (t), is only supported by a subset of relational connectors and whether these events were to be skipped wasn’t consistent. In this release, we have aligned the skipped.operations behavior so that if the connector supports truncate events, these events are skipped by default.

Please review the following rule-set:

Connector supports truncate events and isn’t the Oracle connector
Connector configuration does not specify the skipped.operations in the configuration

If all the above are true, then the connector’s behavior will change after the upgrade. If you wish to continue to emit truncate events, the skipped.operations=none configuration will be required.

MySQL binlog compression support

In this release, Debezium now supports reading of binlog entries that have been written with compression enabled. In version 8.0.20, MySQL adds the ability to compress binlog events using the ZSTD algorithm. To enable compression, you must toggle the binlog.transaction_compression variable on the MySQL server to ON. When compression is enabled, the binlog behaves as usual, except that the contents of the binlog entries are compressed to save space, and are replicated to in compressed format to replicas, significantly reducing network overhead for larger transactions.

If you’re interested in reading more about MySQL binlog compression, you can refer to the Binary Log Transaction Compression section of the MySQL documentation for more details.

Cassandra 4 incremental commit log support

Cassandra 4 has improved the integration with CDC by adding a feature that when the fsync operation occurs, Cassandra will update a CDC-based index file to contain the latest offset values. This index file allows CDC implementations to read up to the offset that is considered durable in Cassandra.

In this release, Debezium now uses this CDC-based index file to eliminate the inherent delay in processing CDC events from Cassandra that previously existed. This should provide Cassandra users a substantial improvement in CDC with Debezium, and gives an incentive to consider Cassandra 4 over Cassandra 3.

Pause and resume incremental snapshots

Incremental snapshots have become an integral feature in Debezium. The incremental snapshot feature allows users to re-run a snapshot on one or more collections/tables for a variety of reasons. Incremental snapshots were originally introduced with just a start signal. We eventually added the ability to stop an ongoing incremental snapshot or to be able to remove a subset of collections/tables from an in-progress incremental snapshot.

In this release, we’ve built on top of the existing signal foundation and we’ve introduced two new signals, one to pause an in-progress incremental snapshot and then another to resume the incremental snapshot if it has previously been paused. To pause an incremental snapshot, a pause-snapshot signal must be sent, and to resume, a resume-snapshot signal can be used.

These two new signals can be sent using the signal table strategy or the Kafka signal topic strategy for MySQL. Please refer to the signal support documentation for more details on signals and how they work.

Custom SQL filtering for incremental snapshots

Although uncommon, there may be scenarios such as a connector misconfiguration, where a specific record or subset of records needs to be re-emitted to the topic. Unfortunately, incremental snapshots have traditionally been an all-or-nothing type of process, where we would re-emit all records from a collection or table as a part of the snapshot.

In this release, a new additional-condition property can be specified in the signal payload, allowing the signal to dictate a SQL-based predicate to control what subset of records should be included in the incremental snapshot instead of the default behavior of all rows.

The following example illustrates sending an incremental snapshot signal for the products table, but instead of sending all rows from the table to the topic, the additional-condition property has been specified to restrict the snapshot to only send events that relate to product id equal to 12:

{
  "type": "execute-snapshot",
  "data": {
    "data-collections": ["inventory.products"],
    "type": "INCREMENTAL",
    "additional-condition": "product_id=12"
  }
}

We believe this new incremental snapshot feature will be tremendously helpful for a variety of reasons, without always having to re-snapshot all rows when only a subset of data is required.

Signal collection automatically added to include filters

In prior releases of Debezium, the signal collection/table used for incremental snapshots had to be manually added to your table.include.list connector property. A big theme in this release was improvements on incremental snapshots, so we’ve taken this opportunity to streamline this as well. Starting in this release, Debezium will automatically add the signal collection/table to the table inclusion filters, avoiding the need for users to manually add it.

This change does not impose any compatibility issues. Connector configurations that already include the signal collection/table in the table.include.list property will continue to work without requiring any changes. However, if you wish to align your configuration with current behavior, you can also safely remove the signal collection/table from the table.include.list, and Debezium will begin to handle this for you automatically.

Multitasking support for Vitess connector

The Vitess connector previously allowed operation in two different modes that depended entirely on whether the connector configuration specified any shard details. Unfortunately in both cases, each resulted in a single task responsible for performing the VStream processing. For larger Vitess installations with many shards, this architecture could begin to show latency issues as it may not be able to keep up with all the changes across all shards. And even more complex, when specifying the shard details, this required manually resolving the shards across the cluster and starting a single Debezium connector per shard, which is both error-prone and more importantly could result in deploying many Debezium connectors.

The Vitess community recognized this and sought to find a solution that addresses all these problems, both from a maintenance and error perspective. In Debezium 2.0 Beta2, the Vitess connector now automatically resolves the shards via a discovery mechanism, quite similar to that of MongoDB. This discovery mechanism will then split the load across multiple tasks, allowing for a single deployment of Debezium running a task per shard or shard lists, depending on the maximum number of allowed tasks for the connector.

During the upgrade, the Vitess connector will automatically migrate the offset storage to the new format used with the multitasking behavior. But be aware that once you’ve upgraded, you won’t be able to downgrade to an earlier version as the offset storage format will have changed.

Other fixes & improvements

There are many bugfixes and stability changes in this release, some noteworthy are:

Source info of incremental snapshot events exports wrong data DBZ-4329
Deprecate internal key/value converter options DBZ-4617
"No maximum LSN recorded" log message can be spammed on low-activity databases DBZ-4631
Redis Sink config properties are not passed to DB history DBZ-5035
Upgrade SQL Server driver to 10.2.1.jre8 DBZ-5290
HTTP sink not retrying failing requests DBZ-5307
Translation from mongodb document to kafka connect schema fails when nested arrays contain no elements DBZ-5434
Duplicate SCNs on same thread Oracle RAC mode incorrectly processed DBZ-5439
Deprecate legacy topic selector for all connectors DBZ-5457
Remove the dependency of JdbcConnection on DatabaseSchema DBZ-5470
Missing the regex properties validation before start connector of DefaultRegexTopicNamingStrategy DBZ-5471
Create Index DDL fails to parse when using TABLESPACE clause with quoted identifier DBZ-5472
Outbox doesn’t check array consistency properly when it determines its schema DBZ-5475
Misleading statistics written to the log DBZ-5476
Remove SQL Server SourceTimestampMode DBZ-5477
Debezium connector task didn’t retry when failover in mongodb 5 DBZ-5479
Better error reporting for signal table failures DBZ-5484
Oracle DATADUMP DDL cannot be parsed DBZ-5488
Upgrade PostgreSQL driver to 42.4.1 DBZ-5493
Mysql connector parser the ddl statement failed when including keyword "buckets" DBZ-5499
duplicate call to config.validateAndRecord() in RedisDatabaseHistory DBZ-5506
DDL statement couldn’t be parsed : mismatched input 'ENGINE' DBZ-5508
Use “database.dbnames” in SQL Server docs DBZ-5516
LogMiner DML parser incorrectly interprets concatenation operator inside quoted column value DBZ-5521
Mysql Connector DDL Parser does not parse all privileges DBZ-5522
CREATE TABLE with JSON-based CHECK constraint clause causes MultipleParsingExceptions DBZ-5526
Disable preferring DDL before logical schema in history recovery DBZ-5535
EmbeddedEngine should initialize Connector using SourceConnectorContext DBZ-5534
Support EMPTY column identifier DBZ-5550
Use TCCL as the default classloader to load interface implementations DBZ-5561
max.queue.size.in.bytes is invalid DBZ-5569
Language type for listings in automatic topic creation DBZ-5573
Upgrade mysql-binlog-connector-java library version DBZ-5574
Vitess: Handle VStream close unexpectedly DBZ-5579
Error when parsing alter sql DBZ-5587
Field validation errors are misleading for positive, non-zero expectations DBZ-5588
Mysql connector can’t handle the case-sensitive of rename/change column statement DBZ-5589
LIST_VALUE_CLAUSE not allowing TIMESTAMP LITERAL DBZ-5592
Oracle DDL does not support comments on materialized views DBZ-5595
Oracle DDL does not support DEFAULT ON NULL DBZ-5605
Datatype mdsys.sdo_geometry not supported DBZ-5609

Altogether, a total of 107 issues were fixed for this release.

A big thank you to all the contributors from the community who worked on this release: Ahmed ELJAMI, Alexander Schwartz, Alexey Loubyansky, Gabor[Andras], Anisha Mohanty, Bob Roldan, Chris Cranford, Claus Ibsen, Debjeet Sarkar, Gabor Andras, Gunnar Morling, Hang Ruan, Harvey Yue, Henry Cai, Inki Hwang, Jakub Cechacek, Jannik Steinmann, Jeremy Ford, Jiri Novotny, Jiri Pechanec, Katerina Galieva, Marek Winkler, Martin Medek, Nitin Chhabra, Phạm Ngọc Thắng, Robert Roldan, Ruud H.G. van Tol, Seo Jae-kwon, Sergei Morozov, Stefan Miklosovic, Vadzim Ramanenka, Vivek Wassan, Vojtech Juranek, Zhongqiang Gong, 合龙张, 崔世杰, and 민규 김!

What’s next?

With the release of Debezium 2.0 Beta2, we’re in the home stretch toward 2.0.0.Final. The community should expect a CR1 by the end of September and 2.0.0.Final released by the middle of October.

In addition, our very own Gunnar Morling and I will be guests on the upcoming Quarkus Insights podcast, episode #103. We will be discussing Debezium and Quarkus, how Debezium leverages the power of Quarkus, a virtual how-to on embedding Debezium in a Quarkus-based application, and discussing all new features in Debezium 2.0. Be sure to check out the podcast and let us what you think!

Chris Cranford

Chris is a software engineer at Red Hat. He previously was a member of the Hibernate ORM team and now works on Debezium. He lives in North Carolina just a few hours from Red Hat towers.

About Debezium

Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.

Get involved

We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Zulip, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.