I am excited to announce the release of Debezium 2.0.0.Beta2!
This release contains several breaking changes, stability fixes, and bug fixes, all to inch us closer to 2.0.0.Final. Overall, this release contains a total of 107 issues that were fixed.
If you intend to upgrade to 2.0.0.Beta2, we strongly recommend that you read the release notes before the upgrade to understand all breaking changes. The following noteworthy list of changes are those we’ll cover in this blog post, some of which are breaking:
-
[breaking] New connector property namespaces
-
[potentially breaking] All event schemas properly named and versioned
-
[potentially breaking] Skipped operations now includes truncate events by default
-
Signal collection now added to table include list automatically
New connector property namespaces
One of the largest overhauls going into Debezium 2.0 is the introduction of new connector property namespaces. Starting in Debezium 2.0 Beta2 and onward, many connector properties have been relocated with new names. This is a breaking change and affects most, if not all, connector deployments during the upgrade process.
Debezium previously used the prefix "database." with a plethora of varied connector properties. Some of these properties were meant to be passed directly to the JDBC driver and in other cases to the database history implementations, and so on. Unfortunately, we identified situations where some properties were being passed to underlying implementations that weren’t intended. While this wasn’t creating any type of regression or problem, it could potentially introduce a future issue if there were property name collisions, for example, a JDBC driver property that matched with a "database." prefixed Debezium connector property.
The following describes the changes to the connector properties
-
All configurations previously prefixed as
database.history.
are now to be prefixed usingschema.history.internal.
instead. -
All JDBC pass-thru options previously specified using
database.
prefix should now be prefixed usingdriver.
instead. -
The
database.server.name
connector property renamed totopic.prefix
. -
The MongoDB
mongodb.name
connector property aligned to usetopic.prefix
instead.
Again, please review your connector configurations prior to deployment and adjust accordingly.
All Debezium event schemas are named and versioned
Debezium change events are emitted with a schema definition, which contains metadata about the fields such as the type, whether it’s required, and so on. In previous iterations of Debezium, some schema definitions did not have explicit names nor were they being explicitly versioned. In this release, we’ve moved to making sure that all schema definitions have an explicit name and version associated with them. The goal of this change is to help with future event structure compatibility, particularly for those who are using schema registries. However, if you are currently using a schema registry, be aware that this change may lead to schema compatibility issues during the upgrade process.
Skipped operations default to truncate events
Debezium supports skipping specific event types by including the skipped.operations
connector property in the connector’s configuration. This feature can be useful if you’re only interested in a subset of operations, such as only inserts and updates but not deletions.
One specific event type, truncates (t
), is only supported by a subset of relational connectors and whether these events were to be skipped wasn’t consistent. In this release, we have aligned the skipped.operations
behavior so that if the connector supports truncate events, these events are skipped by default.
Please review the following rule-set:
-
Connector supports truncate events and isn’t the Oracle connector
-
Connector configuration does not specify the
skipped.operations
in the configuration
If all the above are true, then the connector’s behavior will change after the upgrade. If you wish to continue to emit truncate events, the skipped.operations=none
configuration will be required.
MySQL binlog compression support
In this release, Debezium now supports reading of binlog entries that have been written with compression enabled. In version 8.0.20, MySQL adds the ability to compress binlog events using the ZSTD algorithm. To enable compression, you must toggle the binlog.transaction_compression
variable on the MySQL server to ON
. When compression is enabled, the binlog behaves as usual, except that the contents of the binlog entries are compressed to save space, and are replicated to in compressed format to replicas, significantly reducing network overhead for larger transactions.
If you’re interested in reading more about MySQL binlog compression, you can refer to the Binary Log Transaction Compression section of the MySQL documentation for more details.
Cassandra 4 incremental commit log support
Cassandra 4 has improved the integration with CDC by adding a feature that when the fsync operation occurs, Cassandra will update a CDC-based index file to contain the latest offset values. This index file allows CDC implementations to read up to the offset that is considered durable in Cassandra.
In this release, Debezium now uses this CDC-based index file to eliminate the inherent delay in processing CDC events from Cassandra that previously existed. This should provide Cassandra users a substantial improvement in CDC with Debezium, and gives an incentive to consider Cassandra 4 over Cassandra 3.
Pause and resume incremental snapshots
Incremental snapshots have become an integral feature in Debezium. The incremental snapshot feature allows users to re-run a snapshot on one or more collections/tables for a variety of reasons. Incremental snapshots were originally introduced with just a start signal. We eventually added the ability to stop an ongoing incremental snapshot or to be able to remove a subset of collections/tables from an in-progress incremental snapshot.
In this release, we’ve built on top of the existing signal foundation and we’ve introduced two new signals, one to pause an in-progress incremental snapshot and then another to resume the incremental snapshot if it has previously been paused. To pause an incremental snapshot, a pause-snapshot
signal must be sent, and to resume, a resume-snapshot
signal can be used.
These two new signals can be sent using the signal table strategy or the Kafka signal topic strategy for MySQL. Please refer to the signal support documentation for more details on signals and how they work.
Custom SQL filtering for incremental snapshots
Although uncommon, there may be scenarios such as a connector misconfiguration, where a specific record or subset of records needs to be re-emitted to the topic. Unfortunately, incremental snapshots have traditionally been an all-or-nothing type of process, where we would re-emit all records from a collection or table as a part of the snapshot.
In this release, a new additional-condition
property can be specified in the signal payload, allowing the signal to dictate a SQL-based predicate to control what subset of records should be included in the incremental snapshot instead of the default behavior of all rows.
The following example illustrates sending an incremental snapshot signal for the products
table, but instead of sending all rows from the table to the topic, the additional-condition
property has been specified to restrict the snapshot to only send events that relate to product id equal to 12
:
{
"type": "execute-snapshot",
"data": {
"data-collections": ["inventory.products"],
"type": "INCREMENTAL",
"additional-condition": "product_id=12"
}
}
We believe this new incremental snapshot feature will be tremendously helpful for a variety of reasons, without always having to re-snapshot all rows when only a subset of data is required.
Signal collection automatically added to include filters
In prior releases of Debezium, the signal collection/table used for incremental snapshots had to be manually added to your table.include.list
connector property. A big theme in this release was improvements on incremental snapshots, so we’ve taken this opportunity to streamline this as well. Starting in this release, Debezium will automatically add the signal collection/table to the table inclusion filters, avoiding the need for users to manually add it.
This change does not impose any compatibility issues. Connector configurations that already include the signal collection/table in the table.include.list
property will continue to work without requiring any changes. However, if you wish to align your configuration with current behavior, you can also safely remove the signal collection/table from the table.include.list
, and Debezium will begin to handle this for you automatically.
Multitasking support for Vitess connector
The Vitess connector previously allowed operation in two different modes that depended entirely on whether the connector configuration specified any shard details. Unfortunately in both cases, each resulted in a single task responsible for performing the VStream processing. For larger Vitess installations with many shards, this architecture could begin to show latency issues as it may not be able to keep up with all the changes across all shards. And even more complex, when specifying the shard details, this required manually resolving the shards across the cluster and starting a single Debezium connector per shard, which is both error-prone and more importantly could result in deploying many Debezium connectors.
The Vitess community recognized this and sought to find a solution that addresses all these problems, both from a maintenance and error perspective. In Debezium 2.0 Beta2, the Vitess connector now automatically resolves the shards via a discovery mechanism, quite similar to that of MongoDB. This discovery mechanism will then split the load across multiple tasks, allowing for a single deployment of Debezium running a task per shard or shard lists, depending on the maximum number of allowed tasks for the connector.
During the upgrade, the Vitess connector will automatically migrate the offset storage to the new format used with the multitasking behavior. But be aware that once you’ve upgraded, you won’t be able to downgrade to an earlier version as the offset storage format will have changed.
Other fixes & improvements
There are many bugfixes and stability changes in this release, some noteworthy are:
-
Source info of incremental snapshot events exports wrong data DBZ-4329
-
Deprecate internal key/value converter options DBZ-4617
-
"No maximum LSN recorded" log message can be spammed on low-activity databases DBZ-4631
-
Redis Sink config properties are not passed to DB history DBZ-5035
-
Upgrade SQL Server driver to 10.2.1.jre8 DBZ-5290
-
HTTP sink not retrying failing requests DBZ-5307
-
Translation from mongodb document to kafka connect schema fails when nested arrays contain no elements DBZ-5434
-
Duplicate SCNs on same thread Oracle RAC mode incorrectly processed DBZ-5439
-
Deprecate legacy topic selector for all connectors DBZ-5457
-
Remove the dependency of JdbcConnection on DatabaseSchema DBZ-5470
-
Missing the regex properties validation before start connector of DefaultRegexTopicNamingStrategy DBZ-5471
-
Create Index DDL fails to parse when using TABLESPACE clause with quoted identifier DBZ-5472
-
Outbox doesn’t check array consistency properly when it determines its schema DBZ-5475
-
Misleading statistics written to the log DBZ-5476
-
Remove SQL Server SourceTimestampMode DBZ-5477
-
Debezium connector task didn’t retry when failover in mongodb 5 DBZ-5479
-
Better error reporting for signal table failures DBZ-5484
-
Oracle DATADUMP DDL cannot be parsed DBZ-5488
-
Upgrade PostgreSQL driver to 42.4.1 DBZ-5493
-
Mysql connector parser the ddl statement failed when including keyword "buckets" DBZ-5499
-
duplicate call to config.validateAndRecord() in RedisDatabaseHistory DBZ-5506
-
DDL statement couldn’t be parsed : mismatched input 'ENGINE' DBZ-5508
-
Use “database.dbnames” in SQL Server docs DBZ-5516
-
LogMiner DML parser incorrectly interprets concatenation operator inside quoted column value DBZ-5521
-
Mysql Connector DDL Parser does not parse all privileges DBZ-5522
-
CREATE TABLE with JSON-based CHECK constraint clause causes MultipleParsingExceptions DBZ-5526
-
Disable preferring DDL before logical schema in history recovery DBZ-5535
-
EmbeddedEngine should initialize Connector using SourceConnectorContext DBZ-5534
-
Support EMPTY column identifier DBZ-5550
-
Use TCCL as the default classloader to load interface implementations DBZ-5561
-
max.queue.size.in.bytes is invalid DBZ-5569
-
Language type for listings in automatic topic creation DBZ-5573
-
Upgrade mysql-binlog-connector-java library version DBZ-5574
-
Vitess: Handle VStream close unexpectedly DBZ-5579
-
Error when parsing alter sql DBZ-5587
-
Field validation errors are misleading for positive, non-zero expectations DBZ-5588
-
Mysql connector can’t handle the case-sensitive of rename/change column statement DBZ-5589
-
LIST_VALUE_CLAUSE not allowing TIMESTAMP LITERAL DBZ-5592
-
Oracle DDL does not support comments on materialized views DBZ-5595
-
Oracle DDL does not support DEFAULT ON NULL DBZ-5605
-
Datatype mdsys.sdo_geometry not supported DBZ-5609
Altogether, a total of 107 issues were fixed for this release.
A big thank you to all the contributors from the community who worked on this release: Ahmed ELJAMI, Alexander Schwartz, Alexey Loubyansky, Gabor[Andras], Anisha Mohanty, Bob Roldan, Chris Cranford, Claus Ibsen, Debjeet Sarkar, Gabor Andras, Gunnar Morling, Hang Ruan, Harvey Yue, Henry Cai, Inki Hwang, Jakub Cechacek, Jannik Steinmann, Jeremy Ford, Jiri Novotny, Jiri Pechanec, Katerina Galieva, Marek Winkler, Martin Medek, Nitin Chhabra, Phạm Ngọc Thắng, Robert Roldan, Ruud H.G. van Tol, Seo Jae-kwon, Sergei Morozov, Stefan Miklosovic, Vadzim Ramanenka, Vivek Wassan, Vojtech Juranek, Zhongqiang Gong, 合龙 张, 崔世杰, and 민규 김!
What’s next?
With the release of Debezium 2.0 Beta2, we’re in the home stretch toward 2.0.0.Final. The community should expect a CR1 by the end of September and 2.0.0.Final released by the middle of October.
In addition, our very own Gunnar Morling and I will be guests on the upcoming Quarkus Insights podcast, episode #103. We will be discussing Debezium and Quarkus, how Debezium leverages the power of Quarkus, a virtual how-to on embedding Debezium in a Quarkus-based application, and discussing all new features in Debezium 2.0. Be sure to check out the podcast and let us what you think!
About Debezium
Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.
Get involved
We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Zulip, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.