Despite summer being well underway, Debezium contributors remain hard at work, and it’s my pleasure to announce the next preview release of Debezium 2.4 series, 2.4.0.Alpha2.
This preview release includes a mix of improvements, bug fixes, and new features that are available for the Debezium community to test and offer feedback. Some highlights from this release include ad-hoc blocking snapshots, source-to-sink column name propagation, support for alternative MySQL drivers, and all Cassandra connectors with Debezium Server. Let’s take a few moments and dive into these and others.
Breaking changes
Debezium Server and Cassandra connectors
For Cassandra connector users who may have been using Debezium Server or who may have wanted to use Debezium Server, we previously only shipped Cassandra 4 with the Debezium Server distribution. With Debezium 2.4, we now include all three Cassandra connector variants with the distribution, meaning that Cassandra 3 and DSE can now be used directly.
However, for this to work, a new environment variable EXTRA_CONNECTOR
was introduced to specify specifically which Cassandra connector variant should be used by Debezium Server. This means that if you were using Cassandra 4 with Debezium Server, you must include this environment variable when upgrading to have the same configuration continue to work as it did in prior versions.
This new environment variable should be set to dse
, cassandra-3
, or cassandra-4
depending on the Cassandra version you intend to use for your Debezium Server source connector.
MySQL BIGINT precision changed
The Debezium for MySQL connector was not properly setting the precision for BIGINT
data types when configuring the connector with bigint.unsigned.handling.mode
as precise
. Unfortunately, this led to a situation where the schema for such fields did not include the correct precision value.
Debezium 2.4 includes DBZ-6714, which provides a fix to address the incorrect precision for such fields. This can lead to schema incompatibilities when using schema registry, so you may need to adjust your compatibility settings or take other actions if you need to use strict compatibility rules.
Oracle snapshot and query fetch sizes increased
Debezium 2.4 introduces a change in the default values for the snapshot.fetch.size
and the query.fetch.size
Oracle connector configuration properties. Previously, these properties used a default of 2000
; however, thanks to a community contributor, it was identified that these values may likely be too low for production use.
With this release, the Oracle connector will now use a default of 10000
for both properties, which should have a positive improvement on throughput for most users who were not explicitly setting these values. If you were previously using custom values for these settings in your connector configurations, then you will not see a change in your existing behavior. Only users who previously were not explicitly setting these values will notice that the new defaults will be used.
These configuration properties are meant to act as tuning knobs, as a specific configuration for one JDBC environment may not work as well in a different environment. While we believe this change will have no negative impact, if you do notice a drop in performance, you can add these properties to your connector configuration setting them to their previous defaults of |
Vitess incorrectly mapped _bin
columns
For collations that end with the _bin
designator, Vitess maps these to a data type of VARBINARY
. As a result, the Vitess connector was inferring that these columns should be emitted as binary data; however, for character-based columns that used such collations, this was incorrect.
Debezium 2.4 will now properly emit character-based columns that are collated with a _bin
designator as string-based data rather than binary data. This means that if you are using schema registry, you may observe somee schema incompatibilities and you may need to adjust your compatibility settings or take other actions to mitigate this change.
New Features
Ad-hoc blocking snapshots
Incremental snapshots were first introduced nearly two years ago in Debezium 1.6 and has remained quite popular in the community to deal with a variety of re-snapshot use cases. However, there are some use cases where the intertwined nature of read events with create, updates, and deletes may be less than ideal or even not supported by some consumer application. For those use cases, Debezium 2.4 introduces ad-hoc blocking snapshots.
An ad-hoc blocking snapshot works in a similar way that ad-hoc incremental snapshots work; however, with one major difference. The snapshot is still triggered by sending a signal to Debezium; however when the signal is processed by the connector, the key difference is that streaming is put on hold while the snapshot process runs. This means you won’t be receiving a series of read events interwoven with create, update, or delete events. This also means that we’ll be processing the snapshot in a similar way to traditional snapshots, so the throughput should generally be higher than incremental snapshots.
Be aware that ad-hoc blocking snapshots puts the reading of the transaction logs on hold while the snapshot is performed. This means the same requirements that a traditional snapshot has on transaction log availability also applies when using this type of ad-hoc snapshot mode. When streaming resumes, if a transaction log that is needed has since been removed, the connector will raise an error and stop. |
The signal to initiate an ad-hoc blocking snapshot is very similar to its ad-hoc incremental snapshot counterpart. The following signal below shows the payload to snapshot a specific table with a condition, but this uses the new blocking snapshot rather than the incremental snapshot:
{
"type": "execute-snapshot",
"data": {
"data-collections": ["public.my_table"],
"type": "BLOCKING", (1)
"additional-condition": "last_update_date >= '2023-01-01'"
}
}
1 | The use of BLOCKING rather than INCREMENTAL differentiates the two ad-hoc snapshot modes. |
Source-to-sink column name propagation
Normally column names map directly to field names and vice versa when consumed by sink connectors such as a JDBC connector. However, there are situations where the serialization technology, such as Avro, has very specific rules about field naming conventions. When a column’s name in a database table conflicts with the serialization rule’s naming conventions, Debezium will rename the field in the event so that it adheres to the serialization’s rules. This often means that a field will be prepended with underscores or invalid characters replaced with underscores.
This can create problems for certain types of sinks, such as a JDBC sink connector, because the sink cannot easily deduce the original column name for the destination table nor can it adequately map between the event’s field name and a column name if they differ. This typically means users must use transformation chains on the sink side in order to reconstruct the event’s fields with namings that represent the source.
Debezium 2.4 introduces a way to minimize and potentially avoid that entirely by propagating the original column name as a field schema parameter, much in the same way that we do for data types, precision, scale, and length. The schema parameter __debezium.source.column.name
now includes the original column name when column or data type propagation is enabled.
The Debezium JDBC sink connector already works with column and data type propagation, allowing for the sink connector to more accurately deduce column types, length, precision, and scale. With this new feature, the JDBC sink connector will automatically use the column name from this argument when it is provided to guarantee that the destination table will be created with the same column names as the source, even when using Avro or similar. This means no transformations are needed when using the Debezium JDBC sink connector. |
Alternative MySQL JDBC drivers
In order to use IAM authentication on AWS, a special MySQL driver is required to provide that type of functionality. With Debezium 2.4, you can now provide a reference to this specific driver and the connector will use that driver instead of the default driver shipped with the connector.
As an example, to connect using IAM authentication on AWS, the following configuration is needed:
database.jdbc.driver=software.aws.rds.jdbc.mysql.Driver
database.jdbc.protocol=jdbc:mysql:aws
The database.jdbc.driver
specifies the driver that should be loaded by the connector and used to communicate with the MySQL database. The database.jdbc.protocol
is a supplemental configuration property that may not be required in all contexts. It defaults to jdbc:mysql
but since AWS requires jdbc:mysql:aws
, this allows you to specify this derivative within the configuration.
We’ve love to hear feedback and whether something like this might be useful for other scenarios.
Other fixes
In addition, there were quite a number of stability and bug fixes that made it into this release. These include the following:
-
Switch tracing to OpenTelemetry DBZ-2862
-
Connector drop down causes a scroll bar DBZ-5421
-
Provide outline for drawer component showing connector details DBZ-5831
-
Modify scroll for the running connector component DBZ-5832
-
Connector restart regression DBZ-6213
-
Highlight information about how to configure the schema history topic to store data only for intended tables DBZ-6219
-
Document Optimal MongoDB Oplog Config for Resiliency DBZ-6455
-
JDBC Schema History: When the table name is passed as dbName.tableName, the connector does not start DBZ-6484
-
Update the Edit connector UI to incorporate the feedback received from team in demo DBZ-6514
-
Support blocking ad-hoc snapshots DBZ-6566
-
Add new parameters to RabbitMQ consumer DBZ-6581
-
Document read preference changes in 2.4 DBZ-6591
-
Oracle DDL parser does not properly detect end of statement when comments obfuscate the semicolon DBZ-6599
-
Received an unexpected message type that does not have an 'after' Debezium block DBZ-6637
-
When Debezium Mongodb connector encounter authentication or under privilege errors, the connection between debezium and mongodb keeps going up. DBZ-6643
-
Log appropriate error when JDBC connector receive SchemaChange record DBZ-6655
-
Send tombstone events when partition queries are finished DBZ-6658
-
Snapshot will not capture data when signal.data.collection is present without table.include.list DBZ-6669
-
Retriable operations are retried infinitely since error handlers are not reused DBZ-6670
-
Oracle DDL parser does not support column visibility on ALTER TABLE DBZ-6677
-
Propagate source column name and allow sink to use it DBZ-6684
-
Partition duplication after rebalances with single leader task DBZ-6685
-
JDBC Sink Connector Fails on Loading Flat Data Containing Struct Type Fields from Kafka DBZ-6686
-
SQLSyntaxErrorException using Debezium JDBC Sink connector DBZ-6687
-
Should use topic.prefix rather than connector.server.name in MBean namings DBZ-6690
-
CDC - Debezium x RabbitMQ - Debezium Server crashes when an UPDATE/DELETE on source database (PostgreSQL) DBZ-6691
-
Missing operationTime field on ping command when executed against Atlas DBZ-6700
-
MongoDB SRV protocol not working in Debezium Server DBZ-6701
-
Disable jdk-outreach-workflow.yml in forked personal repo DBZ-6702
-
Custom properties step not working correctly in validation of the properties added by user DBZ-6711
-
Add tzdata-java to UI installation Dockerfile DBZ-6713
-
Refactor EmbeddedEngine::run method DBZ-6715
-
Oracle fails to process a DROP USER DBZ-6716
-
Support alternative JDBC drivers in MySQL connector DBZ-6727
-
Oracle LogMiner mining distance calculation should be skipped when upper bounds is not within distance DBZ-6733
-
Add STOPPED and RESTARTING connector states to testing library DBZ-6734
-
MariaDB: Unparseable DDL statement (ALTER TABLE IF EXISTS) DBZ-6736
-
Update Quarkus to 3.2.3.Final DBZ-6740
-
Decouple Debezium Server and Extension Quarkus versions DBZ-6744
-
SingleProcessor remove redundant filter logic DBZ-6745
-
MySQL dialect does not properly recognize non-default value longblob types due to typo DBZ-6753
-
Add a new parameter for selecting the db index when using Redis Storage DBZ-6759
-
Postgres tests for toasted byte array and toasted date array fail with decoderbufs plugin DBZ-6767
-
Table schemas should be updated for each shard individually DBZ-6775
-
Notifications and signals leaks between MBean instances when using JMX channels DBZ-6777
-
Oracle XML column types are not properly resolved when adding XMLTYPE column during streaming DBZ-6782
-
Bump the MySQL binlog client version to 0.28.1 which includes significant GTID event performance improvements DBZ-6783
-
Add new Redis Sink connector parameter description to the documentation DBZ-6784
-
Upgrade Kafka to 3.5.1 DBZ-6785
Altogether, 62 issues were fixed for this release. A big thank you to all the contributors from the community who worked on this release: Bob Roldan, Chao Tian, Chris Cranford, Chris Egerton, David Remy, Fai Ho Fu, Gurps Bassi, Harvey Yue, Indra Shukla, Jakub Cechacek, Jiri Pechanec, Jochen Schalanda, Mario Fiore Vitale, Massimo Fortunat, Nancy Xu, Nikhil Benesch, Paul Cheung, Robert Roldan, Ronak Jain, Ryan van Huuksloot, Sergey Eizner, Thomas Thornton, Vojtech Juranek, Yanjie Wang, Yashashree Chopada, david remy, and ibnubay!
What’s next?
The Debezium 2.4 series is already packed with lots of new features, and we’re only scratching the surface. We have more in-store, including the new Oracle OpenLogReplicator adapter coming with Debezium 2.4 Alpha3 next week. After that, we’ll begin to wind down the development and shift our focus in the beta and release candidate cycle, targeting the end of September for a Debezium 2.4 final release.
Don’t forget about the Debezium Community Event, which I shared with you on the mailing list. If you have any ideas or suggestions, I’d love your feedback. We will be making an announcement in the next two weeks about the date/time and agenda.
Additionally, if you’re going to Current 2023 this year in San Jose, I’d love to meet up and discuss your experiences with Debezium. I’ll be there doing a talk on event-driven design with Debezium and Apicurio with my good friends Hans-Peter Grahsl and Carles Arnal. If you’re interested in more details, feel free to drop me a line in chat, on the mailing list or directly via email.
As always, if you have any ideas or suggestions, you can also get in touch with us on the mailing list or our chat. Until next time, don’t be a stranger and stay cool out there!
About Debezium
Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.
Get involved
We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Zulip, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.