Debezium 2.6.0.Beta1 Released

Breaking changes

The team aims to avoid any potential breaking changes between minor releases; however, such changes are sometimes inevitable.

Oracle: In older versions of Debezium, users were required to manually install the ojdbc8.jar JDBC driver. With 2.6, the connector now bundles the Oracle JDBC driver with the connector, so manual installation is no longer necessary (DBZ-7364).

We’ve also updated the driver to version 21.11.0.0, please verify that you do not have multiple versions after upgrading to Debezium 2.6 (DBZ-7365).

Container Images: The handling of the MAVEN_DEP_DESTINATION environment variable has changed in the connect-base container image, which is the basis for debezium/connect. It is no longer used for downloading all dependencies, including connectors, but only for general purpose Maven Central located dependencies (DBZ-7551). If you were using custom images that relied on this environment variable, your image build steps may require modifications.

Improvements and changes

Db2 for iSeries connector

Debezium 2.6 introduces a brand-new connector for IBM fans to stream changes from Db2 iSeries/AS400 using the IBM iJournal system. This collaboration is a multi-year development effort from the community, and we’re pleased that the community has allowed this to be distributed under the Debezium umbrella.

The new connector can be obtained from Maven Central using the following coordinates or a direct download.

<dependency>
    <groupId>io.debezium</groupId>
    <artifactId>debezium-connector-ibmi</artifactId>
    <version>2.6.0.Beta1</version>
</dependency>

The documentation for this new connector is still a work-in-progress. If you have any questions, please be sure to reach out to the team on Zulip or the mailing list.

Incremental snapshot row-value constructors for PostgreSQL

The PostgreSQL driver supports a SQL syntax called a row-value constructor using the ROW() function. This allows a query to express predicate conditions in a more efficient way when working with multi-columned primary keys that have a suitable index. The incremental snapshot process is an ideal candidate for the use of the ROW() function, the process involves issuing a series of select SQL statements to fetch data in chunks. Each statement, aka chunk query, should ideally be as efficient as possible to minimize the cost overhead of these queries to maximize throughput of your WAL changes to your topics.

There are no specific changes needed, but the query issued for PostgreSQL incremental snapshots has been adjusted to take advantage of this new syntax, and therefore users who utilize incremental snapshots should see performance improvements.

An example of the old query used might look like this for a simple table:

SELECT *
  FROM users
 WHERE (a = 10 AND (b > 2 OR b IS NULL)) OR (a > 10) OR (a IS NULL)
 ORDER BY a, b LIMIT 1024

The new implementation constructs this query using the ROW() function as follows:

SELECT *
  FROM users
 WHERE row(a,b) > row(10,2)
ORDER BY a, b LIMIT 1024

We’d be interested in any feedback on this change, and what performance improvements are observed.

Signal table watermark metadata

An incremental snapshot process requires a signal table to write open/close markers to coordinate the change boundaries with the data recorded in the transaction logs, unless you’re using MySQL’s read-only flavor. In some cases, users would like to be able to track the window time slot, knowing when the window was opened and closed.

Starting with Debezium 2.6, the data column in the signal table will be populated with the time window details, allowing users to obtain when the window was opened and closed. The following shows the details of the data column for each of the two signal markers:

Window Open Marker

{"openWindowTimestamp": "<window-open-time>"}

Window Close Marker

{"openWindowTimestamp": "<window-open-time>", "closeWindowTimestamp": "<window-close-time>"}

Oracle Redo SQL per event with LogMiner

We have improved the Oracle connector’s event structure for inserts, updates, and deletes to optionally contain the SQL that was reconstructed by LogMiner in the source information block. This feature is an opt-in only feature that you must enable as this can easily more than double the size of your existing event payload.

To enable the inclusion of the REDO SQL as part of the change event, add the following connector configuration:

"log.mining.include.redo.sql": "true"

With this option enabled, the source information block contains a new field redo_sql, as shown below:

"source": {
  ...
  "redo_sql": "INSERT INTO \"DEBEZIUM\".\"TEST\" (\"ID\",\"DATA\") values ('1', 'Test');"
}

This feature cannot be used with lob.enabled set to true due to how LogMiner reconstructs the SQL related to CLOB, BLOB, and XML data types. If the above configuration is added with lob.enabled set to true, the connector will start with an error about this misconfiguration.

Oracle LogMiner transaction buffer improvements

A new delay-strategy for transaction registration has been added when using LogMiner. This strategy effectively delays the creation of the transaction record in the buffer until we observe the first captured change for that transaction.

For users who use the Infinispan cache or who have enabled lob.enabled, this delayed strategy cannot be used due to how specific operations are handled in these two modes of the connector.

Delaying transaction registration has a number of benefits, which include:

Reducing the overhead on the transaction cache, especially in a highly concurrent transaction scenario.
Avoids long-running transactions that have no changes that are being captured by the connector.
Should aid in advancing the low-watermark SCN in the offsets more efficiently in specific scenarios.

We are looking into how we can explore this change for Infinispan-based users in a future build; however, due to the nature of how lob.enabled works with LogMiner, this feature won’t be possible for that use case.

Improved event timestamp precision

Debezium 2.6 introduces a new community requested feature to improve the precision of timestamps in change events. Users will now notice the addition of 4 new fields, two at the envelope level and two in the source information block as shown below:

{
  "source": {
    ...,
    "ts_us": "1559033904863123",
    "ts_ns": "1559033904863123000"
  },
  "ts_us": "1580390884335451",
  "ts_ns": "1580390884335451325",
}

The envelope values will always provide both microsecond (ts_us) and nanosecond (ts_ns) values while the source information block may have both micro and nano -second precision values truncated to a lower precision if the source database does not provide that level of precision.

Informix appends LSN to Transaction Identifier

Informix databases only increases the transaction identifier when there are concurrent transactions, otherwise the value remains identical for sequential transactions. This can prove difficult for users who may want to utilize the transaction metadata to order change events in a post processing step.

Debezium 2.6 for Informix will now append the log sequence number (LSN) to the transaction identifier so that users can easily sort change events based on the transaction metadata. The transaction identifier field will now use the format <id>:<lsn>. This change affects transaction metadata events and the source information block for change events, as shown below:

Transaction Begin Event

{
  "status": "BEGIN",
  "id": "571:53195829",
  ...
}

Transaction End Event

{
  "status": "END",
  "id": "571:53195832",
  ...
}

Change Events

{
  ...
  "source": {
    "id": "571:53195832"
    ...
  }
}

New Arbitrary-based payload formats

While it’s common for users to utilize serialization based on Json, Avro, Protobufs, or CloudEvents, there may be reasons to use a more simplistic format. Thanks to a community contribution as part of DBZ-7512, Debezium can be configured to use two new formats called simplestring and binary.

The simplestring and binary formats are configured in Debezium server using the debezium.format configurations. For simplestring, the payload will be serialized as a single STRING data type into the topic. For binary, the payload will be serialized as a BYTES using a byte[] (byte array).

Oracle LogMiner Hybrid Mining Strategy

Debezium 2.6 also introduces a new Oracle LogMiner mining strategy called hyrid, which can be enabled by setting the configuration property log.mining.strategy with the value of hybrid. This new strategy is designed to support all schema evolution features of the default mining strategy while taking advantage of all the performance optimizations from the online catalog strategy.

The main problem with the online_catalog strategy is that if a mining step observes a schema change and a data change in the same mining step, LogMiner is incapable of reconstructing the SQL correctly, which will result in the table name being OBJ# xxxxxx or the columns represented as COL1, COL2, and so on. To avoid this while using the online catalog strategy, users are recommended to perform schema changes in a lock-step pattern to avoid a mining step that observes both a schema change and a data change together; however, this is not always feasible.

The new hybrid strategy works by tracking a table’s object id at the database level and then using this identifier to look up the schema associated with the table from Debezium’s relational table model. In short, this allows Debezium to do what Oracle LogMiner is unable to do in these specific corner cases. The table name will be taken from the relational model’s table name and columns will be mapped by column position.

Unfortunately, Oracle does not provide a way to reconstruct failed SQL operations for CLOB, BLOB, and XML data types. This means that the new hybrid strategy cannot be configured with configurations that use lob.enabled set to true. If a connector is started using the hybrid strategy and has lob.enabled set to true, the connector will fail to start and report a configuration failure.

Other changes

Altogether, 86 issues were fixed in this release:

MySQL config values validated twice DBZ-2015
PostgreSQL connector doesn’t restart properly if database if not reachable DBZ-6236
NullPointerException in MongoDB connector DBZ-6434
Tests in RHEL system testsuite throw errors without ocp cluster DBZ-7002
Move timeout configuration of MongoDbReplicaSet into Builder class DBZ-7054
Several Oracle tests fail regularly on Testing Farm infrastructure DBZ-7072
Remove obsolete MySQL version from TF DBZ-7173
Add Oracle 23 to CI test matrix DBZ-7195
Refactor sharded mongo ocp test DBZ-7221
Implement Snapshotter SPI Oracle DBZ-7302
Align snapshot modes for SQLServer DBZ-7303
Update snapshot mode documentation DBZ-7309
Cassandra-4: Debezium connector stops producing events after a schema change DBZ-7363
Upgrade ojdbc8 to 21.11.0.0 DBZ-7365
Document relation between column type and serializers for outbox DBZ-7368
Callout annotations rendered multiple times in downstream User Guide DBZ-7418
Test testEmptyChangesProducesHeartbeat tends to fail randomly DBZ-7453
Align snapshot modes for PostgreSQL, MySQL, Oracle DBZ-7461
PreparedStatement leak in Oracle ReselectColumnsProcessor DBZ-7479
Allow special characters in signal table name DBZ-7480
Document toggling MariaDB mode DBZ-7487
Poor snapshot performance with new reselect SMT DBZ-7488
Debezium Oracle Connector ParsingException on XMLTYPE with lob.enabled=true DBZ-7489
Add informix to main repository CI workflow DBZ-7490
Db2ReselectColumnsProcessorIT does not clean-up after test failures DBZ-7491
Disable Oracle Integration Tests on GitHub DBZ-7494
Unify and adjust thread time outs DBZ-7495
Completion callback called before connector stop DBZ-7496
Add "IF [NOT] EXISTS" DDL support for Oracle 23 DBZ-7498
Deployment examples show attribute name instead of its value DBZ-7499
Fix MySQL 8 event timestamp resolution logic error where fallback to seconds occurs erroneously for non-GTID events DBZ-7500
Remove incubating from Debezium documentation DBZ-7501
Add ability to parse Map<String, Object> into ConfigProperties DBZ-7503
LogMinerHelperIT test shouldAddCorrectLogFiles randomly fails DBZ-7504
Support Oracle 23 SELECT without FROM DBZ-7505
Add Oracle 23 Annotation support for CREATE/ALTER TABLE statements DBZ-7506
TestContainers MongoDbReplicaSetAuthTest randomly fails DBZ-7507
MySQl ReadOnlyIncrementalSnapshotIT testStopSnapshotKafkaSignal fails randomly DBZ-7508
Add Informix to Java Outreach DBZ-7510
Disable parallel record processing in DBZ server tests against Apicurio DBZ-7515
Add Start CDC hook in Reselect Columns PostProcessor Tests DBZ-7516
Remove the unused 'connector' parameter in the createSourceTask method in EmbeddedEngine.java DBZ-7517
Update commons-compress to 1.26.0 DBZ-7520
Promote JDBC sink from Incubating DBZ-7521
Allow to download containers also from Docker Hub DBZ-7524
Update rocketmq version DBZ-7525
signalLogWithEscapedCharacter fails with pgoutput-decoder DBZ-7526
Move RocketMQ dependency to debezium server DBZ-7527
Rework shouldGenerateSnapshotAndContinueStreaming assertions to deal with parallelization DBZ-7530
Multi-threaded snapshot can enqueue changes out of order DBZ-7534
AsyncEmbeddedEngineTest#testTasksAreStoppedIfSomeFailsToStart fails randomly DBZ-7535
MongoDbReplicaSetAuthTest fails randomly DBZ-7537
SQLServer tests taking long time due to database bad state DBZ-7541
Explicitly import jakarta dependencies that are excluded via glassfish filter DBZ-7545
ReadOnlyIncrementalSnapshotIT#testStopSnapshotKafkaSignal fails randomly DBZ-7553
Include RocketMQ and Redis container output into test log DBZ-7557
Allow XStream error ORA-23656 to be retried DBZ-7559
Numeric default value decimal scale mismatch DBZ-7562
Wait for Redis server to start DBZ-7564
Documentation conflict DBZ-7565
Fix null event timestamp possible from FORMAT_DESCRIPTION and PREVIOUS_GTIDS events in MySqlStreamingChangeEventSource::setEventTimestamp DBZ-7567
AsyncEmbeddedEngineTest.testExecuteSmt fails randomly DBZ-7568
Debezium fails to compile with JDK 21 DBZ-7569
Upgrade PostgreSQL driver to 42.6.1 DBZ-7571
Upgrade Kafka to 3.7.0 DBZ-7574
Redis tests fail randomly with JedisConnectionException: Unexpected end of stream DBZ-7576
RedisOffsetIT.testRedisConnectionRetry fails randomly DBZ-7578
Oracle connector always brings OLR dependencies DBZ-7579
Correct JDBC connector dependencies DBZ-7580
Improved logging in case of PostgreSQL failure DBZ-7581
Unavailable Toasted HSTORE Json Storage Mode column causes serialization failure DBZ-7582
Reduce debug logs on tests DBZ-7588
Server SQS sink doesn’t support quick profile DBZ-7590
Oracle Connector REST Extension Tests Fail DBZ-7597
Serialization of XML columns with NULL values fails using Infinispan Buffer DBZ-7598

A huge thank you to all the contributors from the community who worked on this release: Akula, Akula, Andrey Pustovetov, Anisha Mohanty, Bue Von Hun, Chris Cranford, Enzo Cappa, Harvey Yue, Jakub Cechacek, James Johnston, Jiri Pechanec, Lars M. Johansson, Lourens Naudé, Mario Fiore Vitale, Martin Medek, Mostafa Ghadimi, Nancy Xu, Ondrej Babec, Razvan Laurus, René Kerner, Robert Roldan, Stavros Champilomatis, Vojtech Juranek, and Xianming Zhou!

Outlook & What’s next?

The next few weeks will be focused primarily on stability and bug fixes. We expect to release Debezium 2.6.0.Final in just under three weeks, so we courage you to download and test the latest Beta and provide your feedback.

If you have any questions or interested in what the roadmap holds for not only 2.6 but also the road to the new Debezium 3.0 later this fall, we encourage you to take a look at our road map. If you have any suggestions or ideas, please feel free to get in touch with us on our mailing list or in our Zulip chat.

And in closing, our very own Mario Vitale will be speaking at Open Source Day 2024, where he will talk about Dealing with data consistency - a CDC approach to dual writes. Please be sure to check out his session on Day 1 as a part of the Beta track at 10:45am!

Until next time…

Chris Cranford

Chris is a software engineer at Red Hat. He previously was a member of the Hibernate ORM team and now works on Debezium. He lives in North Carolina just a few hours from Red Hat towers.

About Debezium

Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.

Get involved

We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Zulip, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.