Today, I am pleased to announce the third alpha release in the 2.2 release stream, Debezium 2.2.0.Alpha3.
This release includes a plethora of bug fixes, improvements, breaking changes, and a number of new features including, but not limited to, optional parallel snapshots, server-side MongoDB change stream filtering, surrogate keys for incremental snapshots, a new Cassandra connector for Cassandra Enterprise, much more.
Let’s take moment and dive into some of these new features, improvements, and breaking changes.
Breaking Changes
We typically try to avoid any breaking changes, even during minor releases such as this; however, sometimes breaking changes are inevitable given the circumstances. Debezium 2.2.0.Alpha3 includes one breaking change:
PostgreSQL zoned date-time data types truncated
It was identified (DBZ-6163) that PostgreSQL timezone based column values that had a value of zero (0
) for milli and micro second parts of a timezone based column were being serialized incorrectly where the string did not include neither the millisecond nor the microsecond portions of the time using zeroes.
This does not create any data loss! |
What’s important to note is that prior to this release, when evaluating the values of such columns, consumers must be prepared to parse these string-based time values without the presence of a milli or microsecond value. In effect, this means events have an inconsistent pattern where some will have the milli and microsecond portions and others may not if their source value had 0 milliseconds or 0 microseconds.
These string-based time values will be be emitted consistently, padded with zeroes (0
) for the milli and microsecond parts of the string-based time, even when the source value has neither milli nor microseconds.
Optional parallel snapshots
Debezium’s relational database initial snapshot process has always been single-threaded. This limitation primarily stems from the complexities of ensuring data consistency across multiple transactions.
Starting in Debezium 2.2, we’re adding a new and initially optional way to utilize multiple threads to perform consistent database snapshot for a connector. This implementation uses these multiple threads to execute table-level snapshots in parallel.
In order to take advantage of this new feature, specify snapshot.max.threads
in your connector’s configuration and when this property has a value greater than 1
, parallel snapshots will be used.
snapshot.max.threads=4
In the example above, if the connector needs to snapshot more than 4 tables, there will be at most 4 tables being snapshot in parallel. When one thread finishes processing a table, it will get a new table to snapshot from the queue and the process continues until all tables have been snapshot.
This feature is considered incubating, but we strongly suggest that new connector deployments give this feature a try. We would welcome any and all feedback on how to improve this going forward. |
MongoDB server-side change stream filtering
Debezium presently subscribes to the MongoDB change stream and evaluates whether an event is of relevance or not on the connector side. On the surface, there is nothing technically wrong with this approach, it has worked well; however, a recent contributor explained how this decision impacts them.
Overall, the current process effectively serializes across the network all changes from MongoDB to the connector. If you have a lower volume of changes, you likely don’t see any issue with this approach; however, in a high volume scenario, especially when you’re only interested in a subset of the data generated by change streams, you quickly begin to see how this approach is inefficient. Furthermore, if you’re running the connector in a cloud environment like AWS, you’ll likely see in a high volume scenario where utilization costs could be impacted.
By moving where the include/exclude list filters are evaluated from the connector to the MongoDB server’s change stream subscription, this adds a number of advantages for all MongoDB connector users.
By reducing the number of events seen by connector, this impacts both network and CPU utilization. When events are sent that the connector simply discards due to include/exclude filters, this leads to network usage that could be avoided. When the connector is configured with full document or pre-image settings, this adds even more utilization to the network that is entirely unnecessary. Furthermore, by receiving more events than the connector configuration is interested in, this leads to the connector doing more processing, raising CPU utilization.
While network and CPU utilization are critical regardless of one’s environment, these are often more scrutinized when operating a cloud-based environments as these two metrics directly impact the operating budget. Users should see an overall lower network and CPU utilization with Debezium MongoDB 2.2 connectors.
We hope to share more details the benefits of this change in a future blog post, so stay tuned!
Incremental snapshot surrogate key support
Debezium’s incremental snapshot feature has been a tremendous success. It provides an efficient way to perform a consist snapshot of data that can be resumed, which is critical when the snapshot consists of large volumes of data.
However, incremental snapshots do have specific requirements that must be met before the feature can be used. One of those requirements is all tables being snapshot must use a primary key. You may ask, why does a table have no primary key, and we aren’t going to debate that here today; however, suffice to say this occurs more often than you may think.
With Debezium 2.2, incremental snapshots can be performed on key-less tables as long as there is one column that is unique and can be considered a "surrogate key" for incremental snapshot purposes.
The surrogate key feature is not supported by MongoDB; only relational connectors. |
To provide the surrogate key column data in an incremental snapshot signal, the signal’s payload must include the new surrogate key attribute, surrogate-key
.
{
"data-collections": [ "public.mytab" ],
"surrogate-key": "customer_ref"
}
In the above example, an incremental snapshot will be started for table public.mytab
and the incremental snapshot will use the customer_ref
column as the primary key for generating the snapshot windows.
A surrogate key cannot be defined using multiple columns, only a single column. |
However, the surrogate key feature isn’t just applicable for tables with no primary keys. There are a series of advantages when using this feature with tables that have primary keys:
-
One clear advantage is when the table’s primary key consists of multiple columns. The query generates a disjunction predicate for each column in the primary key, and it’s performance is highly dependent on the environment. Reducing the number of columns down to a single column often performs universally.
-
Another advantage is when the surrogate key is based on a numeric data type while the primary key column is based on a character-based data type. Relational databases generally perform predicate evaluation more efficiently with numeric comparisons rather than character comparisons. By adjusting the query to use a numeric data type in this case, query performance could be better.
Other fixes
There were quite a number of other improvements, bug fixes, and stability changes in this release, some noteworthy are:
-
When using
snapshot.collection.include.list
, relational schema isn’t populated correctly DBZ-3594 -
Debezium UI should use fast-jar again with Quarkus 2.x DBZ-4621
-
Create a Datastax connector based on Cassandra connector DBZ-5951
-
Add support for honouring MongoDB read preference in change stream after promotion DBZ-5953
-
Add support for header to all Debezium Server sinks DBZ-6017
-
GCP Spanner connector start failing when there are multiple indexes on a single column DBZ-6101
-
Negative remaining attempts on MongoDB reconnect case DBZ-6113
-
Support String type for key in Mongo incremental snapshot DBZ-6116
-
Tables with spaces or non-ASCII characters in their name are not captured by Oracle because they must be quoted. DBZ-6120
-
Offsets are not advanced in a CDB deployment with low frequency of changes to PDB DBZ-6125
-
Allow TestContainers test framework to expose ConnectorConfiguration as JSON DBZ-6136
-
Oracle TIMESTAMP WITH TIME ZONE is emitted as GMT during snapshot rather than the specified TZ DBZ-6143
-
Upgrade impsort-maven-plugin from 1.7.0 to 1.8.0 DBZ-6144
-
Debezium UI E2E Frontend build failing randomly with corrupted Node 16 tar file DBZ-6146
-
Debezium UI SQL Server tests randomly fail due to slow agent start-up DBZ-6149
-
Upgrade Quarkus dependencies to 2.16.3.Final DBZ-6150
-
Remove hardcoded list of system database exclusions that are not required for change streaming DBZ-6152
-
RelationalSnapshotChangeEventSource swallows exception generated during snapshot DBZ-6179
-
Create SSL scenarios for integration tests for MySQL connector DBZ-6184
Altogether, 33 issues were fixed for this release. A big thank you to all the contributors from the community who worked on this release: Gabor Andras, Anisha Mohanty, Bob Roldan, Bobby Tiernay, Chris Cranford, Eugene Abramchuk, Gabor Andras, Gunnar Morling, Harvey Yue, Jakub Cechacek, Jeremy Ford, Jiri Pechanec, Mehmet Firat Komurcu, Plugaru Tudor, Stefan Miklosovic, Subodh Kant Chaturvedi, Vojtech Juranek, and Xinbin Huang!
Outlook & What’s Next?
In addition, we are nearing the end of the Debezium 2.2 development cycle. Assuming no unexpected problems, we do intend to release Beta1 next week, followed by a release candidate two weeks thereafter. Our goal is to finalize the Debezium 2.2 release in late March or early April at the latest.
We would love to hear your feedback or suggestions about our roadmap, changes in this release, or any that are outstanding or that we may haven’t mentioned. Be sure to get in touch with us on the mailing list or our chat if there is.
Also, the DevNexus 2023 conference is coming up in early April in Atlanta, and I have the privilege to be a guest speaker discussing Debezium and CDC patterns. Be sure to check out that talk in person if you have an opportunity!
And finally, be on the lookout for our first installment of our 2023 Newsletter later this month. I also will be wrapping up the blog series, "Debezium for Oracle" where I cover performance, debugging, and frequently asked questions about the Oracle connector.
Until next time…
About Debezium
Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.
Get involved
We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Zulip, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.