Debezium Blog

A Happy New Year to the Debezium Community!

May all your endavours be successful, your data be consistent, and most importantly, everyone stay safe and healthy. With 2020 in the books, I thought it’d be nice to take a look back and do a quick recap of what has happened around Debezium over the last year.

First, some facts and numbers for you stats lovers out there:

Over the last five years, Debezium has become a leading open-source solution for change data capture for a variety of databases. Users from all kinds of industries work with Debezium for use cases like replication of data from operational databases into data warehouses, updating caches and search indexes, driving streaming queries via Kafka Streams or Apache Flink, synchronizing data between microservices, and many more.

When talking to Debezium users, we generally receive very good feedback on the range of applications enabled by Debezium and its flexibility: e.g. each connector can be configured and fine-tuned in many ways, depending on your specific requirements. A large number of metrics provide deep insight into the state of running Debezium connectors, allowing to safely operate CDC pipelines also in huge installations with thousands of connectors.

All this comes at the cost of a learning curve, though: users new to Debezium need to understand the different options and settings as well as learn about best practices for running Debezium in production. We’re therefore constantly exploring how the user experience of Debezium can be further improved, allowing people to set up and operate its connectors more easily.

Create new topics / pipes

When you are working with Kafka Connect Distributed then you might have realized that once you start Kafka Connect there are already some internal Kafka Connect related topics created for you:

$ kafka-topics.sh --bootstrap-server $HOSTNAME:9092 --list

connect_configs
connect_offsets
connect_statuses

This is done automatically for you by Kafka Connect with a sane, customized default topic configuration that fits the needs of these internal topics.

When you start a Debezium connector the topics for the captured events are created by the Kafka broker based on a default, maybe customized, configuration in the broker if auto.create.topics.enable = true is enabled in the broker config:

auto.create.topics.enable = true
default.replication.factor = 1
num.partitions = 1
compression.type = producer
log.cleanup.policy = delete
log.retention.ms = 604800000  ## 7 days

But often, when you use Debezium and Kafka in a production environment you might choose to disable Kafka’s topic auto creation capability with auto.create.topics.enable = false, or you want the connector topics to be configured differently from the default. In this case you have to create topics for Debezium’s captured data sources upfront.
But there’s good news! Beginning with Kafka Connect version 2.6.0, this can be automated since KIP-158 is implemented to enable customizable topic creation with Kafka Connect.

Setting up change data capture (CDC) pipelines with Debezium typically is a matter of configuration, without any programming being involved. It’s still a very good idea to have automated tests for your CDC set-up, making sure that everything is configured correctly and that your Debezium connectors are set up as intended.

There’s two main components involved whose configuration need consideration:

  • The source database: it must be set up so that Debezium can connect to it and retrieve change events; details depend on the specific database, e.g. for MySQL the binlog must be in "row" mode, for Postgres, one of the supported logical decoding plug-ins must be installed, etc.

  • The Debezium connector: it must be configured using the right database host and credentials, possibly using SSL, applying table and column filters, potentially one or more single message transformations (SMTs), etc.

We have developed a Debezium connector for usage with Db2 which is now available as part of the Debezium incubator. Here we describe the use case we have for Change Data Capture (CDC), the various approaches that already exist in the Db2 ecology, and how we came to Debezium. In addition, we motivate the approach we took to implementing the Db2 Debezium connector.

This article is a dive into the realms of Event Sourcing, Command Query Responsibility Segregation (CQRS), Change Data Capture (CDC), and the Outbox Pattern. Much needed clarity on the value of these solutions will be presented. Additionally, two differing designs will be explained in detail with the pros/cons of each.

So why do all these solutions even matter? They matter because many teams are building microservices and distributing data across multiple data stores. One system of microservices might involve relational databases, object stores, in-memory caches, and even searchable indexes of data. Data can quickly become lost, out of sync, or even corrupted therefore resulting in disastrous consequences for mission critical systems.

Solutions that help avoid these serious problems are of paramount importance for many organizations. Unfortunately, many vital solutions are somewhat difficult to understand; Event Sourcing, CQRS, CDC, and Outbox are no exception. Please look at these solutions as an opportunity to learn and understand how they could apply to your specific use cases.

As you will find out at the end of this article, I will propose that three of these four solutions have high value, while the other should be discouraged except for the rarest of circumstances. The advice given in this article should be evaluated against your specific needs, because, in some cases, none of these four solutions would be a good fit.

Outbox as in that folder in my email client? No, not exactly but there are some similarities!

The term outbox describes a pattern that allows independent components or services to perform read your own write semantics while concurrently providing a reliable, eventually consistent view to those writes across component or service boundaries.

You can read more about the Outbox pattern and how it applies to microservices in our blog post, Reliable Microservices Data Exchange With the Outbox Patttern.

So what exactly is an Outbox Event Router?

In Debezium version 0.9.3.Final, we introduced a ready-to-use Single Message Transform (SMT) that builds on the Outbox pattern to propagate data change events using Debezium and Kafka. Please see the documentation for details on how to use this transformation.

As a follow up to the recent Building Audit Logs with Change Data Capture and Stream Processing blog post, we’d like to extend the example with admin features to make it possible to capture and fix any missing transactional data.

In the above mentioned blog post, there is a log enricher service used to combine data inserted or updated in the Vegetable database table with transaction context data such as

  • Transaction id

  • User name who performed the work

  • Use case that was behind the actual change e.g. "CREATE VEGETABLE"

This all works well as long as all the changes are done via the vegetable service. But is this always the case?

What about maintenance activities or migration scripts executed directly on the database level? There are still a lot of such activities going on, either on purpose or because that is our old habits we are trying to change…

Let’s talk about TOAST. Toast? No, TOAST!

So what’s that? TOAST (The Oversized-Attribute Storage Technique) is a mechanism in Postgres which stores large column values in multiple physical rows, circumventing the page size limit of 8 KB.

TOAST!

Typically, TOAST storage is transparent to the user, so you don’t really have to care about it. There’s an exception, though: if a table row has changed, any unchanged values that were stored using the TOAST mechanism are not included in the message that Debezium receives from the database, unless they are part of the table’s replica identity. Consequently, such unchanged TOAST column value will not be contained in Debezium data change events sent to Apache Kafka. In this post we’re going to discuss different strategies for dealing with this situation.

It is a common requirement for business applications to maintain some form of audit log, i.e. a persistent trail of all the changes to the application’s data. If you squint a bit, a Kafka topic with Debezium data change events is quite similar to that: sourced from database transaction logs, it describes all the changes to the records of an application. What’s missing though is some metadata: why, when and by whom was the data changed? In this post we’re going to explore how that metadata can be provided and exposed via change data capture (CDC), and how stream processing can be used to enrich the actual data change events with such metadata.

This is a guest post by Apache Pulsar PMC Member and Committer Jia Zhai.

Debezium is an open source project for change data capture (CDC). It is built on Apache Kafka Connect and supports multiple databases, such as MySQL, MongoDB, PostgreSQL, Oracle, and SQL Server. Apache Pulsar includes a set of built-in connectors based on Pulsar IO framework, which is counter part to Apache Kafka Connect.

As of version 2.3.0, Pulsar IO comes with support for the Debezium source connectors out of the box, so you can leverage Debezium to stream changes from your databases into Apache Pulsar. This tutorial walks you through setting up the Debezium connector for MySQL with Pulsar IO.

As part of their business logic, microservices often do not only have to update their own local data store, but they also need to notify other services about data changes that happened. The outbox pattern describes an approach for letting services execute these two tasks in a safe and consistent manner; it provides source services with instant "read your own writes" semantics, while offering reliable, eventually consistent data exchange across service boundaries.

The second-level cache of Hibernate ORM / JPA is a proven and efficient way to increase application performance: caching read-only or rarely modified entities avoids roundtrips to the database, resulting in improved response times of the application.

Unlike the first-level cache, the second-level cache is associated with the session factory (or entity manager factory in JPA terms), so its contents are shared across transactions and concurrent sessions. Naturally, if a cached entity gets modified, the corresponding cache entry must be updated (or purged from the cache), too. As long as the data changes are done through Hibernate ORM, this is nothing to worry about: the ORM will update the cache automatically.

Things get tricky, though, when bypassing the application, e.g. when modifying records directly in the database. Hibernate ORM then has no way of knowing that the cached data has become stale, and it’s necessary to invalidate the affected items explicitly. A common way for doing so is to foresee some admin functionality that allows to clear an application’s caches. For this to work, it’s vital to not forget about calling that invalidation functionality, or the application will keep working with outdated cached data.

In the following we’re going to explore an alternative approach for cache invalidation, which works in a reliable and fully automated way: by employing Debezium and its change data capture (CDC) capabilities, you can track data changes in the database itself and react to any applied change. This allows to invalidate affected cache entries in near-realtime, without the risk of stale data due to missed changes. If an entry has been evicted from the cache, Hibernate ORM will load the latest version of the entity from the database the next time is requested.

Updating external full text search indexes (e.g. Elasticsearch) after data changes is a very popular use case for change data capture (CDC).

As we’ve discussed in a blog post a while ago, the combination of Debezium’s CDC source connectors and Confluent’s sink connector for Elasticsearch makes it straight forward to capture data changes in MySQL, Postgres etc. and push them towards Elasticsearch in near real-time. This results in a 1:1 relationship between tables in the source database and a corresponding search index in Elasticsearch, which is perfectly fine for many use cases.

It gets more challenging though if you’d like to put entire aggregates into a single index. An example could be a customer and all their addresses; those would typically be stored in two separate tables in an RDBMS, linked by a foreign key, whereas you’d like to have just one index in Elasticsearch, containing documents of customers with their addresses embedded, allowing you to efficiently search for customers based on their address.

Following up to the KStreams-based solution to this we described recently, we’d like to present in this post an alternative for materializing such aggregate views driven by the application layer.

Most of the times Debezium is used to stream data changes into Apache Kafka. What though if you’re using another streaming platform such as Apache Pulsar or a cloud-based solution such as Amazon Kinesis, Azure Event Hubs and the like? Can you still benefit from Debezium’s powerful change data capture (CDC) capabilities and ingest changes from databases such as MySQL, Postgres, SQL Server etc.?

Turns out, with just a bit of glue code, you can! In the following we’ll discuss how to use Debezium to capture changes in a MySQL database and stream the change events into Kinesis, a fully-managed data streaming service available on the Amazon cloud.

Yesterday I had the opportunity to present Debezium and the idea of change data capture (CDC) to the Darmstadt Java User Group. It was a great evening with lots of interesting discussions and questions. One of the questions being the following: what is the advantage of using a log-based change data capturing tool such as Debezium over simply polling for updated records?

So first of all, what’s the difference between the two approaches? With polling-based (or query-based) CDC you repeatedly run queries (e.g. via JDBC) for retrieving any newly inserted or updated rows from the tables to be captured. Log-based CDC in contrast works by reacting to any changes to the database’s log files (e.g. MySQL’s binlog or MongoDB’s op log).

As this wasn’t the first time this question came up, I thought I could provide a more extensive answer also here on the blog. That way I’ll be able to refer to this post in the future, should the question come up again :)

So without further ado, here’s my list of five advantages of log-based CDC over polling-based approaches.

Microservice-based architectures can be considered an industry trend and are thus often found in enterprise applications lately. One possible way to keep data synchronized across multiple services and their backing data stores is to make us of an approach called change data capture, or CDC for short.

Essentially CDC allows to listen to any modifications which are occurring at one end of a data flow (i.e. the data source) and communicate them as change events to other interested parties or storing them into a data sink. Instead of doing this in a point-to-point fashion, it’s advisable to decouple this flow of events between data sources and data sinks. Such a scenario can be implemented based on Debezium and Apache Kafka with relative ease and effectively no coding.

As an example, consider the following microservice-based architecture of an order management system: