Frequently Asked Questions
- What is Debezium?
- Where did the name "Debezium" come from?
- What is Change Data Capture?
- What databases can Debezium monitor?
- What are some uses of Debezium?
- Why is Debezium a distributed system?
- Can my application directly monitor a single database?
- What does the Debezium platform look like?
- How many databases can be monitored?
- How does Debezium affect source databases?
- How are events for a database organized?
- Why are events so large?
- How do I use Confluent’s Avro Converter?
- What happens when an application stops or crashes?
- What happens when Debezium stops or crashes?
- What happens when a monitored database stops or crashes?
- Why must consuming applications expect duplicate events?
- What is Kafka?
- What is Kafka Connect?
- How to connect to Kafka when using the Debezium Docker images?
- What database size can be handled with the default memory settings for the Connect image?
- How to retrieve DECIMAL field from binary representation?
Debezium is a set of distributed services capture row-level changes in your databases so that your applications can see and respond to those changes. Debezium records in a transaction log all row-level changes committed to each database table. Each application simply reads the transaction logs their interested in, and they see all of the events in the same order in which they occurred.
The name is a combination of "DBs", as in the abbreviation for multiple databases, and the "-ium" suffix used in the names of many elements of the periodic table. Say it fast: "DBs-ium". If it helps, we say it like "dee-BEE-zee-uhm".
Change Data Capture, or CDC, is an older term for a system that monitors and captures the changes in data so that other software can respond to those changes. Data warehouses often had built-in CDC support, since data warehouses need to stay up-to-date as the data changed in the upstream OLTP databases.
Debezium is essentially a modern, distributed open source change data capture platform that will eventually support monitoring a variety of database systems.
Note that monitoring PostgreSQL requires installing an extension (plugin) into the PostgreSQL server. If your database is hosted by a service such as Amazon RDS or Heroku Postgres you may be unable to install the extension. If so, you’ll be unable to monitor your database with Debezium (support for RDS is planned).
Support for other DBMSes will be added in future releases. See our list of connectors.
The primary use of Debezium is to enable applications to respond almost immediately whenever data in databases change. Applications can do anything with the insert, update, and delete events. They might use the events to know when to remove entries from a cache. They might update search indexes with the data. They might update a derived data store with the same information or with information computed from the changing data, such as with Command Query Responsibility Separation (CQRS). They might send a push notification to one or more mobile devices. They might aggregate the changes and produce a stream of patches for entities.
Debezium is architected to be tolerant of faults and failures, and the only effectively way to do that is with a distributed system. Debezium distributes the monitoring processes, or connectors, across multiple machines so that, if anything does wrong, the connectors can be restarted. The events are recorded and replicated across multiple machines to minimize risk of information loss.
Yes. Although we recommend most people use the full Debezium platform, it is possible for a single application to embed a Debezium connector so it can monitor a database and respond to the events. This approach is indeed far simpler with few moving parts, but it is more limited and far less tolerant of failures. If your application needs at-least-once delivery guarantees of all messages, please consider using the full distributed system.
A running Debezium system consists of several pieces. A cluster of Apache Kafka brokers provides the persistent, replicated, and partitioned transaction logs where Debezium records all events and from which applications consume all events. The number of Kafka brokers depends largely on the volume of events, the number of database tables being monitored, and the number of applications that are consuming the events. Kafka does rely upon a small cluster of Zookeeper nodes to manage responsibilities of each broker.
Each Debezium connector monitors one database cluster/server, and connectors are configured and deployed to a cluster of Kafka Connect services that ensure that each connector is always running, even as Kafka Connect service instances leave and join the cluster. Each Kafka Connect service cluster (a.k.a., group) is independent, so it is possible for each group within an organization to manage its own clusters.
All connectors record their events (and other information) to Apache Kafka, which persists, replicates, and partitions the events for each table in separate topics. Multiple Kafka Connect service clusters can share a single cluster of Kafka brokers, but the number of Kafka brokers depends largely on the volume of events, the number of database tables being monitored, and the number of applications that are consuming the events.
Applications connect to Kafka directly and consume the events within the appropriate topics.
Debezium can monitor any number of databases. The number of connectors that can be deployed to a single cluster of Kafka Connect services depends upon upon the volume and rate of events. However, Debezium supports multiple Kafka Connect service clusters and, if needed, multiple Kafka clusters as well.
Most databases have to be configured before Debezium can monitor them. For example, a MySQL server must be configured to use the row-level binlog, and to have a user privileged to read the binlog; the Debezium connector must be configured with the correct information, including the privileged user. See the specific connector documentation for details.
Debezium connectors do not store any information inside the upstream databases. However, running a connector may place additional load on the source database.
Most connectors will record all events for a single database table to a single topic. Additionally, all events within a topic are totally-ordered, meaning that the order of all of those events will be maintained. (Even if events are duplicated during failures, the end result after applying all of the events will remain the same.)
For example, a MySQL connector monitoring a MySQL server/cluster (logically named "dbserverA") records all of the changes to the "Addresses" table within the "Customers" database in the topic named
dbserverA.Customers.Addresses. Likewise, all of the changes to the "PaymentMethods" table in the same database will be recorded in the topic named
Debezium is designed to monitor upstream databases and produce for each row-level change one or more corresponding events that completely describe those changes. But Debezium connectors work continuously, and its events have to make sense even as the structure of the tables in the upstream databases change over time. A consumer is also much easier to write if it only has to deal with a single event at a time, rather than having to track state over the entire history of the event stream.
That means each event needs to be completely self-describing: an event’s key and value each contain a payload with the actual information and a schema that fully describes the structure of the information. Consuming applications can process each event, use the schema to understand the structure of the information in that event, and then correctly process the event’s payload. The consuming application can take advantage of the fact that the schema will remain the same for many events in a row, and only when the schema changes might the consuming application need to do a bit more work preparing for the changed structure.
Meanwhile, the Kafka Connect services serialize the connector’s events and record them in Kafka. The JSON converter is very generic and very simple, but it has no choice but to serialize the entire event information. Therefore, events represented in JSON are indeed verbose and large.
However, Confluent’s Avro Converter is much smarter in two ways. First, it converts the connector’s schema into an Apache Avro schema, so the payload can be serialized into a very compact binary form. Secondly, it uses the fact that many events in a row will use the same schema (and thus Avro schema), and by registering those Avro schemas in a separate Schema Registry, it can place into each serialized event a small identifier of the schema version used by the message. The Avro Converter and the Schema Registry can work together to track the history of each schema over time.
Meanwhile, in the consumer, the same Avro Converter decodes the compact binary form of the event, reads the identifier of the schema version used by that message, if it hasn’t yet seen that schema version downloads the Avro schema from the Schema Registry, and finally uses that Avro schema to decode the binary payload of the event. Again, many events in sequence will share the same schema (and Avro schema version), so most of the time the converter can simply decode the raw compact event into the same schema and payload expected by the consumer.
Although our tutorial doesn’t explicitly use them, you can certainly use Confluent’s Avro Converter with Debezium. As mentioned above, the Avro Converter is much smarter and serializes the event messages much more compactly than the JSON converter that is used by default.
If you are deploying Debezium connectors to a Kafka Connect worker service, simply make sure the Avro Converter JARs are available and configure the worker service to use the Avro Converter. You will, for example, need to point the converter to your Confluent Schema Registry. Then, simply deploy the Debezium connectors (or really, any other Kafka Connect connectors) to your worker service. See Avro Serialization for a detailed description of how to use the Avro converter.
Our Docker images for Kafka Connect include the Avro Converter as an option.
To consume the change events for a database, an application creates a Kafka consumer that will connect to the Kafka brokers and consume all events for the topics associated with that database. The consumer is configured to periodically record its position (aka, offset) in each topic. When an application stops gracefully and closes the consumer, the consumer will record the offsets for the last event in each topic. When the application restarts at any later time, the consumer looks up those offsets and starts reading the very next events in each topic. Therefore, under normal operating scenarios, the application sees every event exactly one time.
If the application crashes unexpectedly, then upon restart the application’s consumer will look up the last recorded offsets for each topic, and start consume events from the last offset for each topic. In most cases, the application will see some of the same events it saw prior to the crash (but after it recorded the offset), followed by the events it had not yet seen. Thus, the application sees every event at least once. The application can reduce the number of events seen more than once by recording the offsets more frequently, although doing so will negatively affect performance and throughput of the client.
Note that a Kafka consumer can be configured to connect and start reading with the most recent offset in each topic. This can result in missed events, though this is perfectly acceptable for some use cases.
The behavior of Debezium varies depending upon which components are stopped or crashed. If enough of the Kafka broker were to stop or crash such that the each topic partition is housed by fewer than the minimum number of in-sync replicas, then the connectors writing to those topics and the consuming applications reading from those topics will simply block until the Kafka brokers can be restarted or new brokers brought online. Therefore, the minimum number of in-sync replicas has a very large impact on availability, and for consistency reasons should always be at least 1 (if not 3).
The Kafka Connect service is configured to periodically record the position and offsets of each connector. If one of the Kafka Connect service instances in its cluster is stopped gracefully, all connectors running in that process will be stopped gracefully (meaning all positions and offsets will be recorded) and those same connectors will be restarted on other Kafka Connect service instances in the same cluster. When those connectors are restarted, they will continue recording events exactly where they left off, with no duplicate events being recorded.
When one of the connectors running in a Kafka Connect service cluster is stopped gracefully, it will complete its current work and record the latest positions and offsets in Kafka. Downstream applications consume from the topics will simply wait until new events are added.
If any of the Kafka Connect service instances in its cluster crashes unexpectedly, then all connectors that were running in the crashed process will be restarted on other Kafka Connect service instances in the same cluster. However, when those connectors are restarted, they will begin recording events from the database starting at the position/offset last recorded by the connector before it crashed. This means the newly-restarted connectors may likely record some of the same events it previously recorded prior to the crash, and these duplicates will always be visible to downstream consuming applications.
When a database server monitored by Debezium stops or crashes, the Debezium connector will likely try to re-establish communication. Debezium periodically records the connector’s positions and offsets in Kafka, so once the connector establishes communication the connector should continue to read from the last recorded position and offset.
When all systems are running nominally or when some or all of the systems are gracefully shut down, then consuming applications can expect to see every event exactly one time. However, when things go wrong it is always possible for consuming applications to see events at least once.
When the Debezium’s systems crash, they are not always able to record their last position/offset. When they are restarted, they recover by starting where were last known to have been, and thus the consuming application will always see every event but may likely see at least some messages duplicated during recovery.
Additionally, network failures may cause the Debezium connectors to not receive confirmation of writes, resulting in the same event being recorded one or more times (until confirmation is received).
Apache Kafka is a fast, scalable, durable, and distributed messaging system that records all messages in replicated, partitioned, and totally-ordered transaction logs. Consumers keep track of their position in the logs, and can control this position indepdently of all other consumers. This means that some consumers can start from the very beginning of the log while others are keeping up with the most recently-recorded messages. Kafka operates as a dynamic cluster of brokers. Each log partition is replicated to multiple brokers so that, should any broker fail, the cluster still has multiple copies of the partition.
Debezium connectors record all events to a Kafka cluster, and applications consume those events through Kafka.
Kafka Connect is a framework for scalably and reliably streaming data between Apache Kafka and other systems. It is a recent addition to the Kafka community, and it makes it simple to define connectors that move large collections of data into and out of Kafka, while the framework does most of the hard work of properly recording the offsets of the connectors. A Kafka Connect service has a RESTful API for managing and deploying connectors; the service can be clustered and will automatically distribute the connectors across the cluster, ensuring that the connector is always running.
Debezium use the Kafka Connect framework. All of Debezium’s connectors are Kafka Connector source connectors, and as such they can be deployed and managed using the Kafka Connect service.
When using Docker for Mac or Docker for Windows, the Docker containers run within a light-weight VM. In order to connect to Kafka from your host system, e.g. with a Kafka Consumer started in a test in your IDE, you need to specify your host system’s IP address or host name as
ADVERTISED_HOST_NAME for the Kafka container:
docker run -it --rm --name kafka -p 9092:9092 -e ADVERTISED_HOST_NAME=<%YOUR_HOST_NAME%> --link zookeeper:zookeeper debezium/kafka:0.7. This name will be published by Zookeeper to clients asking for the Kafka broker’s name.
The memory consumption during start-up and runtime depends on the total number if tables in the database that is monitored by Debezium, the number of columns in each table and also the amount of events coming from the database. As a rule of thumb the default memory settings (maximum heap set to 256 MB) will manage to handle databases where the total count of columns across all tables is less than 10000.
If Debezium is configured to handle DECIMAL values as precise then it encodes it as
org.apache.kafka.connect.data.Decimal. This type is converted into a
BigInteger and serialized as a byte array. To decode it back we need to know the scale of value either in advance or it has to be obtained from the schema. The code for unwrapping then can look like one of the following snippets depending whether the encoded value is available as a byte array or as a string.
byte encoded = ...; int scale = ...; final BigDecimal decoded = new BigDecimal(new BigInteger(encoded), scale); String encoded = ...; int scale = ...; final BigDecimal decoded = new BigDecimal(new BigInteger(Base64.getDecoder().decode(encoded)), scale);