Serializing Debezium events with Avro

Kafka serializers and deserializers

Before we get too far, let’s back up and review how Kafka producers and consumers normally do this serialization and deserialization. Because the keys and values are simple opaque byte arrays, you can use anything for your keys and values. For example, consider a case where we’re using simple whole numbers for the keys and strings for the values. Here, a producer of these messages would use a long serializer to convert the long keys to binary form and a string serializer to convert the String values to binary form. Meanwhile, the consumers use a long deserializer to convert the binary keys into usable long values, and a string deserializer to convert the binary values back into String objects.

In cases where the keys and/or values need to be a bit more structured, the producers and consumers can be written to use JSON structures for keys and/or values, and the Kafka-provided JSON serializer and deserializer to do the conversion to and from binary form stored within the Kafka messages. As we said earlier, using JSON for keys and/or values is very flexible and language agnostic, but it is also produces keys and values that are relatively large since the fields and structure of the JSON values need to be encoded as well.

Avro serialization

Avro is a data serialization mechanism that uses a schema to define the structure of data. Avro relies upon this schema when writing the data to the binary format, and the schema allows it to encode the fields within the data in a much more compact form. Avro also relies upon the schema when reading the data, too. But interestingly, Avro schemas are designed to evolve, so it is actually possible to use a slightly different schema for reading than what was used for writing. This feature makes Avro a great choice for Kafka serialization and deserialization.

Confluent provides a Kafka serializer and deserializer that uses Avro and a separate Schema Registry, and it works like this: when a numeric or string object are to be serialized, the Avro serializer will determine the corresponding Avro Schema for the given type, register with the Schema Registry this schema and the topic its used on, get back the unique identifier for the schema, and then encode in the binary form the unique identifier of the schema and the encoded value. The next message is likely to have the same type and thus schema, so the serializer can quickly encode the schema identifier and value for this message without having to talk to the Schema Registry. Only when needing to serialize a schema it hasn’t already seen does the Avro serializer talk with the Schema Registry. So not only is this fast, but it also produces very compact binary forms and allows for the producer to evolve its key and/or value schemas over time. The Schema Registry can also be configured to allow new versions of schemas to be registered only when they are compatible with the Avro schema evolution rules, ensuring that producers do not produce messages that consumers will not be able to read.

Consumers, meanwhile, use the Avro deserializer, which works in a similar manner, albeit backwards: when it reads the binary form of a key or value, it first looks for the schema identifier and, if it hasn’t seen it before asks the Schema Registry for the schema, and then uses that schema to decode the remainder of the binary representation into its object form. Again, if the deserializer has previously seen a particular schema identifier, it already has the schema needed to decode the data and doesn’t have to consult the Schema Registry.

Kafka Connect converters

Kafka Connect is a bit different than many Kafka producers/consumers, since the keys and values will often be structured. And rather than require connectors to work with JSON objects, Kafka Connect defines its own lightweight framework for defining data structures with a schema, making it much easier to write connectors to work with structured data. Kafka Connect defines its own converters that are similar to Kafka (de)serializers, except that Kafka Connect’s converters know about these structures and schemas and can serialize the keys and values to binary form. Kafka Connect provides a JSON converter that converts the structures into JSON and then uses the normal Kafka JSON serializer, so downstream consumers can just use the normal Kafka JSON deserializer and get a JSON representation of the Kafka Connect structs and schema. This is exactly what the Debezium tutorial is using, and the watch-topic consumer knows to use the JSON deserializer.

One great feature of Kafka Connect is that the connectors simply provide the structured messages, and Kafka Connect takes care of serializing them using the configured converter. This means that you can use any Kafka Connect converters with any Kafka Connect connector, including all of Debezium’s connectors.

Kafka Connect’s schema system was designed specifically with Avro in mind, so there is a one-to-one mapping between Kafka Connect schemas and Avro schemas. Confluent provides an Avro Converter for Kafka Connect that serializes the Kafka Connect structs provided by the connectors into the compact Avro binary representation, again using the Schema Registry just like the Avro serializer. The consumer just uses the normal Avro deserializer as mentioned above.

Using Avro for serialization of Debezium events brings several significant advantages:

The encoded binary forms of the Debezium events are significantly smaller than the JSON representations. Not only is the structured data encoded in a more compact form, but the schema associated with that structured data is represented in the binary form as a single integer.
Encoding the Debezium events into their Avro binary forms is fast. Only when the converter sees a new schema does it have to consult with the Schema Registry; otherwise, the schema has already been seen and its encoding logic already precomputed.
The Avro Converter for Kafka Connect produces messages with Avro-encoded keys and values that can be read by any Kafka consumers using the Avro deserializer.
Debezium event structures are based upon the structure of the table from which the changes were captured. When the structure of the source table changes (e.g., because an ALTER statement was applied to it), the structure and schema of the events will also change. If this is done in a manner such that the new Avro schema is compatible with the older Avro schema, then consumers will be able to process the events without disruption, even though the event structures evolve over time.
Avro’s schema mechanism is far more formal and rigorous than the free-form JSON structure, and the changes in the schemas are clearly identified when comparing any two messages.
The Avro converter, Avro (de)serializers, and Schema Registry are all open source.

It is true that using the Avro converter and deserializer requires a running Schema Registry, and that the registry becomes an integral part of your streaming infrastructure. However, this is a small price to pay for the benefits listed above.

Using the Avro Converter with Debezium

As mentioned above, in the interest of keeping the Debezium tutorial as simple as possible, we avoid using the Schema Registry or the Avro converter in the tutorial. We also don’t (yet) include the Avro converter in our Docker images, though that will change soon.

Nevertheless, it is absolutely possible to use the Avro Converter with the Debezium connectors when you are installing the connectors into either the Confluent Platform or into your own installation of Kafka Connect. Simply configure the Kafka Connect workers to use the Avro converter for the keys and values:

key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter

And, if you want to use the Avro Converter for Kafka Connect internal messages, then set these as well:

internal.key.converter=io.confluent.connect.avro.AvroConverter
internal.value.converter=io.confluent.connect.avro.AvroConverter

Once again, there is no need to configure the Debezium connectors any differently.

Randall Hauch

Randall is an open source software developer at Red Hat, and has been working in data integration for almost 20 years. He is the founder of Debezium and has worked on several other open source projects. He lives in Edwardsville, IL, near St. Louis.

About Debezium

Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.

Get involved

We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Zulip, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.