Capturing changes from MySQL

Streaming data at Yelp

The Yelp Engineering Blog recently began a series describing their real-time streaming data infrastructure. The first post provides a good introduction and explains how moving from a monolith to a service-oriented architecture increased productivity, but also made it more challenging to work with data spread across the 100 services that own it. It’s totally worth your time to read it right now.

As Justin writes in the post, several reasons prompted them to create their own real time streaming data pipeline:

Ensuring data always remains consistent across services is always a difficult task, but especially so when things can and do go wrong. Transactions across services may be useful in some situations, but they’re not straightforward, are expensive, and can lead to request amplification where one service calls another, which coordinates with two others, etc.
Services that update data in multiple backend services suffer from the dual write problem, which is where a failure occurs after one backing service was updated but before the other could be updated and that always results in data inconsistencies that are difficult to track down and correct.
Combining and integrating data spread across multiple services can also be difficult and expensive, but it is even harder when that data is continously changing. One approach is to use bulk APIs, but these can beprohibitive to create, can result in inconsistencies, and pose real scalability problems when services need to continually receive the never-ending updates to data.

Yelp’s Real-Time Data Pipeline records changes to data on totally ordered distributed logs so that downstream consumers can receive and process the same changes in exactly the same order. Services can consume changes made by other services, and can therefore stay in sync without explicit interservice communication. This system uses among other things Kafka for event logs, a homegrown system named MySQLStreamer to capture committed changes to MySQL tables, Avro for message format and schemas, and a custom Schematizer service that tracks consumers and enforces the Avro schemas used for messages on every Kafka topic.

How Yelp captures MySQL changes

Perhaps most interesting for Debezium is how Yelp captures the committed changes in their MySQL databases and write them to Kafka topics. Their second post in the series goes into a lot more detail about their MySQLStreamer process that reads the MySQL binary log and continously processes the DDL statements and DML operations that appear in the log, generating the corresponding insert, update, delete, and refresh events, and writing these event messages to a separate Kafka topic for each MySQL table. We’ve mentioned before that MySQL’s row-level binlog events that result from the DML operation don’t include the full definition of the columns, so knowing what the columns mean in each event requires process the DDL statements that also appear in the binlog. Yelp uses a separate MySQL instance it calls the schema tracker database, which behaves like a MySQL slave to which are applied only the DDL statements they read from the binlog. This technique lets Yelp’s MySQLStreamer system know the state of the database schema and the structure of its tables at the point in the binlog where they are processing events. This is pretty interesting, because it uses the MySQL engine to handle the DDL parsing.

Yelp’s MySQLStreamer process uses another MySQL database to track internal state describing its position in the binlog, what events have been successfully published to Kafka, and, because the binlog position varies on each replica, replica-independent information about each transaction. This latter information is similar to MySQL GTIDs, although Yelp is using earlier versions of MySQL that do not support GTIDs.

Of course, special consideration has to be taken for databases that have been around for a long time. The MySQL binlogs are capped and will not contain the entire history of the databases, so Yelp’s MySQLStreamer process bootstraps the change data capture process of old databases by starting another clean MySQL replica, which will use the built-in MySQL replication mechanism with the MySQL blackhole database engine to obtain a consistent snapshot of the master and so that all activity is logged in the replica’s binlog while the replica actually stores no data.

Yelp’s MySQLStreamer mechanism is quite ingenious in its use of MySQL and multiple extra databases to capture changes from MySQL databases and write them to Kafka topics. The downside, of course, is that doing so does increase the operational complexity of the system.

Similar purpose, similar approach

Debezium is an open source project that is building a change data capture for a variety of DBMSes. Like Yelp’s MySQLStreamer, Debezium’s MySQL Connector can continously capture the committed changes to MySQL database rows and record these events in a separate Kafka topic for each table. When first started, Debezium’s MySQL Connector can perform an initial consistent snapshot and then begin reading the MySQL binlog. It uses both DDL and DML operations that appear in the binlog, directly parsing and using the DDL statements to learn the changes to each table’s structure and the mapping of each insert, update, and delete binlog event. And each resulting change event written to Kafka includes information about the originating MySQL server and its binlog position, as well as the before and/or after states of the affected row.

However, unlike Yelp’s MySQLStreamer, the Debezium MySQL connector doesn’t need or use extra MySQL databases to parse DDL or to store the connector’s state. Instead, Debezium is built on top of Kafka Connect, which is a new Kafka library that provides much of the generic functionality of reliably pulling data from external systems, pushing it into Kafka topics, and tracking what data has already been processed. Kafka Connect stores this state inside Kafka itself, simplifying the operational footprint. Debezium’s MySQL connector can then focus on the details of performing a consistent snapshot when required, reading the binlog, and converting the binlog events into useful change events.

Yelp’s real time data pipeline makes use of a custom Avro schema registry, and uses those Avro schemas to encode each event into a compact binary representation while keeping the metadata about the structure of the event. It’s possible to do this with Debezium, too: simply run Confluent’s Schema Registry as a service and then configure the Kafka Connect worker to use the Avro Converter. As the converter serializes each event, it looks at the structure defined by the connector and, when that structure changes, generates an updated Avro Schema and registers it with the Schema Registry. That new Avro schema is then used to encode the event (and others with an identical structure) into a compact binary form written to Kafka. And of course, consumers then also use the same Avro converter so that as events are deserialized, the converter coordinates with the Schema Registry whenever it needs an Avro schema it doesn’t know about. As a result, the events are stored in a compact manner while the events' content and metadata remain available, while Schema Registry captures and maintains the history of the Avro schema for each table as it evolves over time.

Capturing changes from MySQL with Debezium

If you’re interested in change data capture with MySQL (or any other DBMSes), give Debezium a try by going through our tutorial that walks you through starting Kafka, Kafka Connect, and Debezium’s MySQL Connector to see exactly what change data events look like and how they can be used. Best of all, it’s open source with a growing community of developers that has had the benefit of building on top of recently-created Kafka Connect framework. Our MySQL connector is ready now, but we’re working on connectors for other DBMSes. Specifically, our upcoming 0.3 release will include our MongoDB Connector, with 0.4 including connectors for PostgreSQL and/or Oracle.

Correction: A previous version of this post incorrectly stated that Yelp was using a MySQL version that did support GTIDs, when in fact they are using a version that does not support MySQL GTIDs. The post has been corrected, and the author regrets the mistake.

Randall Hauch

Randall is an open source software developer at Red Hat, and has been working in data integration for almost 20 years. He is the founder of Debezium and has worked on several other open source projects. He lives in Edwardsville, IL, near St. Louis.

About Debezium

Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.

Get involved

We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Zulip, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.