In the realm of data streaming optimization, even subtle improvements can make a significant impact. This article focuses on one such refinement: the introduction of batch support in Debezium’s JDBC connector. We’ll guide you through the process of enabling batches and share the practical outcomes of our performance testing.

Configuring batches

Since our previous releases were primarily focused on core functionalities, our latest release is dedicated to addressing the main pain point associated with the connector: performance.

The goal was to improve throughput in terms of events processed per second (EPS). To achieve this, we revisited the connector to enable the processing of batches of events.

You can now fine-tune the size of the batch using a new property: batch.size.

The batch.size property defines the number of records to be attempted for batching into the destination table. However, it’s crucial to note that the actual size of processed records is contingent upon the consumer.max.poll.records property of the Kafka Connect worker.

It’s important to be aware that if you set consumer.max.poll.records in the Connect worker properties to a value lower than batch.size, batch processing will be constrained by consumer.max.poll.records, and the intended batch.size may not be achieved.

Configure consumer max poll records

If you prefer not to configure the consumer.max.poll.records property globally on the Connect worker, you have the option to set the underlying consumer’s max.poll.records for a specific connector using consumer.override.max.poll.records in the connector configuration.

To enable per-connector configuration properties and override the default worker properties, add the following parameter to the worker properties file: connector.client.config.override.policy (see override-the-worker-configuration).

This property defines which configurations can be overridden by the connector. The default implementation is All, but other possible policies include None and Principal.

When connector.client.config.override.policy=All, each connector belonging to the worker is permitted to override the worker configuration. You can now utilize the following override prefixes for sink connector configurations: consumer.override.<sink-configuration-property>.

It’s worth noting that even if you set max.poll.records (with the default value of 500), you may receive fewer records. This is due to other properties that can impact record fetching from the topic/partitions.

Name: fetch.max.bytes
Default: 52428800 (52MB)

Name: max.partition.fetch.bytes
Default: 1048576 (1MB)

Name: message.max.bytes
Default: 1048588 (1MB)

Name: max.message.bytes
Default: 1048588 (1MB)

So adjust these based on your expected payload size to reach the desired number of poll records.

Results from performance testing

The objective of the performance tests was to provide a sense of how batch support improved the EPS. Therefore, these numbers do not reflect any real scenario but rather showcase relative improvements compared to the old JDBC version.

Configuration used for the tests

All tests executed on a ThinkPad T14s Gen 2i

CPU: Intel® Core™ i7-1185G7 @ 3.00GHz (8 cores)
RAM: 32GB
Disk: 512GB NVMe

All required components (Kafka, Connect, Zookeeper, etc.) inside docker containers.

The table used for the test has the following structure:

CREATE TABLE `aviation` (
  `id` int NOT NULL,
  `aircraft` longtext,
  `airline` longtext,
  `passengers` int DEFAULT NULL,
  `airport` longtext,
  `flight` longtext,
  `metar` longtext,
  `flight_distance` double DEFAULT NULL
)

Test plan

We planned to execute these tests:

  • 100K events from single table

    • MySQL batch vs without batch

  • 100K events from three different table

    • MySQL batch vs without batch

  • 1M events from single table

    • MySQL batch with batch size: 500, 1000, 5000, 10000 vs without batch

    • MySQL batch with batch size: 500, 1000, 5000, 10000 with JSONConverter

    • MySQL batch with batch size: 500, 1000, 5000, 10000 with Avro

    • MySQL batch with batch size: 500, 1000, 5000, 10000 with Avro and no index on destination table

100k batch no batch
Figure 1.  

Figure 1 illustrates the total execution time required to process 100,000 events from a single table, comparing MySQL connector with and without the batch support.

Despite the default values being set to 500 for both batch.size and consumer.max.poll.records, the observed actual size was reduced to 337 records due to payload size considerations.

We can observe, as expected, that the Debezium JDBC connector with batch support is faster.

100k 3 tables
Figure 2.  

In Figure 2, we observe that splitting 100,000 events into three tables does not impact the results. The Debezium JDBC connector with batch support remains faster compared to the non-batch version.

1M batch no batch
Figure 3.  

In Figure 3, it is evident that the performance gain becomes more pronounced with 1,000,000 events. The Debezium JDBC connector with batch support took approximately 7 minutes to insert all events, with an average throughput of 2300 eps, while the process without batch support took 570 minutes (9.5 hours). Therefore, the Debezium JDBC connector with batch support is 79 times faster than the version without batch support.

1M different batch size json
Figure 4.  

In Figure 4, we observe the behavior of the Debezium JDBC connector using the org.apache.kafka.connect.json.JsonConverter converter and writing to MySQL with different batch.size settings. While the initial differences are noticeable, it becomes apparent that the throughput continues to slow down. On average, all the batch.size configurations take about 7 minutes to process all events.

This raised a concern for us. After conducting a thorough analysis (profiling), we identified another issue: event deserialization. With high probability, this was the cause of the non-scalability of batch.size settings.

Although serialization improved scalability, we still lack an answer regarding the slowdown of EPS during the test run. One hypothesis could involve a certain type of buffer somewhere.

1M different batch size avro
Figure 5.  

We then conducted experiments with Avro, and as depicted in Figure 5, the results show a significant improvement. As expected, processing 1,000,000 events with batch.size=500 is slower than with batch.size=10000. Notably, in our test configuration, the optimal value for batch.size is 1000, resulting in the fastest processing time.

Although the results are better compared to JSON, there is still some performance degradation.

To identify potential bottlenecks in the code, we added some metrics and found that the majority of time was spent executing batch statements on the database.

Further investigation revealed that our table had an index defined on the primary key, which was slowing down the inserts.

1M different batch size avro no index
Figure 6.  

in Figure 6 you can see the improved performance with Avro and without primary key index. It is also evident the performance boost with a high value of batch.size.

Conclusion

We’ve explored how adjusting the batch.size can enhance the performance of the Debezium JDBC connector and discussed the proper configuration for maximizing its benefits. Equally crucial is adhering to performance tips and general guidelines for efficient inserts tailored to your specific database.

Here are a few examples:

While some settings may be specific to certain databases, several general principles apply across the majority of them.

Fiore Mario Vitale

Mario is a Senior Software Engineer at Red Hat. He lives in Italy.

   


About Debezium

Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.

Get involved

We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Zulip, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.