Subscribe

Streaming Data Changes from Your Database to Elasticsearch

We wish all the best to the Debezium community for 2018!

While we’re working on the 0.7.2 release, we thought we’d publish another post describing an end-to-end data streaming use case based on Debezium. We have seen how to set up a change data stream to a downstream database a few weeks ago. In this blog post we will follow the same approach to stream the data to an Elasticsearch server to leverage its excellent capabilities for full-text search on our data. But to make the matter a little bit more interesting, we will stream the data to both, a PostgreSQL database and Elasticsearch, so we will optimize access to the data via the SQL query language as well as via full-text search.

Topology

Here’s a diagram that shows how the data is flowing through our distributed system. First, the Debezium MySQL connector is continuously capturing the changes from the MySQL database, and sending the changes for each table to separate Kafka topics. Then, the Confluent JDBC sink connector is continuously reading those topics and writing the events into the PostgreSQL database. And, at the same time, the Confluent Elasticsearch connector is continuously reading those same topics and writing the events into Elasticsearch.

 

Scenario topology
Figure 1: A general topology

 

We are going to deploy these components into several different processes. In this example, we’ll deploy all three connectors to a single Kafka Connect instance that will write to and read from Kafka on behalf of all of the connectors (in production you might need to keep the connectors separated to achieve better performance).

 

Scenario topology
Figure 2: A simplified topology

Configuration

We will use this Docker Compose file for a fast deployment of the demo. The deployment consists of the following Docker images:

  • Apache ZooKeeper

  • Apache Kafka

  • An enriched Kafka Connect / Debezium image with a few changes:

    • PostgreSQL JDBC driver placed into /kafka/libs directory

    • The Confluent JDBC connector placed into /kafka/connect/kafka-connect-jdbc directory

  • Pre-populated MySQL as used in our tutorial

  • Empty PostgreSQL

  • Empty Elasticsearch

The message format is not the same for the Debezium source connector and the JDBC and Elasticsearch connectors as they are developed separately and each focuses on slightly different objectives. Debezium emits a more complex event structure so that it captures all of the information available. In particular, the change events contain the old and the new state of a changed record. Both sink connectors on the other hand expect a simple message that just represents the record state to be written.

Debezium’s UnwrapFromEnvelope single message transformation (SMT) collapses the complex change event structure into the same row-based format expected by the two sink connectors and effectively acts as a message translator between the two aforementioned formats.

Example

Let’s move directly to our example as that’s where the changes are visible. First of all we need to deploy all components:

export DEBEZIUM_VERSION=0.7
docker-compose up

When all components are started we are going to register the Elasticsearch Sink connector writing into the Elasticsearch instance. We want to use the same key (primary id) in the source and both PostgreSQL and Elasticsearch:

curl -i -X POST -H "Accept:application/json" \
    -H  "Content-Type:application/json" http://localhost:8083/connectors/ \
    -d @es-sink.json

We’re using this registration request:

{
  {
    "name": "elastic-sink",
    "config": {
      "connector.class":
          "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
      "tasks.max": "1",
      "topics": "customers",
      "connection.url": "http://elastic:9200",
      "transforms": "unwrap,key",
      "transforms.unwrap.type": "io.debezium.transforms.UnwrapFromEnvelope",        (1)
      "transforms.key.type": "org.apache.kafka.connect.transforms.ExtractField$Key",(2)
      "transforms.key.field": "id",                                                 (2)
      "key.ignore": "false",                                                        (3)
      "type.name": "customer"                                                       (4)
    }
  }
}

The request configures these options:

  1. extracting only the new row’s state from Debezium’s change data message

  2. extracting the id field from the key struct, then the same key is used for the source and both destinations. This is to address the fact that the Elasticsearch connector only supports numeric types and string as keys. If we do not extract the id the messages will be filtered out by the connector because of unknown key type.

  3. use key from the event instead of generating a synthetic one

  4. type under which the events will be registered in Elasticsearch

Next we are going to register the JDBC Sink connector writing into PostgreSQL database:

curl -i -X POST -H "Accept:application/json" \
    -H  "Content-Type:application/json" http://localhost:8083/connectors/ \
    -d @jdbc-sink.json

Finally, the source connector must be set up:

curl -i -X POST -H "Accept:application/json" \
    -H  "Content-Type:application/json" http://localhost:8083/connectors/ \
    -d @source.json

Let’s check if the databases and the search server are synchronized. All the rows of the customers table should be found in the source database (MySQL) as well as the target database (Postgres) and Elasticsearch:

docker-compose exec mysql bash -c 'mysql -u $MYSQL_USER  -p$MYSQL_PASSWORD inventory -e "select * from customers"'
+------+------------+-----------+-----------------------+
| id   | first_name | last_name | email                 |
+------+------------+-----------+-----------------------+
| 1001 | Sally      | Thomas    | sally.thomas@acme.com |
| 1002 | George     | Bailey    | gbailey@foobar.com    |
| 1003 | Edward     | Walker    | ed@walker.com         |
| 1004 | Anne       | Kretchmar | annek@noanswer.org    |
+------+------------+-----------+-----------------------+
docker-compose exec postgres bash -c 'psql -U $POSTGRES_USER $POSTGRES_DB -c "select * from customers"'
 last_name |  id  | first_name |         email
-----------+------+------------+-----------------------
 Thomas    | 1001 | Sally      | sally.thomas@acme.com
 Bailey    | 1002 | George     | gbailey@foobar.com
 Walker    | 1003 | Edward     | ed@walker.com
 Kretchmar | 1004 | Anne       | annek@noanswer.org
curl 'http://localhost:9200/customers/_search?pretty'
{
  "took" : 42,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "customers",
        "_type" : "customer",
        "_id" : "1001",
        "_score" : 1.0,
        "_source" : {
          "id" : 1001,
          "first_name" : "Sally",
          "last_name" : "Thomas",
          "email" : "sally.thomas@acme.com"
        }
      },
      {
        "_index" : "customers",
        "_type" : "customer",
        "_id" : "1004",
        "_score" : 1.0,
        "_source" : {
          "id" : 1004,
          "first_name" : "Anne",
          "last_name" : "Kretchmar",
          "email" : "annek@noanswer.org"
        }
      },
      {
        "_index" : "customers",
        "_type" : "customer",
        "_id" : "1002",
        "_score" : 1.0,
        "_source" : {
          "id" : 1002,
          "first_name" : "George",
          "last_name" : "Bailey",
          "email" : "gbailey@foobar.com"
        }
      },
      {
        "_index" : "customers",
        "_type" : "customer",
        "_id" : "1003",
        "_score" : 1.0,
        "_source" : {
          "id" : 1003,
          "first_name" : "Edward",
          "last_name" : "Walker",
          "email" : "ed@walker.com"
        }
      }
    ]
  }
}

With the connectors still running, we can add a new row to the MySQL database and then check that it was replicated into both the PostgreSQL database and Elasticsearch:

docker-compose exec mysql bash -c 'mysql -u $MYSQL_USER  -p$MYSQL_PASSWORD inventory'

mysql> insert into customers values(default, 'John', 'Doe', 'john.doe@example.com');
Query OK, 1 row affected (0.02 sec)
docker-compose exec -postgres bash -c 'psql -U $POSTGRES_USER $POSTGRES_DB -c "select * from customers"'
 last_name |  id  | first_name |         email
-----------+------+------------+-----------------------
...
Doe        | 1005 | John       | john.doe@example.com
(5 rows)
curl 'http://localhost:9200/customers/_search?pretty'
...
{
  "_index" : "customers",
  "_type" : "customer",
  "_id" : "1005",
  "_score" : 1.0,
  "_source" : {
    "id" : 1005,
    "first_name" : "John",
    "last_name" : "Doe",
    "email" : "john.doe@example.com"
  }
}
...

Summary

We set up a complex streaming data pipeline to synchronize a MySQL database with another database and also with an Elasticsearch instance. We managed to keep the same identifier across all systems which allows us to correlate records across the system as a whole.

Propagating data changes from a primary database in near realtime to a search engine such as Elasticsearch enables many interesting use cases. Besides different applications of fulltext search one could for instance also think about creating dashboards and all kinds of visualizations using Kibana, to gain further insight into the data.

If you’d like to try out this set-up yourself, just clone the project from our examples repo. In case you need help, have feature requests or would like to share your experiences with this pipeline, please let us know in the comments below.

About Debezium

Debezium is an open source distributed platform that turns your existing databases into event streams, so applications can see and respond almost instantly to each committed row-level change in the databases. Debezium is built on top of Kafka and provides Kafka Connect compatible connectors that monitor specific database management systems. Debezium records the history of data changes in Kafka logs, so your application can be stopped and restarted at any time and can easily consume all of the events it missed while it was not running, ensuring that all events are processed correctly and completely. Debezium is open source under the Apache License, Version 2.0.

Get involved

We hope you find Debezium interesting and useful, and want to give it a try. Follow us on Twitter @debezium, chat with us on Gitter, or join our mailing list to talk with the community. All of the code is open source on GitHub, so build the code locally and help us improve ours existing connectors and add even more connectors. If you find problems or have ideas how we can improve Debezium, please let us know or log an issue.

Jiri Pechanec

Jiri is a software developer (and a former quality engineer) at Red Hat. He spent most of his career with Java and system integration projects and tasks. He lives near Brno, Czech Republic.

   

back to top