Reselect columns

Overview

To improve performance and reduce storage overhead, databases can use external storage for certain columns. This type of storage is used for columns that store large amounts of data, such as the PostgreSQL TOAST (The Oversized-Attribute Storage Technique), Oracle Large Object (LOB), or the Oracle Exadata Extended String data types. To reduce I/O overhead and increase query speed, when data changes in a table row, the database retrieves only the columns that contain new values, ignoring data in externally stored columns that remain unchanged. As a result, the value of the externally stored column is not recorded in the database log, and Debezium subsequently omits the column when it emits the event record. Downstream consumers that receive event records that omit required values can experience processing errors.

IF a value for an externally stored column is not present in the database log entry for an event, when Debezium emits a record for the event, it replaces the missing value with an unavailable.value.placeholder sentinel value. These sentinel values are inserted into appropriately typed fields, for example, a byte array for bytes, a string for strings, or a key-value map for maps.

To retrieve data for columns that were not available in the initial query, you can apply the Debezium reselect columns post processor (ReselectColumnsPostProcessor). You can configure the post processor to reselect one or more columns from a table. After you configure the post processor, it monitors events that the connector emits for the column names that you designate for reselection. When it detects an event with the specified columns, the post processor re-queries the source tables to retrieve data for the specified columns, and fetches their current state.

You can configure the post processor to reselect the following column types:

  • null columns.

  • Columns that contain the unavailable.value.placeholder sentinel value.

You can use the ReselectColumnsPostProcessor post processor only with Debezium source connectors.
The post processor is not designed to work with the Debezium JDBC sink connector.

Keyless tables

The reselect columns post processor generates a reselect query that returns the row to be modified. To construct the WHERE clause for the query, by default, the post processor uses a relational table model that is based on the table’s primary key columns or on the unique index that is defined for the table.

For keyless tables, the SELECT query that ReselectColumnsPostProcessor submits might return multiple rows, in which case Debezium always uses only the first row. You cannot prioritize the order of the returned rows. To enable the post processor to return a consistently usable result for a keyless table, it’s best to designate a custom key that can identify a unique row. The custom key must be capable of uniquely identify records in the source table based on a combination of columns.

To define such a custom message key, use the message.key.columns property in the connector configuration. After you define a custom key, set the reselect.use.event.key configuration property to true. Setting this option enables the post processor to use the specified event key fields as selection criteria in lieu of a primary key column. Be sure to test the configuration to ensure that the reselection query provides the expected results.

Configuration example

Configuring a post processor is similar to configuring a custom converter or single message transformation (SMT). To enable the connector to use the ReselectColumnsPostProcessor, add the following entries to the connector configuration:

  "post.processors": "reselector", (1)
  "reselector.type": "io.debezium.processors.reselect.ReselectColumnsPostProcessor", (2)
  "reselector.reselect.columns.include.list": "<schema>.<table>:<column>,<schema>.<table>:<column>", (3)
  "reselector.reselect.unavailable.values": "true", (4)
  "reselector.reselect.null.values": "true" (5)
  "reselector.reselect.use.event.key": "false" (6)
Item Description

1

Comma-separated list of post processor prefixes.

2

The fully-qualified class type name for the post processor.

3

Comma-separated list of column names, specified by using the following format: <schema>.<table>:<column>.

4

Enables or disables reselection of columns that contain the unavailable.value.placeholder sentinel value.

5

Enables or disables reselection of columns that are null.

6

Enables or disables reselection based event key field names.

Configuration options

The following table lists the configuration options that you can set for the Reselect Columns post-processor.

Table 1. Reselect columns post processor configuration options

Property

Default

Description

No default

Comma-separated list of column names to reselect from the source database. Use the following format to specify column names: + <schema>.<table>:_<column>_

Do not set this property if you set the reselect.columns.exclude.list property.

No default

Comma-separated list of column names in the source database to exclude from reselection. Use the following format to specify column names: + <schema>.<table>:_<column>_

Do not set this property if you set the reselect.columns.include.list property.

true

Specifies whether the post processor reselects a column that matches the reselect.columns.include.list filter if the column value is provided by the connector’s unavailable.value.placeholder property.

true

Specifies whether the post processor reselects a column that matches the reselect.columns.include.list filter if the column value is null.

false

Specifies whether the post processor reselects based on the event’s key field names or uses the relational table’s primary key column names.

By default, the reselection query is based on the relational table’s primary key columns or unique key index. For tables that do not have a primary key, set this property to true, and configure the message.key.columns property in the connector configuration to specify a custom key for the connector to use when it creates events. The post processor then uses the specified key field names as the primary key in the SQL reselection query.