Reselect columns
Overview
To improve performance and reduce storage overhead, databases can use external storage for certain columns. This type of storage is used for columns that store large amounts of data, such as the PostgreSQL TOAST (The Oversized-Attribute Storage Technique), Oracle Large Object (LOB), or the Oracle Exadata Extended String data types. To reduce I/O overhead and increase query speed, when data changes in a table row, the database retrieves only the columns that contain new values, ignoring data in externally stored columns that remain unchanged. As a result, the value of the externally stored column is not recorded in the database log, and Debezium subsequently omits the column when it emits the event record. Downstream consumers that receive event records that omit required values can experience processing errors.
IF a value for an externally stored column is not present in the database log entry for an event, when Debezium emits a record for the event, it replaces the missing value with an unavailable.value.placeholder
sentinel value.
These sentinel values are inserted into appropriately typed fields, for example, a byte array for bytes, a string for strings, or a key-value map for maps.
To retrieve data for columns that were not available in the initial query, you can apply the Debezium reselect columns post processor (ReselectColumnsPostProcessor
).
You can configure the post processor to reselect one or more columns from a table.
After you configure the post processor, it monitors events that the connector emits for the column names that you designate for reselection.
When it detects an event with the specified columns, the post processor re-queries the source tables to retrieve data for the specified columns, and fetches their current state.
You can configure the post processor to reselect the following column types:
-
null
columns. -
Columns that contain the
unavailable.value.placeholder
sentinel value.
You can use the ReselectColumnsPostProcessor post processor only with Debezium SQL database connectors.
|
Keyless tables
The reselect columns post processor generates a reselect query that returns the row to be modified.
To construct the WHERE
clause for the query, by default, the post processor uses a relational table model that is based on the table’s primary key columns or on the unique index that is defined for the table.
For keyless tables, the SELECT
query that ReselectColumnsPostProcessor
submits might return multiple rows, in which case Debezium always uses only the first row.
You cannot prioritize the order of the returned rows.
To enable the post processor to return a consistently usable result for a keyless table, it’s best to designate a custom key that can identify a unique row.
The custom key must be capable of uniquely identify records in the source table based on a combination of columns.
To define such a custom message key, use the message.key.columns
property in the connector configuration.
After you define a custom key, set the reselect.use.event.key
configuration property to true
.
Setting this option enables the post processor to use the specified event key fields as selection criteria in lieu of a primary key column.
Be sure to test the configuration to ensure that the reselection query provides the expected results.
Configuration example
Configuring a post processor is similar to configuring a custom converter or single message transformation (SMT).
To enable the connector to use the ReselectColumnsPostProcessor
, add the following entries to the connector configuration:
"post.processors": "reselector", (1)
"reselector.type": "io.debezium.processors.reselect.ReselectColumnsPostProcessor", (2)
"reselector.reselect.columns.include.list": "<schema>.<table>:<column>,<schema>.<table>:<column>", (3)
"reselector.reselect.unavailable.values": "true", (4)
"reselector.reselect.null.values": "true" (5)
"reselector.reselect.use.event.key": "false" (6)
Item | Description |
---|---|
1 |
Comma-separated list of post processor prefixes. |
2 |
The fully-qualified class type name for the post processor. |
3 |
Comma-separated list of column names, specified by using the following format: |
4 |
Enables or disables reselection of columns that contain the |
5 |
Enables or disables reselection of columns that are |
6 |
Enables or disables reselection based event key field names. |
Configuration options
The following table lists the configuration options that you can set for the Reselect Columns post-processor.
Property |
Default |
Description |
No default |
Comma-separated list of column names to reselect from the source database.
Use the following format to specify column names: + |
|
No default |
Comma-separated list of column names in the source database to exclude from reselection.
Use the following format to specify column names: + |
|
|
Specifies whether the post processor reselects a column that matches the |
|
|
Specifies whether the post processor reselects a column that matches the |
|
|
Specifies whether the post processor reselects based on the event’s key field names or uses the relational table’s primary key column names. |