You are viewing documentation for an unreleased version of Debezium.
If you want to view the latest stable version of this page, please go here.

Embeddings Transformation

One important task that must be performed to prepare content for use in large language models, or for natural language processing in general, is to convert text to a numeric representation, also known as text vectorization. One way to implement this conversion is through so-called text embeddings, a process in which text is converted to high-dimensional numerical vectors. Representing text as a vector enables computers to perform semantic similarity searches and other advanced operations.

Debezium offers a built-in feature to transform specified text fields into numerical embeddings vectors, and to add the resulting embeddings to the event record. Embedding inference is performed according to the embeddings model served by the configured provider. Debezium offers several embeddings model providers.

Debezium can use a single message transformation (SMT) to generate embeddings. To interact with different embeddings providers, this Embeddings transformation uses the langchain4j framework.

Behavior

The Embeddings transformation takes as input specific fields in the original event record, and passes the text content of those fields to the configured embedding model for inference. That is, the SMT creates embeddings from the text contained in the specified fields. The resulting embeddings are appended to the record. The original source field is also preserved in the record.

The source field must be a string field. The value of an embedding field is a vector of floating-point 32-bit numbers. The size of the vector depends on the selected model. To provide the internal representation of an embedding, Debezium uses the FloatVector data type. The schema type of the embedding field in the record is io.debezium.data.FloatVector.

Both source field and embedding field specifications support nested structures, such as, after.product_description_embedding.

Configuration

To configure a connector to use the embeddings transformation, add the following lines to your connector configuration:

transforms=embeddings,
transforms.embeddings.type=io.debezium.ai.embeddings.FieldToEmbedding

You must specify at least one field to use as the input for the embedding. Destination field where the embedding will be placed is not mandatory. If it’s not specified, message value will contain only the embedding itself. However, it’s recommended to specify the embedding destination field, for example:

transforms.embeddings.field.source=after.product
transforms.embeddings.field.embedding=after.product_embedding

Finally, you must place the model provider JAR file in the connector class path, and configure the provider. For example, for the Ollama provider, add debezium-ai-embeddings-ollama-$VERSION.jar to your connector class path, and add the Ollama URL and model name to the connector configuration, as shown in the following example:

transforms.embeddings.ollama.url=http://localhost:11434
transforms.embeddings.ollama.model.name=all-minilm

The following example shows what the full configuration for an Ollama provider might look like:

transforms=embeddings,
transforms.embeddings.type=io.debezium.ai.embeddings.FieldToEmbedding
transforms.embeddings.field.source=after.product
transforms.embeddings.field.embedding=after.product_embedding
transforms.embeddings.ollama.url=http://localhost:11434
transforms.embeddings.ollama.model.name=all-minilm

General configuration options

Table 1. Descriptions of embedding SMT configuration options
Option Default Description

No default value

Specifies the field in the source record to use as an input for the embeddings. The data type of the specified field must be string.

No default value

Specifies the name of the field that the SMT adds to the record to contain the text embedding. If no value is specified, the resulting record contains only the embedding value.

Model provider configuration

Hugging Face

Embeddings provided by models available via Hugging Face.

Table 2. Configuration options for Hugging Face embeddings
Option Default Description

No default value

Hugging Face access token.

No default value

Name of the embedding model. Use the REST API to retrieve the list of available models. Specify the provider in the REST call, for example, Hugging Face inference provider.

The base Hugging Face inference API URL.

15000 (15 seconds)

Maximum amount of time in milliseconds to wait for the embeddings reply.

Hugging Face started to support different embedding inference providers, including external providers. However, the LangChain4j framework does not yet support external providers. As a result, the only inference provider currently available for use with the Debezium is the Hugging Face provider.

Ollama

Supports any model provided by Ollama server.

Table 3. Ollama embeddings configuration options
Option Default Description

No default value

URL of the Ollama server, including port number, for example, http://localhost:11434.

No default value

Name of the embedding model.

15000 (15 seconds)

Maximum amount of time in milliseconds to wait for the embeddings reply.

ONNX MiniLM

Provides the ONNX in-process all-minilm-l6-v2 model, which is included directly in the jar file. No other configuration besides the general one is needed.

This model is especially suited to prototyping and testing, because it does not depend on any external infrastructure or remote requests.

Voyage AI

Embeddings provided by Voyage AI models.

Table 4. Voyage AI embeddings configuration options
Option Default Description

No default value

The Voyage AI access token.

No default value

Name of the embedding model. The list of Voyage AI models can be found in the Voyage AI Text Embeddings documentation.

Base Voyage AI API server.

15000 (15 seconds)

Maximum amount of time in milliseconds to wait for the embeddings reply.