Embeddings Transformation
One important task that must be performed to prepare content for use in large language models, or for natural language processing in general, is to convert text to a numeric representation, also known as text vectorization. One way to implement this conversion is through so-called text embeddings, a process in which text is converted to high-dimensional numerical vectors. Representing text as a vector enables computers to perform semantic similarity searches and other advanced operations.
Debezium offers a built-in feature to transform specified text fields into numerical embeddings vectors, and to add the resulting embeddings to the event record. Embedding inference is performed according to the embeddings model served by the configured provider. Debezium offers several embeddings model providers.
Debezium can use a single message transformation (SMT) to generate embeddings. To interact with different embeddings providers, this Embeddings transformation uses the langchain4j framework.
Behavior
The Embeddings transformation takes as input specific fields in the original event record, and passes the text content of those fields to the configured embedding model for inference. That is, the SMT creates embeddings from the text contained in the specified fields. The resulting embeddings are appended to the record. The original source field is also preserved in the record.
The source field must be a string field.
The value of an embedding field is a vector of floating-point 32-bit numbers.
The size of the vector depends on the selected model.
To provide the internal representation of an embedding, Debezium uses the FloatVector data type.
The schema type of the embedding field in the record is io.debezium.data.FloatVector
.
Both source field and embedding field specifications support nested structures, such as, after.product_description_embedding
.
Configuration
To configure a connector to use the embeddings transformation, add the following lines to your connector configuration:
transforms=embeddings,
transforms.embeddings.type=io.debezium.ai.embeddings.FieldToEmbedding
You must specify at least one field to use as the input for the embedding. Destination field where the embedding will be placed is not mandatory. If it’s not specified, message value will contain only the embedding itself. However, it’s recommended to specify the embedding destination field, for example:
transforms.embeddings.field.source=after.product
transforms.embeddings.field.embedding=after.product_embedding
Finally, you must place the model provider JAR file in the connector class path, and configure the provider.
For example, for the Ollama provider, add debezium-ai-embeddings-ollama-$VERSION.jar
to your connector class path, and add the Ollama URL and model name to the connector configuration, as shown in the following example:
transforms.embeddings.ollama.url=http://localhost:11434
transforms.embeddings.ollama.model.name=all-minilm
The following example shows what the full configuration for an Ollama provider might look like:
transforms=embeddings,
transforms.embeddings.type=io.debezium.ai.embeddings.FieldToEmbedding
transforms.embeddings.field.source=after.product
transforms.embeddings.field.embedding=after.product_embedding
transforms.embeddings.ollama.url=http://localhost:11434
transforms.embeddings.ollama.model.name=all-minilm
General configuration options
Option | Default | Description |
---|---|---|
No default value |
Specifies the field in the source record to use as an input for the embeddings.
The data type of the specified field must be |
|
No default value |
Specifies the name of the field that the SMT adds to the record to contain the text embedding. If no value is specified, the resulting record contains only the embedding value. |
Model provider configuration
Hugging Face
Embeddings provided by models available via Hugging Face.
Option | Default | Description |
---|---|---|
No default value |
Hugging Face access token. |
|
No default value |
Name of the embedding model. Use the REST API to retrieve the list of available models. Specify the provider in the REST call, for example, Hugging Face inference provider. |
|
The base Hugging Face inference API URL. |
||
15000 (15 seconds) |
Maximum amount of time in milliseconds to wait for the embeddings reply. |
Hugging Face started to support different embedding inference providers, including external providers. However, the LangChain4j framework does not yet support external providers. As a result, the only inference provider currently available for use with the Debezium is the Hugging Face provider. |
Ollama
Supports any model provided by Ollama server.
Option | Default | Description |
---|---|---|
No default value |
URL of the Ollama server, including port number, for example, |
|
No default value |
Name of the embedding model. |
|
15000 (15 seconds) |
Maximum amount of time in milliseconds to wait for the embeddings reply. |
ONNX MiniLM
Provides the ONNX in-process all-minilm-l6-v2
model, which is included directly in the jar file.
No other configuration besides the general one is needed.
This model is especially suited to prototyping and testing, because it does not depend on any external infrastructure or remote requests.
Voyage AI
Embeddings provided by Voyage AI models.
Option | Default | Description |
---|---|---|
No default value |
The Voyage AI access token. |
|
No default value |
Name of the embedding model. The list of Voyage AI models can be found in the Voyage AI Text Embeddings documentation. |
|
Base Voyage AI API server. |
||
15000 (15 seconds) |
Maximum amount of time in milliseconds to wait for the embeddings reply. |