Create Embeddings

Synopsis

Calculates embeddings from the given

column.

Description

This operator calculates an embedding (a vector in a high-dimensional space) for each row of the given text column. These embeddings can be used as input to machine learning algorithms but also as input to vector stores for performing similarity-based retrieval.

Differentiation

Insert Documents

The Insert Documents operator is used to add embeddings to a vector store.

Retrieve Documents

The Retrieve Documents operator is used to find related embeddings in a vector store.

Input

input

The table containing the column which will be converted into an embedding.

onnx model file

Optional ONNX Model file used for custom embedding models.

tokenizer json file

Optional pre-trained huggingface tokenizer json file for the given onnx_model_file.

Output

embedding

The Embeddings created from the given column.

through

The table that was provided at the input port is delivered through this output port without any modifications. This is usually used to reuse the same table in further operators of the process.

Parameters

column

The column that should be converted into an embedding.

embedding model

The embedding model that should be used.

all-minilm-l6-v2: All MiniLM L6 V2 An english language 384 dimension sentence transformer, supports 256 word pieces. For more details visit https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
bge-small-en-v1.5: BGE small en v1.5 An english language 384 dimension sentence transformer, supports 512 word pieces. For more details visit https://huggingface.co/BAAI/bge-small-en-v1.5
e5-small-v2: E5 Small V2 An english language 384 dimension sentence transformer, supports 512 word pieces. For more details visit https://huggingface.co/intfloat/e5-small-v2
custom: Provide your own onnx_model_file, tokenizer_json_file and pooling_mode

pooling mode

The pooling mode used by the custom embedding model for dimensionality reduction.

CLS: Uses the first embedding (a special summarizing CLS embedding) of each sequence.
MEAN: Uses the average of all embeddings in the sequence.

Synopsis​

Description​

Differentiation​

Insert Documents​

Retrieve Documents​

Input​

input​

onnx model file​

tokenizer json file​

Output​

embedding​

through​

Parameters​

column​

embedding model​

pooling mode​