Create Embeddings
Synopsis
Calculates embeddings from the given
column.
Description
This operator calculates an embedding (a vector in a high-dimensional space) for each row of the given text column. These embeddings can be used as input to machine learning algorithms but also as input to vector stores for performing similarity-based retrieval.
Differentiation
Insert Documents
The Insert Documents operator is used to add embeddings to a vector store.
Retrieve Documents
The Retrieve Documents operator is used to find related embeddings in a vector store.
Input
input
The table containing the column which will be converted into an embedding.
onnx model file
Optional ONNX Model file used for custom embedding models.
tokenizer json file
Optional pre-trained huggingface tokenizer json file for the given onnx_model_file.
Output
embedding
The Embeddings created from the given column.
through
The table that was provided at the input port is delivered through this output port without any modifications. This is usually used to reuse the same table in further operators of the process.
Parameters
column
The column that should be converted into an embedding.
embedding model
The embedding model that should be used.
-
all-minilm-l6-v2: All MiniLM L6 V2 An english language 384 dimension sentence transformer, supports 256 word pieces. For more details visit https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
-
bge-small-en-v1.5: BGE small en v1.5 An english language 384 dimension sentence transformer, supports 512 word pieces. For more details visit https://huggingface.co/BAAI/bge-small-en-v1.5
-
e5-small-v2: E5 Small V2 An english language 384 dimension sentence transformer, supports 512 word pieces. For more details visit https://huggingface.co/intfloat/e5-small-v2
-
custom: Provide your own
onnx_model_file,tokenizer_json_fileandpooling_mode
pooling mode
The pooling mode used by the custom embedding model for dimensionality reduction.
- CLS: Uses the first embedding (a special summarizing CLS embedding) of each sequence.
- MEAN: Uses the average of all embeddings in the sequence.