Skip to main content

Text to Embedding Id

Synopsis

Maps "tokens" into their respective indices using an embedding model.

Description

This operator utilizes a pre-trained embedding model and transforms its input. The output will hold the indices of the tokens instead of the tokens themselves thus preparing the dataset for training with a neural network (+ embedding layer!).

Input

example set

Input ExampleSet. Contains column that holds tokens which are to be mapped by this operator.

Output

example set

Transformed input. Token column values mapped into the respective indices.

original example set

Original input data (~ throughput).

Parameters

Model type

Select the supported model types for embedding.

  • WORD2VEC: Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.
  • STATIC_MODEL: A static model is a model for distributed word representation. Select this option, if the representation is stored in a txt file with one token as a string per line, followed by the (decimal, english notation) numbers of its representation separated by spaces. E.g.: word 0.123 0.432 0.445 as one line in a txt file. A prime example is the structure of the GloVe, coined from Global Vectors, model. The model is an unsupervised learning algorithm for obtaining vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Embedding model file

Provide path to the model file that contains the embedding weights/parameters.

Unknown token

Tokens that are part of the dataset but are not part of the model need to be handled somehow. Just imagine a word that has a typo in it, probably won't be a part of for instance Word2Vec. With this parameter one can set a default value per-se, that should be used when bumping into such anomalies. In those cases, the word that does not have an embedding for itself, will automatically be embedded into the vector representation of this "unknown token".

Token attribute

Select the attribute that contains the tokens to be mapped into indices.