Text Vectorization

Synopsis

This operator can be used for basic feature extraction from text columns like TFIDF vectorization, adding sentiments, or detecting languages.

Description

This operator is a simplified version of the text processing operators available in extensions. It takes one or several text or nominal columns and transforms them into a vectorized format using TFIDF. While being simpler to use, this operator only offers a subset of features from the text mining extension. One major advantage is the fact that users can simply select the number of features to be added and - if a label column is defined - only the most relevant features will be added to the example set. If no label is defined, pruning based on frequency is applied to bring the amount of columns down to the desired number.

This operator only performs tokenization, to lower case, and TDIDF calculation. In addition, it can also extract the sentiment of each text column and detect its language out of a set of 20 languages. Please note, however, that a generic sentiment analysis often only delivers directionally correct results and is not comparable to specific domain-based models with respect to accuracy. Those two additional columns for each input text column will become special attributes and would need to be transformed to regular attributes afterwards if they are desired to be inputs for machine learning models.

The operator delivers a pre-processing model which can be applied to new data sets to perform the same processing on this data. This is necessary for transforming scoring data sets in the same way as training data sets.

Input

example set input

This input port expects a data set. At least one of the columns should contain free text. It can be either nominal or of type text.

Output

example set output

This output port provides the transformed data, i.e. the original data set with the extracted TFIDF columns, sentiment scores, or languages.

original

The original input data without any changes.

preprocessing model

The text processing model which shows useful information about the extracted features and can be applied on new (scoring) data sets in order to perform the same transformations there.

Parameters

Add sentiment

Indicates if a column with the most likely sentiment (positive, negative, neutral) should be added for each of the processed text columns.

Add language

Indicates if a column with the most likely content language should be added for each of the processed text columns.

Keep original

Indicates if the original text attributes should be kept or removed.

Store training documents

Indicates if the documents used for building the word vectors should be stored as part of the model. This is useful for visualizations but will increase memory usage.

Store scoring documents

Indicates if the documents which are transformed during scoring should also be stored as part of the model. This is useful for visualizations but will increase memory usage especially in continous use.

Document class attribute

The name of the nominal attribute which should be used for deriving the document classes (typically the label attribute).

Token split

This regular expression is used to split the tokens from each other. Default is word boundaries.

Apply pruning

Indicates if pruning should be applied to the resulting columns.

Max number of new columns

The maximum number of columns after pruning generated for each of the input text columns.

Synopsis​

Description​

Input​

example set input​

Output​

example set output​

original​

preprocessing model​

Parameters​

Add sentiment​

Add language​

Keep original​

Store training documents​

Store scoring documents​

Document class attribute​

Token split​

Apply pruning​

Max number of new columns​