Text Vectorization
Synopsis
This operator can be used for basic feature extraction from text columns like TFIDF vectorization, adding sentiments, or detecting languages.
Description
This operator is a simplified version of the text processing operators available in extensions. It takes one or several text or nominal columns and transforms them into a vectorized format using TFIDF. While being simpler to use, this operator only offers a subset of features from the text mining extension. One major advantage is the fact that users can simply select the number of features to be added and - if a label column is defined - only the most relevant features will be added to the example set. If no label is defined, pruning based on frequency is applied to bring the amount of columns down to the desired number.
This operator only performs tokenization, to lower case, and TDIDF calculation. In addition, it can also extract the sentiment of each text column and detect its language out of a set of 20 languages. Please note, however, that a generic sentiment analysis often only delivers directionally correct results and is not comparable to specific domain-based models with respect to accuracy. Those two additional columns for each input text column will become special attributes and would need to be transformed to regular attributes afterwards if they are desired to be inputs for machine learning models.
The operator delivers a pre-processing model which can be applied to new data sets to perform the same processing on this data. This is necessary for transforming scoring data sets in the same way as training data sets.
Input
example set input
This input port expects a data set. At least one of the columns should contain free text. It can be either nominal or of type text.
Output
example set output
This output port provides the transformed data, i.e. the original data set with the extracted TFIDF columns, sentiment scores, or languages.
original
The original input data without any changes.