Text Vectorization
Synopsis
This operator can be used for basic feature extraction from text columns like TFIDF vectorization, adding sentiments, or detecting languages.
Description
This operator is a simplified version of the text processing operators available in extensions. It takes one or several text or nominal columns and transforms them into a vectorized format using TFIDF. While being simpler to use, this operator only offers a subset of features from the text mining extension. One major advantage is the fact that users can simply select the number of features to be added and - if a label column is defined - only the most relevant features will be added to the example set. If no label is defined, pruning based on frequency is applied to bring the amount of columns down to the desired number.
This operator only performs tokenization, to lower case, and TDIDF calculation. In addition, it can also extract the sentiment of each text column and detect its language out of a set of 20 languages. Please note, however, that a generic sentiment analysis often only delivers directionally correct results and is not comparable to specific domain-based models with respect to accuracy. Those two additional columns for each input text column will become special attributes and would need to be transformed to regular attributes afterwards if they are desired to be inputs for machine learning models.
The operator delivers a pre-processing model which can be applied to new data sets to perform the same processing on this data. This is necessary for transforming scoring data sets in the same way as training data sets.
Input
example set input
This input port expects a data set. At least one of the columns should contain free text. It can be either nominal or of type text.
Output
example set output
This output port provides the transformed data, i.e. the original data set with the extracted TFIDF columns, sentiment scores, or languages.
original
The original input data without any changes.
preprocessing model
The text processing model which shows useful information about the extracted features and can be applied on new (scoring) data sets in order to perform the same transformations there.
Parameters
Add sentiment
Indicates if a column with the most likely sentiment (positive, negative, neutral) should be added for each of the processed text columns.
Add language
Indicates if a column with the most likely content language should be added for each of the processed text columns.
Keep original
Indicates if the original text attributes should be kept or removed.
Store training documents
Indicates if the documents used for building the word vectors should be stored as part of the model. This is useful for visualizations but will increase memory usage.
Store scoring documents
Indicates if the documents which are transformed during scoring should also be stored as part of the model. This is useful for visualizations but will increase memory usage especially in continous use.
Document class attribute
The name of the nominal attribute which should be used for deriving the document classes (typically the label attribute).
Token split
This regular expression is used to split the tokens from each other. Default is word boundaries.
Apply pruning
Indicates if pruning should be applied to the resulting columns.
Max number of new columns
The maximum number of columns after pruning generated for each of the input text columns.