Process Documents from Data
Synopsis
Generates word vectors from string attributes.
Input
word list
The word list port.
example set
The example set port.
Output
example set
The example set port.
word list
The word list port.
Parameters
Create word vector
If checked, the tokens of a document will be used to generate a vector numerically representing the document.
Vector creation
Select the schema for creating the word vector.
Add meta information
If checked, available meta information of the text like filename, date is added as attribute.
Keep text
If checked, the input text will be stored as a special String attribute with the role text.
Prune method
Specifies if to frequent or to infrequent words should be ignored for word list building and how the frequencies are specified.
Prune below percent
Ignore words that appear in less than this percentage of all documents.
Prune above percent
Ignore words that appear in more than this percentage of all documents.
Prune below absolute
Ignore words that appear in less than that many documents.
Prune above absolute
Ignore words that appear in more than that many documents.
Prune below rank
Words are ordered by frequency and words with a frequency less than the frequency of the rank given by this percentage will be pruned.
Prune above rank
Words are ordered by frequency and words with a frequency higher than the frequency of the rank given by this percentage will be pruned.
Datamanagement
Determines, how the data is represented internally.
Select attributes and weights
If checked, you might select the used text attributes and their weights. Otherwise all text attributes are used.
Specify weights
This parameters allows to set weights per attribute. Text from attributes with higher weight will be more imporant during analysis.
Parallelize vector creation
Determines whether the execution of Vector Creation should be parallelized.