Tokenize
Synopsis
Tokenizes a document.
Description
This operator splits the text of a document into a sequence of tokens. There are several options how to specify the splitting points. Either you may use all non-letter character, what is the default settings. This will result in tokens consisting of one single word, what's the most appropriate option before finally building the word vector
Or if you are going to build windows of tokens or something like that, you will probably split complete sentences, this is possible by setting the split mode to specify character and enter all splitting characters.
The third option let's you define regular expressions and is the most flexible for very special cases. Each non-letter character is used as separator. As a result, each word in the text is represented by a single token.
Input
document
The document port.
Output
document
The document port.
Parameters
Mode
This selects the tokenization mode. Depending on the mode, split points are chosen differently.
Characters
The incoming document will be split into tokens on each of this characters. For example enter a '.' for splitting into sentences.
Expression
This regular expression defines the splitting point.
Language
The language for the used part of speech (POS) tagger.
Max token length
The maximal token length of the tokens