Skip to main content

Tokenize

Synopsis

Tokenizes a document.

Description

This operator splits the text of a document into a sequence of tokens. There are several options how to specify the splitting points. Either you may use all non-letter character, what is the default settings. This will result in tokens consisting of one single word, what's the most appropriate option before finally building the word vector

Or if you are going to build windows of tokens or something like that, you will probably split complete sentences, this is possible by setting the split mode to specify character and enter all splitting characters.

The third option let's you define regular expressions and is the most flexible for very special cases. Each non-letter character is used as separator. As a result, each word in the text is represented by a single token.

Input

document

The document port.

Output

document

The document port.

Parameters

Mode

This selects the tokenization mode. Depending on the mode, split points are chosen differently.

Characters

The incoming document will be split into tokens on each of this characters. For example enter a '.' for splitting into sentences.

Expression

This regular expression defines the splitting point.

Language

The language for the used part of speech (POS) tagger.

Max token length

The maximal token length of the tokens