Process Documents from Files

Synopsis

Generates word vectors from a text collection stored in multiple files.

Input

word list

The word list port.

Output

example set

The example set port.

word list

The word list port.

Parameters

Text directories

In this list arbitrary directories can be specified. All files matching the given file ending will be loaded and assigned to the class value provided with the directory.

File pattern

A pattern for the file to be read. Usual wildcards like ? and * are supported.

Extract text only

If checked, structural information like xml or html tags will be ignored and discarded.

Use file extension as type

If checked, the type of the files will be determined by their extensions. Unknown extensions will be treated as text files.

Content type

The content type of the input texts

Encoding

The encoding used for reading or writing files.

Create word vector

If checked, the tokens of a document will be used to generate a vector numerically representing the document.

Vector creation

Select the schema for creating the word vector.

Add meta information

If checked, available meta information of the text like filename, date is added as attribute.

Keep text

If checked, the input text will be stored as a special String attribute with the role text.

Prune method

Specifies if to frequent or to infrequent words should be ignored for word list building and how the frequencies are specified.

Prune below percent

Ignore words that appear in less than this percentage of all documents.

Prune above percent

Ignore words that appear in more than this percentage of all documents.

Prune below absolute

Ignore words that appear in less than that many documents.

Prune above absolute

Ignore words that appear in more than that many documents.

Prune below rank

Words are ordered by frequency and words with a frequency less than the frequency of the rank given by this percentage will be pruned.

Prune above rank

Words are ordered by frequency and words with a frequency higher than the frequency of the rank given by this percentage will be pruned.

Datamanagement

Determines, how the data is represented internally.

Parallelize vector creation

Determines whether the execution of Vector Creation should be parallelized.

Synopsis​

Input​

word list​

Output​

example set​

word list​

Parameters​

Text directories​

File pattern​

Extract text only​

Use file extension as type​

Content type​

Encoding​

Create word vector​

Vector creation​

Add meta information​

Keep text​

Prune method​

Prune below percent​

Prune above percent​

Prune below absolute​

Prune above absolute​

Prune below rank​

Prune above rank​

Datamanagement​

Parallelize vector creation​

Synopsis

Input

word list

Output

example set

word list

Parameters

Text directories

File pattern

Extract text only

Use file extension as type

Content type

Encoding

Create word vector

Vector creation

Add meta information

Keep text

Prune method

Prune below percent

Prune above percent

Prune below absolute

Prune above absolute

Prune below rank

Prune above rank

Datamanagement

Parallelize vector creation