Skip to main content

Process Documents from Mail Store

Synopsis

Generates word vectors from a text collection stored in an IMAP or POP3 mail server.

Input

word list

The word list port.

Output

example set

The example set port.

word list

The word list port.

Parameters

Create word vector

If checked, the tokens of a document will be used to generate a vector numerically representing the document.

Vector creation

Select the schema for creating the word vector.

Add meta information

If checked, available meta information of the text like filename, date is added as attribute.

Keep text

If checked, the input text will be stored as a special String attribute with the role text.

Prune method

Specifies if to frequent or to infrequent words should be ignored for word list building and how the frequencies are specified.

Prune below percent

Ignore words that appear in less than this percentage of all documents.

Prune above percent

Ignore words that appear in more than this percentage of all documents.

Prune below absolute

Ignore words that appear in less than that many documents.

Prune above absolute

Ignore words that appear in more than that many documents.

Prune below rank

Words are ordered by frequency and words with a frequency less than the frequency of the rank given by this percentage will be pruned.

Prune above rank

Words are ordered by frequency and words with a frequency higher than the frequency of the rank given by this percentage will be pruned.

Datamanagement

Determines, how the data is represented internally.

Define store

Mail store connection can be defined by using either a session bound to a JNDI name, or explicitly by specifying host and user.

Jndi name

JNDI name referencing a mail session.

Host

IMAP or POP3 host name

User

IMAP or POP3 user name

Password

IMAP or POP3 password

Connection properties

Additional properties for the mail store.

Protocol

IMAP or POP3

Only unseen

If checked, only new unseen messages will be processed.

Mark seen

If checked, all processed messages will be marked read. Only works with IMAP, not with POP3.

Delete messages

If checked, all processed messages will be deleted. Especially useful for POP3

Recursive

Recurse into subfolders?

Folder

Name of the IMAP or POP3 folder to scan

Download attachments

select to download mails and attachments

Attachment file-pattern

A pattern for the attachment you want to select. Usual wildcards like ? and * are supported.

Attachment mime-type

type in the MIME-type you want to select.(if this label and all additional labels are empty all MIME-types are selected)

Parallelize vector creation

Determines whether the execution of Vector Creation should be parallelized.