Extract Content

Synopsis

Extracts content from an HTML document.

Description

This operator extracts textual content from a given HTML document and returns the extracted text blocks as documents. Only text blocks consisting of a given number of words are extracted to prevent single words (e.g. in navigation bars) to be kept.

Input

document

The document port.

Output

document

The document port.

Parameters

Extract content

Specifies whether content is extracted or not

Minimum text block length

The minimum length (in words/tokens) of text blocks.

Override content type information

Specifies whether potentially existing content type information and used encoding information should be overriden using the HTML meta http-equiv tag.

Neglegt span tags

Specifies whether  tags should be neglected or used as text block divider.

Neglect p tags

Specifies whether  tags should be neglected or used as text block divider.

Neglect b tags

Specifies whether  tags should be neglected or used as text block divider.

Neglect i tags

Specifies whether  tags should be neglected or used as text block divider.

Neglect br tags

Specifies whether   tags should be neglected or used as text block divider.

Ignore non html tags

Specifies whether tags that are not common HTML should be ignored.

Synopsis​

Description​

Input​

document​

Output​

document​

Parameters​

Extract content​

Minimum text block length​

Override content type information​

Neglegt span tags​

Neglect p tags​

Neglect b tags​

Neglect i tags​

Neglect br tags​

Ignore non html tags​

Synopsis

Description

Input

document

Output

document

Parameters

Extract content

Minimum text block length

Override content type information

Neglegt span tags

Neglect p tags

Neglect b tags

Neglect i tags

Neglect br tags

Ignore non html tags