Skip to main content

Extract Content

Synopsis

Extracts content from an HTML document.

Description

This operator extracts textual content from a given HTML document and returns the extracted text blocks as documents. Only text blocks consisting of a given number of words are extracted to prevent single words (e.g. in navigation bars) to be kept.

Input

document

The document port.

Output

document

The document port.

Parameters

Extract content

Specifies whether content is extracted or not

Minimum text block length

The minimum length (in words/tokens) of text blocks.

Override content type information

Specifies whether potentially existing content type information and used encoding information should be overriden using the HTML meta http-equiv tag.

Neglegt span tags

Specifies whether <span> tags should be neglected or used as text block divider.

Neglect p tags

Specifies whether <p> tags should be neglected or used as text block divider.

Neglect b tags

Specifies whether <b> tags should be neglected or used as text block divider.

Neglect i tags

Specifies whether <i> tags should be neglected or used as text block divider.

Neglect br tags

Specifies whether <br> tags should be neglected or used as text block divider.

Ignore non html tags

Specifies whether tags that are not common HTML should be ignored.