Extract Content
Synopsis
Extracts content from an HTML document.
Description
This operator extracts textual content from a given HTML document and returns the extracted text blocks as documents. Only text blocks consisting of a given number of words are extracted to prevent single words (e.g. in navigation bars) to be kept.
Input
document
The document port.
Output
document
The document port.
Parameters
Extract content
Specifies whether content is extracted or not
Minimum text block length
The minimum length (in words/tokens) of text blocks.
Override content type information
Specifies whether potentially existing content type information and used encoding information should be overriden using the HTML meta http-equiv tag.
Neglegt span tags
Specifies whether <span> tags should be neglected or used as text block divider.
Neglect p tags
Specifies whether <p> tags should be neglected or used as text block divider.
Neglect b tags
Specifies whether <b> tags should be neglected or used as text block divider.
Neglect i tags
Specifies whether <i> tags should be neglected or used as text block divider.
Neglect br tags
Specifies whether <br> tags should be neglected or used as text block divider.
Ignore non html tags
Specifies whether tags that are not common HTML should be ignored.