Skip to main content

Split File by Content

Synopsis

Segments documents based on regular expressions or xpath.

Description

Operator that allows to extract segments from a set of text documents in a directory based on regular expressions, XPath or simple string matching. This operator does support several formats as XML, HTML, Text and PDF, although XPath will work on XML and HTML documents only. The written files will be of the same ending as the input files type if possible. PDF for example will always be transformed into text files.

Input

through

The through port.

Output

through

The through port.

Parameters

Preview

Shows a preview for the results which will be achieved by the current configuration.

Matching mode

This parameter determines which mode for selecting the segments is used.

Xpath query

Specifies the XPath expression that matches against substrings of the content which should be treated as individual segments. Each match is treated as single segment.

Namespaces

Specifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h.

Ignore cdata

Specifies whether CDATA should be ignored when parsing HTML

Assume html

If checked a more tolerant xml parser will be used, which copes with forbidden HTML constructions, but always assumes HTML and adds missing tags. For plain XML uncheck this.

Regular expression

Specifies the regular expression that matches against substrings of the content which should be treated as individual segments. Each match is treated as single segment.

Segment expression

Specifies the expression, which is used to replace the found match of the regular expression above. Matchinggroups might be used to specify e.g. content of attributes without including the surrounding attributes itself.

Start string

Specifies the String used as startpoint in string matching. The text between the start string and the end string, both exclusive, is threated as segment.

End string

Specifies the String used as endpoint in string matching. The text between the start string and the end string, both exclusive, is threated as segment.

Json path query

Specifies the JSONPath expression that matches against substrings of the content which should be treated as individual segments. Each match is treated as single segment.

Texts

A directory containing the documents to be segmented

Output

The directory to which to write the segments

Use file extension as type

If checked, the type of the files will be determined by their extensions. Unknown extensions will be treated as text files.

Content type

The content type of the input texts

Encoding

The encoding used for reading or writing files.