Split File by Content
Synopsis
Segments documents based on regular expressions or xpath.
Description
Operator that allows to extract segments from a set of text documents in a directory based on regular expressions, XPath or simple string matching. This operator does support several formats as XML, HTML, Text and PDF, although XPath will work on XML and HTML documents only. The written files will be of the same ending as the input files type if possible. PDF for example will always be transformed into text files.
Input
through
The through port.
Output
through
The through port.
Parameters
Preview
Shows a preview for the results which will be achieved by the current configuration.
Matching mode
This parameter determines which mode for selecting the segments is used.
Xpath query
Specifies the XPath expression that matches against substrings of the content which should be treated as individual segments. Each match is treated as single segment.
Namespaces
Specifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h.
Ignore cdata
Specifies whether CDATA should be ignored when parsing HTML
Assume html
If checked a more tolerant xml parser will be used, which copes with forbidden HTML constructions, but always assumes HTML and adds missing tags. For plain XML uncheck this.
Regular expression
Specifies the regular expression that matches against substrings of the content which should be treated as individual segments. Each match is treated as single segment.
Segment expression
Specifies the expression, which is used to replace the found match of the regular expression above. Matchinggroups might be used to specify e.g. content of attributes without including the surrounding attributes itself.
Start string
Specifies the String used as startpoint in string matching. The text between the start string and the end string, both exclusive, is threated as segment.
End string
Specifies the String used as endpoint in string matching. The text between the start string and the end string, both exclusive, is threated as segment.
Json path query
Specifies the JSONPath expression that matches against substrings of the content which should be treated as individual segments. Each match is treated as single segment.
Texts
A directory containing the documents to be segmented
Output
The directory to which to write the segments
Use file extension as type
If checked, the type of the files will be determined by their extensions. Unknown extensions will be treated as text files.
Content type
The content type of the input texts
Encoding
The encoding used for reading or writing files.