Skip to main content

Cut Document

Synopsis

Cuts an input document into segments using regular expressions specifiying start and end of segments.

Description

This operator segments a text based on a starting and ending regular expression.

Input

document

Output

documents

Description

Collection of the segmented document.

Parameters

Query type

Specifies the type of the query. The available query types are: ** String Matching, Regular Expression, Regular Region, Indexed, XPath** and JSONPath;

String matching queries

Specifies a list of string matching start and end sequences. Everything between will be used as result. See the operator documentation for details on string matching.

Attribute type

Specifies the type of the resulting attributes. If numerical or binomial is chosen, ensure that the returned result is interpretable. The available types are: Nominal, Numerical and Binominal;

Regular expression queries

Specifies a list of attribute names and their corresponding regular expressions. The first matching group is used as value. See the operator documentation for details on regular expressions.

Regular region queries

Specifies a list of attribute names and their corresponding regular expressions. Two regular expressions might be specified in order to define the start and the end of a region. Everything in between the two matches will be delivered as result.

Xpath queries

Specifies a list of attribute names and their corresponding XPath queries. See the operator documentation for details on XPath.

Namespaces

Specifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h.

Ignore cdata

Indicates if CDATA should be ignored when using the XPATH expression.

Assume html

If checked a more tolerant xml parser will be used, which copes with forbidden HTML constructions, but always assumes HTML and adds missing tags. For plain XML uncheck this.

Index queries

Specifies a list of attribute names and the regions. Regions are specified as offset index and length of the match.

Jsonpath queries

Specifies a list of attribute names and their corresponding JSONPath queries.