Skip to main content

Crawl Web

Synopsis

This operator allows to crawl the web and store the retrieved links and pages in an ExampleSet or on disk.

Description

This crawler will start on the specified starting URL to load pages and follow all links as commanded by the rules. There are different types of rules, each one applied in different situations:

  • store_with_matching_url: If the regular expression matches the URL, this page will be stored in the resulting ExampleSet and on disk (if selected).
  • store_with_matching_content: If the page content contains the given term, this page will be stored in the resulting ExampleSet. Note: Using this filter will slow down crawling a lot! Also note that this is NOT a regular expression but a simple contains filter.
  • follow_link_with_matching_url: If the regular expression matches the URL, the crawler will follow the link and load the URL.

To avoid crawling a potentially unlimited number of pages, the maximal number of pages and depth the crawler will retrieve can be specified with the parameters max pages and max depth. To speed up loading, the delay can be lowered. But please be friendly to the web site owners and avoid causing high traffic on their sites. Otherwise you may get blacklisted. Note that while the crawling makes use of your available CPU cores (license limits apply), usually crawling speed is limited by your bandwidth, disk IO (if applicable), the crawling delay and the fact that this crawler is benign and queries the robots.txt for each page it visits.

Please let the ignore robot exclusion parameter be unchecked unless you are going to crawl your own sites. Some site owners might forbid crawling of their content and for legal reasons you may be bound to their will.

Output

Example Set

The example set port which returns the crawling results.

Parameters

Url

The root page from which the crawler will start.

Crawling rules

Specifies a set of rules that determine which links to follow and which pages to process.

Retrieve as html

If selected, the actual HTML is returned instead of a textual representation.

Enable basic auth

If selected, all requests will send basic auth information in their header. Use only when crawling HTTPS pages!

Username

Username for basic authentication.

Password

Password for basic authentication.

Add content as attribute

Specifies, whether the pages' content should be added as a text attribute.

Write pages to disk

Specifies if the crawled pages should be saved as files.

Include binary content

If selected, the crawler will also consider binary content instead of only text pages. This can be useful to for example download all .pdf files from a web site by making use of the crawling rules parameter.

Output dir

Specifies the directory on disk into which the files are written if write pages into files is selected.

Output file extension

Specifies the file extension of the stored files.

Max crawl depth

Specifies the maximal depth of the crawling process. A depth of 1 means 'only crawl direct links on the initial page'.

Max pages

The maximal number of pages to store.

Max page size

Specifies the maximum page size (in KB): pages larger than this limit are not downloaded.

Delay

Specifies the delay when visiting a page in milliseconds.

Max concurrent connections

Maximum amount of HTTP connections used at the same time.

Max connections per host

Maximum amount of simultaneous HTTP connections used to connect to a single host. Increasing this parameter can put heavy load on a host so please be careful!

User agent

The identity the crawler uses while accessing a server.

Ignore robot exclusion

Specifies whether the crawler should ignore the robot exclusion rules set by the crawled page. Enable only for your own sites, otherwise you may end up violating laws!