Skip to main content

Extract Text From Images

Synopsis

This operator extracts text from one or multiple images using the Tesseract OCR library.

Description

This operator creates a new ExampleSet containing extracted text from images provided as well as the given input image locations. If only one image is to be used for extraction, then look for another operator "extract text from image" in RapidMiner studio. For more images, provide an ExampleSet containing image locations on disk in one Attribute. Using the "Read Image Meta Data" Operator is an easy way of creating an ExampleSet with image locations.

The operator uses the Tesseract OCR library in version 4.1.1. It identifies text on images using a pre-trained LSTM model. Currently, this operator only ships with an English language model.

Input

ExampleSet

This input port expects an ExampleSet containing an Attribute with image locations. Provide an ExampleSet if more than one image is to be used for extraction. If this port is connected, the Path Attribute selection parameter will appear.

Output

ExampleSet(Data Table)

A new ExampleSet with extracted text and the location of the images used will be provided.