Read Excel
Synopsis
This operator reads an ExampleSet from the specified Excel file.
Description
This operator can be used to load data from Microsoft Excel spreadsheets. This operator is able to read data from Excel 95, 97, 2000, XP, and 2003. The user has to define which of the spreadsheets in the workbook should be used as data table. The table must have a format such that each row is an example and each column represents an attribute. Please note that the first row of the Excel sheet might be used for attribute names which can be indicated by a parameter. The data table can be placed anywhere on the sheet and can contain arbitrary formatting instructions, empty rows and empty columns. Missing data values in Excel should be indicated by empty cells or by cells containing only "?".
For complete understanding of this operator read the parameters section. The easiest and shortest way to import an Excel file is to use the import_configuration_wizard from the Parameters panel. The best way, which may require some extra effort, is to first set all the parameters in the Parameters panel and then use the wizard. Please make sure that the Excel file is read correctly before building a process using it.
Input
file
An Excel file is expected as a file object which can be created with other operators with file output ports like the Open File (Deprecated) operator.
Output
output
This port delivers the Excel file in tabular form along with the meta data. This output is similar to the output of the Retrieve (Deprecated) operator.
Parameters
import configuration wizard
This option allows you to configure this operator by means of a wizard. This user-friendly wizard makes the use of this operator easy.
sheet selection
This option allows you to change the sheet selection between sheet number and sheet name.
sheet number
The number of the sheet which you want to import should be specified here.
sheet name
The name of the sheet which you want to import should be specified here.
imported cell range
The range of cells to be imported from the specified sheet is given here. It is specified in 'xm:yn' format where 'x' is the column of the first cell of range, 'm' is the row of the first cell of range, 'y' is the column of the last cell of range, 'n' is the row of the last cell of range. 'A1:E10' will select all cells of the first five columns from row 1 to 10.
use header row
If this parameter is set to true, the header_row parameter needs to specify the row with the names of the Attributes. If this is activated no annotations can be used.
header row
Sets the index of the row that should be used for Attribute names. The Attributes names are determined by the content of this row. As a consequence, data will only be read below this row. The parameter data_set_meta_data_information overwrites the data from the header row if it is set.
annotations
If the first row as names parameter is not set to true, annotations can be added using the 'Edit List' button of this parameter which opens a new menu. This menu allows you to select any row and assign an annotation to it. Name, Comment and Unit annotations can be assigned. If row 0 is assigned Name annotation, it is equivalent to setting the first row as names parameter to true. If you want to ignore any rows you can annotate them as Comment.
date format
The date and time format is specified here. Many predefined options exist; users can also specify a new format. If text in an Excel file column matches this date format, that column is automatically converted to date type. Some corrections are automatically made in the date type values. For example a value '32-March' will automatically be converted to '1-April'. Columns containing values which can't be interpreted as numbers will be interpreted as nominal, as long as they don't match the date and time pattern of the date format parameter. If they do, this column of the Excel file will be automatically parsed as date and the according attribute will be of date type.
time zone
This is an expert parameter. A list of time zones is provided; users can select any of them. Dates (from Excel date columns or text converted dates, see below) are read with the time zone specified in this parameter.
If text in an Excel file column matches the format provided in the date_format parameter, that column is automatically converted to date type (see date_format parameter info).
locale
This is an expert parameter. A long list of locales is provided; users can select any of them.
read all values as polynominal
This option allows you to disable the type handling for this operator. Every column will be read as a polynominal attribute. To parse an Excel date afterwards, use 'date_parse(86400000 * (parse(date_attribute) - 25569))' (- 24107 for Mac Excel 2007) in the Generate Columns operator.
data set meta data information
This option is an important one. It allows you to adjust the meta data of the ExampleSet created from the specified Excel file. Column index,name,type and role can be specified here. The Read Excel operator tries to determine an appropriate type of the attributes by reading the first few lines and checking the occurring values. If all values are integers, the attribute will become an integer. Similarly if all values are real numbers, the attribute will become of type real. Columns containing values which can't be interpreted as numbers will be interpreted as nominal, as long as they don't match the date and time pattern of the date_format parameter. If they do, this column of the Excel file will be automatically parsed as date and the according attribute will be of type date. Automatically determined types can be overridden using this parameter.
read not matching values as missings
If this value is set to true, values that do not match with the expected value type are considered as missing values and are replaced by '?'. For example if 'abc' is written in an integer column, it will be treated as a missing value. A question mark (?) or an empty cell in the Excel file is also read as a missing value.