Skip to main content

The Catalog

Data in RapidMiner

Data is critical for any data science project. The starting point is data, and the results may include enriched data.

RapidMiner provides the catalog as an easily-accessible shared repository for both:

  • uploaded data and
  • generated data.

Depending on the need, access to a data file can be restricted to a limited number of users or it can be shared by multiple projects.

Global view / project view

The catalog provides global access to your data, but in fact there are two access points:

  • Catalog
  • Projects

Within the catalog, you can see any data that you have uploaded or to which you have access.

Note however that all work is done in the scope of a project, and the project has more limited data access. Each project provides a Data tab, where you can see the data accessible to that particular project.

Upload your data

Both in the catalog as well as in the Data tab of any project, you can upload new data files using the Add Data button. In either case, the data lands in the catalog.

Add Data

Notice in the screenshot that the data set Churn-Go is owned by an individual user, whereas churn-sample is owned by a project. The reason is that churn-sample was generated from Churn-Go within a project called Churn9, and therefore Churn9 is the owner.

Supported formats

You can upload files of any data format to the catalog. Nevertheless, we distinguish between two different cases:

  • HDF5: the native data file format of RapidMiner. You can find your HDF5 data files in RapidMiner Studio in the folder Documents/RapidMiner, with the extension rmhdf5table.
  • Other: to use any other data format, such as CSV or Excel, you need to connect the input to the relevant operator in the workflow designer (e.g., Read CSV).

In practice, the difference is undramatic. It usually implies an extra step when developing your workflow.

In order to do anything with the data, you must first link it to a project.

  • Does the project exist? If not, create the project.

  • Once you have uploaded the data, click on Link to Project and select a project.

  • If the data is generated inside a project, the data is automatically linked.

Data linked to a project is available to all project contributors and visible to all project viewers. You may link the same data file to multiple projects.

Organize your data

There are four elements that help you to organize and find your data.

  • Name: A search field allows you to filter by file name, by typing any substring. Search is the easiest way to locate a data file, if you know its name.

  • Tags: Each data file can have multiple tags, and you can use them to filter the set of files you want to see or exclude from the view.

  • Projects: Knowing which projects the data file is linked to helps you understand where it's used and what its potential dependencies are.

  • Filter type: The Filter type locates specific file types, such as Excel, CSV, etc.

Data filters

Details

Clicking on the name of a data file, you arrive at a Details page.

Data details

Here you can:

  • See the data table by selecting the Data tab.
  • Plot the data by selecting the Chart tab.
  • Add a Description to help other users to understand the content of the data file.
  • Add or remove Tags, for better organization.
  • Link to projects that will have access to the data file.
  • Set Permissions for users who will have access to the file, independent of projects.

Permissions

Note that if the data file is linked to a project and a user has access to the project, that user does not need additional permissions.

Comparing the screenshots above and below, you can see that any user who has access to the Churn9 project will have access to both data sets churn-sample and Churn-Go, but

  • whereas Churn-Go is explictly linked to the Churn9 project,
  • the link to churn-sample is implicit, because Churn9 is the owner.

When explicit permission is required, because a user has no access to the project, data files can be shared as:

  • Read (read only) or
  • Write (read-write).

Data permissions