Build models
Auto ML is designed to help you build predictive models from your data – fast and simple. All you need is a data set and something you want to predict. It's that simple!
As discussed in the introduction, we will guide you through the following steps:
- Start Auto ML -- Assuming you have a data file in rmhdf5table format linked to a project, select Start Auto ML.
- Choose Column -- choose the column whose values you want to predict
- Select Inputs -- decide what's relevant and eliminate what's irrelevant
- Select Models -- select and build one or more models
By the end of step (4), you will have created one or more models. After that, you can inspect the models and decide which one best suits your purpose.
Step 1: Start Auto ML
To follow the documentation step by step:
The use of Auto ML presupposes that:
-
you have a data file in rmhdf5table format
(if you want to use your own data set, but it's not in rmhdf5table format, find out how to convert it)
-
and that file is linked to a project.
From within the Data tab or the Content tab of the project, select Start Auto ML.
Step 2: Choose Column
In what follows, we'll discuss the consequences of choosing the sample data set ChurnPredictionData. The data concerns customers of a phone company, who may or may not give up on their subscription.
One of the data columns -- we'll call it the target column -- has values that you want to predict.
In our current example, the target column is Churn
, since we want to predict who will churn.
From the dropdown menu, choose Churn
before clicking Next.
In general, the values of the target column can be numerical (like CustServ Calls
) or categorical (like Churn
).
Depending on your target column, the problem will fall into one of the three following categories:
- Binary classification - Categorical data, two possible values (like
Churn
) - Multiclass classification - Categorical data, three or more possible values
- Regression - Numerical data (like
CustServ Calls
)
Choose a column, and Auto ML will automatically detect what type of problem it has to solve. Additional details for for each type of problem are given below.
-
Binary Classification (predicting one of exactly two possible values)
Some questions have a yes-or-no answer. For example, if you take a medical test, the results are often described as positive or negative:
- Positive : the test found what you were looking for (e.g., an infection)
- Negative : the test did not find what you were looking for (e.g., no infection)
If the result is positive, a more thorough investigation may be necessary; if the result is negative, no more work is needed. Arguably, the positive result is more important and deserves a higher degree of attention, because the focus of medical work is to treat the infection.
Our current problem, where
Churn
takes the values "yes" or "no", is an example of a binary classification problem, with the focus on "yes", since we want to predict which customers will churn. -
Multiclass Classification (predicting one of three or more possible values)
If your target column has three or more non-numerical values, your problem is called a multiclass classification problem.
-
Regression (predicting numerical values)
If your target column is numerical, and you want to predict the numbers in that column, your problem is called a regression problem. For example, in our Churn Prediction Data, there is a column called
CustServ Calls
whose value is the number of times a customer has called customer service.
Step 3: Select Inputs
Not all of your data columns will help you make a prediction. By discarding some of the columns, you may speed up your model-building and / or improve the model's performance. But how do you make that decision? A key point is that you're looking for patterns. Without some variation in the data and some discernible patterns, the data is not likely to be useful.
The four criteria that Auto ML uses to determine if a particular column is useful are:
- Correlation - how closely do the values resemble the target column?
- ID-ness - how different are the values from one another?
- Stability - how similar are the values to one another?
- Missing - how many missing values are in the column relative to the total?
Each column is marked with a quality tag: green, yellow, or red.
Green Good quality | Yellow Needs examination | Red Poor quality |
---|---|---|
No problem! |
|
|
By default, Auto ML will deselect the columns marked with a red or yellow quality tag, but you are of course free to to select or deselect any columns you like! Usually the defaults will work well, but you should pay careful attention if a column is marked with a yellow tag and has high correlation.
To understand the issue with high correlation, consider an extreme example: perfect correlation. If you have two columns called X and Y, and X = Y, then the correlation is 100% and X is just another name for Y. If you are predicting X, you would discard the column called Y, because it's redundant. It may be redundant even if the correlation is less than 100%. Ask yourself the following question: will I have access to the data in the highly- correlated column prior to making a prediction? If not, the data is not useful.
In some cases, however, the column is useful for prediction, precisely because it is highly correlated with the target column; if you exclude it, you risk damaging your model. Only you can tell for certain. In case of doubt, you can create two models: one with the highly-correlated column and one without, to help you decide which is best.