Skip to main content

Deep Learning

Synopsis

This operator is used to configure and train a feed-forward neural network with different types of layers using the DeepLearning4J library.

Description

Use this operator to configure a feed-forward neural network and train it on a CPU or GPU. The general network configurations are set through this operator's parameters. Double click the operator to access its subprocess, where you can configure a layer architecture. Layers can be added and configured through operators like "Add Fully-Connected Layer". Make sure to also visit https://docs.rapidminer.com/latest/studio/installation/deep-learning-extension.html for additional information e.g. about deployments and GPU-enabled ready-to-use docker containers.

Loss Function and Updater:

Training a neural network is an iterative process of updating the so-called weights and biases of the layers. Each network layer might contain neurons, weights, biases or functions that are used to calculate a prediction for a given example. After creating a prediction for an example the result is compared to a desired value. In the case of supervised classification (labels are available) the prediction is compared to the label value. For this first you need to specify which kind of learning objective you are aiming at. Use the loss function parameter of the output layer to define the type of learning task (classification or regression) and a method that defines how far the algorithm is off the desired value. This information, called error or loss, is then used, to change all previously used weight and bias values successively, this is called back propagation. A method for updating the values can be chosen via the updater parameter. Each update method has its own set of parameters becoming available for configuration after choosing said method.

Weights, Bias and Epochs:

Since the training of a neural network is an iterative process, starting values for the weights and biases are needed. Various strategies for initializing weights can be chosen through the weight initialization parameter. Setting an initial bias can be done through the bias initialization parameter. Using all examples once to calculate a loss is called an epoch. Calculated loss values can be logged for every epoch or every given number of epochs. Configuring this is possible via the epochs per log parameter, which becomes available after unchecking the log each epoch parameter. These values can also be obtained as an ExamplesSets via the history port. It is recommended to check the development of the loss score over computed epochs to understand the training behaviour and adjust parameters of chosen updaters accordingly.

Mini Batches:

How often an update is performed is based on the selected update and optimization method. For many problems it is helpful to first collect the loss of a small subset of examples, before updating the model. This can be done by selecting the use miniBatch parameter and setting the size of the collection with the batch size parameter.

Regularization:

By connecting multiple layers a network can easily consist of a huge size of values that need to be updated. To avoid overtraining, methods of regularization can be used. Enabling network wide L1 and/or L2 regularization can be done by selecting the use regularization parameter and setting the appearing parameters. Another option is to randomly deactivate neurons during training. This method is called dropout and is available as an option in some layers or via using the "Add Dropout Layer" operator.

None ExampleSet Data:

Neural networks are capable of not only training models on ExampleSets, but also on collections of ExampleSets called tensors. A tensor can be used to represent for example text as numbers, images or other media data. For creating models on these kinds of data, use the "Deep Learning on Tensors" operator in combination with the "Apply Model on Tensors" operator.

Input Shape:

Neural networks are applicable for a wide range of use-cases, hence the way data is handled can differ. Therefore the available data structure has to be configured for the network. In many cases this can be done automatically by checking the infer input shape parameter, but in some cases the desired behavior should differ or an automatic estimation is not possible. For those cases uncheck the parameter, select the type of network you are building and fill in the input dimensions of your data.

Switching between CPU and GPU:

Currently a NVIDIA GPU in combination with CUDA is needed in order to execute a Deep Learning process on a GPU. Make sure to add CUDA to your system PATH. If you wand to use CudNN as well, the currently supported version is 7.4.

The actual switch between CPU and GPU can be done through the General Preference named "Backend". This setting is part of the RapidMiner Studio preferences. On RapidMiner Server add the key "rapidminer.backend.nd4j" to the settings with the appropriate value.

Further Remarks:

Find tips on configuring available parameters in the respective parameter descriptions below.

This operator uses the DeepLearning4J Java library in version 1.0-beta6.

Input

training set

Input Training ExampleSet. Currently only supervised problems are supported, therefore provide an ExampleSet containing a label. Other attributes have to be numeric and free of missing values.

test set

Input Test ExampleSet. Currently only supervised problems are supported, therefore provide an ExampleSet containing a label. Other attributes have to be numeric and free of missing values. This set will be used for model evaluation at the end of an epoch.

Output

model

Neural Network Model. This model can be applied using the Apply Model operator or trained further using the Update Model operator.

exampleSet

The ExampleSet that was given as input is passed through without changes.

history

An ExampleSet containing example-based loss values and respective epoch counts representing the training and test behavior. The training loss is derived from the training model, thus including dropped out neurons and other regularization methods. Test loss is derived from the test model without these mechanisms. Loss values are on a per example-base. Plot these loss values as a scatter plot for example to check whether the learning rate needs to be changed, another weight initialization needs to be chosen, or if regularization should be applied. Expected is a decreasing function. As long as it decreases more epochs could be used.

weights

A collection of ExampleSet-s containing neuron/unit weights for each layer (1 ExampleSet belongs to 1 layer). Attributes represent neurons/units in the layer hence their name (e.g.: layer_0_unit_1). A row in an ExampleSet contains the weights coming from a single neuron/unit from the previous layer.

biases

A collection of ExampleSet-s containing neuron/unit biases for each layer (1 ExampleSet belongs to 1 layer). Attributes represent neurons/units in the layer hence their name (e.g.: layer_0_unit_1). A row in an ExampleSet contains the biases for every neuron/unit in that layer.

Parameters

Cudnn algo mode

This parameter will only influence the runtime environment of the network if it is executed on a GPU. Nvidia (manufacturer of the supported GPU architecture) provides a library called: CuDNN, which contains efficient implementations for various layers. CuDNN can accelerate training but at a the cost of a potentially larger memory footprint. In certain edge-cases due to the higher memory consumption, strange errors will occur. When that happens, it is advised to reduce performance but preserve more memory by setting this parameter to "No workspace".

  • Prefer fastest: Default setting. Has better performance but at the cost of higher memory consumption, which could potentially impose a problem. If memory constraints are a problem, it is advised to switch over to "No workspace" and give that a try.
  • No workspace: Has a lower memory footprint than "Prefer fastest", but at the cost of lower performance.

Epochs

Number of times the whole data set is passed through the network. Use the advanced parameter use early-stopping to select a strategy enabling an early-stopping. These strategies often result in a shorter training time, since the training process is stopped, when a desired criteria is reached.

Use early stopping

When training neural networks, numerous decisions need to be made regarding the settings (hyper parameters) used, in order to obtain good performance. One such hyper parameter is the number of training epochs, i.e. the number of full passes over the data set. Early-stopping attempts to remove the need to manually set this value. It can also be considered as a type of regularization method (like L1/L2 weight decay and dropout) in that it can stop the network from over-fitting. The number of epochs set using the epochs parameter is always used as an upper limit. Available conditions can be selected through the condition strategy selector.

Available epoch conditions, that are tested every epoch:

  • score_improvement: Uses the parameters patience and minimal score improvement to check for a score not improving anymore, which leads to a stop. The patience defines the number of epochs the score needs to be considered constant. The definition of a score being constant in comparison to the previous one, is dependent on the amount it has changed. A minimum value to define a score change as such is set using minimal score improvement.
  • best epoch score: Use this strategy to define a targeted score with the best epoch score parameter. If this score is reached training is stopped.
  • score improvement: Allowed number of epochs without any improvement in score compared to the best score so far.

Available iteration conditions, that are tested on each mini-batch:

  • max iteration score: Score threshold for every iteration. If an iteration exceeds this score threshold the training will be stopped. This can occur for example with a poorly tuned (too high) learning rate.
  • maximum time: Maximum amount of time (in seconds) an iteration can last during training.

Use minibatch

Pass data in batches through the network and update weights and biases after each of those batches.

Batch size

Number of examples to be used in one batch for a single weight update. Values are often chosen as multiples of 2. When switching from CPU to GPU backend execution increase batch size by a lot (factor of 100-1,000) to take advantage of the GPUs extra memory.

Updater

Method used to calculate new weight and bias values in order to minimize the chosen loss.

  • SGD: Stochastic gradient descent. Uses a learning rate to adjust the extend to which weights and biases are updated.
  • Adam: Adaptive momentum change. http://arxiv.org/abs/1412.6980
  • AdaMax: Similar to Adam but using the infinity norm. Recommended parameter settings: learning rate = 0.002, beta 1 = 0.9, beta 2 = 0.999. http://arxiv.org/abs/1412.6980
  • AdaDelta: Similar to AdaGrad but adjusts learning rate based on moving window averages instead of all collected gradients. Recommended parameter settings: learning rate = 1.0, rho = 0.95.
  • Nesterovs: Use SGD with nesterov momentum.
  • NAdam: Similar to Adam, but using nesterov mechanism for momentum change. Recommended parameter settings: learning rate = 0.002, beta 1 = 0.9, beta 2 = 0.999.
  • AdaGrad: Uses set learning rate as a baseline and decreases it during training. The rate is adjusted for each weight and reduced when more updates are performed. Recommended parameter settings: learning rate = 0.01.
  • RMSProp: Use a moving average of squared gradients to divide the gradient by. This is well suited for recurrent networks. Recommended parameter settings: learning rate = 0.001, beta 1 = 0.9, beta 2 = 0.999.
  • None: Don't update weights and parameters.

Learning rate

Speed of navigating through the landscape of potential weight and bias values. 0.005 and 0.01 are often good starting points. Higher learning rates can reduce the number of epochs needed for reaching a 'good' result, while it also increases the change of missing the optimal point.

Momentum

Acceleration of a chosen learning rate. Reduces fluctuating values and can help avoid local minima / results getting stuck too early.

Rho

Exponential decay rate of the learning rate. Slows down learning, while decreasing possibility to miss weight and bias values resulting in lower losses.

Epsilon

Jitter value used to ensure numerical stability of updates. Should be very small.

Beta1

Fine tuning parameter for some updaters. In many cases this should be close to one.

Beta2

Fine tuning parameter for some updaters. In many cases this should be close to one.

Rmsdecay

Decay rate for the RMSProp update mechanism.

Weight initialization

A Deep Learning model is defined by so called weights. Weights are set within most layers and define the model. The process of finding the best weight values during training is an iterative process and requires start values. Weight values are multiplied to respective input data. At the first layer the input data is the data provided at the training port of the Deep Learning operator. For successive layers weights are multiplied to the output of the previous layer. Select one of the provided pre-defined methods to initialize all weights by the given strategy. Change this parameter, if during training the loss is not decreasing or it takes a long time before the loss values goes down.

  • Identity: Use identity matrices.
  • Normal: Use a Gaussian distribution with a mean of zero and a standard deviation of 1 / sqrt(number of layer inputs).
  • Ones: Use ones.
  • ReLU: Use a Gaussian distribution with a mean of zero and a variance of 2 / (number of layer inputs).
  • ReLU Uniform: Use a Uniform distribution from -a to a, where a = sqrt(6/(number of layer inputs)).
  • Sigmoid Uniform: Use a Uniform distribution from -a to a, where a = sqrt(6 / (number of layer inputs + number of layer outputs)).
  • Uniform: Use a Uniform distribution from -a to a, where a = 1 / sqrt(number of layer inputs).
  • Xavier: Use a Gaussian distribution with a mean of zero and a variance of 2 / (number of layer inputs + number of layer outputs).
  • Xavier Uniform: Use a Uniform distribution from -a to a, where a = sqrt(6/(number of layer inputs + number of layer outputs)).
  • Zero: Initialize all weights with zero. This is rarely a good idea.

Bias initialization

As described for the weight initialization method parameter, a Deep Learning model needs starting values for the training process. While the weights are multiplied to input data, the bias values are added ontop of this product. When training a regression model, for a data set with a mean target value of 10, starting with a bias initialization value of 10 could enable a network to find a fitting bias value more quickly.

Use regularization

Define whether to use regularization for weight calculation or not. Regularization can help reduce overfitting. Use a scatter plot from the data provided at the history port to check for the development of the training and test loss across epochs. Oscillating / jumping loss values often indicate a need for regularization. Set one of the values (L1 and L2) to 0.0 to only use one of the values. Starting with a small L2 regularization (~0.1) is often a good starting point.

L1 strength

Define strength of L1 (sum of all absolute weight values) for regularization.

L2 strength

Define strength of L2 (proportional to the weight itself) for regularization.

Optimization method

An optimization method defines the strategy used to define when to update parameters (weights and biases) and how. Provided methods allow to change between batched and none-batched methods. Batched methods like Conjugate Gradient Line Search and L-BFGS update the calculated loss after the full data set was passed through the network once. Hence these methods are more memory demanding. None-batched methods like Stochastic Gradient Descent and Line Gradient Descent perform updates for each example. Applying use miniBatch can alter this behaviour by first collecting a predefined amount of examples (set with the batch size parameter) before performing an update.

Most of the time a combination of Stochastic Gradient Descent with miniBatch is very performant, while providing good results.

For ExampleSets with less than 10,000 examples or big ExampleSets without much redundancy it is recommended to use a batch optimization method in combination with adaptive update mechanisms like Adam.

  • Line Gradient Descent: Stochastic gradient descent with line search. This method performs weight and bias updates for each example.
  • Conjugate Gradient Line Search: This method performs weight and bias updates after calculating losses for the full data set. Might be memory intensive.
  • L-BFGS: This method performs weight and bias updates after calculating losses for the full data set. Might be memory intensive.
  • Stochastic Gradient Descent: Stochastic gradient descent. This method performs weight and bias updates for each example.

Infer input shape

Guess and log to console the input shape of the given training data.

Network type

Choose network type to configure data shape for.

  • Simple Neural Network: Simple two dimensional neural network only consisting of fully-connected, dropout, activation and batch normalization layers.
  • Convolutional: A four dimensional neural network using convolutional layers, pooling and others.
  • Convolutional Flattened: A two dimensional neural network using convolutional layers, pooling and others, but converting the input from four dimension to two. Conversion to: [miniBatchSize, height ** width ** channels]

Input dimension

Dimension of the data as expected by the network. For simple neural networks this is often the number of regular attributes.

Height

Height dimensionality of the data.

Width

Width dimensionality of the data.

Depth

Depth dimensionality of the data.

Log each epoch

Disable this option to choose the number of epochs after which loss values should be logged. This effects the output of the history port as well as the actual process log.

Epochs per log

This parameter is available, if log each epoch is disabled.

Use local random seed

This parameter indicates if a local random seed should be used.

Local random seed

If the use local random seed parameter is checked this parameter determines the local random seed.