Fine-tune Model
Synopsis
This operator is capable of updating an existing neural network model. With the help of this operator transfer learning becomes directly available. Allowed are models which either originate from the extension (built and trained) or get imported by one of the import-operators (existing, Keras).
Description
To be able to correctly modify a network, one has to know the details of the model in question. For deep-learning models, one can attain the list of layers (and see the general structure of the entire model) by printing the network --> connect the model to an output port or just set a breakpoint on an operator (after) that produces such models.
In many use-cases, when making use of transfer learning, we will have to change the initial (pre-trained) model, to adjust to our needs. A good example for this is the following scenario: we train a network to classify handwritten numbers (MNIST) and then later want to apply the model to recognize handwritten letters (EMNIST). In this case, the number of classes does not match (10 digits vs 26 english letters). That means, at the very least we will need to be modifying the last layer (must have 26 instead of the original 10 neurons). For this purpose, we remove the last layer, freeze the layers before that and then continue the network from here with a new (last-)layer having 26 neurons. Freezing prior layers makes sense if we do not intend to change the weights and biases of those, i.e. we will be using the same, learnt features. This can immensely speed up training and show the benefits of transfer learning. Since we want to switch the last layer (10 neurons) with a new one (26 neurons), we will need to remove it first and then continue the remaining network with the new layer. This new layer can be designed inside this operator, similarly to how one would go about the "normal" deep-learning operators when building a completely new model. Not only can we replace the last layer, but if desired we could even extend the network with more than one layers to enrich its capabilities. However, in the above example, we would just simply replace the last layer and thus have only 1 layer with 26 neurons in the nest. To gain more information about the possible modifications, please have a look below at following parameter documentations: "freeze up to", "continue from layer", "list removable layers".
Please make sure to check out the tutorial processes. The one called, "MNIST Retrain Last Layer" is an example of transfer learning with a slight change in the model. The other tutorial process, "Iris Improvement" presents the use-case, where one would not like to change the structure of the network, i.e. leave the model intact. In that case the nest operator will not contain any layers inside and will simply further train the model. This operation is basically similar to the "Update Model" operation in RapidMiner, for other types of models that support it.
Caution: When performing transfer learning it is advised to first understand the model being used. Basing an analysis on an existing models bares the risk of inheriting potential unwanted biases.
Input
model
Model to fine-tune. The weights and biases of the provided model are used as the initial state whose values will be improved up-on each epoch through the selected optimization method.
training set
ExampleSet or TensorIOObject holding the training data.
test set
ExampleSet or TensorIOObject holding the evaluation/test data.
Output
model
Updated model. The provided model is the result of updating the initially provided model and its weights over the defined amounts of epochs based on the provided training data. Frozen layers will still have the initial weights and bias values.
throughput
Input sample set passed through.
history
An ExampleSet containing example-based loss values and respective epoch counts representing the training and test behavior. The training loss is derived from the training model, thus including dropped out neurons and other regularization methods. Test loss is derived from the test model without these mechanisms. Loss values are on a per example-base. Plot these loss values as a scatter plot for example to check whether the learning rate needs to be changed, another weight initialization needs to be chosen, or if regularization should be applied. Expected is a decreasing function. As long as it decreases more epochs could be used.
Parameters
Epochs
Number of times the whole data set is passed through the network. Use the advanced parameter use early-stopping to select a strategy enabling an early-stopping. These strategies often result in a shorter training time, since the training process is stopped, when a desired criteria is reached.
Use minibatch
Pass data in batches through the network and update weights and biases after each of those batches.
Batch size
Number of examples to be used in one batch for a single weight update. Values are often chosen as multiples of 2. When switching from CPU to GPU backend execution increase batch size by a lot (factor of 100-1,000) to take advantage of the GPUs extra memory.
Log each epoch
Disable this option to choose the number of epochs after which loss values should be logged. This effects the output of the history port as well as the actual process log.
Epochs per log
This parameter is available, if log each epoch is disabled.
Use early stopping
When training neural networks, numerous decisions need to be made regarding the settings (hyper parameters) used, in order to obtain good performance. One such hyper parameter is the number of training epochs, i.e. the number of full passes over the data set. Early-stopping attempts to remove the need to manually set this value. It can also be considered as a type of regularization method (like L1/L2 weight decay and dropout) in that it can stop the network from over-fitting. The number of epochs set using the epochs parameter is always used as an upper limit. Available conditions can be selected through the condition strategy selector.
Available epoch conditions, that are tested every epoch:
- score_improvement: Uses the parameters patience and minimal score improvement to check for a score not improving anymore, which leads to a stop. The patience defines the number of epochs the score needs to be considered constant. The definition of a score being constant in comparison to the previous one, is dependent on the amount it has changed. A minimum value to define a score change as such is set using minimal score improvement.
- best epoch score: Use this strategy to define a targeted score with the best epoch score parameter. If this score is reached training is stopped.
- score improvement: Allowed number of epochs without any improvement in score compared to the best score so far.
Available iteration conditions, that are tested on each mini-batch:
- max iteration score: Score threshold for every iteration. If an iteration exceeds this score threshold the training will be stopped. This can occur for example with a poorly tuned (too high) learning rate.
- maximum time: Maximum amount of time (in seconds) an iteration can last during training.
Override configuration
Fine-tuning specific (frozen, removable layers) + generic training configurations.
Freeze up to
Name/ID of the layer up to which all layers should be frozen. Freezing a layer means that its parameters (biases, weights) will not change during training/fine-tuning. Sometimes this concept is also referred to as: feature extraction.
Continue from layer
Name/ID of the layer which the sub-network (designed inside this nest-operator) will be attached to.
List removable layers
List of layer names/ID-s that will be removed from the network before training/fine-tuning.
Configure optimization
Whether to override default training configurations regarding network optimisation.
Loss function
A loss function defines a quantitative measure for the correctness of a result. Therefore a distance is defined between estimates created within the network and the provided label information. This distance is also called loss, score or error. Make sure to choose a function that fits your problem type. The problem types regression and classification in different variations are displayed next to each method name.
- Mean Squared Error (Linear Regression): The error is calculated as the mean squared distance between estimates and label values. Use this as a starting point for regression problems.
- Mean Absolute Error (Linear Regression): The error is calculated as the absolute value of the mean squared distance between estimates and label values.
- Exponential Log Likelihood (Poisson Regression): The error is calculated through the exponential log likelihood method. This is applicable for poisson distributed label values.
- L1 (Lasso Regression): This error is calculated through taking the absolute difference between prediction and label values.
- L2 (Ridge Regression): This error is calculated through calculating the squared differences between prediction and label values.
- Cross Entropy (Binary Classification): The error is calculated using cross entropy. It is assumed, that the label only consists of zeros and ones.
- Hinge Loss (Binary Classification): The error is calculated using the hinge loss function, which is used for maximum-margin classification problems. It is assumed, that the label only consists of zeros and ones. The method is similar to an indicator function.
- Squared Hinge Loss (Binary Classification): The error is calculated using the squared hinge loss function, which is used for maximum-margin classification problems. It is assumed, that the label only consists of zeros and ones.
- Multiclass Cross Entropy (Classification): The error is calculated using the negative log likelihood. Therefore this method is suitable for multi-class problems.
- Negative Log Likelihood (Classification): This is identical to multiclass cross entropy, but added for convenience.
- Cosine Proximity (Classification): The error is calculated as a cosine similarity between estimates and label values. This method is suitable for multi-class problems.
- Reconstruction Loss Entropy (Classification): The error is calculated as the Kullback Leibler Divergence between estimates and label values. This method is suitable for multi-class problems.
Optimization method
An optimization method defines the strategy used to define when to update parameters (weights and biases) and how. Provided methods allow to change between batched and none-batched methods. Batched methods like Conjugate Gradient Line Search and L-BFGS update the calculated loss after the full data set was passed through the network once. Hence these methods are more memory demanding. None-batched methods like Stochastic Gradient Descent and Line Gradient Descent perform updates for each example. Applying use miniBatch can alter this behaviour by first collecting a predefined amount of examples (set with the batch size parameter) before performing an update.
Most of the time a combination of Stochastic Gradient Descent with miniBatch is very performant, while providing good results.
For ExampleSets with less than 10,000 examples or big ExampleSets without much redundancy it is recommended to use a batch optimization method in combination with adaptive update mechanisms like Adam.
- Line Gradient Descent: Stochastic gradient descent with line search. This method performs weight and bias updates for each example.
- Conjugate Gradient Line Search: This method performs weight and bias updates after calculating losses for the full data set. Might be memory intensive.
- L-BFGS: This method performs weight and bias updates after calculating losses for the full data set. Might be memory intensive.
- Stochastic Gradient Descent: Stochastic gradient descent. This method performs weight and bias updates for each example.
Backpropagation
Choose the type of error propagation through the network. For most scenarios the standard setting is sufficient. But for recurrent networks, e.g., when using a LSTM layer the truncated option might be used.
- Standard: Standard backpropagation method propagating errors (defined by the chosen loss function) back through the network for updating parameter weights.
- Truncated: You can choose this option, when using a recurrent network architecture (one that uses recurrent layers like the LSTM layer). It enables another option called backpropagation length which defines the number of steps to use for one backpropagation step. Using the full length often increases the training time by a lot, due to the complexity added by hidden states of recurrent layers.
Backpropagation length
This option is available when selecting truncated as the backpropagation method. Define a value for the number of backpropagation steps to use for one error propagation.
Configure updater
Whether to override the default network update configurations.
Updater
Method used to calculate new weight and bias values in order to minimize the chosen loss.
- SGD: Stochastic gradient descent. Uses a learning rate to adjust the extend to which weights and biases are updated.
- Adam: Adaptive momentum change. http://arxiv.org/abs/1412.6980
- AdaMax: Similar to Adam but using the infinity norm. Recommended parameter settings: learning rate = 0.002, beta 1 = 0.9, beta 2 = 0.999. http://arxiv.org/abs/1412.6980
- AdaDelta: Similar to AdaGrad but adjusts learning rate based on moving window averages instead of all collected gradients. Recommended parameter settings: learning rate = 1.0, rho = 0.95.
- Nesterovs: Use SGD with nesterov momentum.
- NAdam: Similar to Adam, but using nesterov mechanism for momentum change. Recommended parameter settings: learning rate = 0.002, beta 1 = 0.9, beta 2 = 0.999.
- AdaGrad: Uses set learning rate as a baseline and decreases it during training. The rate is adjusted for each weight and reduced when more updates are performed. Recommended parameter settings: learning rate = 0.01.
- RMSProp: Use a moving average of squared gradients to divide the gradient by. This is well suited for recurrent networks. Recommended parameter settings: learning rate = 0.001, beta 1 = 0.9, beta 2 = 0.999.
- None: Don't update weights and parameters.
Learning rate
Speed of navigating through the landscape of potential weight and bias values. 0.005 and 0.01 are often good starting points. Higher learning rates can reduce the number of epochs needed for reaching a 'good' result, while it also increases the change of missing the optimal point.
Momentum
Acceleration of a chosen learning rate. Reduces fluctuating values and can help avoid local minima / results getting stuck too early.
Rho
Exponential decay rate of the learning rate. Slows down learning, while decreasing possibility to miss weight and bias values resulting in lower losses.
Epsilon
Jitter value used to ensure numerical stability of updates. Should be very small.
Beta1
Fine tuning parameter for some updaters. In many cases this should be close to one.
Beta2
Fine tuning parameter for some updaters. In many cases this should be close to one.
Rmsdecay
Decay rate for the RMSProp update mechanism.
Configure layers
Whether to override default layer level configurations.
Weight initialization
A Deep Learning model is defined by so called weights. Weights are set within most layers and define the model. The process of finding the best weight values during training is an iterative process and requires start values. Weight values are multiplied to respective input data. At the first layer the input data is the data provided at the training port of the Deep Learning operator. For successive layers weights are multiplied to the output of the previous layer. Select one of the provided pre-defined methods to initialize all weights by the given strategy. Change this parameter, if during training the loss is not decreasing or it takes a long time before the loss values goes down.
- Identity: Use identity matrices.
- Normal: Use a Gaussian distribution with a mean of zero and a standard deviation of 1 / sqrt(number of layer inputs).
- Ones: Use ones.
- ReLU: Use a Gaussian distribution with a mean of zero and a variance of 2 / (number of layer inputs).
- ReLU Uniform: Use a Uniform distribution from -a to a, where a = sqrt(6/(number of layer inputs)).
- Sigmoid Uniform: Use a Uniform distribution from -a to a, where a = sqrt(6 / (number of layer inputs + number of layer outputs)).
- Uniform: Use a Uniform distribution from -a to a, where a = 1 / sqrt(number of layer inputs).
- Xavier: Use a Gaussian distribution with a mean of zero and a variance of 2 / (number of layer inputs + number of layer outputs).
- Xavier Uniform: Use a Uniform distribution from -a to a, where a = sqrt(6/(number of layer inputs + number of layer outputs)).
- Zero: Initialize all weights with zero. This is rarely a good idea.
Bias initialization
As described for the weight initialization method parameter, a Deep Learning model needs starting values for the training process. While the weights are multiplied to input data, the bias values are added ontop of this product. When training a regression model, for a data set with a mean target value of 10, starting with a bias initialization value of 10 could enable a network to find a fitting bias value more quickly.
Use regularization
Define whether to use regularization for weight calculation or not. Regularization can help reduce overfitting. Use a scatter plot from the data provided at the history port to check for the development of the training and test loss across epochs. Oscillating / jumping loss values often indicate a need for regularization. Set one of the values (L1 and L2) to 0.0 to only use one of the values. Starting with a small L2 regularization (~0.1) is often a good starting point.
L1 strength
Define strength of L1 (sum of all absolute weight values) for regularization.
L2 strength
Define strength of L2 (proportional to the weight itself) for regularization.
Cudnn algo mode
This parameter will only influence the runtime environment of the network if it is executed on a GPU. Nvidia (manufacturer of the supported GPU architecture) provides a library called: CuDNN, which contains efficient implementations for various layers. CuDNN can accelerate training but at a the cost of a potentially larger memory footprint. In certain edge-cases due to the higher memory consumption, strange errors will occur. When that happens, it is advised to reduce performance but preserve more memory by setting this parameter to "No workspace".
- Prefer fastest: Default setting. Has better performance but at the cost of higher memory consumption, which could potentially impose a problem. If memory constraints are a problem, it is advised to switch over to "No workspace" and give that a try.
- No workspace: Has a lower memory footprint than "Prefer fastest", but at the cost of lower performance.
Infer input shape
Guess and log to console the input shape of the given training data.
Network type
Choose network type to configure data shape for.
- Simple Neural Network: Simple two dimensional neural network only consisting of fully-connected, dropout, activation and batch normalization layers.
- Convolutional: A four dimensional neural network using convolutional layers, pooling and others.
- Convolutional Flattened: A two dimensional neural network using convolutional layers, pooling and others, but converting the input from four dimension to two. Conversion to: [miniBatchSize, height ** width ** channels]
Input dimension
Dimension of the data as expected by the network. For simple neural networks this is often the number of regular attributes.
Height
Height dimensionality of the data.
Width
Width dimensionality of the data.
Depth
Depth dimensionality of the data.
Use local random seed
This parameter indicates if a local random seed should be used.
Local random seed
If the use local random seed parameter is checked this parameter determines the local random seed.