Add Convolutional Layer
Synopsis
Adds a convolutional layer to your neural net structure.
Description
This operator has to be placed into the subprocess of the Deep Learning, Deep Learning (Tensor) or Autoencoder operator. It adds a convolutional layer to the neural net structure.
A convolutional layer uses a randomly initialized filter also called kernel that is moved over the input data. The kernel values are used to determine if a given input value is used or not, hence applying convolution results in a selective use of data at each step. Since a kernel is moved across the whole input data set all available data is taken into account for the filtering. The movement of a kernel can be changed using the so-called stride mechanism. Setting stride values essential means setting step sizes for each direction a kernel can be moved.
Multiple kernels can be used on the same input data, the results of this filtering are called activation map. Through usage of the number of activation maps parameter the number of filters to use for creating more selective attributes can be set.
A convolution layer is often followed by a pooling layer, that aggregates values across the created activation maps.
Input
layerArchitecture
A network configuration setup with previous operators. Connect this port to the layerArchitecture output port of another add layer operator or to the layer port of the "Deep Learning" operator if this layer is the first one.
Output
layerArchitecture
The network with the configuration for this convolutional layer added. Connect this port to the next input port of another layer or the layer port on the right side of the "Deep Learning" operator.
Parameters
Number of activation maps
Provide a number that defines how many times a kernel is used for the input data to create a so-called activation map. An activation map is the result of calculating the convolution of a kernel with the input data. The kernels are randomly chosen, hence selecting a higher number of activation maps results in more diverse activation maps and thus in creating new attributes.
Kernel size
Provide two values to set the kernel height and width. A kernel is randomly initiated and used to convolve the input data. It is moved over the input data using a step size set by the stride size parameter.
When performing convolution on text data, the second kernel size value should be equal to the number of dimensions that the used Word2Vec model has. E.g. the standard Google News Word2Vec model uses 300 dimensions to describe one word, hence a common setting for the kernel size would be 3 and 300 here, where the first value is one to choose.
Stride size
Provide two values to the steps to be done both in the height and width dimension. The stride value is used to move the kernel a given number of inputs.
The values chosen for stride influence the shape of this layers output data. Using a stride of 2 for example, would result in a down sampling.
When performing convolution on text data, the second stride size value should be equal to the number of dimensions that the used Word2Vec model has. E.g. the standard Google News Word2Vec model uses 300 dimensions to describe one word, hence a common setting for the kernel size would be 1 and 300 here, where the first value is one to choose.
Padding mode
Select a padding mode to use. Available padding modes are: truncated, same and causal. The padding mode defines how zeros are set around the data table.
- Truncated: Output layer size = (input size - kernel size + 2 * padding) / stride + 1;
- Same: Automatically add zeros around the data table based on the chosen stride size. Using a stride of '1' results in input size = output size;
- Causal: Used for 1D convolutions only. Use this for time based data sets to ensure only previous time steps are used. Effectively it's like the option 'same' but only applying padding on the left side of the data set. Selecting this value ignores the a potentially set y-value for the padding parameter, since it would have no effect.
Padding size
Provide two values for the padding size to be done both in the height and width dimension. The padding size is used to extend the data table by adding new rows/columns filled with zeros on both sides or a chosen dimension.
Dilation factor
Provide two values for the dilation factors in both the height and width dimension. The dilation factor defines that only every n-th data point in one direction is used. Using dilation increases the reception area of each data point without increasing the number of parameters/using a higher kernel size.
Activation function
Activation functions allow networks to create complex nonlinear decision boundaries. Mathematically speaking the chosen activation function is wrapped around the result of multiplying weights to input data and adding the bias. Hence activation functions ensure that a layers output is within a given range and a general decision whether to use the output or not can be made.
Because these nonlinear functions increase the computational load during training, choosing a simple function (with a monotonic derivative) is recommended for many situations.
Choosing the activation function for the last layer of a network is slightly different from previous layers. At this point the activation functions provides a conversion from the internal network state to the awaited output. For regression tasks "None (identity)" might be chosen, while for classification problems "Softmax" converts the results to probabilities for the given class values.
- ReLU (Rectified Linear Unit): Rectified linear unit. Activation function is max(0, x). Monotonic derivative.
- Sigmoid: Sigmoid or logistic function. None monotonic derivative. Sensitive to small changes in the present data. Results are in the range between 0 and 1.
- Softmax: Softmax or normalized exponential function. Resulting values are in a range between 0 and 1, while adding up to one. Hence this function can be used to map values to probability like values.
- TanH: TanH function, similar to the sigmoid function. None monotonic derivative with values in the range -1 and +1.
- Cube: Cubic function. Output is the cubic of input values. https://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf
- ELU (Exponential Linear Unit): Same as ReLU for values above zero, but an exponential function below. Hence the derivative is only monotonic for values above zero.
- GELU (Gaussian Error Linear Unit): Gaussian Error Linear Unit. Activation function is x * Phi(x), with Phi(x) as the standard Gaussian cumulative distribution function. Difference to ReLU: input is weighted based on its value instead of its sign. https://arxiv.org/abs/1606.08415 Sigmoid version of the implementation is used.
- MISH: A self-regularized non-monotonic activation function. Activation function is x tanh (ln(1 + exp(x))). https://arxiv.org/abs/1908.08681 Sigmoid version of the implementation is used.
- Leaky ReLU: Same as ReLU for values above zero, but with a linear function for values below. Monotonic derivative.
- Rational TanH: Rational TanH approximation, element-wise function.
- Randomized ReLU: Similar to ReLU but with a randomly chosen scaling factor for the linearity. Monotonic derivative.
- Rectified TanH: Similar to ReLU, but with a TanH function for positive values instead of a linearity. None monotonic derivative.
- Softplus: A logarithmic function with values ranging vom zero to infinity. Monotonic derivative.
- Softsign: Similar to TanH with same range and monotonicity but less prone to changes.
- SELU (Scaled ELU): Scaled exponential linear unit. Similar to ELU, but with a scaling factor. None monotonic derivative. https://arxiv.org/pdf/1706.02515.pdf
- None (identity): Output equals input. This function can be used e.g. within the last layer of a network to obtain a regression result. Monotonic derivative.
Layer name
Provide a name for the layer for ease of identification, when inspecting the model or re-using it.
Use dropout
Enable dropout regularization for this layer.
Dropout rate
Define a rate between 0 and 1 that is used to randomly remove a neuron of this layer during training. This is only applied during training and helps to reduce overtraining.
Overwrite networks weight initialization
Enabling this parameter allows to choose a weight initialization method for this layer, that is different from the general networks setting.
Weight initialization
A Deep Learning model is defined by so called weights. Weights are set within most layers and define the model. The process of finding the best weight values during training is an iterative process and requires start values. Weight values are multiplied to respective input data. At the first layer the input data is the data provided at the training port of the Deep Learning operator. For successive layers weights are multiplied to the output of the previous layer. Select one of the provided pre-defined methods to initialize all weights by the given strategy. Change this parameter, if during training the score is not decreasing or it takes a long time before the scores values goes down.
- Identity: Use identity matricies.
- Normal: Use a Gaussian distribution with a mean of zero and a standard deviation of 1 / sqrt(number of layer inputs).
- Ones: Use ones. This is rarely a good idea.
- ReLU: Use a Gaussian distribution with a mean of zero and a variance of 2 / (number of layer inputs).
- ReLU Uniform: Use a Uniform distribution from -a to a, where a = sqrt(6/(number of layer inputs)).
- Sigmoid Uniform: Use a Uniform distribution from -a to a, where a = sqrt(6 / (number of layer inputs + number of layer outputs)).
- Uniform: Use a Uniform distribution from -a to a, where a = 1 / sqrt(number of layer inputs).
- Xavier: Use a Gaussian distribution with a mean of zero and a variance of 2 / (number of layer inputs + number of layer outputs).
- Xavier Uniform: Use a Uniform distribution from -a to a, where a = sqrt(6/(number of layer inputs + number of layer outputs)).
- Zero: Initialize all weights with zero. This is rarely a good idea.
Overwrite networks bias initialization
Enabling this parameter allows to choose a bias initialization value for this layer, that is differnt from the general networks setting.
Bias initialization
As described for the weight initialization method parameter, a Deep Learning model needs starting values for the training process. While the weights are multiplied to input data, the bias values are added on top of this product. When training a regression model, for a data set with a mean target value of 10, starting with a bias initialization value of 10 could enable a network to find a fitting bias value more quickly.