Understanding the back-propagation algorithm for training neural networks can sometimes be challenging, because often there is a lot of confusing terminology which varies between sources. Also it is commonly described just in terms of the mathematics. Here I present a diagrammatic explanation of back-propagation for the visually inclined. I also summarize the non-linear stages that are commonly used, and provide some philosophical insight.
The forward pass though a neural net consists of alternating stages of linear multiplication by a weight matrix and non-linear activation functions which transform the output of each linear unit independently. We can write the transformation in vector form as , and where is the input, is the output of the linear stage, is the output of the non-linear stage, and is the activation function which acts on each element of independently. For subsequent stages, the input is the output of the previous stage.
During stochastic gradient descent each input pattern from the training set is presented to the network and the final output is computed. In the case that the network is set up for regression, the desired output vector is equal to the activation of the last units of the network. The error for a single training pattern is the scalar , where is a training vector. The total error which we are trying to minimize is the sum of these errors over the whole training set.
In order to minimize the error we need to adjust the weight matrices. In order to update the weights we compute the derivative of the error with respect to each weight in the network using the chain rule of derivatives:
Here is the weight that connects linear unit with input . The first term is the derivative of the error with respect to the network output which we call and is equal to . The second term is value of the derivative of the activation function at the current working point . The last term is the derivative with respect to the weight, and this is just equal to the value of the input unit in the case of the first layer, or equal to from the previous layer in the case of hidden layers.
This formula applies for the output layer, but in order to compute the weight updates for earlier layers we have to back-propagate the errors, so that preceding stages get an error from the later stage feeding backwards to , rather than the error from direct comparison with the training vector.
This is where a diagram really helps to explain things. Below you can see the flow of information in a two layer neural net which is set up for regression:
In the forward pass the input vector is multiplied by the weights for the first layer and the result is summed to produce the left most value of . The activation function is applied to get the output of the first layer . This process is repeated for the second layer. The output of the network is subtracted from the training vector to form the first error vector for the output. This then gets multiplied by the value of the derivative of the activation function which uses the forward value of from the same layer. This gives the error .
The derivative with respect to the weights is computed by multiplying the layer input or with the value of (using the path shown in red). This derivative is used for gradient descent using the stochastic update formula
where is the learning rate.
The new values of the error for the previous layer are formed by ‘pushing’ the error back through the weights (summing the contributions). This step is only necessary for hidden layers because the error vector is not needed at the network input.
If the network has bias weights then these are updated by assigning them a virtual input unit which has an activation value of 1.
If a network is set up for classification, the output layer is normally a softmax function:
This converts the linear activation into a probability. The training set consists of vectors where all elements have the value zero except for the one which is the correct class, which has the value 1. The error is given by:
You can see that it is almost the same structurally except that the output nonlinearity has been replaced by the softmax forward calculation and the error feedback is simplified.
With regards to back propagating the error through the weights in a fully connected network, this operation is simply multiplication by the matrix transpose.
For a convolutional neural network we have a slightly more complex situation because there is only one small set of weights shared for the whole layer and this affects the weights update and the back propagation. For the weights update, the situation is fairly straightforward – one sums the contribution of weight deltas from all the layer input and output pairs that are connected through that weight.
For back propagating the error the situation seems quite complex because there may be downsampling and so keeping track of which weights connect back from into the earlier stage from the perspective of the input seems quite complex. For this reason it is useful to think of the operation as pushing the error back through the weights because then you do a loop over the output error variables and for each one sequentially add in its contribution to the input error variables. While in the forward pass you gather the network inputs together through the convolution kernel, for the backward pass you spread the errors back out by pushing them back through the same convolution kernel. This is slightly less computationally efficient but avoids bookkeeping headaches.
Various activation functions are possible for the nonlinear part of the network. Here are some of these and their derivatives:
Rectified linear units:
Soft rectified units:
Note that one must diligently check the gradients of a neural network because it is very easy to learn garbage with slight errors in the gradient calculations. The network might generate half-decent results and still be incorrect leading to all kinds of headaches.
I find the symbolic derivative calculator web site to be an excellent resource.
One thing that is not commonly mentioned is that the popular rectified linear units are not amenable to numerical gradient checking. This is because when one computes the difference in the output error by changing a weight ever so slightly and then computes the gradient with respect to these weights, the gradient may change suddenly when the unit activations go from slightly negative to slightly positive. This step change in gradient makes it impossible to do a gradient check in this way. The solution is to bypass the rectified units by making them linear during the gradient check procedure.
On a final philosophical note, its interesting to see that back-propagation is almost a local operation. In the brain, weight modification takes place between neurons using local update rules, such as Hebbian synaptic modification. Back propagation seems like it does not do that because the forward and backwards paths are seperate. But if the corresponding forward variables and error variables were somehow ‘merged’ then the weight update would be a local operation.
The only way that the variables could be merged is if there is a change from neurons representing feedforward activation to feedback errors during the temporal evolution of the response to a stimulus presentation. There are many theories that due to the large number of back-projecting connections in the cortex the brain actually does learn from an error signal that is created by subtracting a reconstructed expectation of the input from its actual input in a backward pass. This idea is consistent with the merging of the roles of feedforward activation and error back propagation in the same neurons, where each neuron codes an evolving fraction of activation and error signals.