# Understanding back-propogation

Understanding the back-propagation algorithm for training neural networks can sometimes be challenging, because often there is a lot of confusing terminology which varies between sources. Also it is commonly described just in terms of the mathematics. Here I present a diagrammatic explanation of back-propagation for the visually inclined. I also summarize the non-linear stages that are commonly used, and provide some philosophical insight.

The forward pass though a neural net consists of alternating stages of linear multiplication by a weight matrix and non-linear activation functions which transform the output of each linear unit independently. We can write the transformation in vector form as ${\bf z}={\bf Wx}$, and ${\bf y}=g({\bf z})$ where ${\bf x}$ is the input, ${\bf z}$ is the output of the linear stage, ${\bf y}$ is the output of the non-linear stage, and $g({\bf z})$ is the activation function which acts on each element of ${\bf z}$ independently. For subsequent stages, the input ${\bf x}$ is the output ${\bf y}$ of the previous stage.

# Energy pooling in neural networks for digit recognition

Having trained a two layer neural network to recognize handwritten digits with reasonable accuracy, as described in my previous blog post, I wanted to see what would happen if neurons were forced to pool the outputs of pairs of rectified units according to a fixed weight schedule.

I created a network which is almost a three layer network where the output of pairs of the first layer rectified units are combined additively before being passed to the second fully connected layer. This means that the first layer has a 28×28 input and a 50 unit output (hidden layer) with rectified linear units, and then pairs of these units are averaged to reduce the neuron count to 25, and then the second fully connected layer reduces this down to 10. Finally the softmax classifier is applied.

# Training two-layer neural nets on MNIST digits

In my last blog post I talked about trying out my code for training neural nets on a simple one-layer network which consists of a single weight layer and a softmax output. In this post I share results for training a fully connected two-layer network.

In this network, the input goes from 28×28 image pixels down to 50 hidden units. Then there is a rectified linear activation function. The second layer goes from the 50 hidden units down to 10 units, and finally there is the softmax output stage for classification.

When I train this network on the MNIST handwriting dataset I get a test error rate of 2.89% which is pretty good and actually lower than other results quoted on the MNIST web site. It is interesting to inspect the patterns of the weights for the first layer below (here I organized the weights for the 50 hidden units as a 10×5 matrix):

# Training neural nets on MNIST digits

Recently I have been experimenting with a C++ deep learning library that I have written by testing it out on the MNIST handwritten digits data set. In this dataset there are 60,000 training images and 10,000 test images which are of size 28×28 pixels. I have been trying to reproduce some of the error rates that Yann LeCun reports on the MNIST site. The digits written in many different styles and some of them are quite hard to classify, and so it makes a good test for neural net learning.