# Training neural nets on MNIST digits

Recently I have been experimenting with a C++ deep learning library that I have written by testing it out on the MNIST handwritten digits data set. In this dataset there are 60,000 training images and 10,000 test images which are of size 28×28 pixels. I have been trying to reproduce some of the error rates that Yann LeCun reports on the MNIST site. The digits written in many different styles and some of them are quite hard to classify, and so it makes a good test for neural net learning.

I have experimented with four layer convolutional nets and achieved sub 1% error rates with those, but I wanted to break it down and so have just been trying out a single layer linear classifier with softmax output stage. That is a fully connected network that goes from 28×28 inputs down to 10 outputs. This is a very simple network that is terrible at generalization, for example if there is additional translation or warping of the input.

I was able to achieve about 7.5% error rate using this network. I used stochastic gradient descent with a mini batch size of 100 and a starting learning rate of 0.1, which gets multiplied by 0.99 every 10,000 presentations of randomly selected inputs. The weights were initialized to Gaussian random values with a standard deviation of $\sqrt{2/n}$ where n is the number of inputs to a unit.

You can see here the learned weights for this single layer network and you can make out the outlines of the numbers 0 through 9 in the weight patterns. For example in the case of zero, the classifier has negative weights in the center of the zero meaning that if the figure has any foreground here then it will reduce the probability that a zero output is generated. Likewise for the one, the white central area represents positive weights that increase the likelihood of this being a one if any mark is present there.

This network obtained a 7.7% test error rate. However if the training and testing images are all independently bias gain normalized then the network performs better. Bias gain normalization centers the input data ensuring that it has a mean of zero and a standard deviation of 1. This improves the conditioning of the hessian of the error at the minimum and allows the weights to adapt more independently. The neuron biases can only compensate for additive DC levels over the whole dataset, not for within class variations, so removing the mean of each source image is helpful. However the difference is not much for this network; the test error was only reduced to about 7.3%, but the initial learning was faster.

I also tried replacing the softmax output stage by tanh, rectified linear, or logistic non-linearities. The best test results I got for these were around 11%. I found that the learning rates had to be hand tuned and were significantly different for these methods, and too high a value could easily blow up the error.

The next thing I tried was to look at the effect of random translations on the ability of the single layer network to recognize the digits. I expected this spatial generalization ability to be poor and it was. In particular, the pattern of weights becomes very spatially blurred out with additional translation, leading to lower performance. I trained the system for 20 epochs for random shifts and the error rate went from 7.7% to 42.6% for $\pm 4$ pixel offsets.

Next I am going to try out a two layer fully connected network. The reason I am doing this set of experiments is partly to test the library, and partly because I am trying to get a really good digit recognizer that uses not too many weights and is very fast computationally, because I want to use it in a real-time demo on a very low power embedded processor.