Training two-layer neural nets on MNIST digits

In my last blog post I talked about trying out my code for training neural nets on a simple one-layer network which consists of a single weight layer and a softmax output. In this post I share results for training a fully connected two-layer network.

In this network, the input goes from 28×28 image pixels down to 50 hidden units. Then there is a rectified linear activation function. The second layer goes from the 50 hidden units down to 10 units, and finally there is the softmax output stage for classification.

When I train this network on the MNIST handwriting dataset I get a test error rate of 2.89% which is pretty good and actually lower than other results quoted on the MNIST web site. It is interesting to inspect the patterns of the weights for the first layer below (here I organized the weights for the 50 hidden units as a 10×5 matrix):

two_layers_50_289It is apparent that adding units to a middle layer allows the network to focus on particular aspects of the input patterns, instead of trying to come up with a single weight set for each class. There are still however quite a number of significant weights at many spatial locations over the input square, reminiscent of template matching, but there are also a lot of quite strongly dominant sub regions.

Also interesting is to examine the weight update for a random mini batch, which looks like this:

Example Update

You can see that the one hundred training digits of the mini batch have been accumulated into the weight update matrix in various places and these inputs have been weighted by the errors which are back propagated from the network output. The slow accumulation of these weight updates gives rise to the final trained weights.

Regarding the weights for the second layer, these are typically not very interesting because they retain a very noisy appearance despite obviously changing from the initialization. Probably the network is not constrained enough to put a strong organizational pressure on these weights; that is, there may be a lot of dimensions of freedom to choose combinations of input and output layer weights which result in the output layer weights not appearing to organize very strongly. I am not sure.

After doing this experiment, I decided to add \pm 4 pixels of random jitter to the position of the numerals. This forces the network to be more position invariant. This level of jitter is barely tolerable by a single layer network, causing the error rate to rise from 7.7% to 42.6%. This is because its “template matching” is unable to generalize over space. The two layer fully connected network does much better – the error rate only increases from 2.89% to 5.36%. In addition the first layer weights begin to look quite interesting:

two_layers_50_pm4_515It is evident that the weights are more focused and indeed look like oriented spatial filters with different frequencies, bandwidths, and phases. This is typical for receptive fields in the visual system, and is learned by this network even though it is not trained on natural images. It seems that jittering the input forces the system to ignore long range spatial correlations in the input images and ensures that it just focuses on a limited region.

The accumulated weight updates for each mini batch appear a lot more blurred out because the digits are not so precisely aligned on top of each other.

One thing that is annoying when initializing these networks is that commonly one or more of the intermediate units will end up being “dead”, producing a zero output for most of the training patterns. Sometimes these units restart after a while, but often their weights just remain as noise. Optimal weight initialization is hard. Erhan et al talk about doing it using greedy unsupervised pre-training.

In my final experiment for this blog article, I added Gaussian noise to the input images (both testing and training) as well as the jitter mentioned earlier. The range of input values is 0-1 and 0.2 standard deviations of noise was added, which is a lot. To give some examples, here are a few digits processed in this way:

check18 12.21.29 PM check17 12.21.29 PM check19 12.21.29 PM check23 12.21.29 PM

The updates to the first layer weights for each mini batch are now quite diffuse and noisy, for example:

weight updates noisyThe result of the training process is that the network weights end up having somewhat lower spatial frequencies than before, but they retain a similar localized receptive field structure:

two layers with noiseThe final error rate on the test set ends up being 7.9%, which is still pretty good for this level of input corruption.