# Energy pooling in neural networks for digit recognition

Having trained a two layer neural network to recognize handwritten digits with reasonable accuracy, as described in my previous blog post, I wanted to see what would happen if neurons were forced to pool the outputs of pairs of rectified units according to a fixed weight schedule.

I created a network which is almost a three layer network where the output of pairs of the first layer rectified units are combined additively before being passed to the second fully connected layer. This means that the first layer has a 28×28 input and a 50 unit output (hidden layer) with rectified linear units, and then pairs of these units are averaged to reduce the neuron count to 25, and then the second fully connected layer reduces this down to 10. Finally the softmax classifier is applied.

This architecture is identical to a two layer network where the second layer has half as many weights and pairs of input units share the same weight. I was interested to know how well this would perform compared to the earlier network with more parameters, and what kind of first layer weights would be learned.

The results are that the network gets an error rate of 8.45% for $\pm 4$ pixel position jittered MNIST figures with added Gaussian noise (described previously). This compares to 7.9% for the earlier full network without shared weights (double the number of free parameters). However it learns interesting weight patterns:

Notice that consecutive weight maps are similar, particularly in the orientation of the features that are selected for. Often the weights are complementary in sign or else are shifted spatially compared to each other. This has the effect of providing some position independence in a similar manner to complex cell sub-units in the visual cortex, because the rectified positive output of one unit will partially overlap with the output of the other one in the pair, increasing the area over which a positive response is generated, without changing the linear spatial selectivity.

In a similar manner I tried actually using a geometric combination of pairs of outputs of the first layer linear units without the rectification layer. The formula is $y = \sqrt{x_1^2 + x_2^2 + \epsilon}$. (The addition of $\epsilon$ is necessary to remove the derivative singularity at zero.) Without the rectification layer, both positive and negative parts of the first layer unit responses can contribute positively to the hidden layer inputs which introduces a significant second order nonlinearity. In particular, if the weights of the input layer end up generating responses in spatial phase quadrature then the unit will be completely phase independent as in complex cell receptive fields. This contributes to spatial location invariance.

The results show that magically most of the input weights for the geometrically summed units do end up nicely in phase quadrature:

The error rate for this network is 7.53%, somewhat better than before. This is probably because the geometric addition introduces better spatial invariance than adding rectified outputs.

Incidentally, if you run this on the raw MNIST data (without added distortions to make the recognition harder), the test error rate is a very respectable 2.1%, with the training error down at 0.19%. This is a good result for a two layer net with 50 hidden units.

Next, I will be exploring convolutional layers.