The code below generates such a random basis by concatenating random Householder transforms.

import numpy import random import math def make_orthonormal_matrix(n): """ Makes a square matrix which is orthonormal by concatenating random Householder transformations """ A = numpy.identity(n) d = numpy.zeros(n) d[n-1] = random.choice([-1.0, 1.0]) for k in range(n-2, -1, -1): # generate random Householder transformation x = numpy.random.randn(n-k) s = math.sqrt((x**2).sum()) # norm(x) sign = math.copysign(1.0, x[0]) s *= sign d[k] = -sign x[0] += s beta = s * x[0] # apply the transformation y = numpy.dot(x,A[k:n,:]) / beta A[k:n,:] -= numpy.outer(x,y) # change sign of rows A *= d.reshape(n,1) return A n = 100 A = make_orthonormal_matrix(n) # test matrix maxdot = 0 maxlen = 0.0 for i in range(n-1): maxlen = max(math.fabs(math.sqrt((A[i,:]**2).sum())-1.0), maxlen) for j in range(i+1,n): maxdot = max(math.fabs(numpy.dot(A[i,:],A[j,:])), maxdot) print("max dot product = %g" % maxdot) print("max vector length error = %g" % maxlen)

Another way to do this is to do a QR decomposition of a random Gaussian matrix. However the code above avoids calculating the R matrix.

Postscript:

I did some timing tests and it seems like the QR method is 3 times faster in python3:

import numpy from scipy.linalg import qr n = 4 H = numpy.random.randn(n, n) Q, R = qr(H) print(Q)]]>

This brings us to the shell property of high dimensional spaces.

Let’s consider a normal (Gaussian) distribution in D-dimensions. In 1D it is obvious that all the probability bulk is in the middle, near zero. In 2D the peak is also in the middle. One might imagine that for any number of dimensions this would continue to hold, but this is false. The shell property of high dimensional spaces shows that the probability mass of a D-dimensional Gaussian distribution where D>>3 is all concentrated in a thin shell at a distance of sqrt(D) away from the origin, and the larger the value of D, the thinner that shell becomes. This is because the volume of the shell grows exponentially with D compared with the volume around the origin, and so with large D there is essentially zero probability that a point will end up near the center: Mr Average does not exist.

This can be seen on the following graph which plots a frequency histogram of the distance from the origin of Gaussian random vectors in D-dimensions. You can see that the distribution becomes more and more narrow.

From this graph you can see that if you have a high dimensional Gaussian random vector you can approximately normalize its length by simply dividing each element by sqrt(D), instead of actually dividing by the L2 norm.

Now lets consider the related property of points which are constrained to lie on the surface of a hypersphere. We might then ask how the properties of high dimensional spaces affect the distance between these points. If you plot the geometric distances between pairs of points on a hypersphere you find that the distances between them all tend to about sqrt(2) as the number of dimensions increases.

So with 100 dimensions any new point that you generate on this sphere will be most likely about sqrt(2) away from any other Gaussian distributed random point on the sphere.

Lastly let’s consider the cosine (dot product) distance between pairs of vectors going from the origin to random points on the surface. The cosine distance ranges from -1 where vectors point in opposite directions to 1 when they are exactly parallel. The distribution of cosine distances is shown below.

Here you can see that the cosine distance is distributed around zero and that this distribution gets narrower as the number of dimensions increases. What this means is that the random vectors are on average 90 degrees apart. This is consistent with the sqrt(2) geometric distance, which means that on average they are orthogonal. With higher numbers of dimensions the chances of generating a pair of vectors which share considerable extent along the same direction rapidly reduces.

These properties relate to the problem of matching of high dimensionality media descriptors in image, text or audio recognition. When you have a query descriptor vector and want to match into a database of many other descriptors, you typically compute the descriptor space distance. The shell property described here dictates that the distance from the query to most of the non-matching descriptors will be about the same. Only the closely matching descriptors will fall significantly within this shell. If you sort the distances from the query descriptor to the closest N candidate descriptor vectors, this distance graph will flatten out and in the limit become the shell distance. This can be used to establish the threshold bound for a matching process and discriminate between inliers and outliers.

]]>For small networks, there can be some on-line learning, which might be useful to learn control systems with a few inputs and outputs, connecting for example IMU axes or simple sensors to servos or motors, trained with deep reinforcement learning. This is the scenario that I’m experimenting with and trying to enable for small, low power, and cheap interactive robots and toys.

For more complex processing where insufficient RAM is available to store weights, a fixed network can be stored in ROM built from weights that have been trained off line using python code.

Anyway watch this space because I’m currently working on this library and intend to make it open source through my company Impressive Machines.

]]>I have worked hard to bring a production version of my Nixie display controller to market. You can now actually order these units from my Etsy store here

- High quality gold plated surface mount PCB
- Four digit Nixie display; product includes tubes.
- RGB LED back-lighting on each tube independently programmable to generate multiple colors
- The colon indicator can also be turned on and off
- Modules can be stacked next to each other for more digits
- Runs from 9-12V, with on-board 180V power supply
- Easily controlled by a serial line from the Arduino or any micro-controller or laptop to display any digits
- The board can also function as a stand-alone voltmeter
- Based on the familiar ATMega328
- Comes pre-programmed with open source display software
- Easily customized via the ISP port using standard tools
- Most spare micro-controller pins are accessible at the connector
- Based on plug-in IN-4 Nixies which are easily replaced
- Schematics and code are available for easy hacking

Sign up here to keep up to date.

]]>I contributed to this project for a year or so when I was employed at Microsoft, working on 3D reconstruction from multiple infra-red camera views, so it was nice to get an acknowledgment. Some of this work was inspired by our earlier work at Microsoft Research which I co-presented at SIGGRAPH in 2004.

It’s very nice to see how far they have progressed with this project and to see the possible links that it can have with the Hololens virtual reality system.

]]>

During on-line training, such as with a robot, or when people learn, adjacent training examples are highly correlated. Visual scenes have temporal coherence and people spend a long time at specific tasks, such as playing a card game, where their visual input, over perhaps hours, is not representative of the general statistics of natural scenes. During on-line training we would expect that a neural net weights would become artificially biased by having highly correlated consecutive training examples so that the network would not be as effective at tasks requiring balanced knowledge of the whole training set.

In a neural network consisting of multiple layers, we ideally would like all neurons to be equally active with similar (sparse) activation histograms. Typically for unsupervised networks with ICA-like sparse learning, training images are manually decorrelated and normalized as a pre-processing step. This ensures that the variance of each unit’s activation remains around unity so that it can learn sparse structure instead of conventional second order statistics. However this is not always done, and may not typically be done for deep classification models. Also often there is a requirement to normalize the energy of the weight vectors. Making them unit length means that the variance of a unit’s activation is going to be equal to the variance in the source pattern, which is somewhat artificial. When this source pattern is the non-linear output of the previous layers’ neurons in a multilayer network, the variance may not be very well defined.

Without justifying it thoroughly, it seems to me that a network will do well if the variance of the activity of its neurons is similar over the whole network when considering each neuron’s activation over a training set corpus. The idea is to have all neurons similarly involved – they will still provide a sparse code, but ideally they are all in general responsive to equal numbers of patterns in the training set and engage equally frequently in the design task.

One way to ensure this behavior without the need to keep a unit norm on the weight vectors is to adjust the weights adaptively to ensure that the output of each neuron is activated in general at about the same level as training progresses. This is a sort of homeostatic adaptation which is certainly one of the mechanisms that is used by the brain to keep neuron outputs in the biochemical working range. One way to do this is to have the weights constantly update in the usual way during training and at the same time renormalize them with respect to their outputs so that the measured variance of each associated unit’s activation is kept constant. Another way to do it is to have the weights change freely during learning, and then have a dedicated renormalization cycle, which may interfere less with learning. This is where the idea of sleep comes in.

In this idea the neurons learn from input samples which may be randomly chosen, or they may be temporally correlated. As a result the weights change and the general balance of activity across the network will possibly depart from being even, especially if a lot of correlated samples are seen, since correlated samples repeatedly train the same neuron, which can cause its weights to grow excessively. This could well produce effects that are similar to the human experience of constantly internally hallucinating visual scenes (like playing hands of cards) when you have been repeating the same task for hours. The associated neurons are over-trained and have excessive responses to everything.

A renormalization cycle would expose the network to random training examples without training the weights in the usual sense, but just scaling them up and down to even out the network activities. Obviously this will effect the training somewhat, so needs to be alternated with training.

One method of doing renormalization is to just show the network sequences of independent training examples in order to generate activations whose statistics can be measured and used for rescaling.

Another method might be to create a muti-layer network that is capable of dream-like recall. For example using the Google deep dream network approach it is possible to randomly choose neurons in the network to consider and then optimize patterns of activation in other levels that provide a memory-like recall of related concepts.

The relationship with renormalization is that this dream-like method would be a way to internally generate representative activity in the network that is then used for rebalancing neural activity gains. This really is a sort of dreaming process because, in this idea, spontaneous activity is generated in the network and propagated to all layers solely to generate responses so that excessively high or low activations can be detected and fixed by weight renormalization. The network is effectively generating its own training set for this purpose by recalling patterns starting from random noise, and then fixing the statistical biases that arise from over-training particular subsets of neurons.

]]>I generated an image patch database that contains 500,000 28×28 or 64×64 sized monochrome patches that were randomly sampled from 5000 representative natural images, including a mix of landscape, city, and indoor photos. I am offering them here for download from Dropbox. There are two files:

image_patches_28x28_500k_nofaces.dat (334MB compressed)

image_patches_64x64_500k_nofaces.dat (1.66GB compressed)

The first file contains 28×28 pixel patches and the second one contains 64×64 patches. The patches were sampled from a corpus of personal photographs at many different locations and uniformly in log scale. A concerted effort was made to avoid images with faces, so that these could be used as a non-face class for face detector training. However there are occasional faces that have slipped through but the frequency is less than one in one thousand.

The format of the files is very simple. It is just raw concatenated byte valued pixel data without any header, so for example for the 28×28 case there are 784 bytes for each image.

You can see 10,000 examples below:

These images are provided freely and without any usage restrictions and can be used for commercial or non-commercial purposes. The only requirement is that you provide attribution to Simon Winder, Impressive Machines LLC.

If you find that the links don’t work, then try again the next day as Dropbox only gives me 20GB of bandwidth per day and these are large files.

I may expand this set to include color images, and I will update this page if I do.

]]>The forward pass though a neural net consists of alternating stages of linear multiplication by a weight matrix and non-linear activation functions which transform the output of each linear unit independently. We can write the transformation in vector form as , and where is the input, is the output of the linear stage, is the output of the non-linear stage, and is the activation function which acts on each element of independently. For subsequent stages, the input is the output of the previous stage.

During stochastic gradient descent each input pattern from the training set is presented to the network and the final output is computed. In the case that the network is set up for regression, the desired output vector is equal to the activation of the last units of the network. The error for a single training pattern is the scalar , where is a training vector. The total error which we are trying to minimize is the sum of these errors over the whole training set.

In order to minimize the error we need to adjust the weight matrices. In order to update the weights we compute the derivative of the error with respect to each weight in the network using the chain rule of derivatives:

Here is the weight that connects linear unit with input . The first term is the derivative of the error with respect to the network output which we call and is equal to . The second term is value of the derivative of the activation function at the current working point . The last term is the derivative with respect to the weight, and this is just equal to the value of the input unit in the case of the first layer, or equal to from the previous layer in the case of hidden layers.

This formula applies for the output layer, but in order to compute the weight updates for earlier layers we have to back-propagate the errors, so that preceding stages get an error from the later stage feeding backwards to , rather than the error from direct comparison with the training vector.

This is where a diagram really helps to explain things. Below you can see the flow of information in a two layer neural net which is set up for regression:

In the forward pass the input vector is multiplied by the weights for the first layer and the result is summed to produce the left most value of . The activation function is applied to get the output of the first layer . This process is repeated for the second layer. The output of the network is subtracted from the training vector to form the first error vector for the output. This then gets multiplied by the value of the derivative of the activation function which uses the forward value of from the same layer. This gives the error .

The derivative with respect to the weights is computed by multiplying the layer input or with the value of (using the path shown in red). This derivative is used for gradient descent using the stochastic update formula

where is the learning rate.

The new values of the error for the previous layer are formed by ‘pushing’ the error back through the weights (summing the contributions). This step is only necessary for hidden layers because the error vector is not needed at the network input.

If the network has bias weights then these are updated by assigning them a virtual input unit which has an activation value of 1.

If a network is set up for classification, the output layer is normally a softmax function:

This converts the linear activation into a probability. The training set consists of vectors where all elements have the value zero except for the one which is the correct class, which has the value 1. The error is given by:

and the derivative of this happens to be equal to which is the same as for the regression case. The following diagram shows the situation when we have one layer of neurons with a nonlinear output function and a second layer with a softmax error (loss) layer.

You can see that it is almost the same structurally except that the output nonlinearity has been replaced by the softmax forward calculation and the error feedback is simplified.

With regards to back propagating the error through the weights in a fully connected network, this operation is simply multiplication by the matrix transpose.

For a convolutional neural network we have a slightly more complex situation because there is only one small set of weights shared for the whole layer and this affects the weights update and the back propagation. For the weights update, the situation is fairly straightforward – one sums the contribution of weight deltas from all the layer input and output pairs that are connected through that weight.

For back propagating the error the situation seems quite complex because there may be downsampling and so keeping track of which weights connect back from into the earlier stage from the perspective of the input seems quite complex. For this reason it is useful to think of the operation as pushing the error back through the weights because then you do a loop over the output error variables and for each one sequentially add in its contribution to the input error variables. While in the forward pass you gather the network inputs together through the convolution kernel, for the backward pass you spread the errors back out by pushing them back through the same convolution kernel. This is slightly less computationally efficient but avoids bookkeeping headaches.

Various activation functions are possible for the nonlinear part of the network. Here are some of these and their derivatives:

Rectified linear units:

Soft rectified units:

Logistic units:

Tanh units:

Note that one must diligently check the gradients of a neural network because it is very easy to learn garbage with slight errors in the gradient calculations. The network might generate half-decent results and still be incorrect leading to all kinds of headaches.

I find the symbolic derivative calculator web site to be an excellent resource.

One thing that is not commonly mentioned is that the popular rectified linear units are not amenable to numerical gradient checking. This is because when one computes the difference in the output error by changing a weight ever so slightly and then computes the gradient with respect to these weights, the gradient may change suddenly when the unit activations go from slightly negative to slightly positive. This step change in gradient makes it impossible to do a gradient check in this way. The solution is to bypass the rectified units by making them linear during the gradient check procedure.

On a final philosophical note, its interesting to see that back-propagation is *almost* a local operation. In the brain, weight modification takes place between neurons using local update rules, such as Hebbian synaptic modification. Back propagation seems like it does not do that because the forward and backwards paths are seperate. But if the corresponding forward variables and error variables were somehow ‘merged’ then the weight update would be a local operation.

The only way that the variables could be merged is if there is a change from neurons representing feedforward activation to feedback errors during the temporal evolution of the response to a stimulus presentation. There are many theories that due to the large number of back-projecting connections in the cortex the brain actually does learn from an error signal that is created by subtracting a reconstructed expectation of the input from its actual input in a backward pass. This idea is consistent with the merging of the roles of feedforward activation and error back propagation in the same neurons, where each neuron codes an evolving fraction of activation and error signals.

]]>I created a network which is almost a three layer network where the output of pairs of the first layer rectified units are combined additively before being passed to the second fully connected layer. This means that the first layer has a 28×28 input and a 50 unit output (hidden layer) with rectified linear units, and then pairs of these units are averaged to reduce the neuron count to 25, and then the second fully connected layer reduces this down to 10. Finally the softmax classifier is applied.

This architecture is identical to a two layer network where the second layer has half as many weights and pairs of input units share the same weight. I was interested to know how well this would perform compared to the earlier network with more parameters, and what kind of first layer weights would be learned.

The results are that the network gets an error rate of 8.45% for pixel position jittered MNIST figures with added Gaussian noise (described previously). This compares to 7.9% for the earlier full network without shared weights (double the number of free parameters). However it learns interesting weight patterns:

Notice that consecutive weight maps are similar, particularly in the orientation of the features that are selected for. Often the weights are complementary in sign or else are shifted spatially compared to each other. This has the effect of providing some position independence in a similar manner to complex cell sub-units in the visual cortex, because the rectified positive output of one unit will partially overlap with the output of the other one in the pair, increasing the area over which a positive response is generated, without changing the linear spatial selectivity.

In a similar manner I tried actually using a geometric combination of pairs of outputs of the first layer linear units without the rectification layer. The formula is . (The addition of is necessary to remove the derivative singularity at zero.) Without the rectification layer, both positive and negative parts of the first layer unit responses can contribute positively to the hidden layer inputs which introduces a significant second order nonlinearity. In particular, if the weights of the input layer end up generating responses in spatial phase quadrature then the unit will be completely phase independent as in complex cell receptive fields. This contributes to spatial location invariance.

The results show that magically most of the input weights for the geometrically summed units do end up nicely in phase quadrature:

The error rate for this network is 7.53%, somewhat better than before. This is probably because the geometric addition introduces better spatial invariance than adding rectified outputs.

Incidentally, if you run this on the raw MNIST data (without added distortions to make the recognition harder), the test error rate is a very respectable 2.1%, with the training error down at 0.19%. This is a good result for a two layer net with 50 hidden units.

Next, I will be exploring convolutional layers.

]]>