How to Make a Random Orthonormal Matrix

To initialize neural networks it’s often desirable to generate a set of vectors which span the space. In the case of a square weights matrix this means that we want a random orthonormal basis.

The code below generates such a random basis by concatenating random Householder transforms.


import numpy
import random
import math

def make_orthonormal_matrix(n):
	"""
	Makes a square matrix which is orthonormal by concatenating
	random Householder transformations
	"""
	A = numpy.identity(n)
	d = numpy.zeros(n)
	d[n-1] = random.choice([-1.0, 1.0])
	for k in range(n-2, -1, -1):
		# generate random Householder transformation
		x = numpy.random.randn(n-k)
		s = math.sqrt((x**2).sum()) # norm(x)
		sign = math.copysign(1.0, x[0])
		s *= sign
		d[k] = -sign
		x[0] += s
		beta = s * x[0]
		# apply the transformation
		y = numpy.dot(x,A[k:n,:]) / beta
		A[k:n,:] -= numpy.outer(x,y)
	# change sign of rows
	A *= d.reshape(n,1)
	return A

n = 100
A = make_orthonormal_matrix(n)

# test matrix
maxdot = 0
maxlen = 0.0
for i in range(n-1):
	maxlen = max(math.fabs(math.sqrt((A[i,:]**2).sum())-1.0), maxlen)
	for j in range(i+1,n):
		maxdot = max(math.fabs(numpy.dot(A[i,:],A[j,:])), maxdot)
print("max dot product = %g" % maxdot)
print("max vector length error = %g" % maxlen)

Another way to do this is to do a QR decomposition of a random Gaussian matrix. However the code above avoids calculating the R matrix.

Postscript:

I did some timing tests and it seems like the QR method is 3 times faster in python3:

import numpy
from scipy.linalg import qr

n = 4
H = numpy.random.randn(n, n)
Q, R = qr(H)
print(Q)

A role for sleep and dreaming in neural networks

dreamingWhen training neural networks it is a good idea to have a training set which has examples that are randomly ordered. We want to ensure that any sequence of training set examples, long or short, has statistics that are representative of the whole. During training we will be adjusting weights, often by using stochastic gradient descent, and so we ideally would like the source statistics to remain stationary.

During on-line training, such as with a robot, or when people learn, adjacent training examples are highly correlated. Visual scenes have temporal coherence and people spend a long time at specific tasks, such as playing a card game, where their visual input, over perhaps hours, is not representative of the general statistics of natural scenes. During on-line training we would expect that a neural net weights would become artificially biased by having highly correlated consecutive training examples so that the network would not be as effective at tasks requiring balanced knowledge of the whole training set.
Continue reading

Natural image patch database

patchesIf you are training neural networks or experimenting with natural image statistics, or even just making art, then you may want a database of natural images.

I generated an image patch database that contains 500,000 28×28 or 64×64 sized monochrome patches that were randomly sampled from 5000 representative natural images, including a mix of landscape, city, and indoor photos. I am offering them here for download from Dropbox. There are two files:

image_patches_28x28_500k_nofaces.dat (334MB compressed)
image_patches_64x64_500k_nofaces.dat (1.66GB compressed)

The first file contains 28×28 pixel patches and the second one contains 64×64 patches. The patches were sampled from a corpus of personal photographs at many different locations and uniformly in log scale. A concerted effort was made to avoid images with faces, so that these could be used as a non-face class for face detector training. However there are occasional faces that have slipped through but the frequency is less than one in one thousand.  Continue reading

Understanding back-propogation

Understanding the back-propagation algorithm for training neural networks can sometimes be challenging, because often there is a lot of confusing terminology which varies between sources. Also it is commonly described just in terms of the mathematics. Here I present a diagrammatic explanation of back-propagation for the visually inclined. I also summarize the non-linear stages that are commonly used, and provide some philosophical insight.

The forward pass though a neural net consists of alternating stages of linear multiplication by a weight matrix and non-linear activation functions which transform the output of each linear unit independently. We can write the transformation in vector form as {\bf z}={\bf Wx}, and {\bf y}=g({\bf z}) where {\bf x} is the input, {\bf z} is the output of the linear stage, {\bf y} is the output of the non-linear stage, and g({\bf z}) is the activation function which acts on each element of {\bf z} independently. For subsequent stages, the input {\bf x} is the output {\bf y} of the previous stage.
Continue reading

Training two-layer neural nets on MNIST digits

In my last blog post I talked about trying out my code for training neural nets on a simple one-layer network which consists of a single weight layer and a softmax output. In this post I share results for training a fully connected two-layer network.

In this network, the input goes from 28×28 image pixels down to 50 hidden units. Then there is a rectified linear activation function. The second layer goes from the 50 hidden units down to 10 units, and finally there is the softmax output stage for classification.

When I train this network on the MNIST handwriting dataset I get a test error rate of 2.89% which is pretty good and actually lower than other results quoted on the MNIST web site. It is interesting to inspect the patterns of the weights for the first layer below (here I organized the weights for the 50 hidden units as a 10×5 matrix):

two_layers_50_289 Continue reading