When training neural networks it is a good idea to have a training set which has examples that are randomly ordered. We want to ensure that any sequence of training set examples, long or short, has statistics that are representative of the whole. During training we will be adjusting weights, often by using stochastic gradient descent, and so we ideally would like the source statistics to remain stationary.
During on-line training, such as with a robot, or when people learn, adjacent training examples are highly correlated. Visual scenes have temporal coherence and people spend a long time at specific tasks, such as playing a card game, where their visual input, over perhaps hours, is not representative of the general statistics of natural scenes. During on-line training we would expect that a neural net weights would become artificially biased by having highly correlated consecutive training examples so that the network would not be as effective at tasks requiring balanced knowledge of the whole training set.
In a neural network consisting of multiple layers, we ideally would like all neurons to be equally active with similar (sparse) activation histograms. Typically for unsupervised networks with ICA-like sparse learning, training images are manually decorrelated and normalized as a pre-processing step. This ensures that the variance of each unit’s activation remains around unity so that it can learn sparse structure instead of conventional second order statistics. However this is not always done, and may not typically be done for deep classification models. Also often there is a requirement to normalize the energy of the weight vectors. Making them unit length means that the variance of a unit’s activation is going to be equal to the variance in the source pattern, which is somewhat artificial. When this source pattern is the non-linear output of the previous layers’ neurons in a multilayer network, the variance may not be very well defined.
Without justifying it thoroughly, it seems to me that a network will do well if the variance of the activity of its neurons is similar over the whole network when considering each neuron’s activation over a training set corpus. The idea is to have all neurons similarly involved – they will still provide a sparse code, but ideally they are all in general responsive to equal numbers of patterns in the training set and engage equally frequently in the design task.
One way to ensure this behavior without the need to keep a unit norm on the weight vectors is to adjust the weights adaptively to ensure that the output of each neuron is activated in general at about the same level as training progresses. This is a sort of homeostatic adaptation which is certainly one of the mechanisms that is used by the brain to keep neuron outputs in the biochemical working range. One way to do this is to have the weights constantly update in the usual way during training and at the same time renormalize them with respect to their outputs so that the measured variance of each associated unit’s activation is kept constant. Another way to do it is to have the weights change freely during learning, and then have a dedicated renormalization cycle, which may interfere less with learning. This is where the idea of sleep comes in.
In this idea the neurons learn from input samples which may be randomly chosen, or they may be temporally correlated. As a result the weights change and the general balance of activity across the network will possibly depart from being even, especially if a lot of correlated samples are seen, since correlated samples repeatedly train the same neuron, which can cause its weights to grow excessively. This could well produce effects that are similar to the human experience of constantly internally hallucinating visual scenes (like playing hands of cards) when you have been repeating the same task for hours. The associated neurons are over-trained and have excessive responses to everything.
A renormalization cycle would expose the network to random training examples without training the weights in the usual sense, but just scaling them up and down to even out the network activities. Obviously this will effect the training somewhat, so needs to be alternated with training.
One method of doing renormalization is to just show the network sequences of independent training examples in order to generate activations whose statistics can be measured and used for rescaling.
Another method might be to create a muti-layer network that is capable of dream-like recall. For example using the Google deep dream network approach it is possible to randomly choose neurons in the network to consider and then optimize patterns of activation in other levels that provide a memory-like recall of related concepts.
The relationship with renormalization is that this dream-like method would be a way to internally generate representative activity in the network that is then used for rebalancing neural activity gains. This really is a sort of dreaming process because, in this idea, spontaneous activity is generated in the network and propagated to all layers solely to generate responses so that excessively high or low activations can be detected and fixed by weight renormalization. The network is effectively generating its own training set for this purpose by recalling patterns starting from random noise, and then fixing the statistical biases that arise from over-training particular subsets of neurons.