I’m doing some simple exploration of image statistics on a large database of natural images. The first thing that I tried was computing the histogram of neighboring image pixel intensity differences. Here is the graph for that using a log y axis, for a few pixel separations.

It is clear that large differences occur much more rarely and that the most probable pixel to pixel spatial change in intensity is zero. However the tails are heavy, so it is nothing like a Gaussian distribution. The adjacent pixel intensity difference log probabilities were fairly well fitted by a function that goes like , and the pixels further apart require a smaller exponent.

Below is an image of the 2D joint log probability function for two adjacent pixels.

You can see that there are some checker artifacts which most likely come from the jpeg encoding scheme.

What is interesting here and not commonly pointed out is that the fact that since we have a limited range of intensity from 0 to 255, this means that the distribution is clipped. This explains the strange behavior of the tails of the graphs above, particularly where they start to fall off rapidly with large for pixels that are not adjacent. Some probability mass is missing that would lift those tails if we were directly plotting luminance in a natural scene.

Now lets take a quick look at color statistics. Once can compute the 3D space PDF of RGB components, but that’s difficult to visualize. Below you can see three 2D PDFs of RG, RB, and GB pairs.

The first thing to notice is that there are a lot of weird lines and artificial looking black areas. These are caused by artifacts of JPEG compression and also one or more color space conversions which have warped the space slightly making places where histogram values bunch up or are rarer. It does not affect the result much.

These PDFs show that there is a lot of mass and therefore correlation along the luminance direction. After that RG looks narrower in the R-G direction meaning that R and G are quite correlated, whereas RB is very spread out, meaning that R and B are quite independent. G and B is in the middle.

If we carry out PCA on the color components this is equivalent to estimating a 3D Gaussian PDF that approximates the true joint PDF. From this we find that the directions of the components are

The standard deviation of that direction of the distribution is given in brackets. This experiment shows that the variance is strongest in the white direction, then in the blue – red direction, and last in the green – magenta direction. Of course since PCA vectors are forced to be orthogonal, it does not say much about the psychophysics of color, but it is well known that the vision system has channels that code for luminance, blue-yellow, and red-green color opponent directions because this choice de-correlates the components and creates a whitened colorspace in the statistical sense.

If one computes the PCA of image patches including both RGB and spatial data then the following component vectors are generated:

There are many more of these patches with progressively higher spatial frequency components. They represent a set of basis vectors with increasing variance.

The first ones represent the DC component and include the luminance, blue-yellow and magenta green directions. There are also horizontal and vertical low frequency gradients, followed by higher order spatial frequency mixes. Blue yellow first order gradients are visible on the second row. Note that the luminance direction dominates the variance, and the blue yellow direction is probably set by typical sky color which is common in natural images. The green/magenta direction is forced to be orthogonal to these other two directions.

The PCA breakdown is approximating the PDF of the image region as a single Gaussian which is oriented in space and color. The variance (eigenvalues of the covariance matrix) fall off with each patch as the frequencies get higher.

Representing the image with these second order statistics does not really characterize the interesting structure in the image because this just encodes the energy in each frequency band of the image FFT. The image phase structure which is not represented by this model is more critical for image recognition. It is the departures from a Gaussian model which are interesting. The most useful thing that can be done with the information in the covariance matrix is to use it to whiten or de-correlate the image data to normalize out the second order statistics prior to later analysis.

Next let’s look to see how each pixel is correlated with its neighbors. We can compute the autocorrelation which is where is an image intensity value at after the mean has been removed. (Note that the mean typically varies from the top to the bottom of an image because most natural images are lighter at the top.) This is a 2D function and is shown below, together with a cross section graph in the x direction of the log of this function.

The function is approximately a circular spot, although there is widening in the x and y directions. The graph includes an approximate fit of the function , which is the same function that fit the derivative distribution. One can see that the image pixels are statistically correlated over a very wide region.

Now the power spectral density is the Fourier transform of the autocorrelation function and this ends up being highest for low spatial frequencies and falls off continuously as frequency is increased. In order to whiten the data we must flatten the spectrum in the frequency domain, assuming that these second order statistics remain constant with changes in spatial location, which is a reasonable assumption in images.

Many forms of whitening of the data are possible because the second order statistics don’t constrain the whitening process. Once the data (e.g. square image patch) has been whitened then any matrix rotation of the space can be made without loss of the whitening.

If we whiten it with a high pass filter defined in the frequency domain as the reciprocal of the square root of the power spectrum (using zero for the phase), this leads to an impulse response which is approximately circular symmetric and has a center surround arrangement similar to retinal processing stages.

Now actually there is noise in the signal who’s power increases with frequency and so the optimal pre-processing to maintain information content is to use a whitening filter based on a Wiener filter which does not exaggerate the highest frequencies (boosting the noise) but allows the gain to roll off above a certain point to increase the signal to noise ratio. The resulting filter has a Gaussian like center in the spatial domain, instead of an impulse.

In summary, we can create this impulse response by Fourier transforming the autocorrelation image, taking the square root, replacing the resulting amplitude spectrum by its reciprocal and multiplying by a Gaussian function of frequency to damp out the high frequency components, and then computing the inverse Fourier transform.

The result of doing this is to generate a 2D filter with the impulse response shown here. The graph shows a section along the X-axis. Clearly this is a center-surround filter not dissimilar to the difference of Gaussians that are popularly used when modeling the visual system.

I spent some time to get a good model of the frequency domain falloff of the natural image energy and came up with a good parametric model for a FFT domain whitening filter. This is , where is the radial frequency and is always positive. For images with pixel values in the range 0 to 1, I find that works very well to largely diagonalize the covariance matrix and bring the pixel variances near unity.

Here is an example image whitened using this transform. The frequency domain filter is a combination of an exponential high pass filter convolved with a small diameter Gaussian low pass filter. The low pass filter retains some very short range correlations, but makes the output less sensitive to camera noise and removes the frequencies in the high diagonal frequency bands. This gives nicer results when this preprocessing stage is used to provide the input to a neural network.

One thing to note here however is that despite this whitening, there are still areas where the contrast is persistently low, such as in the shadows, and areas where the contrast is high, in the sunlit areas. So it is clear that there is still an obvious correlation of a multiplicative sort, such that if the contrast is low at a certain place, then it is likely to be low nearby.

Once can see this if we compute correlation on even functions of the whitened data.

On this graph, the pink curve shows the original raw image pixel to pixel correlation score which is high, even very far from the central pixel. After whitening, the blue correlation curve shows that whitening has removed most of the correlation except for regions very close to the central pixel. However if one computes correlations using a nonlinear function of whitened pixel values, then this new correlation is still present to a large extent out to far distances, particularly when the function is the absolute value function (red curve). These correlations represent the contrast variations that multiply signals differently from region to region. These correlations also fall off as following the same general form as before.

If the underlying reflectance function of the surfaces in the image has a particular probability distribution, then the varying lighting will increase and decrease the contrast in the image and will act to multiply the spread of that probability distribution as it appears in the image statistics. This acts to make the final image distribution more sparse than the real world reflectance distribution. If the intention of recognition is to ignore lighting changes, then these contrast variations are to some extent a nuisance. The obvious way to get rid of them is to divide by a function of local contrast.

Let us suppose that we have a source vector (pixel patch) which has been multiplied by an independent contrast variable to give . Let us suppose that has some unknown distribution but is Gaussian. If we divide by the L2 norm of this cancels out the but leaves which no longer has a Gaussian distribution because it has been projected onto a unit sphere. However, if the dimensionality is large enough, the joint distribution still approximates a Gaussian because of the shell property of high dimensional spaces.

This last statement pertains to the concept that even in a pure Gaussian distribution in a large number of dimensions, the probability mass is primarily concentrated at a shell a fixed distance from the origin. So normalization does not make a lot of difference. In experiments, I found that the distribution of each element of a Gaussian vector was still very Gaussian after normalization provided the dimensionality was larger than around 8.