Let’s say that we make measurements of a large group of people. Such measurements might include height, weight, IQ, blood pressure, credit score, hair length, preference, personality traits, etc. You can imagine obtaining a mass of data about people like this where each measurement is taken to lie on a continuous scale. Typically the distribution of the population along each one of these measurements will be a bell curve. Most people have average height for example. The interesting fact is that the more measurements you take, the less likely it is that you will find anyone who is simultaneously average along all the dimensions that you consider. All of us are abnormal if you consider enough personal attributes.

This brings us to the shell property of high dimensional spaces.

Let’s consider a normal (Gaussian) distribution in D-dimensions. In 1D it is obvious that all the probability bulk is in the middle, near zero. In 2D the peak is also in the middle. One might imagine that for any number of dimensions this would continue to hold, but this is false. The shell property of high dimensional spaces shows that the probability mass of a D-dimensional Gaussian distribution where D>>3 is all concentrated in a thin shell at a distance of sqrt(D) away from the origin, and the larger the value of D, the thinner that shell becomes. This is because the volume of the shell grows exponentially with D compared with the volume around the origin, and so with large D there is essentially zero probability that a point will end up near the center: Mr Average does not exist.

This can be seen on the following graph which plots a frequency histogram of the distance from the origin of Gaussian random vectors in D-dimensions. You can see that the distribution becomes more and more narrow.

From this graph you can see that if you have a high dimensional Gaussian random vector you can approximately normalize its length by simply dividing each element by sqrt(D), instead of actually dividing by the L2 norm.

Now lets consider the related property of points which are constrained to lie on the surface of a hypersphere. We might then ask how the properties of high dimensional spaces affect the distance between these points. If you plot the geometric distances between pairs of points on a hypersphere you find that the distances between them all tend to about sqrt(2) as the number of dimensions increases.

So with 100 dimensions any new point that you generate on this sphere will be most likely about sqrt(2) away from any other Gaussian distributed random point on the sphere.

Lastly let’s consider the cosine (dot product) distance between pairs of vectors going from the origin to random points on the surface. The cosine distance ranges from -1 where vectors point in opposite directions to 1 when they are exactly parallel. The distribution of cosine distances is shown below.

Here you can see that the cosine distance is distributed around zero and that this distribution gets narrower as the number of dimensions increases. What this means is that the random vectors are on average 90 degrees apart. This is consistent with the sqrt(2) geometric distance, which means that on average they are orthogonal. With higher numbers of dimensions the chances of generating a pair of vectors which share considerable extent along the same direction rapidly reduces.

These properties relate to the problem of matching of high dimensionality media descriptors in image, text or audio recognition. When you have a query descriptor vector and want to match into a database of many other descriptors, you typically compute the descriptor space distance. The shell property described here dictates that the distance from the query to most of the non-matching descriptors will be about the same. Only the closely matching descriptors will fall significantly within this shell. If you sort the distances from the query descriptor to the closest N candidate descriptor vectors, this distance graph will flatten out and in the limit become the shell distance. This can be used to establish the threshold bound for a matching process and discriminate between inliers and outliers.