This post is about fitting a curve through a set of points. This is called regression: It is also the classic problem of coming up with a generalization from a discrete training data set in machine learning. We have a set of points that are observations at specific places and we want to make a system that predicts what the likely observations should be at all places within the domain of interest. We use the given observations (a training set) to train a modelĀ and then when we get more observations (a test set) we can evaluate how much error there is in our prediction.

The model will have a set of parameters which can be adjusted to come up with a minimum error on the training set. Then we hope that these parameters give rise to a low error on the test set. The underlying structure of the model and its parameterization is typically something we make up. An example of a model for curve fitting could be a polynomial, where the parameters are the coefficients of the polynomial terms.

It is well known that, if we have a lot of parameters, we can fit the training data with low or zero error, but end up with a large error on the test data. This is the well-known over-fitting problem. If we think of the observations as being created by an unknown stationary process with the addition of random noise, then over-fitting is trying to make the curve pass through all the noisy points in the training data, and so when we run the test data we find that those fluctuations were actually irrelevant and should not have been accommodated in the fitting. Choosing the right complexity of the model is therefore key; for polynomial fitting this would be the degree of the polynomial, or the number of non-zero coefficients of a polynomial with arbitrary degree.

One may see this problem occurring in human thinking, when people seize onto a number of anomalies in, for example, the 9/11 WTC disaster, and extrapolate a conspiracy theory. In effect, they are trying to move to a higher order model with more parameters. Noise in the data comes from uncertain facts from the event. Since it is a historical event one cannot easily re-examine evidence from the time or perform careful tests, and so outlandish evidence, which may just be incorrect observations, leads some people to choose to apply a more complex model of the event than the conventional narrative. Choosing the simplest model amounts to an application of Occam’s Razor which is to avoid adding complexity if at all possible. Increased levels of paranoia reflect increased expectations of complex causes for the data. However the facts that support a more complex interpretation may nevertheless build up until moving to that model is warranted, particularly if we can establish that the observational noise is actually low.

Going back to curve fitting, this issue of over-fitting is actually a problem that comes from the fact that such fitting is an example of maximum likelihood estimation. If we use a full Bayesian formulation for the problem, which assigns probability distributions to the parameters without forcing a choice and also makes predictions based on these parameter distributions, then the model can naturally choose its own complexity.

If we know the signal to noise ratio, then everything gets easier. This means that we know in advance the noise model. We can train our model until the error rate reaches the minimum implied by the known amount of noise. We expect that when we check against the test set, then we will get an error rate equal to our expectations based on the noise model.

Noise in observations effectively hides the presence of a complex signal cause when the amount of data is limited. If the noise is high, we cannot observe all of the undulations of the underlying process to the extent that they remain below the noise floor, and we can only see a broad trend. We cannot tell from our finite training data if the cause can be simply modeled but has a large noise component, or if the cause has higher complexity and the noise is small.

I think that this is a fundamental philosophical limitation, and, to the extent that it is fixed by using a Bayesian approach, this is just because we have hidden the uncertainty away inside the PDFs that express the values on the parameters. Such an approach is similar to quantum theory in physics where all outcomes are possible unless we are forced to make an observation and choose the most likely values.

It is clear that humans vary a lot on how much they will add parameters to their idea of a what is the cause of a particular situation given the same data; hence the variation between conspiratorial versus non-conspiratorial thinkers. This distribution of mental predispositions reminds me of the exploration versus exploitation tradeoff in reinforcement learning or in natural systems, e.g., ant behavior, which is about following the herd, versus straying into un-charted territory.

As a last point, notice that there can be a lot of different kinds of noise models. For example we may assume that the observation noise is Gaussian, and compare large or small amounts of Gaussian noise. We might think of noise as a one dimensional thing like this, where it can just be larger or smaller. If the noise is Gaussian then we use least squares regression because this is the solution to maximum likelihood estimation with a Gaussian noise source. However the noise might appear to have a large variance because the data actually has a lot of outliers: The probability density function of the noise source could be narrow with long tails and we would get mostly correct samples in our training data, but with a few bogus outliers; potentially introducing more complexity to model selection. Noise models potentially have higher degrees of freedom than commonly assumed.

It is clear from this that, when we think of the observations as consisting of a signal component added to a noise component, we should give equal importance to modeling both the signal process and the noise process. This means that choosing the parameterization of the noise is subject to the same over-fitting and model selection issues as exist for the signal. We need to estimate the noise PDF at the same time as the signal model and this PDF could be arbitrarily complex.

It is useful that nature often throws Gaussian noise at us, but this can be contaminated with measurements that are just plain wrong according to some outlier process, and this will significantly bias the results if it is not modeled. For example, consider finding the average age of people in a class. The data says that most range from 12.4 years old to 13.3 years old, but one entry is 23. The ages should perhaps follow a uniform PDF, but the 23 could be a typo (a 2 instead of a 1). Or they could have included the teacher in the list.