Likelihood Log

Econometrics of scale

The Bias-Variance Trade-off

biasvariance3The bias-variance trade-off is a fundamental principles that is is at the heart of statistics and machine learning.  You may have already seen its various faces, for example in this equation:

\text{MSE}\small\left(\hat{\theta}\small\right) = \normalsize\left[\text{Bias}\small\left(\hat{\theta}\small\right)\normalsize\right]^{2} + \text{Var}\small\left(\hat{\theta}\small\right)

or in linear regression analysis where omitting valid explanatory variables made other coefficients biased while introducing correlated explanatory variables made standard errors of coefficients high.

Those, however, were all concrete examples of a much broader phenomenon that I want to discuss here.

Intuition Behind Bias, Variance and Noise.

First, some intuition.

Recall that the aim of any statistical learning is as follows:
  • Assuming that the relationship between x-s and y-s is perfectly described by some stochastic function y = f\left(x\right)+\epsilon, where \epsilon is random noise with zero mean and variance \sigma^{2} .
  • And given a sample/training set S_{1} = {x_{1}, \text{...}, x_{n}} and response values y_{i} associated with each point x_{i} form S .
  • Use clever algorithms and statistical methods to find a function \hat{f}\left(x\right) , that is “the best possible approximation” of y = f\left(x\right)+\epsilon.

By “the best possible approximation” we mean that on average if we take any new sample point z that is in the population X but not in the original sample S_{1}, and plug it into \hat{f}, then the value predicted by our learned function \hat{y}_{z} = \hat{f}\left(z\right) will be as close as possible to the real value of y_{z} = f\left(z\right)+\epsilon.

Almost certainly, whatever \hat{f} the learning algorithm comes up with, it will differ from the actual f. While f itself is fixed, \hat{f} is an estimator of f, obtained using the learning algorithm on the basis of the random sample S_{1}.  If we were to take a new random sample S_{2} from X and repeat the whole process using exactly the same learning algorithm, we would get a slightly different estimator \hat{f_{2}} of f. And again, for a third re-sample S_{3}, we would get a slightly different version \hat{f_{3}} yet again and so on.

Thus, \hat{f} is a random estimator of the non-random f, and it will be different each time, depending on the random sample S_{1}, S_{2}, S_{3} that was used as the training set.  And, just like any estimator, \hat{f} has a mean and variance.  The mean of \hat{f} is the “average” that all \hat{f_{i}}-s, learned from all different training sets S_{i}, are centered around. The variance of \hat{f} is how dispersed these \hat{f_{i}}‘s are around this mean.

Note that the mean of \hat{f} may or may not be f itself. It could well be that even though \hat{f} is meant to estimate f, the average of all \hat{f_{i}}‘s lies “slightly to the side” of f.  In such case \hat{f} is called a biased estimator of f.

Also the variance of \hat{f} could be high or low, i.e. \hat{f} can be greatly dispersed around its mean or it could be strongly centralized.

Here is a graphical representation of bias and variance in 2D:

bias-and-variance

So far, we have made the point that \hat{f}, the function that was fitted or learned from the sample set S and the corresponding given response values {y_{1},\text{...},y_{n}} will be slightly different from the actual f.

However, even if \hat{f} matched f perfectly (we are talking zero bias and zero variance here!) then \hat{f\left(x\right)} will still not predict y = f\left(x\right)+\epsilon perfectly because of the random \epsilon.

Thus, when predicting y, \hat{f} will likely be off for three reasons:
  • Firstly, \hat{f} is an estimator of f that was obtained on the basis of random S and therefore has variance, i.e. will be different from S_{i} to S_{j}.
  • Secondly, even the average of these \hat{f}-s may be off from the real f, due to bias.
  • Thirdly, even if we were able to learn f perfectly, there is a random noise component \epsilon in y = f\left(x\right)+\epsilon which makes it impossible to predict y even with \hat{f} being a perfect match with f:

Mathematics Behind Bias, Variance and Noise.

OK, so now let’s go over the same thinking as above but with proper maths.

We talked about \hat{f} being “on average a good approximation”  of f + \epsilon.  There are many ways to define what this “on average a good approximation” statement could mean, but the most conventional way is to use the mean squared error (MSE) :
\text{MSE}\left(\hat{f}\right) = \mathbb{E}\left[\left(y-\hat{f}\right)^{2}\right]

We would then make the following observations:

  • Since f is deterministic begin{align} mathrm{E}[f] = f end{align}
  • Given y=f+epsilon  and {mathrm {E}}[epsilon ]=0, we have {mathrm {E}}[y]={mathrm {E}}[f+epsilon ]={mathrm {E}}[f]=f.
  • Given that for any random variable X, we have  begin{align} mathrm{Var}[X] = mathrm{E}[X^2] - mathrm{E}[X]^2 end{align} and therefore  begin{align} mathrm{E}[X^2] = mathrm{Var}[X] + mathrm{E}[X]^2 end{align} and given that {mathrm {Var}}[epsilon ]=sigma ^{2} we have:{begin{aligned}{mathrm {Var}}[y]={mathrm {E}}[(y-{mathrm {E}}[y])^{2}]={mathrm {E}}[(y-f)^{2}]={mathrm {E}}[(f+epsilon -f)^{2}]={mathrm {E}}[epsilon ^{2}]={mathrm {Var}}[epsilon ]+{mathrm {E}}[epsilon ]^{2}=sigma ^{2}end{aligned}}

Thus, since epsilon  and {hat {f}} are independent, we can write

{displaystyle {begin{aligned}mathrm {E} {big [}(y-{hat {f}})^{2}{big ]}&=mathrm {E} [y^{2}+{hat {f}}^{2}-2y{hat {f}}]\&=mathrm {E} [y^{2}]+mathrm {E} [{hat {f}}^{2}]-mathrm {E} [2y{hat {f}}]\&=mathrm {Var} [y]+mathrm {E} [y]^{2}+mathrm {Var} [{hat {f}}]+mathrm {E} [{hat {f}}]^{2}-2fmathrm {E} [{hat {f}}]\&=mathrm {Var} [y]+mathrm {Var} [{hat {f}}]+(f^{2}-2fmathrm {E} [{hat {f}}]+mathrm {E} [{hat {f}}]^{2})\&=mathrm {Var} [y]+mathrm {Var} [{hat {f}}]+(f-mathrm {E} [{hat {f}}])^{2}\&=mathrm {Var} [y]+mathrm {Var} [{hat {f}}]+mathrm {E} [f-{hat {f}}]^{2}\&=sigma ^{2}+mathrm {Var} [{hat {f}}]+mathrm {Bias} [{hat {f}}]^{2}end{aligned}}}

The three terms represent:
  • the square of the bias of \hat{f} ( bias of \hat{f} is how far away the the mean of \hat{f} is from f)
  • the variance of hat{f}(x)  (i.e. how much \hat{f} moves around its mean);
  • the variance of noise sigma ^{2}

In other words:

Prediction Error = Error due to Noise + Error Due to Variance + Error due to Bias

Reducible and Irreducible Components.

Let’s look at each of these three error components separately.

We can’t do much about the error due to noise.  This error was introduced during the framing of the problem when we assumed that the true model has a stochastic term \epsilon, so we just have to live with it.  We call this the irreducible part of the prediction error.

We can however try to do something about the other two components of the error – ones due to bias and variance.

Concretely, we can try various learning algorithms to come up with \hat{f} estimator of f that will have as low a bias and variance as possible.  This will give us the lowest possible average prediction error on fresh data points x' (where x' is in X but not in S). We therefore call the error due to bias and variance the reducible part of the prediction error.

The Bias-Variance Trade-off

Well, as it turns out, by exploring different learning algorithms and functional forms of f we can improve our model more and more by reducing both bias and variance at the same time…   but only up to a certain point.  Once we reach that minimal point the prediction error cannot be reduced further. From here on any decrease in bias will necessarily come with a corresponding increase in variance and vice versa.

biasvariance

This phenomenon is called the “bias-variance trade-off” and is a fundamental result that appears in many shapes and forms throughout statistical modelling.

Note that at this point the prediction error may be still well above the irreducible component, i.e. the prediction error is not just due to the \epsilon stochastic term but also due to some remaining variance and bias.  However the remaining so called reducible error cannot be reduced further due to the bias-variance trade-off.  (The term “reducible error” may start to seem misleading now, but bear in mind that the reducible error was able to be reduced up until now and only cannot be reduced further from here on).

The bias-variance trade-off phenomenon places a hard limit on what can be achieved through statistical learning. No matter what learning algorithm we try for a particular problem, we will always eventually hit a barrier where all further gains in one component of prediction error (bias or variance) are necessarily offset by losses in the other (variance or bias).

True Bias and Variance Error Terms Can’t Be Measured in Reality

I should point out that in reality we often can’t actually calculate the real bias or variance error terms in the MSE (\hat{f}) because we do not know the actual function f, only its estimate \hat{f} obtained on the basis of the random sample S.

What we can do, however, is either of the following.

If we want to study in theory how a particular model fits a particular type of theoretical population distribution, we can use our assumed theoretical model of the population
y=f\left(x_{i}\right)+\epsilon
and Monte-Carlo simulation to generate various members of that population. We can then see what models \hat{f} are produced by various fitting methods and, most of the time, calculate the bias and variance of these models. This approach allows us to answer theoretical questions where the population distribution is known or assumed and we are looking to understand how well a particular model approximates it, including the bias and variance that the model carries with it.

If, on the other hand, we want to measure how a particular learned model fits a real life population of which we only have a sample S and no knowledge of the real f, then we need to resort to re-sampling techniques such as cross-validation, jackknife or bootstrapping – all of which I have described in detail in this post here and here. These approaches allow us to estimate the bias or variance of the proposed model, rather than to calculate it. It would be impossible to calculate the real bias and variance because we don’t know the real distribution of the population and aren’t assuming anything about it, but instead are restricted to using the random sample S as an imperfect proxy for the population.

Bias vs. Variance and Model vs. Training Set

As it turns out, bias and variance of a model are closely related to how complex the model is or how much the model assumes about the underlying population distribution.

The error due to bias is the amount by which the model prediction \hat{y} on average differs from the true value y.  Thus, one way of thinking about bias is to view it as the error component that is there because of the model’s shortcomings.

For example, if the truth is non-linear and we try to use a linear model to approximate it, we are introducing a bias due to the linear model’s inability to fully capture the subtleties of the non-linear truth and to approximate it well, i.e. we have “underfit” the model. In other words bias is introduced at model selection stage and is due to the (usually simplifying) assumptions made by the model. High bias suggests that \hat{f} makes strong assumptions about the form of f, while low low bias suggests weaker assumptions about the form of f.

Variance measures how dispersed the predictions are from one another, over different training sets, rather than whether they are accurate on average.

If the true relationship y=f+\epsilon is approximately linear with some noise and yet we chose a highly non-linear model \hat{f} because it has fit the training set S really well and even explained some of the noise, then we are introducing variance – the non-linear \hat{f} will not generalize well to data outside of the sample S due to model’s inability to be linear where it should be, i.e. we have “overfit” the model.

Therefore, high variance is typical of situations where a change in the training set S will greatly change the estimate \hat{f} of f, while low variance is typical of situations where \hat{f} is more resilient to changes in S.

Thus, bias is introduced through over-commitment to a particular type of a model or functional form (eg linear) and  “underfitting” with respect to the data, whereas variance is introduced into the model through over-commitment to the data points in the training set and thereby “overfitting”.


Indeed, if we consider all the machine learning methods we have covered so far and where they sit on the bias-variance trade-off we see the following landscape:

Low-bias algorithms: Decision Trees, k-Nearest Neighbors and Support Vector Machines.
High-bias algorithms: Linear Regression, Linear Discriminant Analysis and Logistic Regression.
Low-variance algorithms: Linear Regression, Linear Discriminant Analysis and Logistic Regression.
High-variance algorithms: Decision Trees, k-Nearest Neighbors and Support Vector Machines.

There is clearly a pattern here. Parametric or linear methods tend have a high bias and low variance whereas non-parametric or non-linear methods tend have a low bias but a high variance.

This observation is consistent with the point made earlier – bias occurs where the model is parametric, with strong assumptions about the functional form of f, whereas variance occurs where the model is non-parametric and driven strongly by the training set.

Controlling the Bias-Variance Trade-off

There is no escaping the bias-variance relationship.  Instead, there are methods for finding different balances along the bias-variance trade-off continuum – balances that may be appropriate to this or that particular problem at hand.

General Methods for Controlling the Bias-Variance Trade-off
  • Using fewer features increases bias but decreases variance. Earlier in this post I’ve given an intuitive explanation of why this is the case in general. There are also concrete proofs of this fact for specific model types. For example, introductory courses in linear regression often cover the proof that omitting relevant predictors from a linear regression model makes remaining coefficient estimates biased while including too many predictors introduces correlation between then and therefore increases variance.
  • Dimensionality reduction also increases bias but decreases variance. The argument is identical the case of using fewer features.
  • Larger training sets decrease variance.  The detailed proofs of this fact vary from model type to model type, but they are almost always due to the fact that the models are obtained by doing some kind of averaging over the training set and, by the law of large numbers, the larger number of points we average over, the less variance our averages display.
  • Using ensemble learning or model averaging  such as bagging or boosting.


There are also methods that are specific to particular model types. For example pruning can be used to decrease variance in trees, while LASSO and ridge regularization do the same for linear models.


All of these methods allow us to control variance or bias but regardless of the method, there comes a point after which all gains in one are necessarily offset by losses in the other.


romansmith

View more posts from this author

Leave a Reply

Your email address will not be published. Required fields are marked *