# Re-sampling Pt.1: Cross-Validation

Today and over the next three posts we will talk about re-sampling methods, which is a family of approaches to synthesizing multiple data samples from one original data set.

There are a number of reasons why you may want to do that and a number of ways in which you could do that. Specific re-sampling methods all differ in the way they generate new samples and subsequently in their computational complexity and bias-variance trade-off.  Because of this, specific re-sampling methods differ in their suitability for various specific purposes.

In this post I will talk about re-sampling in general as well as about one specific re-sampling approach – cross-validation – a method used in machine learning for estimating predictive abilities of models and selecting the “best” predictive models.  In the next post I will talk about the jackknife and bootstrapping – two re-sampling methods from the more “old-school” statistics background, both used for deriving parameter estimates (as well as their bias and standard errors) for learned models.  In the last post on re-sampling, we will talk about model averaging, which is actually a separate topic in itself but often relies heavily on re-sampling.

### Why Re-sampling?

Why would we want to synthesize many datasets from one given dataset?  There are many nuanced reasons, but at high intuitive level the thinking goes as follows.

Firstly whenever we are learning a predictive model, we want to be able to see how well it predicts. Ideally we want to test this predictive ability on a brand new dataset that is completely separate from the data set that we used for learning the model. Thus, we already need at least two sample datasets from the same population – one for learning the predictive model and a separate one for testing its predictive ability.

Secondly, instead of just using one dataset to train the model and one dataset to validate the model, we may wish to use multiple datasets to train the model multiple times and multiple datasets to validate the model many times.  This would allow us to average across many learned models and to average across many test results.

So there are a few reasons why we may need multiple sample sets from the same population in order to be better at building and testing models. In reality however, we don’t usually get to have as many datasets as we want. Often data is expensive or hard to get or rare and we just have to figure out creative ways to come up with multiple datasets despite these limitations. And this is where re-sampling helps.

Ok, so by now this all probably sound sensible enough but a bit vague.  It’s time to look at concrete re-sampling methods – cross-validation (this post), jackknife and bootstrap (next post) to illustrate more concretely how these aims are achieved in practice.  As they say, an example is worth a thousand definitions.

### Cross-validation in General

Cross-validation is a re-sampling method whereby subsets of the full dataset are held out for use as validating sets.

A model is fit to part of the original dataset, known as the training set and we then validate the model to see how well it predicts for the remaining part of the dataset, the validation set or test set.  This allows us to estimate test error rate of a proposed model and subsequently to compare multiple models in order to pick one with the lowest test error rate, i.e. to find the best predictive model.

The splitting into training and validation sets can be done just once – approach known as hold-out cross-validation.  Alternatively, the process can be repeated multiple times, each time splitting the set into two random subsets, one for training one for testing, and averaging the quality of the predictions across the multiple validation sets – this is what is done in leave-one-out cross-validation (LOOCV) and in k-fold cross-validation. All of these methods are described in detail further down.

Note that while the multiple repeat approaches such as LOOCV and k-fold cross-validation can become computationally taxing, thesedays due to cheap computing power this usually isn’t a serious concern.  Moreover, repeated re-sampling, re-training and re-testing can allow us to achieve more “averaged out” results and place us at a more advantageous point in the bias-variance trade-off continuum for a particular problem at hand.  For these reasons, it is often attractive to employ leave-one-out or k-fold cross-validation methods over the more primitive hold-out cross validation.

Below I discuss in detail these three cross-validation approaches and their associated computational complexities and bias-variance trade-offs.

### Comparison of Specific Cross-validation Methods

#### Method #1: Hold-Out / Validation Set Approach

##### How It Works

We take the entire data sample $S$ at our disposal and randomly split it into two disjointed sets – the training set $S_{train}$ and the test set or validation set $S_{v}$.

We then use $S_{train}$ to learn the model that we think tells us something about the underlying population $P$ that $S$, and subsequently $S_{train}$ and $S_{v}$, are samples of.

Having learned the model from $S_{train}$ , we can now use $S_{v}$ to validate how well the model performs on a dataset that is not $S_{train}$ but comes from the same population $P$ that $S_{train}$ came from.  The average validation error serves as an estimator of the true prediction error that the model will exhibit when applied to new data.

Opinions vary on the relative sizes of the two sets, but common consensus is that the training set should contain 70% of the data points, while the validation set should contain the remaining 30%.

Also, it goes without saying that the assignment of data points between the training and validation sets should be random.  If, for example, the full dataset is sorted in ascending order by a particular feature $X$ and we place the first 70% of records into the training bucket and the remaining 30% into the validation bucket then our model will be trained on “smaller” values of $X$ and validated against larger values of $X$ – hardly a good experimental design approach.

##### Computational Complexity: Low

Hold-out cross-validation algorithms are simple to implement and aren’t computationally intensive to run.  We split the dataset once, train the model on the training set once and validate on the validation set once.

##### Bias-Variance trade-off: High Bias, High Variance

Note that our training set is substantially smaller than it could have been had we used the entire dataset without sacrificing some of its members to be in the validation set.

This has the following consequences:

• Validation error, as an estimate of test error, has high bias. In intuitive terms, this is true because the model has been fit on fewer observations (for example only 70% of the full dataset $S$) and is generally a poorer fit that a model that would have been fit on the full dataset $S$. Therefore validation set error will be higher than the real test error that the full model (i.e. model trained on the full dataset $S$) would have achieved on the test set. In other words, the as an estimate of the real test error, the validation error over-estimates test error. Thus, the estimate of test error has bias – on average the test error estimated by this means will be higher than the real test error of the full model fit on full data set and tested against a test set.
• Validation error, as an estimate of test error, has high variance. Intuitively speaking, this is true because validation error is an estimate obtained from a random sub-sample of $S$ that has a relatively small number of data points (usually only 30% of datapoints of $S$) and, as we know, the smaller the sample size used in constructing an estimator, the higher the variance of the estimator.  As an estimate obtained from a smaller number of data points it has higher variance.
##### Conclusion

The hold-out validation method is easy to understand and to implement and it seems like a great first step in defining a proper build-test framework for our model learning.  It may also be good enough for some rapid rough-and-ready exploratory model prototyping.  But it suffers from two serious drawbacks.

Firstly, if we don’t have a huge dataset then we may not be able to “afford the luxury” of splitting the dataset into two smaller subsets that are still sufficiently large for training and testing.  This problem, by the way, gets especially severe when we have many data fields – see my post on the curse of dimensionality.

Secondly, because we only split once and only take one shot at training the model, the holdout estimate of the test error will be too dependent on any irregularities that the one-off split may have accidentally had in it.

#### Method #2: Leave-one-out Cross-validation

##### How It Works

In leave-one-out cross-validation (abbreviated as LOOCV) we repeatedly train and evaluate the model by leaving just one element out of our dataset S.

More concretely, for a sample dataset $S$ of size $n$, the algorithm is as follows:

• Temporarily remove from $S$ the 1-st element $s_{1}$ and train the model on the remaining dataset $S-\{s_{1}\}$.  Now use $s_{1}$ to test the resulting model and record the test error $E_{t_{1}}$
• Temporarily remove from $S$ the 2-nd element $s_{2}$ and train the model on the remaining dataset $S-\{s_{2}\}$.  Use $s_{2}$ to test the resulting model and record the test error $E_{t_{2}}$
• Repeat $n$ times, each time removing $s_{i}$ from $S$, training the model on $S-\{s_{i}\}$ and testing on $s_{i}$, recording the test error $E_{t_{i}}$
• Upon completion of these $n$ iterations, take the average of all the test errors $E_{t_{i}}$ for $1 \leq i \leq n$ and use that as the test error for the model.

So essentially, we repeat the hold-out cross-validation method $n$ times where on each iteration we pick the $i$-th element $s_{i}$ of $S$ and use $S-s_{i}$ as the training set and $s_{i}$ as the (singleton) validation / test set.  We then take the average across the test errors returned by each iteration and use that as the final estimator of the true test error.

##### Computational Complexity: High

We have to fit the model $n$ times, where each time the fitting of the model can be quite computationally heavy itself.  There are specific cases where there are shortcuts – eg. for OLS linear regression we can use a formula that reuses a lot results from run to run, but generally speaking, for most other cases there are no computational shortcuts.

##### Bias-variance Trade-off:  Low Bias, High Variance
• Validation error, as an estimate of test error, has low bias.  The training set $S_{train}$ is almost the same size $n-1$ as full set $S$, thus the model $M_{S_{train}}$ learned from $S_{train}$ has been fit on almost the same number of observations as the model $M_{S}$ that would have been fit on the entire $S$.  In this sense, $M_{S_{train}}$ is “not much less accurate” than $M_{S}$ and the validation set error produced by $M_{S_{train}}$ won’t be much higher than the real test error produced by $M_{S}$.  In other words the bias of the validation set error as an estimate of the test error is low-to-negligible – the validation set error may over-estimate test set error but not by much.
• Validation error, as an estimate of test error, has high variance. The high level reasoning goes as follows. Estimated test error is the mean of prediction errors for each of the $n$ cross-validation passes.  While the left out singletons are all independent, the models that are build on $n-1$ sample points overlap in all but one point and are therefore highly positively correlated.  Therefore the variance of the validation error (i.e. the variance of the mean of prediction errors for each of the $n$ cross-validation passes) will have as one of its terms the variance of the sum of these modeled outputs.  You will recall that the variance of the sum of ay two random variables $A$ and $B$ is: $\text{Var}\left(A+B\right) = \text{Var}\left(A\right)+\text{Var}\left(B\right)+2\text{Cov}\left(A,B\right)$ and the $2\text{Cov}\left(A,B\right)$ term is high if the two variables are highly positively correlated.

#### Method #3: k-fold Cross-validation

##### How It Works

In k-fold cross-validation we divide our total dataset $X$ into k subsets or folds, and the holdout method is repeated k times.

Each time, we temporarily remove one of the k folds and use the leftover subset of $S$ made up of the remaining k-1 folds as the training set, while using the the removed fold as the test / validation set.  Then, upon completion of these k iterations, we take the average error across all k trials.

Note that we are free to chose any k such that $1\leq k\leq n$ where $n$ is the size of $S$.

In fact, you may notice that LOOCV described earlier is just a special case of k-fold where k=n. At the same time though, the hold-out cross-validation is not 2-fold cross-validation.  Hold-out cross-validation requires fitting once (using the first subset for training) and validating once (using the 2nd subset as validation), while 2-fold cross-validation would require fitting twice – first use the first subset to train and second to validate, then swapping and using the second subset to train and the first to validate.

By varying k, one can control the level of computational complexity involved in k-fold cross -validation (the higher k, the more iterations the algorithm requires) as well as the variance of the estimated test error (the higher k, the more correlated the training sets used in each iteration, resulting in higher variance of test error estimator). Common practice is to set k=10 or k=5.

##### Computational Complexity: Medium-Low

We now have to re-fit the model k times (where usually k=10 or k=5) rather than n times (where n can be in the thousands or higher) that we had to re-fit it in LOOCV.  Thus, the computational complexity of k-fold cross-validation falls somewhere between that of LOOCV and hold-out validation set method.  In fact, given that n is likely to be much larger than k, the reduction in computational complexity is very substantial and we are much closer to hold-out validation set’s complexity than to LOOCV’s.

##### Bias-Variance Trade-off:  Medium-to-low Bias, Medium-to-low Variance
When it comes to the bias-variance trade-off, k-fold cross-validation offers a happy medium between the hold-out validation set and the LOOCV approaches.  More concretely, the validation set error obtained via k-fold cross-validation, when viewed as an estimator of the true test error, has the following nice properties:
• Bias is not zero, but still reasonably low.  This follows from the same reasoning that we used in understanding the validation set error bias for hold-out validation and LOOCV approaches and from the fact that the models are now trained on datasets that are of size n-k, which is not as large as n-1 but still considerably larger than 70%*n.
• Variance is medium.  This follows from the same reasoning that we used for LOOCV but now that the training sets differ by k points rather than by 1 point and are therefore not as highly positively correlated, but still are fairly correlated because the n-k data points are still the same between any two training sets.
##### Conclusion

The k-fold cross-validation method is slightly more computationally taxing than the hold-out method but produces a much better estimate of the test error.  At the same time, it is far less computationally intensive than the LOOCV method and the resulting test error estimator is more efficient (lower variance) albeit with a slight bias when compared to the LOOCV one. Here is an influential paper that compares LOOCV, k-fold and bootstrapping to conclude that under reasonable assumptions k-fold cross-validation with k=10 is the best practical approach for estimating test errors and evaluating models: https://pdfs.semanticscholar.org/0be0/d781305750b37acb35fa187febd8db67bfcc.pdf

### How Cross-validation Is Actually Used in Practice

In all of the discussions above we talked about the validation test error as being the main “output” of the cross-validation process.  We run a cross-validation, whether it is LOOCV or k-fold, and we then look at the validation set error as an estimator (with accompanying bias and variance) of the real test error that the model would produce “in the wild”.

In practice however, the envelope is pushed one step further.  Now that we can use cross-validation to estimate the test error that (for a given problem and a given dataset) a particular learned model gives, we can also use cross-validation to try learning different models from the training set (these different models could be either models of the same model type but with different parameters, or models of different model types altogether!), get these models’ estimated test errors and ultimately pick the one “best” model that gives us the lowest estimated test error.

Thus, the real value of cross-validation is not in evaluating the predictive ability of one particular model, but in allowing us to build and compare multiple candidate models in a consistent and meaningful way and to select the best model.

For this purpose, we often actually split our original sample dataset into three subsets, rather than two as I was using earlier.  We first remove a small part (usually 20%) to be our independent test set.  We then take the remaining 80% of the dataset and use it in cross-validation.  This would mean, for example, that if we do hold-out validation we would subsequently split it into another 70/30 training and validation set, or if we do k-fold we just divide these 80% into the k folds, etc.  Thus, in the 80% we have the training set (either the 70% of 80% in hold-out validation, or the k-1 folds in k-fold) for learning the model and the remaining independent cross-validation set (the remaining 30% or the 80% in hold out, or the 1 fold in k-fold) for estimating the test error.  We subsequently select the best model that cross-validation suggests to us based on the 80% of data.  We then (and this is the new step), run that model on the 20% and see what the real test error is, i.e. how well the model predicts not on the set that was used to train it and not on the set that was used to validate it, but on a completely fresh dataset (all from the same population as the previous two).

### Summary

As you can see there is a spectrum of cross-validation methods, all offering requiring various levels of computing power and offering various points along the bias-variance trade-off continuum. On one hand we have the primitive hold-out validation where the computational requirements are low but  as an estimator of test error the validation error has high bias and high variance.  On the other hand have LOOCV – a computationally intensive approach that produces a low bias but high variance estimate of the test error.  Somewhere in between lies k-fold CV which is computationally reasonable (re-fit k = 10 times) and gives medium-to-low bias and medium-to-high variance.

Here is an influential paper that compares CV (varying k) and Bootstrap, to conclude that 10-fold CV is generally the best: http://web.cs.iastate.edu/~jtian/cs573/Papers/Kohavi-IJCAI-95.pdf.

We’ll talk about bootstrap in the next post but for now at least the k-fold CV part of the paper may be of interest.

Despite all these differences in technicalities, costs and benefits, all cross-validation methods are more similar than they are different.  Firstly what they all have in common is that they split the available data-set to create multiple non-overlapping subsets and then use one part the data to train the model and the other part of the data to validate it.  Secondly, is that they all share the same purpose –  to estimate test error rate of a proposed model and is subsequently used to compare multiple models in order to pick one with the lowest test error rate, i.e. to find the best predictive model.

As we will see below, other re-sampling methods like bootstrap and jackknife are rather used to estimate variance and or bias of parameter estimates rather than the predictive accuracy of the model.