Likelihood Log

Econometrics of scale

Re-sampling Pt.2: Jackknife and Bootstrap

Suppose we have a sample of n data points and we want to estimate some parameter \theta.  We come up with \hat{\theta} – an estimator of \theta.  What do we know about \hat{\theta}?  How good an estimator of \theta is it?  Is it biased?  How efficient is it?

We could answer these questions if we knew the distribution of the population  from which x_{1}, \text{...}, x_{n} came. More often than not however, we don’t know anything about the distribution of the underlying population, all we have is a sample and we want to figure out things about the population.

This is where re-sampling, such as jackknife or bootstrapping comes into play.


We generate multiple samples of the basis of our original data set and we use these multiple samples to compute multiple “versions” of \hat{\theta}.  Having many versions of \hat{\theta} allows us to average and estimate the expected value of \hat{\theta}, as well as its bias and variance.

Let’s see how each of the two methods does this.

The Jackknife
 
How it works:

We take our original sample S = \{x_{1},\text{...},x_{n}\} and generate n sub-samples S_{i} of size n - 1 by leaving out one observation at a time: 

\text{for} 1\leq i \geq n \text{, } S_{i} = \{x_{j} \in S | j\neq i\}

For each such sub-sample S_{i} we calculate its own estimator \hat{\theta_{i}} of \theta, using the same method that we used to compute the original \hat{\theta} estimator from the entire sample S.

We thus end up with n separate estimators \hat{\theta}_{.i}for \theta.

The jackknife estimator of a parameter is defined as:
\bar{\theta}_{\left(.\right)} = \frac {1}{n} \sum\limits_{i}^{n} \hat{\theta}_{.i}
and the  jackknife estimator of variance is defined as:
Var_{\left(jackknife\right)} = \frac{n-1}{n}\sum_{i=1}^{n}\left(\hat{\theta}_{.i}-\bar{\theta}_{\left(.\right)} \right)^{2}

Why it works:

OK, so far so good, we have come up with an alternative estimator of \theta and an estimator of variance of \theta.  But why did we need to do that?  We already had \hat{\theta}, why did we need \hat{\theta}_{\left(.\right)} – yet another estimator of \theta and a rather exotic one?  And, while we didn’t have any estimators of variance of \hat{\theta}, what makes us say that our Var_{\left(Jackknife\right)} is a decent estimator of variance of \hat{\theta}?

The answer to the first question is to do with the fact that while \hat{\theta} was indeed an estimator of \theta, we didn’t know whether it was biased or unbiased. However, now that in addition to \hat{\theta} we also have \hat{\theta}_{\left(.\right)} we can do some clever math and infer something about bias of \hat{\theta}. We can then, if necessary, correct the bias and derive an unbiased estimator of \theta.

More concretely, if \hat{\theta} is an unbiased estimator of \theta we have:
E\left(\hat{\theta}\right) = E\left(\theta\right)
and therefore for each i
E\left(\hat{\theta}_{i}\right) = E\left(\theta\right)
which in turn implies that
E\left(\bar{\theta}_{\left(.\right)}\right) = E\left(\theta\right)
i.e.  the jackknife estimator is also unbiased.

However, if \hat{\theta} is biased so that
E\left(\hat{\theta}\right) = \theta + \frac{a}{n} + \frac{b}{n^{2}} + O\left(\frac{1}{n^{3}}\right)
Then
E\left(\bar{\theta} - \hat{\theta}\right) = \frac{a}{n\left(n - 1\right)} + O\left(\frac{1}{n^{3}}\right)

NB: I have skipped a whole lot of math in this last part here and just jumped to the result, but those who are interested in the detailed proof can see it here:

Thus, we can estimate the bias of \hat{\theta} up to second order as:
b_{jackknife} = \left(n - 1\right)\left(\bar{\theta} - \hat{\theta}\right)
Subsequently, the bias-corrected jackknife estimator,
\hat{\theta}_{jackknife} = \hat{\theta} - b_{jackknife}
is an unbiased estimator of \theta, again up to second order.

Note that in the reasoning above we have assumed that the bias had a particular form:

E\left(\hat{\theta}\right) = \theta + \frac{a}{n} + \frac{b}{n^{2}} + O\left(\frac{1}{n^{3}}\right)

This may seem like a strange assumption, but in fact it is usually justified. It’s a deep result in statistics that a lot of biased estimators do in fact have bias in this particular form. I won’t cover the detail here, but you can read up on it here.  If, however, we were dealing with a rare case where the assumption wasn’t justified then we wouldn’t be able to use the jackknife bias correction meaningfully.

The answer to the second question is also quite mathematically involved (here are the details) but is essentially because Var_{jackknife}, is a consistent estimator of Var_{\hat{\theta}}.

Note that this holds most of the times but not always, as witnessed by this Theorem by Efron: http://www.biostat.umn.edu/~johnh /pubh8422/notes/Jackknife_and_Bootstrap.pdf So again, in the rare cases where the assumption is not justified, we wouldn’t be able to use jackknifing properly.

Conclusion:

To recap, for a given sample x_{1},\text{...}, x_{n} and an estimator \hat{\theta} of \theta, the jackknife method allows us to:
  • test and, if necessary, correct \hat{\theta} for bias, rather than just blindly use the possibly biased \hat{\theta}
  • obtain a consistent estimator of variance of \hat{\theta}, rather than just… well not have any variance estimator at all.

Bootstrapping

How it works:

Recall that with the jackknife method, we took n sub-samples of S = \{x_{1},\text{...}, x_{n}\} by leaving out one element at a time and subsequently ended up with n pre-determined sub-samples, each of size \left(n-1\right).

With bootstrapping, we take a different approach.  We take samples of size n from S = \{x_{1},\text{...}, x_{n}\}, but we sample with replacement.

The exact algorithm is as follows:
  • Observe a sample S = \{x_{1},\text{...}, x_{n}\}.
  • Compute \hat{\theta}\left(S\right) – a sample based estimator of some parameter \theta of the model.
  • For i = 1 up to s where s is the number of bootstrap samples being generated:
    • generate a bootstrap sample S_{i} = \{x_{i1},\text{...}, x_{in}\} by sampling with replacement from the original observed data set
    • compute \hat{\theta}_{i} = \hat{\theta}\left(S_{i}\right) in the same way that you calculated the original estimate \hat{\theta}.

Firstly, note that in each sub-sample S_{i} we may have some members of S missing and some occur multiple times.  As it turns out, it can be shown that on average each such sample will contain 2/3 of members of S and will be missing the other 1/3.

Secondly, note that, while with jackknifing we could only have the maximum of n sub-samples of S, with bootstrapping, thanks to sampling with replacement, we can go on forever and generate an arbitrarily large number of sub-samples.

Having a really large number of sub-samples S_{i} in turn allows us to generate many \hat{\theta}_{i} estimates of \theta – one \hat{\theta}_{i} for each S_{i}.  As we end up with many \hat{\theta}_{i}-s we are essentially approximating the distribution of these \hat{\theta}_{i}-s.

From here we can compute the sample mean and sample variance of \hat{\theta}_{i}-s:

\bar{\hat{\theta}} = \frac{1}{s} \sum\limits_{1}^{s} \hat{\theta}_{i}
 b = \frac{1}{s - 1} \sum\limits_{1}^{s}\left(\hat{\theta}_{i} - \bar{\hat{\theta}}\right)^{2}

So now things look better.  We have started out with just one estimator \hat{\theta} of \theta without knowing its bias or variance. We now have two estimators of \theta – firstly the original \hat{\theta} , but also secondly the bootstrap estimator \bar{\hat{\theta}}. We also have an estimator of variance of \hat{\theta}, given by b.

Note that just like in the case of the jackknife, bootstrap allows us to test \hat{\theta} for bias.  The math is detailed here: https://www.otexts.org/1470  and here  http://www.ssc.wisc.edu/~xshi/econ715/Lecture_10_bootstrap.pdf

Also, bootstrap estimators of variance of \hat{\theta} have all sorts of good properties, as outlined here:  http://www.ssc.wisc.edu/~xshi/econ715/Lecture_10_bootstrap.pdf

Thus, given a sample S = x_{1}, \text{...}, x_{n} and an estimator \hat{\theta} of \theta, just like in the case of jackknifing, bootstrapping allows us to:
  • test and, if necessary, correct \hat{\theta} for bias, rather than just blindly use the possibly biased \hat{\theta}
  • obtain a consistent estimator of variance of \hat{\theta}, rather than just… well not have any variance estimator at all.
however, unlike jackknife, bootstrap also allows us to:
  • simulate the distribution of \hat{\theta} by generating an arbitrarily large number of \hat{\theta}_{i}-s, one for each of the arbitrary many bootstrap sub-samples.

Why it works:

When I first saw bootstrapping I was puzzled. How can one learn things about the distribution of a population simply by taking sub-samples of a sample again and again?  I mean it literally does feel like lifting oneself by one’s boot straps – impossible.

It took me time to figure out that there is nothing magical about bootstrap and instead, it was my original “novice” interpretation of it that confused me.  I also know that this is how pretty much everyone who sees bootstrap for the first time reacts, which is why I think it is important to address that source of confusion here.

First thing to note is that bootstrap works asymptotically.  The larger the sample, the closer the empirical distribution is to the actual distribution. This is true because of the Dvoretzky Kiefer Wolfowitz inequality.  Therefore, for a sufficiently large sample the difference between the empirical and the actual distributions is negligible and all conclusions we make for the empirical distribution are equally applicable to the actual distribution.

The second thing to note is that even if the sample size is small (which in reality happens often), if the sample was randomly chosen from the population then it will “look quite like the population” in the long run.  By this I mean that any one particular small sample may be off due to skewness or imperfection of that particular sample, but on average, across many samplings a random sample is a good representation of the population. Thus, the asymptotic property is not only from the “large sample perspective” as highlighted by the first point earlier, but also from a “long run perspective” as highlighted by the second point here.

Having made these two points, I will concede that they are a bit “theoretic”.  In reality we don’t usually get large samples and don’t get to repeat many small samples again.  Thus the two theoretical points are of limited use in a real life situation where we have just one not-so-large sample and need to infer population characteristics from it.  In this case, we can still use bootstrap (especially if there is no other available alternatives), but bear in mind that our bootstrap estimates will likely inherit the imperfections that the sample had as a representative of the bigger population.

romansmith

View more posts from this author

Leave a Reply

Your email address will not be published. Required fields are marked *