# Model Evaluation Pt. 1 – LRT, AIC and BIC.

We Need a Universal Metric To Evaluate Models

In previous posts we looked at different ways to evaluate goodness of fit of particular types of models such as linear regression, trees or logistic regression. For example in the post on linear regression, we looked at $R^2$ and adjusted-$R^2$ and in the post on logistic regression we considered pseudo-$R^2$, Pearson’s $\chi^2$ Test and Hosmer-Lemeshow metric.

These statistics often come as part of the standard output in statistical software whenever you fit a linear regression or logistic regression or whatever is the specific type of model that you are experimenting with. They convey a great deal of meaningful information about how well your model has fitted the data, but at the same time are rife with shortcomings and in particular are usually meaningful only for one particular type of model. For example $R^2$ only makes sense in the context of linear regresssion and once you start comparing oranges with apples it really stops telling you anything meaningful.

What would be really great is if there was a uniform and consistent way to compare all sorts of models against each other. And this is the topic I would like to explore in this post.

OK, So What Are The Options?

We need a method of measuring quality of fit that is transferrable across all types of models – linear, binary, anything – you name it, our method should be able to consistently evaluate that model.

There are essentially two schools of though here.

One school of thought comes from theoretical statistics and preaches the use of Likelihood Ratio Tests, AIC and BIC.  You measure your models by these various statistics (they all have relative strengths and weaknesses) and the best model is the one that is the best as measured by this or that statistic.

The other school of thought, comes from the more computationally intensive realm of data mining and machine learning and preaches the approach whereby the data is split into two random subsets, the first training subset is used to fit the model and the second testing or cross-validation (CV) subset is used for checking how well the newly trained model predicts. Models are evaluated and compared based on how well they perform on the testing subset.

In today’s post I will talk about the first approach – the one focusing on statistical metrics. I will first discuss some general theoretical concepts and then go through several popular statistical methods for measuring models’ goodness of fit and will conclude that two statistics $AIC$ and $BIC$ are the best for this purpose.

So, let’s get started.  First a bit of theory.

A Few Definitions and Facts From Statistics

Likelihood Ratio

Suppose we have a sample $X = \{x_1,..,x_n\}$ and two models, A and B, that we have fitted based on the sample. At this stage we will not place any constraints on what these models can and cannot be, let’s just assume these are any two models where likelihood can be computed.

Likelihood ratio is defined as:

$\Lambda_{A,B}(x_1,...,x_n) = \frac{\text{likelihood of model A, given the sample }X}{\text{likelihood of model B, given the sample }X}$

The concept of likelihood ratio is of fundamental importance for a number of reasons. Two such reasons are the fact that it gives rise to deviance and the fact that it gives rise to the idea of a Likelihood Ratio Test.

Deviance

The deviance is defined by:

$D(y) = 2\log\frac{p(y\mid\hat \theta_s)}{p(y\mid\hat \theta_0)}$

where

• $p(y\mid\hat \theta_s)$ is the maximum value of the likelihood function for the saturated model, and
• $p(y\mid\hat \theta_0)$ is the value of the likelihood function when fitting the model

Here the saturated model is the model with a parameter for every observation so that the data is fitted exactly.

And if you recall that log(A/B) = logA – logB, we have another expression for deviance:

$D(y) =-2\Big( \log \big( p(y\mid\hat \theta_0)\big)-\log \big(p(y\mid\hat \theta_s)\big)\Big)$

Note that this just $2\log\Lambda_{\theta_s,\theta_0}$, where $\Lambda$ is the likelihood ratio as defined above.

First of all, note that deviance is just the likelihood ratio (as defined above) of the hypothesized model to the saturated model.

Secondly, you may often see deviance defined or referred to as just $-2\log \big( p(y\mid\hat \theta_0)\big)$.  This is justified by saying that for saturated models $\log \big(p(y\mid\hat \theta_s)\big) = 0$ because probabilities are equal to 1. So even though this popular description of deviance is not strictly correct according to the formal definition, it is certainly consistent with the formal definition and so the shorthand glossing over for the sake of convenience is justified.

Lastly, if you stare at this definition as it is (i.e. logs of differences) for long enough you will become comfortable with the fact that the role that deviance plays for general linear models is similar to the role that $RSS$ plays for linear regression. In fact for OLS linear regression, deviance will be just equal to $RSS$.

Nested Models, Wilk’s Theorem and the Likelihood Ratio Test (LRT)

We have already defined the concept of likelihood ratio, $\Lambda$. This was defined for a very general case, where models A and B are just any pair of models for which we can compute likelihood.

Now suppose we have a more specific case, where the two models are nested, i.e. the simpler model is a special case of the more complex model. This actually occurs quite often in statistical analysis, so is a very important special case to consider. Well as it turns out, in this special case of nested model, we have the following important fact:

Wilk’s Theorem:  If S and G are nested model and several other regularity conditions are satisfied, then asymptotically, as sample size $n \rightarrow \infty$, the statistic $-2\log\Lambda_{S,G}$ approximates $\chi^2$ distribution with degrees of freedom equal to the difference in the number of parameters between the two models.

The proof of this theorem is quite complicated. Also by saying “several other regularity conditions are satisfied” I have glossed over a lot of detail regarding necessary preconditions which if violated would invalidate the theorem. These preconditions are numerous and are all rather technical, for example assuming that all relevant derivatives are non zero, or that we are dealing with i.i.d. data, and so forth. If you wanted to really do this thoroughly you would need to go through these preconditions one by one to check they are all satisfied before invoking Wilk’s theorem, but for the purpose of most analysis that we are likely to do here, we can just assume that they are.

Relationship between LRT and Deviance

Firstly note that deviance is just the $-2\log\Lambda_{m,sat}$ where sat is simply the saturated model.

Conversely, if you recall that log(A/B) = logA – logB, then the test statistic $-2\log\Lambda_{S,G}$ that was used in LRT, can be re-written as a difference of log likelihoods:

$-2\log\Lambda_{S,G} = -2\log\frac{\text{likelihood of S}}{\text{likelihood of G}} = -2\Big( \log \big( p(y\mid\hat \theta_S)\big)-\log \big(p(y\mid\hat \theta_G)\big)\Big)$

which is just the difference of deviances.

Ok, so that is how deviance and likelihood ratio test statistic are related and are similar. Now, how are they different? Deviance is a measure of imperfection or “badness” of a model, compared to the “perfect” saturated model, just like $RSS$ was a measure of imperfection of a OLS fit compared to the perfect model where the $RSS=0$. Whereas the LRT statistic is a measure of “badness” or “goodness” of one nested model compared to another more general (and not necessarily saturated) model.

Wald Test and Score Test – Quite Good But LRT Still Wins.

In addition to LRT there are also Wald Test and Score Test. I won’t go into great detail about these two methods to measure models. I will just say that these two methods are less powerful than LRT, but they can be quicker to compute – you only need to deal with likelihood for one model rather than two models. The fact that these tests are easier to compute used to be an important factor but it matters less these days when computing power is cheap, so LRT is the preferred method.

Summary So Far: LRT Pros and Cons

So far we have looked at LRT as one way of comparing goodness of fit of two models. There are many things that are great about LRT and that have made it such a popular method to compare models, for example:

• it is intuitive and easy to understand and it relates to deviance which is a fairly straightforward and natural way to describe goodness (or rather, “badness”) of fit,
• Wilk’s Theorem means that when looking at nested models we can use $\chi$-squared distribution to very precisely describe the superiority, or lack thereof, of one model over another,
• LRT is consistent with $R^2$.  In the case of linear regression, as long as error terms $\epsilon$ can be assumed to be normally distributed (which, as discussed in one of our previous posts, is a reasonable assumptions) the coefficient estimates obtained by OLS linear regression will in fact be equal to the coefficients that would have been obtained via the maximum likelihood estimation (MLE) method. In fact, as I mentioned earlier, if we look at deviance for a linear regression model fitted via MLE and compare it to $RSS$ for the same linear regression model fitted via OLS, the two will be equal and all other statistics based on MLE, such as LRT, will be consistent with statistics based on $RSS$ such as $R^2$
• LRT is more versatile than $R^2$. It can be used to compare any set of models that are nested and that have a notion of likelihood.

So what are the drawbacks of LRT?  Well, one obvious drawback is that Wilk’s theorem only holds for nested models, so LRT can be used meaningfully only when comparing nested models.

There have been suggestions on how to use LRT to compare non nested models – see for example the discussion on page 157 of this paper, but really once you lose “nested-ness” and are unable to make use of Wilk’s theorem, you don’t have the asymptotic tendency towards $\chi^2$-distribution anymore and you have to start approximating, improvising and getting clunky.

So we are back looking for even more general ways that would now allow us to compare models that are non-nested and, ideally, be consistent with the nice and straightforward and intuitive methods we have looked at so far – $R^2$, $F$-tests and LRT.

The Best Option: AIC

Enter $AIC$.  Akaike’s Infor­ma­tion Cri­te­rion (AIC) for a model is defined as:

$\mathrm{AIC} = 2k - 2\log{L}$

where:

• $L$ is the max­i­mized like­li­hood using all avail­able data for esti­ma­tion and
• $k$ is the num­ber of free para­me­ters in the model.

Notice that the higher the likelihood $L$, the lower the $AIC$. Thus, when evaluating models by $AIC$, the lower the better and the one out of all models that has the lowest $AIC$ is the best.

The $2k$ term is often erroneously interpreted as some kind of penalty for number of explanatory variables, i.e. something you use to “regularise” the expression and to discourage the use of unnecessary explanatory variables . This is actually not the case – the $2k$ term is there to correct for bias that the $- 2\log{L}$ would otherwise have.

There are a number of good things about $AIC$. First of all it is is easy to compute – you just need the log likelihood, which you probably already have from doing MLE in the first place. Secondly $AIC$ is consistent with the previous measures we used – such as $R^2$ or LRT. Moreover, and most importantly, $AIC$ can be used to compare models that are not necessarily nested. We can literally take any two models, nested or not nested, compare their $AIC$-s and pick the one with the lowest $AIC$ as the better.

Another great thing about $AIC$ is that asymp­tot­i­cally, finding the model that has the lowest $AIC$  is equiv­a­lent to finding the model that performs best under cross-validation.  This is true for any model (Stone 1977), not just lin­ear mod­els. This is a huge big picture win –  not only is $AIC$  the best statistical method for evaluating and comparing models so far (it’s the most general in that it allows to compare even non-nested models, while at the same time consistent with all our other favourite less general methods such as LRT), but it is also consistent with the preferred method from the “evaluating models empirically by testing against sub-samples” school of thought.

The Best Option v2: BIC

At this point I should mention $BIC$ – another goodness of fit criterion for models and a close “competitor” of $AIC$ that is often cited as the best tool for this purpose.

Bayes Information Criterion (BIC) for a model is defined as:

$\mathrm{BIC} = {-2\log L + k \log(n)}$

where:

• $L$ is the max­i­mized like­li­hood using all avail­able data for esti­ma­tion,
• $k$ is the num­ber of free para­me­ters in the model and
• $n$ is the number of observations.

So what is the difference between $AIC$ and $BIC$ and which one should we use?

The arguments about $AIC$ vs $BIC$ and which is better are rooted in philosophy and pivot around subtle technical differences in how $AIC$ and $BIC$ are computed and interpreted.

Essentially, when the set of models that you are evaluating and comparing contains a model that is a perfect and true representation of reality, then asymptotically as $n\rightarrow\infty$, $BIC$ will select that one true model as “the best” whereas, on the other hand, $AIC$ may fail to converge to to this true model instead picking some other imperfect and inferior model as its version of “the best”.  However, when the true model that represents the world does not exist at all (for example if there is an infinite number of explanatory variables, so we can only hope for a good approximation at best), or the perfect model exists but has not been included in our set of models that we are comparing (for example we just haven’t thought of this particular model), then $AIC$ will do a better job than $BIC$ at finding from our set of imperfect models the one that is the “least imperfect” and is the best at approximating and predicting in the long run.

I personally think that there is no such thing as the perfect model.  There will always be something that has not been included, or something that has been measured incorrectly, or a functional relationship that has been (at least slightly) mis-specified. Even if there was a model that would describe the reality perfectly, the chance of coming across that model and including it in your set of candidate models that you are evaluating by $BIC$ is very slim.  Therefore, I believe that the trump card that $BIC$ has over $AIC$ , this ability to converge on the true model when it is present in the set of candidate models, is redundant, and that the advantages of $AIC$ over $BIC$, such as consistency with leave-one-out cross-validation make $AIC$ the metric of choice.

In this post I have referred twice to the fact that $AIC$ is consistent with something called leave-one-out cross-validation and both times I have hailed it as a very good thing without actually explaining what leave-one-out cross-validation is and why it is so great to be consistent with it.  In  the next post I will talk about this topic in mode detail.  Sample splitting and cross validation methods (of which leave-one-out cross-validation is a specific example) is a whole separate school of thought with its own good, bad and ugly. The good news is that in most cases it produces results that are consistent with the statistical methods like LRT, $AIC$ and $BIC$, giving us the peace of mind that regardless of which road we take in evaluating our models, all roads will likely lead to the same (or at least similar) best model.