Logistic Regression – Meaningful Coefficients

Why Logistic Regression

Today I will talk about logistic regression, a method that allows us to use some of the useful features of linear regression to fit models that predict binary outcomes: $1$ vs. $0$, “success” vs. “failure”, “result present” vs. “result absent”.

Just by itself linear regression is not adequate for this sort of thing as it doesn’t place any restrictions on the range of the response variable and we could end up with values far greater than $1$ or far below $0$. We could discard anything we know about linear regression and use an altogether different approach to build a model that would, on the basis of a group of explanatory variables, predict our response variable. Indeed there are a number of such approaches: trees, forests, SVMs, etc. However, linear regression is such a simple but powerful technique with so many great things about it – it would be really good if we could somehow harness its power to build binary predictive models. And logistic regression is the method that allows us to do just that.

What Logistic Regression Is and How It Works

We are trying to build a model that would predict a binary outcome.  In other words the model would accept as inputs a number of explanatory variables and would output the response variable $y$ that takes on values $1$ or $0$. How can we combine our explanatory variables into an expression that would be simple (ideally a linear combination), produce a binary outcome, and allow for an intuitive interpretation of various components and coefficients within that expression?

First of all let’s note that we can be flexible and instead of requiring the outcome variable $y$ to strictly take on values $1$ or $0$, allow it to take any value between $1$ and $0$ and then use some kind of a threshold $b \in \left[0,1\right]$ such that whenever $b < y < 1$, we interpret it as $1$ and whenever $0 < y < b$ we interpret it as $0$.  A convenient way to think of it is that $y$ becomes a probability, so that if it is close to $1$ we predict $1$, and if close to $0$ we predict $0$.  In this case, a natural candidate for the threshold is $0.5$.

Secondly let’s see how we can convert a linear combination of explanatory variables into a $y \in \left[0,1\right]$ outcome.

To start, take a linear model based on our explanatory variables:

$z=\beta_{0}x_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}$

At this stage, depending on the values of $\beta_{i}\text{'s}$ and $x_{i}\text{'s}$, this could produce any number arbitrarily large or small, positive of negative, i.e. in the $\left(-\infty,\infty\right)$ range.

Now “squeeze” the $\left(-\infty,\infty\right)$ range into the $\left[0,1\right]$ range by means of a logistic function:

$F(x) = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-\left(\beta_{0}x_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}\right)}}$

This logistic function is nice because it is a very neat way to take our linear combination of explanatory variables $z=\beta_{0}x_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}$ and to transform it to essentially a probability value between $0$ and $1$. Have a look at the formula and the graph:

• $p \rightarrow 0$ when $y \rightarrow -\infty$, i.e. when $z=\beta_{0}x_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}$ goes into far negatives.
• $p \rightarrow 1$ when $z \rightarrow \infty$,, i.e. when $z=\beta_{0}x_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}$ goes into large positives.
• $p = 0.5$ exactly when $z = 0$, i.e. when $z=\beta_{0}x_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}$, i.e crosses the $y = 0$-plane.

As a result we have a model that takes all our explanatory variables and by means of first a linear combination and then a logistic transformation maps these to a response variable that lies between $0$ and $1$. We can them agree on a threshold, for example $0.5$ (although it doesn’t have to be $0.5$ and there are cases where the model is better off having a different threshold), and predict $0$ whenever the model gives an outcome $\leq 0.5$, or predict $1$ when the model gives an outcome $\geq 0.5$:

Ok so now we have some kind of a model that takes a bunch of explanatory variables in all shapes and forms and from that produces a probability, which can in turn be interpreted as a binary outcome.  At this stage I am not saying anything about whether this model is good or bad, whether it predicts well, whether it is easy to fit, whether it has any nice statistical properties. All I can say for now is that it gets us from A to Z and it does that via a path that is fairly straightforward, intuitive and easy to interpret (well at least as straightforward as easy as this sort of thing can be).

Fitting the Model

Logistic regression models are fitted through the maximum likelihood estimation (MLE) method. This is a very popular way to fit models in general, not just logistic regression, for a very long list of reasons.

Often in statistics courses logistic regression is covered somewhere towards the end of the course, after the larger part of the course has been spent on OLS linear regression.  As a result, a lot of students coming through the traditional path of statistical education may be inclined to have OLS fresh in their minds and be tempted to just use OLS instead of MLE to fit a logistic regression model. This, however, would not work.  With logistic regression the concept of residual sum of squares (RSS) doesn’t really make sense – we can still compute something that would technically be $RSS$, but because our responses have been “squeezed” into the $\left[0,1\right]$ range, the residuals are not meaningful and it doesn’t make sense to try to minimize them or their sum of squares.  Thus, we use MLE instead.

The good news is that linear regression models too can be fit by means of MLE instead of OLS.  Furthermore, as long some minor regularity conditions are satisfied, it turns out that fitting a linear regression model by MLE would yield exactly the same coefficient estimates as fitting the model via the conventional OLS method would.  In fact, while it is correct to say that using MLE to fit a linear regression model gives results consistent with results of OLS, the statement would be a bit unfair. It would be more fair to say that it is OLS, the niche method for fitting linear regression models, that is consistent with MLE, the ubiquitous and versatile method for fitting all sorts of models including linear regression, logistic regression and others.

Coefficient Interpretation

One of the reasons logistic regression is popular is the fact that regression coefficients have a very intuitive interpretation.

Logit Equals Log Odds

The linear component $z=\beta_{0}x_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}$ is commonly referred to as logit.  It can be interpreted as the log of odds (log-odds). Recall that the odds are defined as the ratio of the probability of success to the probability of failure:

$\text{odds}=\frac{P\left(\text{success}\right)}{P\left(\text{failure}\right)}$

Continuing with our previous idea that the sigmoid function $F$ corresponds to probability that the outcome is $1$ or “success”, the odds can be expressed as:

$\text{odds} = \frac{F(x)}{1-F(x)} = \frac{\frac{1}{1 + e^{-z}}}{1-\frac{1}{1 + e^{-z}}} = e^z$

And subsequently, substituting the full expression for $z$, the log-odds are equal to the logit:

$\log e^{z} = z = \beta_{0}x_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}$.

So for any set of explanatory variables $\{x_1, ..., x_i, ...x_n\}$, we can just plug them into the logit formula and quickly be able to say something like “the odds of the passenger surviving are $e^{z}$”.

Individual Coefficients Equal to Log of Odds Ratio

Odds ratio (OR) is just that – a ratio of the odds.  We use it to test one explanatory variable’s effect on increasing or decreasing the odds (all other explanatory variables are adjusted for):

A good summary and a crash course in odds ratios can be found here.

It turns out that each individual coefficient in the logistic regression model can be interpreted as the log of odds ratio (log odds ratio). This is easiest to demonstrate with a dummy explanatory variable. Recall that a dummy explanatory variable is an explanatory variable that is binary, taking only values $x=0$ and $x=1$, so for example something that represents gender (males are coded as $x=1$, females as $x=0$), exposure to treatment (“received treatment” is coded as $x=1$, “hasn’t received the treatment” is coded as $x=0$), and so on.

Suppose we have our usual logit:

$z=\beta_{0}x_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}$.

where $x_1$ is a dummy variable taking only values $x=0$ and $x=1$.   Then whenever $x_1=1$ the odds are:

$e^{z} = e^{\left(\beta_{0}x_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}\right)} = e^{\left(\beta_{0}x_{0}+\beta_{1}+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}\right)}$

whereas whenever $x_1=0$ the odds are:

$e^{z} = e^{\left(\beta_{0}x_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}\right)} = e^{\left(\beta_{0}x_{0}+0+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}\right)}$

Thus, the odds ratio is:

$\frac{e^{\left(\beta_{0}x_{0}+\beta_{1}+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}\right)}}{e^{\left(\beta_{0}x_{0}+0+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}\right)}} = e^{\beta_1}$

and the logarithm of the odds ratio is $\log e^{\beta_1} = \beta_1$.

From the argument above we see that the statement would hold for any dummy explanatory variable, not just $x_1$  and for any number of explanatory variables. We could also extend this line of thought to demonstrate that for non-dummy variables (ie. continuous numerical variables) the coefficient represents the log-odds ratio for a unit change in the explanatory variable.

If you are like me and come from pure mathematics and statistics background, you may find the concept of odds and odds ratios somewhat archaic – something they use when making bets in horse races. However, even though these concepts might be a exotic to you, they are very well ensconced in some of the applied disciplines like epidemiology or genetics that make very heavy use of binary classification. So for these fields it is nice to have something like logistic regression – on one hand it carries all the power of a proper statistical model, on the other hand it has nice intuitive interpretation of the logit and the coefficients that is consistent with the concepts of odds and odds ratio that anyone can just derive with a pen and paper and a simple 2×2 table.

Evaluating the Model

$R^2$ – Not Right For Logistic Regression

There is a “pseudo-$R^2$” metric that often gets included in the standard output by statistical software when fitting a logistic regression model. There are actually several variants of these, you can see a comprehensive list here.

It would be nice to have something as simple to compute and easy to interpret as $R^2$ was for linear regression, but unfortunately none of these pseudo-$R^2$ metrics cut it. In the case of OLS linear regression, $R^2$ was based on the meaningful concept of $RSS$ and had all sorts of nice facts associated with it, such as for example the fact that (provided there is an intercept term) $TSS=ESS+RSS$, or the fact that $R^2 = r^2$ = (where $r$ is the Pearson correlation coefficient). In the case of logistic regression we have none of these niceties. The pseudo-$R^2$ is something that has been computed through an artificial formula that has been made to look like the real $R^2$, but unfortunately it lacks most of the properties that make the real $R^2$ useful.

Metrics Specific to Logistic Regression – Pearson’s $\chi^2$-Test and Hosmer-Lemeshow Test

Secondly, continuing on the topic of odds and odds ratios and 2×2 tables from the previous section, there is the Pearson’s $\chi^2$-Test for categorical data that can be applied to our binary classification problem.   There is also the Hosmer-Lemeshow Test.

The problem with these metrics, is similar to one of the problems that we had for $R^2$ in the case of linear regression. You will recall that in linear regression, $R^2$ was a great measure of goodness of fit by many standards but it had a few drawbacks, one being that it only allowed to compare linear regression models with linear regression models and not with other non-linear-regression models. Well, similarly here, both the Pearson $\chi^2$-Test and the Hosmer-Lemeshow Test let us compare various logistic regression models to each other and to determine the better candidate, but may be not applicable to some of the other models that we may want to compare as alternatives to the logistic regression models.

General Model Evaluation and Goodness Of Fit Measures

And it is for this reason that it is best to use a very general goodness of fit measure, not just a metric that measures logistic regression models but is not applicable to other models. We have talked about model evaluation in another post.  And we have concluded that the best ways to evaluate and compare models is to either use $AIC$ or $BIC$ statistics or to use cross-validation. We have noted that the preference between$AIC$ and $BIC$ is largely philosophical and that cross validation (with leave-one-out) and $AIC$ are asymptotically equivalent. So it would be these three that would be the best methods for evaluating our linear regression model – they would enable us to compare the model to all sorts of alternative models, not all of which have to be logistic regression models.

Other Alternatives Compared

As mentioned earlier, logistic regression is used for fitting models where the outcome is a binary response variable. There are however other methods one could use to predict binary models. Below are some examples of “competing” binary response models along with their strengths and weaknesses relative to logistic regression.

Probit Regression

We could use a method similar to logistic regression but where the “squeezing” function is not the logistic function but instead some other function with $\left(-\infty,\infty\right)$ domain and $\left[0,1\right]$ range. For this purpose, any probability cumulative distribution function (CDF) will do. For example a natural candidate would be the CDF for normal distribution. This way we still have the linear relationship

$z=\beta_{0}x_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\text{...}+\beta_{n}x_{n}$

but this time it is squeezed into $\left[0,1\right]$ range by means of the CDF for normal distribution.  The new sigmoid curve would like like this when compared to the old logistic sigmoid curve:

Thus we have a model similar to logistic regression that we can just as well fit by means of MLE and grade by means of $AIC, BIC$ or cross validation. We call this model probit regression.

On one hand probit regression appears to be more intuitive – instead of using this exotic looking logistic function we use the very popular normal distribution CDF. The former usually gets taught at the very end of the first year university Calculus course only to never be seen again while the latter is the well-known and ubiquitous normal distribution that everyone is comfortable with. So it makes sense to use probit rather than logit right?

Not necessarily. Firstly, the fact the “squeezing” function is probit does not actually add any nice properties to our model – it does not make residuals normally distributed or introduce any normal distribution anywhere. We simply use normal distribution as a means of squeeing the $\left(-\infty,\infty\right)$ range into $\left[0,1\right]$ but without making any assumptions about normal distribution anywhere and without obtaining normal distribution anywhere. On the other hand with logistic regression we actually do have a very nice and straightforward interpretation of logit (log odds) and the individual coefficients (log odds ratio) – and that is a major advantage over probit regression.