# Re-sampling Pt.1: Cross-Validation

Today and over the next three posts we will talk about re-sampling methods, which is a family of approaches to synthesizing multiple data samples from one original data set.

There are a number of reasons why you may want to do that and a number of ways in which you could do that. Specific re-sampling methods all differ in the way they generate new samples and subsequently in their computational complexity and bias-variance trade-off.  Because of this, specific re-sampling methods differ in their suitability for various specific purposes.

# Re-sampling Pt.2: Jackknife and Bootstrap

Suppose we have a sample of $n$ data points and we want to estimate some parameter $\theta$.  We come up with $\hat{\theta}$ – an estimator of $\theta$.  What do we know about $\hat{\theta}$?  How good an estimator of $\theta$ is it?  Is it biased?  How efficient is it?
We could answer these questions if we knew the distribution of the population  from which $x_{1}, \text{...}, x_{n}$ came. More often than not however, we don’t know anything about the distribution of the underlying population, all we have is a sample and we want to figure out things about the population.
This is where re-sampling, such as jackknife or bootstrapping comes into play.

The bias-variance trade-off is a fundamental principles that is is at the heart of statistics and machine learning.  You may have already seen its various faces, for example in this equation:

$\text{MSE}\small\left(\hat{\theta}\small\right) = \normalsize\left[\text{Bias}\small\left(\hat{\theta}\small\right)\normalsize\right]^{2} + \text{Var}\small\left(\hat{\theta}\small\right)$

or in linear regression analysis where omitting valid explanatory variables made other coefficients biased while introducing correlated explanatory variables made standard errors of coefficients high.

Those, however, were all concrete examples of a much broader phenomenon that I want to discuss here.

# Logistic Regression – Meaningful Coefficients

Why Logistic Regression

Today I will talk about logistic regression, a method that allows us to use some of the useful features of linear regression to fit models that predict binary outcomes: $1$ vs. $0$, “success” vs. “failure”, “result present” vs. “result absent”.

Just by itself linear regression is not adequate for this sort of thing as it doesn’t place any restrictions on the range of the response variable and we could end up with values far greater than $1$ or far below $0$. We could discard anything we know about linear regression and use an altogether different approach to build a model that would, on the basis of a group of explanatory variables, predict our response variable. Indeed there are a number of such approaches: trees, forests, SVMs, etc. However, linear regression is such a simple but powerful technique with so many great things about it – it would be really good if we could somehow harness its power to build binary predictive models. And logistic regression is the method that allows us to do just that.

# Model Evaluation Pt. 1 – LRT, AIC and BIC.

We Need a Universal Metric To Evaluate Models

In previous posts we looked at different ways to evaluate goodness of fit of particular types of models such as linear regression, trees or logistic regression. For example in the post on linear regression, we looked at $R^2$ and adjusted-$R^2$ and in the post on logistic regression we considered pseudo-$R^2$, Pearson’s $\chi^2$ Test and Hosmer-Lemeshow metric.

These statistics often come as part of the standard output in statistical software whenever you fit a linear regression or logistic regression or whatever is the specific type of model that you are experimenting with. They convey a great deal of meaningful information about how well your model has fitted the data, but at the same time are rife with shortcomings and in particular are usually meaningful only for one particular type of model. For example $R^2$ only makes sense in the context of linear regresssion and once you start comparing oranges with apples it really stops telling you anything meaningful.

What would be really great is if there was a uniform and consistent way to compare all sorts of models against each other. And this is the topic I would like to explore in this post.

# Linear Regression – How To Do It Properly Pt.3 – The Process

Time To Summarise The Previous Two Posts

In the last two posts we talked about the maths and the theory behind linear regression.  We have covered the mathematics of fitting an OLS linear regression model and we have looked at derivation of individual regressor coefficient estimates. We have discussed the Gauss-Markov Theorem, the conditions that are necessary for the theorem to hold and more generally the conditions that must be satisfied in order for us to be able to use regression effectively and to place significant faith in coefficient estimates.  We have looked at the sample size, hypothesized about distribution of the theoretical error term $\epsilon$, disucussed omitted explanatory variables, unnecessary explanatory variables, proxy variables, non-linear regressors and other factors that may influence our model’s reliability, unbiasedness, efficiency and overall goodness of fit.  Finally we have looked at objective ways of measuring this goodness of fit.

This has been a long series of posts with a lot of maths and a lot of ifs and buts and today I would like to summarise all of it and attempt to come up with a simple step-by-step process that one can follow to get the most out of linear regression without getting burned by its many pitfalls and dangers.

# Linear Regression – How To Do It Properly Pt.2 – The Model

Model Specification and Evaluation

In the last post we talked about the maths behind linear regression.  We looked at how the model is fitted, how individual coefficient estimates are computed and what their individual properties such as mean and variance are.   We have also gone over some important conditions that must be satisfied in order for linear regression to really be an effective and powerful tool for data analysis and we have made a point that unless all of these conditions are met, the OLS linear regression model loses most of its authority and other models often become better alternatives.

In today’s post I would like to continue looking at mathematical due diligence that an analyst needs to do in order to make proper use of linear regression.  Specifically I would like to talk about specifying the model – selecting the explanatory variables that should be in the model, omitting the explanatory variables that should be left out and avoiding confusion between causality and correlation.  I will also look at ways of evaluating and comparing linear regression models with each other and with other kinds of models.

# Linear Regression – How To Do It Properly Pt.1 – The Maths

Today I would like to talk about the mathematical concepts behind ordinary least squares (OLS) linear regression. We will look at the linear algebra used for fitting linear regression models and for estimating regression coefficients. We will also talk about theorems that make linear regression so powerful and we will investigate how, depending on which preconditions for which theorems are met, regression models can be meaningful or completely meaningless or anything in between.

# Linear Regression – How To Do It Properly – Pt 0

Linear regression is dangerous. Very dangerous. Here is why…

Most of introductory statistics courses that are taken by social science specialists, after covering some descriptive basics like skewness, kurtosis and Student’s t-distribution finish off with one piece of “sort of advanced” statistical material – linear regression. Econometrics courses for economists focus almost exclusively on linear regression with only a chapter or two dedicated to things like logistic regression or trees. More recently, the numerous data science and machine learning courses for technologists again treat linear regression as the most important citizen of data science. All of this leads many in social sciences, economics, finance and technology to believe that data analysis is pretty much linear regression.