Re-sampling Pt.1: Cross-Validation

Today and over the next three posts we will talk about re-sampling methods, which is a family of approaches to synthesizing multiple data samples from one original data set.

There are a number of reasons why you may want to do that and a number of ways in which you could do that. Specific re-sampling methods all differ in the way they generate new samples and subsequently in their computational complexity and bias-variance trade-off.  Because of this, specific re-sampling methods differ in their suitability for various specific purposes.

Re-sampling Pt.2: Jackknife and Bootstrap

Suppose we have a sample of $n$ data points and we want to estimate some parameter $\theta$.  We come up with $\hat{\theta}$ – an estimator of $\theta$.  What do we know about $\hat{\theta}$?  How good an estimator of $\theta$ is it?  Is it biased?  How efficient is it?
We could answer these questions if we knew the distribution of the population  from which $x_{1}, \text{...}, x_{n}$ came. More often than not however, we don’t know anything about the distribution of the underlying population, all we have is a sample and we want to figure out things about the population.
This is where re-sampling, such as jackknife or bootstrapping comes into play.

The bias-variance trade-off is a fundamental principles that is is at the heart of statistics and machine learning.  You may have already seen its various faces, for example in this equation:

$\text{MSE}\small\left(\hat{\theta}\small\right) = \normalsize\left[\text{Bias}\small\left(\hat{\theta}\small\right)\normalsize\right]^{2} + \text{Var}\small\left(\hat{\theta}\small\right)$

or in linear regression analysis where omitting valid explanatory variables made other coefficients biased while introducing correlated explanatory variables made standard errors of coefficients high.

Those, however, were all concrete examples of a much broader phenomenon that I want to discuss here.

Data Wrangling: From Data Sources to Datasets

What is Data Wrangling

Data wrangling is the process of converting a raw dataset into a more “cooked” data set that will be more amenable to automated data analysis.

In the previous posts we discussed how to get hold of data and covered publicly available datasets, as well as reading data from web services via APIs, web scraping and other methods. But even after you have obtained a dataset, the chances are it is still not in a format where you can just throw it into an R or Python dataframe or a SQL table and start analyzing away:

Sourcing Data 101: Last Resort – Web Scraping

What is web scraping?

Continuing our discussion about sourcing data sets, today we will talk about web scraping.

Web scraping is the process of automated extraction of data from a web page by exploiting the structure of the HTML code underlying the page. Other definitions of web scraping can be found on Wikipedia (but of course!) and Webopedia.

In terms of getting a data set for our data analysis, web scraping is usually the “last resort” option to fall back on if all else has failed. Concretely, if you have had no luck finding a nice well-organized data set in a conventional format like CSV or JSON and if you have had no luck plugging into your desired data stream via some sort of a controlled API, you would then try web scraping.

Sourcing Data 101: API-s and Programmable Web.

We Need API-s

Today we will continue our discussion of sourcing datasets for research and analysis.

Last time we talked about datasets that are available to the general public and can be freely downloaded and used for research and we have discussed various portals, repositories and tools that can be used in searching for these datasets. Today we will talk about extracting data from the web programmatically. So basically instead of manually downloading a data set from some site A onto the hard drive and then using some program B to analyze that dataset, we would like to make program B connect to site A and read the data automatically and analyze it on the go.

Sourcing Data 101: Publicly Available Datasets

First step in getting started with any data analysis is to actually get hold of some meaningful data.

Fortunately, we live in the age of Big Data where information is abundant in all its volume, velocity, volatility and voracity.

Last 50 years have seen unprecedented rise of video cameras, sound recorders, counters, logs, sensors, trackers, all accompanied by ever decreasing cost of storage to keep all of this data. Suddenly both capturing and storing data is easy and the challenge has shifted to the actual making sense of the data.

Thus, as long as you know how to make sense of data and to turn it into meaningful information, there is no shortage of actual data repositories out there that you can apply your skill to. A large number of these repositories are free and open to public and this should be your first point of call before investing into customized sampling and research or purchasing premium data sets from data service providers.

Today I would like to talk about large data sets that are available on the web, usually for free, for public use.

Logistic Regression – Meaningful Coefficients

Why Logistic Regression

Today I will talk about logistic regression, a method that allows us to use some of the useful features of linear regression to fit models that predict binary outcomes: $1$ vs. $0$, “success” vs. “failure”, “result present” vs. “result absent”.

Just by itself linear regression is not adequate for this sort of thing as it doesn’t place any restrictions on the range of the response variable and we could end up with values far greater than $1$ or far below $0$. We could discard anything we know about linear regression and use an altogether different approach to build a model that would, on the basis of a group of explanatory variables, predict our response variable. Indeed there are a number of such approaches: trees, forests, SVMs, etc. However, linear regression is such a simple but powerful technique with so many great things about it – it would be really good if we could somehow harness its power to build binary predictive models. And logistic regression is the method that allows us to do just that.

Model Evaluation Pt. 1 – LRT, AIC and BIC.

We Need a Universal Metric To Evaluate Models

In previous posts we looked at different ways to evaluate goodness of fit of particular types of models such as linear regression, trees or logistic regression. For example in the post on linear regression, we looked at $R^2$ and adjusted-$R^2$ and in the post on logistic regression we considered pseudo-$R^2$, Pearson’s $\chi^2$ Test and Hosmer-Lemeshow metric.

These statistics often come as part of the standard output in statistical software whenever you fit a linear regression or logistic regression or whatever is the specific type of model that you are experimenting with. They convey a great deal of meaningful information about how well your model has fitted the data, but at the same time are rife with shortcomings and in particular are usually meaningful only for one particular type of model. For example $R^2$ only makes sense in the context of linear regresssion and once you start comparing oranges with apples it really stops telling you anything meaningful.

What would be really great is if there was a uniform and consistent way to compare all sorts of models against each other. And this is the topic I would like to explore in this post.

Linear Regression – How To Do It Properly Pt.3 – The Process

Time To Summarise The Previous Two Posts

In the last two posts we talked about the maths and the theory behind linear regression.  We have covered the mathematics of fitting an OLS linear regression model and we have looked at derivation of individual regressor coefficient estimates. We have discussed the Gauss-Markov Theorem, the conditions that are necessary for the theorem to hold and more generally the conditions that must be satisfied in order for us to be able to use regression effectively and to place significant faith in coefficient estimates.  We have looked at the sample size, hypothesized about distribution of the theoretical error term $\epsilon$, disucussed omitted explanatory variables, unnecessary explanatory variables, proxy variables, non-linear regressors and other factors that may influence our model’s reliability, unbiasedness, efficiency and overall goodness of fit.  Finally we have looked at objective ways of measuring this goodness of fit.

This has been a long series of posts with a lot of maths and a lot of ifs and buts and today I would like to summarise all of it and attempt to come up with a simple step-by-step process that one can follow to get the most out of linear regression without getting burned by its many pitfalls and dangers.