# Re-sampling Pt.1: Cross-Validation

Today and over the next three posts we will talk about re-sampling methods, which is a family of approaches to synthesizing multiple data samples from one original data set.

There are a number of reasons why you may want to do that and a number of ways in which you could do that. Specific re-sampling methods all differ in the way they generate new samples and subsequently in their computational complexity and bias-variance trade-off.  Because of this, specific re-sampling methods differ in their suitability for various specific purposes.

# Re-sampling Pt.2: Jackknife and Bootstrap

Suppose we have a sample of $n$ data points and we want to estimate some parameter $\theta$.  We come up with $\hat{\theta}$ – an estimator of $\theta$.  What do we know about $\hat{\theta}$?  How good an estimator of $\theta$ is it?  Is it biased?  How efficient is it?
We could answer these questions if we knew the distribution of the population  from which $x_{1}, \text{...}, x_{n}$ came. More often than not however, we don’t know anything about the distribution of the underlying population, all we have is a sample and we want to figure out things about the population.
This is where re-sampling, such as jackknife or bootstrapping comes into play.

# Data Wrangling: From Data Sources to Datasets

What is Data Wrangling

Data wrangling is the process of converting a raw dataset into a more “cooked” data set that will be more amenable to automated data analysis.

In the previous posts we discussed how to get hold of data and covered publicly available datasets, as well as reading data from web services via APIs, web scraping and other methods. But even after you have obtained a dataset, the chances are it is still not in a format where you can just throw it into an R or Python dataframe or a SQL table and start analyzing away:

# Sourcing Data 101: Publicly Available Datasets

First step in getting started with any data analysis is to actually get hold of some meaningful data.

Fortunately, we live in the age of Big Data where information is abundant in all its volume, velocity, volatility and voracity.

Last 50 years have seen unprecedented rise of video cameras, sound recorders, counters, logs, sensors, trackers, all accompanied by ever decreasing cost of storage to keep all of this data. Suddenly both capturing and storing data is easy and the challenge has shifted to the actual making sense of the data.

Thus, as long as you know how to make sense of data and to turn it into meaningful information, there is no shortage of actual data repositories out there that you can apply your skill to. A large number of these repositories are free and open to public and this should be your first point of call before investing into customized sampling and research or purchasing premium data sets from data service providers.

Today I would like to talk about large data sets that are available on the web, usually for free, for public use.

# NoSQL vs. SQL, Who Is Who?

NoSQL is a very fashionable buzzword lately, everyone has heard about it, everyone knows it to be the new big thing and yet very few know what it really is.  To most, NoSQL is a magical new technology that is just like SQL except friendly to parallel processing and therefore scalable to very large datasets, something to do with the cloud, Hadoop and MapReduce.  That is partially true, but perhaps a clearer description is called for.

In fact, NoSQL is not just one specific technology, or paradigm. Instead, it is a loosely grouped collection of data storage and retrieval technologies that all extend or altogether replace the traditional relational database paradigm in one form or another, but always in some way that makes it horizontally scalable, versatile and suitable for large and fast data flows.  Sometimes the paradigm is relational sometimes not, sometimes there is SQL sometimes there is no room for it – No SQL is a heterogeneous set of technologies and the right interpretation for the acronym NoSQL is ”Not Only SQL”.

# KPCB Internet Trend Report 2014 Is Out Now

Kleiner Perkins (KPCB) is a venture capital firm that has, since its establishment in 1972, successfully invested in incubation of AOL, Amazon.com, Citrix, Compaq, Electronic Arts, Google, Intuit, Juniper Networks, Netscape, Sun Microsystems and Symantec among others – they are considered one of Silicon Valley’s top venture capital providers. Their long awaited annual Internet Trends report has just been released and it makes a fascinating read. The full report can be found here.

It’s a 164 slide powerpoint presentation, but well worth a read. For those who are too busy, here is a summary distilled to a few key points: