Likelihood Log

Econometrics of scale

Category Archive: Databases and Datastores

Data Wrangling: From Data Sources to Datasets

data wrangling3What is Data Wrangling

Data wrangling is the process of converting a raw dataset into a more “cooked” data set that will be more amenable to automated data analysis.

In the previous posts we discussed how to get hold of data and covered publicly available datasets, as well as reading data from web services via APIs, web scraping and other methods. But even after you have obtained a dataset, the chances are it is still not in a format where you can just throw it into an R or Python dataframe or a SQL table and start analyzing away:

(more…)

Continue Reading

NoSQL vs. SQL, Who Is Who?

nosql2NoSQL is a very fashionable buzzword lately, everyone has heard about it, everyone knows it to be the new big thing and yet very few know what it really is.  To most, NoSQL is a magical new technology that is just like SQL except friendly to parallel processing and therefore scalable to very large datasets, something to do with the cloud, Hadoop and MapReduce.  That is partially true, but perhaps a clearer description is called for.

In fact, NoSQL is not just one specific technology, or paradigm. Instead, it is a loosely grouped collection of data storage and retrieval technologies that all extend or altogether replace the traditional relational database paradigm in one form or another, but always in some way that makes it horizontally scalable, versatile and suitable for large and fast data flows.  Sometimes the paradigm is relational sometimes not, sometimes there is SQL sometimes there is no room for it – No SQL is a heterogeneous set of technologies and the right interpretation for the acronym NoSQL is ”Not Only SQL”.

(more…)

Continue Reading