What is Data Wrangling
Data wrangling is the process of converting a raw dataset into a more “cooked” data set that will be more amenable to automated data analysis.
In the previous posts we discussed how to get hold of data and covered publicly available datasets, as well as reading data from web services via APIs, web scraping and other methods. But even after you have obtained a dataset, the chances are it is still not in a format where you can just throw it into an R or Python dataframe or a SQL table and start analyzing away:
NoSQL is a very fashionable buzzword lately, everyone has heard about it, everyone knows it to be the new big thing and yet very few know what it really is. To most, NoSQL is a magical new technology that is just like SQL except friendly to parallel processing and therefore scalable to very large datasets, something to do with the cloud, Hadoop and MapReduce. That is partially true, but perhaps a clearer description is called for.
In fact, NoSQL is not just one specific technology, or paradigm. Instead, it is a loosely grouped collection of data storage and retrieval technologies that all extend or altogether replace the traditional relational database paradigm in one form or another, but always in some way that makes it horizontally scalable, versatile and suitable for large and fast data flows. Sometimes the paradigm is relational sometimes not, sometimes there is SQL sometimes there is no room for it – No SQL is a heterogeneous set of technologies and the right interpretation for the acronym NoSQL is ”Not Only SQL”.