Data Wrangling: From Data Sources to Datasets

data wrangling3What is Data Wrangling

Data wrangling is the process of converting a raw dataset into a more “cooked” data set that will be more amenable to automated data analysis.

In the previous posts we discussed how to get hold of data and covered publicly available datasets, as well as reading data from web services via APIs, web scraping and other methods. But even after you have obtained a dataset, the chances are it is still not in a format where you can just throw it into an R or Python dataframe or a SQL table and start analyzing away:


Sourcing Data 101: Last Resort – Web Scraping

web scraping 3What is web scraping? 

Continuing our discussion about sourcing data sets, today we will talk about web scraping.

Web scraping is the process of automated extraction of data from a web page by exploiting the structure of the HTML code underlying the page. Other definitions of web scraping can be found on Wikipedia (but of course!) and Webopedia.

In terms of getting a data set for our data analysis, web scraping is usually the “last resort” option to fall back on if all else has failed. Concretely, if you have had no luck finding a nice well-organized data set in a conventional format like CSV or JSON and if you have had no luck plugging into your desired data stream via some sort of a controlled API, you would then try web scraping.


