What is Data Wrangling
Data wrangling is the process of converting a raw dataset into a more “cooked” data set that will be more amenable to automated data analysis.
In the previous posts we discussed how to get hold of data and covered publicly available datasets, as well as reading data from web services via APIs, web scraping and other methods. But even after you have obtained a dataset, the chances are it is still not in a format where you can just throw it into an R or Python dataframe or a SQL table and start analyzing away:
What is web scraping?
Continuing our discussion about sourcing data sets, today we will talk about web scraping.
Web scraping is the process of automated extraction of data from a web page by exploiting the structure of the HTML code underlying the page. Other definitions of web scraping can be found on Wikipedia (but of course!) and Webopedia.
In terms of getting a data set for our data analysis, web scraping is usually the “last resort” option to fall back on if all else has failed. Concretely, if you have had no luck finding a nice well-organized data set in a conventional format like CSV or JSON and if you have had no luck plugging into your desired data stream via some sort of a controlled API, you would then try web scraping.
What is Python
Python is a programming language that is often used in data science applications. It is not just a data science specific tool, but is in fact a versatile all-purpose programming language that is widely used for building all sorts of applications – games, interactive websites, enterprise software, etc. However, because Python is relatively easy to learn, is open source/free and has amassed an impressive range of libraries geared towards number crunching, scientific computing and machine learning, it has become the programming language of choice in the data science community.