What is web scraping?
Continuing our discussion about sourcing data sets, today we will talk about web scraping.
Web scraping is the process of automated extraction of data from a web page by exploiting the structure of the HTML code underlying the page. Other definitions of web scraping can be found on Wikipedia (but of course!) and Webopedia.
In terms of getting a data set for our data analysis, web scraping is usually the “last resort” option to fall back on if all else has failed. Concretely, if you have had no luck finding a nice well-organized data set in a conventional format like CSV or JSON and if you have had no luck plugging into your desired data stream via some sort of a controlled API, you would then try web scraping.
We Need API-s
Today we will continue our discussion of sourcing datasets for research and analysis.
Last time we talked about datasets that are available to the general public and can be freely downloaded and used for research and we have discussed various portals, repositories and tools that can be used in searching for these datasets. Today we will talk about extracting data from the web programmatically. So basically instead of manually downloading a data set from some site A onto the hard drive and then using some program B to analyze that dataset, we would like to make program B connect to site A and read the data automatically and analyze it on the go.
What is Python
Python is a programming language that is often used in data science applications. It is not just a data science specific tool, but is in fact a versatile all-purpose programming language that is widely used for building all sorts of applications – games, interactive websites, enterprise software, etc. However, because Python is relatively easy to learn, is open source/free and has amassed an impressive range of libraries geared towards number crunching, scientific computing and machine learning, it has become the programming language of choice in the data science community.
All you need is a bit of web programming know-how (and I’m talking rather basic stuff), the Johnny Five library that runs on node.js and a simple Adruino open source micro controller!
Why is this big news? Because, all of these technology components are simple, easy to get hold of and easy to learn. And this allows almost anyone to get into robotics, play around and contribute.