What is web scraping?
Continuing our discussion about sourcing data sets, today we will talk about web scraping.
Web scraping is the process of automated extraction of data from a web page by exploiting the structure of the HTML code underlying the page. Other definitions of web scraping can be found on Wikipedia (but of course!) and Webopedia.
In terms of getting a data set for our data analysis, web scraping is usually the “last resort” option to fall back on if all else has failed. Concretely, if you have had no luck finding a nice well-organized data set in a conventional format like CSV or JSON and if you have had no luck plugging into your desired data stream via some sort of a controlled API, you would then try web scraping.
We Need API-s
Today we will continue our discussion of sourcing datasets for research and analysis.
Last time we talked about datasets that are available to the general public and can be freely downloaded and used for research and we have discussed various portals, repositories and tools that can be used in searching for these datasets. Today we will talk about extracting data from the web programmatically. So basically instead of manually downloading a data set from some site A onto the hard drive and then using some program B to analyze that dataset, we would like to make program B connect to site A and read the data automatically and analyze it on the go.
All you need is a bit of web programming know-how (and I’m talking rather basic stuff), the Johnny Five library that runs on node.js and a simple Adruino open source micro controller!
Why is this big news? Because, all of these technology components are simple, easy to get hold of and easy to learn. And this allows almost anyone to get into robotics, play around and contribute.
I have recently come across WebRTC (RTC stands for Real Time Communication) and found it to be a very neat piece of technology.
WebRTC is a suite of protocols, standards and APIs that allow real time browser-to-browser communication on a peer-to-peer basis. Well, not quite exactly that if there are firewalls involved, but you get the point.
This doesn’t just mean instant chat, video messaging, file exchange, i.e. things that the likes of Skype are already do well. This means a lot of other things, and it is this extension on the usual Skype-like functionality that is the really exciting part. Basically we now have the ability to bring to life any kind of instant interaction between two web browsing experiences across the world – what I do in my browser while I surf the net determines what you see in your browser!