June 5, 2015
Sourcing Data 101: Last Resort – Web Scraping
Continuing our discussion about sourcing data sets, today we will talk about web scraping.
Web scraping is the process of automated extraction of data from a web page by exploiting the structure of the HTML code underlying the page. Other definitions of web scraping can be found on Wikipedia (but of course!) and Webopedia.
In terms of getting a data set for our data analysis, web scraping is usually the “last resort” option to fall back on if all else has failed. Concretely, if you have had no luck finding a nice well-organized data set in a conventional format like CSV or JSON and if you have had no luck plugging into your desired data stream via some sort of a controlled API, you would then try web scraping.
Web scraping is used in applications like price comparison websites (eg. hotels or supermarkets comparisons) or in other situations where the entire dataset is made up of smaller subsets that come from a large and heterogeneous universe of data sources. Web scraping also comes in handy when you are after information that is not available (maybe unintentionally or maybe deliberately) in a nice CSV file or via an accessible API, such as for example a list from LinkedIn with the details of all the .NET programmers in New Jersey. For some interesting use cases of web scraping have a look at this beer analytics portal or this angel investor syndication tool.
How Web Scraping Actually Works: HTML, DOM and XPATH
Automated web scraping relies on the fact that the website or web application that you intend to “scrape” has some sort of a regular structure, so that you can describe very precisely to your web scraping tool what kind of data to extract from which part of the underlying HTML.
This is usually achieved by studying the HTML structure of your target page and figuring out a way to uniquely identify the HTML elements of the web page (more precisely, the DOM corresponding to the web page) that contain the data you are interested in. All modern browsers (including IE) have developer tools that allow users to view a web page’s underlying source code and zoom in on individual elements.
For example in Chrome, you can just right click on any part of a web page and a menu will pop up with an option to “inspect element”, which, if clicked, will open up a developer panel with the HTML code view of that part of the web page.
How to uniquely identify DOM elements and how to access them will depend on the particular technology that you use for web scraping (see below for more detail), but most web scraping tools or programming languages will at the very least allow to access elements by id, or by class type, or name. If all else fails however, there is another very powerful method for uniquely identifying and accessing elements of a web page – XPATH.
XPATH is a language that describes structures and substructures of XML documents and thereby allows web scraper to navigate through the HTML / XML of the webpage to the part that contains the actual data to scrape. Think of XPATH expressions as being the same for XML documents as what file path names are for file and directory systems. XPATH is quite expressive and is a whole new subject matter in itself. For the purposes of simple web scraping there is no need to get into all the details and capabilities of XPATH, it will be sufficient to just say that there is plenty of documentation out there that can be referenced if and when a specific question or problem comes up. As a starting point, this W3schools tutorial or this XML tutotial on Lynda.com will be more than enough.
There are plenty of effective tools out there for web scraping but they essentially fall into two categories.
On one hand there are services like:
that are aesthetically pleasing user friendly websites where you can “drag and drop” a web scraping solution together. These are very slick but can be quite limited in effectiveness.
On the other hand there are web scraping packages for programming languages like Python or R – these allow a fine degree of control and enable really sophisticated web scraping but at the same time require programming skills to get up and running.
With this second category, the best place to start is to look at Python’s Beautifulsoup package, which is a great web scraping library, nice and lean and simple to understand but at the same time quite powerful. There is also ScraPy (also for Python) which is even more powerful than Beautifulsoup but the programming can get quite involved.
Also here are a couple of guides to web scraping libraries in R:
Dynamically Generating Web Pages For Scraping
Lastly, if you are building a fully fledged automated web scraping solution, you are probably looking to scrape a number of web pages all in one go, rather than just target one static page. For example you may wish to interact with a web application, enter data into forms, run a query and then scrape the resulting page that is returned by the web application, and then iterate through this process dozens or hundreds of times building up a comprehensive data set.
The challenge now is not only how to scrape an available web page, but also how to automate the generating of a sequence of relevant pages in the first place.
For this purpose, a very useful piece of technology is Selenium WebDriver – a popular browser automation tool. Selenium is normally used for automating testing of web applications, but at its core it is essentially an automation tool that drives the interaction with a website or a web application, so it can as well be used for generating web pages for scraping. Selenium is built in Java, but has API-s that other programming languages can call. In Python, Selenium APIs are available via the selenium package. (If at this point you are unsure what an API is, please have a look at my other post that explains just that).
Here is a great walk through guide on web scraping using Python’s Beautifulsoup and selenium packages. And if you need to brush up on Python or make sense of how to use Python’s packages, please read my other Python post here.
One last thing I should mention is that there are a number of legal issues associated with web scraping. In most cases web scraping itself is not illegal, but you may want to investigate whether in your particular case collecting information generally, whether through web scraping or other means, breaches the law.
An interesting and comprehensive discussion of legalities of web scraping can be found here, and there is also a thread on Quora here. And of course, here is a mandatory disclaimer from me: I am not a lawyer so please do not take any of this as legal advice, please go do your own due dilligence and speak with a proper expensive lawyer if you must.