May 1, 2015
Sourcing Data 101: Publicly Available Datasets
Fortunately, we live in the age of Big Data where information is abundant in all its volume, velocity, volatility and voracity.
Last 50 years have seen unprecedented rise of video cameras, sound recorders, counters, logs, sensors, trackers, all accompanied by ever decreasing cost of storage to keep all of this data. Suddenly both capturing and storing data is easy and the challenge has shifted to the actual making sense of the data.
Thus, as long as you know how to make sense of data and to turn it into meaningful information, there is no shortage of actual data repositories out there that you can apply your skill to. A large number of these repositories are free and open to public and this should be your first point of call before investing into customized sampling and research or purchasing premium data sets from data service providers.
Today I would like to talk about large data sets that are available on the web, usually for free, for public use.
Where To Start
The first port of call is to check out the following resources:
- http://datahub.io/ – an ambitious project to aggregate all datasets into one searchable repository,
- https://www.quandl.com/ – a repository focusing primarily on financial, economic and social data, containing over 20 million datasets from over 500 databases,
- http://www.re3data.org/ – a repository focusing predominantly on scientific data and scientific research.
Also, Google has an entire search engine dedicated to finding datasets for data scientists:
Leading Big Data players like Amazon, Google and Microsoft have their own repositories that contains hundreds of interesting datasets:
- Amazon’s data set repository: https://aws.amazon.com/datasets
- Google’s data set repository: http://www.google.com/publicdata/directory
- Microsoft’s data set repository: http://datamarket.azure.com/browse/data
In particular, check out “the largest ever dataset” (as of June 2015) that was recently added to the Azure Data Market:
If You Are Looking For Something Specific
More specifically, if you are looking for data about a concrete topic, you would be best to try authoritative websites for the specific industry or topic that you are analyzing.
Thus, for example, if you are looking for econometric data, then a good place to start is the websites of individual country’s government (for example http://ukdataservice.ac.uk/) or of international bodies such as World Bank or IMF or of financial exchanges such as NYSE or CME – these almost always provide various time series data in freely downloadable CSV files.
If You Are Just Window Shopping
If you do not yet know what dataset you are looking for or if you are window shopping and simply want to know what interesting data sets are generally out there that may supplement the data that you already have, then here are a couple of interesting lists that have been floating aroundthe Web that reference a lot of publicly available datasets:
- List of publicly available datasets from KDNuggets: http://www.kdnuggets.com/2015/04/awesome-public-datasets-github.html
- A list of datasets that I have found on GitHub: https://github.com/caesar0301/awesome-public-datasets
- List of publicly available datasets from Revolution Analytics: http://mran.revolutionanalytics.com/documents/data
- An interesting list from rs.io: http://rs.io/100-interesting-data-sets-for-statistics/
- More Datasets (on GitHub): http://serc.carleton.edu/NICHE/resources_data_analysis.html
- UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.html
And, if you really want to lose hours sifting through random links to curious (and often really random) datasets then knock yourself out on this Reddit thread.
Here you will come across a dataset of every email that Jeb Bush received while governor, some interesting Bitcoin data aggregations here and here, a US civil unrest dataset, or a guide on how to extract all of your Facebook data into CSV.
Recommended Data Sets
Some of the more interesting datasets that I have personally come across and would recommend are as follows:
- Google N-grams. Collection of all fixed size n-tuples of words extracted from the Google Books corpus. This dataset was produced by essentially passing a “sliding window n words wide” over the text of books and is a great way of storing text information in an analytics friendly format. Great resource for doing text mining and bibliometrics.
- Common Crawl Corpus. Four times a year this nonprofit organization crawls the web and archives snapshots of the web as it is at that point. The archived dataset is freely available to the public and consists of 2 PB of data from 1.95 billion webpages as of March 2015.
- The 1000 Genomes Project. An international research undertaking to sequence the genomes of at least 1000 anonymous participants from different ethnic groups and to build a detailed catalogue of human genetic variation. In 2012 the sequencing of 1092 genomes has been completed and the data was made available for the benefit of the scientific community.
- Freebase Metadata. A metadata dump of Freebase. Freebase is a freely available structured database of the world’s information, covering millions of topics in hundreds of categories, drawn from large open data sets like Wikipedia, MusicBrainz, and the SEC archives.
- Wikipedia XML. This data set contains a complete copy of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML as provided by the Wikimedia Foundation.
- Stanford Large Network Dataset Collection. A comprehensive collection of social network analysis graphs and data sets from Jure Leskovec at Stanford.
- The GDELT Project. The GDELT Project monitors the world’s broadcast, print and web news and builds a real time network diagram and a database of people, locations, organizations, themes, sources, emotions and events that drive the world with the goal of providing a free open platform for computing on the entire world.