June 21, 2015
Data Wrangling: From Data Sources to Datasets
Data wrangling is the process of converting a raw dataset into a more “cooked” data set that will be more amenable to automated data analysis.
In the previous posts we discussed how to get hold of data and covered publicly available datasets, as well as reading data from web services via APIs, web scraping and other methods. But even after you have obtained a dataset, the chances are it is still not in a format where you can just throw it into an R or Python dataframe or a SQL table and start analyzing away:
- If you have a few disparate data sources or data tables you may want to join them together into one “mega” table,
- If you are dealing with time series you may want to sort by dates or change date formats,
- If you are dealing with text analysis, you may want to get rid of leading and trailing spaces or to break up your text into n-grams,
- If you are building mathematical models you may want to get rid of records that have Nulls or NAs, so that they do not mess up calculations or totals or averages,
- You may even create new calculated fields – for example logs and powers if you think you may have to explore non-linear relationships.
The list goes on and on – there is a myriad of small cosmetic fine tunings to do in order to convert the raw data into a more analysis-friendly format. The full extent will depend on what analysis you plan to do and how bad your initial raw data set is, but it will always be substantial.
More Important Than First Meets the Eye
Thus, data wrangling, while often overlooked and understated, is actually a crucial piece of the data analysis workflow. Not only is it crucial, but it is also very time consuming and almost always tedious and hated. In the industry literature it is often quoted that 80% of data scientists’ time is spent on data wrangling.
So why am I saying all of this?
Well firstly it is so that we as data scientists are aware of this “hidden cost” of any data science exercise. We often get excited at the start when we get a light bulb idea about some new research hypothesis to test or when we get out hands on a new data set. Also the end point can be exciting – you throw a bunch of regression fitting or machine learning algorithms at the data and see some nice looking models come out. But let’s just remember that between the start point and the end point there is the long, drudging and frustrating walk, in fact more often a crawl, of all the necessary data wrangling that “cooks” your raw data set into a format consumable by the analytics tools.
Secondly let’s also be aware that even though data wrangling is a guaranteed (even though somehow always unexpected) cost of data analysis, things do not have to be all that bad. In the recent years a large industry has grown up around alleviating the pain of data wrangling and freeing up the analysts’ time from it as much as possible. There is a wide range of solutions available, ranging from your old school ETL tools, to more modern and sleek data integration platforms and even all the way to platforms for outsourcing data wrangling to cost effective manual workers across the world. Let’s talk about these one by one.
Data Wrangling Solutions Overview
Classic ETL tools
There is always the classic ETL, reborn as:
or Pentaho’s open source ETL tool, Kettle:
Machine Work Outsourced Back to Humans
Have a look at CrowdFlower and Amazon’s Mechanical Turk. Here you essentially can farm out your data wrangling tasks to human workers across the world who would then perform these on their computers in their spare time, for a small fee.
These services are cost effective. At the same time there are quality control procedures in place to ensure that data wrangling is performed reliably. For example CrowdFlower have a system whereby a random “test task” is occasionally injected into each worker’s workload, the answer to the test task is known in advance and the system is able to test the worker’s accuracy against this known answer (of course the worker doesn’t actually know that it’s a test task). In the long run workers get scored on their reliability and those who perform poorly, won’t be kept working for the service for long. Each task is farmed out to several workers and the final answer for the task, i.e. the answer presented to the requester, is the average of several workers’ answers, weighted by their reliability scores.
Using a service like CrowdFlower or Mechanical Turk is effective for tasks that are not handled well by computers – tasks that require intuition or common sense judgment. For example imagine sifting through thousands of images of faces to classify whether each of the faces is sad or happy. In absence of good facial recognition software, something like Mechanical Turk is perfect – humans across the world classify images for a fraction of a cent per image. Once a significant portion of the dataset has been processed, with each face classified, this portion of the dataset can be used as a training set for a machine learning algorithm that would then learn how to classify photographs and be able to process the rest of the dataset automatically, without further need of human supervision.
End-to-end Data Integration Platforms
There is also a multitude of companies that provide end-to-end “data integration” platforms.
These solutions attempt to join all disparate data sources together into one seamless flow of uniformly formatted data. Most of these platforms are priced at a premium but also have value-add features like audit trail and validations that are especially valuable to users in corporate environments. Some of the well known platforms in this space are:
- Trifacta: http://www.trifacta.com/
- Paxata: http://www.paxata.com/
- Astera: http://www.astera.com/
- Informatica: https://www.informatica.com/
And here is an interesting company that starts at data wrangling and goes all the way into the realm of Internet of Things by integrating data, applications and even household items.
Data Wrangling Features of Analytics Tools
Lastly, if you are using a statistical or analytics tool for your data analysis, the chances are it already has a rich library of data wrangling functions and features.
For example if you are working with in Microsoft environment, please check out the previously mentioned SQL Server Integration Services and Azure Data Factory.
If you are working with R, here is an excellent list of R packages for data importing and data wrangling. And here is a more general “cheat sheet” related to data wrangling in R.
And here is a similar sort of cheat sheet in how to do data wrangling in Python, using NumPy, Pandas and Matplotlib.
As I write this, automated data wrangling is a fast growing field. It spans vast territories, from human workers toiling on manual micro-tasks in offshore locations, through to corporate grade ETL and data integration solutions in banks, and all the way through to futuristic platforms that connect toasters and smart watches into the Internet of Things. The more data science and machine learning are embraced in mainstream industry, the more ubiquitous data wrangling becomes.