May 19, 2015
Sourcing Data 101: API-s and Programmable Web.
Today we will continue our discussion of sourcing datasets for research and analysis.
Last time we talked about datasets that are available to the general public and can be freely downloaded and used for research and we have discussed various portals, repositories and tools that can be used in searching for these datasets. Today we will talk about extracting data from the web programmatically. So basically instead of manually downloading a data set from some site A onto the hard drive and then using some program B to analyze that dataset, we would like to make program B connect to site A and read the data automatically and analyze it on the go.
This sort of automated program-to-program communication is normally achieved by means of something called an API. The concept of an API is not really a data science “thing” as such and you may have not heard much about it before, but APIs are a really big deal in the realm of software engineering, systems integration and more recently web applications.
The idea of an API is a major concept that drives the interoperability and automation of today’s web and as of 2013 there were over 10,000 API published by companies for open reference and use.
So, given that APIs are so important and given that we can use them to automate a lot of data extraction, let’s invest some time to understand this concept properly.
What is an API?
API stands for Application Programming Interface.
An API for an application A is basically a protocol, an accepted set of rules, whereby any other application B can communicate with A programmatically, access A’s resources, call A’s methods, perform CRUD operations (create, read, update, destroy) on A’s resources – all in a way that is controlled, automated and (hopefully) scalable.
Imagine you are a human user (just imagine) that accesses an application A and plays around with it. You press various buttons, drag and drop things, fill in pretty forms, click links. Whenever you perform any of these actions, program A receives your input and responds by running some sort of set of methods and producing an output. When I say “output” I mean it in a very general sense – it could be a computed number, it could also be a list of search results, it could be the action of displaying a profile page, or even returning a ‘404 not found’ message.
Now imagine you do not want to be sitting there in front of your computer screen dealing with A manually but instead would like to automate this process. You would like some other program B to be doing all of this on your behalf. More precisely, you would like B to get A to do whatever A would have done if you as a human had clicked that button or filled in that form or dragged-and-dropped that figure onto that grid.
Unfortunately, all those things that made your human user experience great – the graphics, the clicking of the links etc – all of these are completely useless as far as program B’s “user experience” goes. Program B does not care about the colors, the pretty forms, the nice clicking sounds – it needs something far less human friendly and a lot more machine friendly in order to be able to obtain from A whatever it needs to obtain. And this is exactly the utility that an API provides.
The sort of things that B can do to A via an API, or, maybe better phrased, the sort of services that B can receive from A via an API, are usually similar to what a human user would be able to obtain from B by playing around with B. The human user would do this by clicking various links, dragging-and-dropping, manually filling in forms, etc – all that user-friendly user experience stuff, whereas program B would do this by calling A’s API, i.e. talking to A through the API protocol in such a way that A would return to B the services that B requires.
Even though APIs can be used in any application-to-application communication, in this post I will focus on web applications as this is the dominant and the fastest growing use case for APIs.
The web is driven by the HTTP protocol and as a result most application makers chose to use the HTTP protocol as the underlying mechanism for their APIs.
I will try to avoid going to deep into the technicalities of HTTP protocol here. I will just mention that the whenever a client is interacting with a server on the web – for example when you access a web page through your browser or when you book a ticket through an online booking app – it is the HTTP protocol that coordinates the client’s requests and the server’s responses. A client, for example a web browser, sends an HTTP response packages to the server asking the server to do something (fetch a web page, look up something in a database, perform a calculation) and subsequently the server, after completing the requested action, sends an HTTP response with confirmation or with further information back to the client.
Thus, in order to obtain a particular service from program A, program B would need to send to A an HTTP request of a particular format and content and would subsequently expect from A an HTTP response of particular nature.
In essence the HTTP protocol is defined by the standard structure that HTTP request and HTTP response packages must follow in order to carry information in a controlled and interpretable way, as well as some conventions about the actions that need to take place on the client and server sides, based on the content of these HTTP packages.
An HTTP Client Request package has the following format:
- URL: a uniform resource locator identifying the site/page/object that you are attempting to access.
- HTTP Method: one of several available methods, determining what exactly you would like to do with the accessed object. The four most common HTTP methods are POST, GET, PUT and DETELE and you can roughly think of these as corresponding to respective requests by the client to create, read, update or destroy (commonly abbreviated and referred to as CRUD actions) the object identified by the URL. To be pedantic, the HTTP POST, GET, PUT and DELETE methods do not necessarily strictly correspond one-to-one to the CRUD operations all of the times, but the loose association will do for now, for the purposes of the bigger picture explanation.
- List of Headers: a list of meta-information entries that specify all sorts of things like the time of sending the request, the size of the request packet, whether authentication is required, what format the body of the message is in, etc.
- Body: the actual information that the client wants to send to the server. This can be English text or in any one of the standard formats like XML or JSON. This message body can also be empty. For example when the client requests from the server a web page, the URL, method and header sections already contain everything that the client needs to communicate to the server in order to request for the page, so there is no need to clog up the body section. (By contrast when the server responds to that request, the body section of the HTTP response packet will not be empty and will contain the HTML code for the requested web page).
An HTTP Server Response package has the following format:
- Status Line: this contains a code like “200 OK” (means “I have processed your request, everything is OK, here is your requested data, see the message Body”) or “404 Page Not Found” or “503 Service Unavailable” or some other code from the list of a few dozen standard HTTP response codes.
- Headers: various meta-information, just like in the case of HTTP requests.
- Message Body (optional): this is where the actual information that was requested by the client and returned by the server resides. For example, this could be the HTML code for the web page that the client requested or a result of a computation. This message body could also be empty. For example if the HTTP request had a GET method asking the server for a resource, the server upon receiving the request tried to get the resource but something went wrong and that resource was not available, the response would be a “503 Service Unavailable” (in the status line section). In this case the actual body section of the response doesn’t need to have anything – there isn’t a web page to send back to the client anyway.
Web API-s Based On HTTP
The purpose of going over the HTTP protocol and packet structures in detail, was to demonstrate how much flexibility there is in the different combinations of method, header and body fields of the HTTP packages. Therefore, despite looking primitive at first sight, HTTP actually provides a very versatile vehicle for cross-application communication.
In order to obtain a particular service from program A, program B would need to send to A an HTTP request of a particular format and content and would subsequently expect from A an HTTP response of particular nature.
Earlier I said that:
an API for an application A is a protocol, an accepted set of rules, whereby any other application B can communicate with A programmatically, access A’s resources, call A’s methods, perform CRUD operations.
We can now rephrase this in a more specific way and say that:
an API for an application A is a convention that some other application B’s HTTP requests must follow and that A’s HTTP responses will honor in order to make B be able to use A’s services.
So far we have discussed what an API is in general and we have discussed how HTTP, the protocol underlying the web, is a suitable medium for implementing web API-s. However, when it comes to a specific API for a concrete application, it will be up to that application’s developers to agree on the exact protocol that defines that API, for example what exact header, method and body contents A’s HTTP request package must have in order to successfully elicit from B the required service.
Furthermore, once this convention is decided on, it will need to be documented and published somewhere, where the rest of the world can access it when they build other software components that they want to interact with the application via its APIs.
There are a couple of centralized repositories where applications publish their APIs, one major such repository is Programmable Web. Using applications A and B from the example above, this is where I would go if I were the owner of application A in order to publish A’s API so that other developers, building applications B, C, D, etc can integrate them with my application A.
SOAP vs REST
Over the years there emerged a number of philosophies on how to use all possibilities and combinations allowed by HTTP requests and responses to build the “best” APIs.
Among these numerous design philosophies, the two dominant ones are SOAP and REST. I will not go into all the technicalities, there are plenty of other information sources on the very popular and important subject of SOAP vs REST, for example this 3 hour video tutorial. I will ontly try to quickly cover off some high level points.
SOAP basically works by (and this is a VERY crude simplification) sending HTTP requests and responses where the body of the message contains, in XML format, information that is necessary to make various method calls on the server side or to process returned values on the client side. SOAP is analogous to (and again this is a crude oversimplification) doing remote method calls and returning values remotely, all via HTTP, where both the method call and the returned value are stored in XML format and as per specific XML convention pre-agreed between the client and the server. Thus, it is the exact XML convention that determines the API.
REST (Representational State Transfer) on the other hand relies less on the HTTP request / response body (more precisely the XML message therein) and more on the actual URL specified in the HTTP, as well as the HTTP method and headers. REST provides a convention whereby each resource on the server (web page, blog post, database entry) and a corresponding CRUD operation that the client may wish to perform on that resource, is associated with a unique URL. Upon receiving an HTTP request with that URL included, the server will perform the particular CRUD operation on the particular resource, as specified by the method and the URL of the received request packet.
For example if we have a server that contains user profiles, a REST-ful API will look as follows:
HTTP verb Endpoint Action
GET /profiles List existing profiles
POST /profiles Create a new profile
GET /profiles/1 Get details for profile #1
GET /profiles/2 Get details for profile #2
PUT /profiles/1 Update profile #1
DELETE /profiles/1 Destroy profiles #1
Note how all the CRUD operations on all the resources are determined solely through the method and the URL parts of the HTTP request. Of course there is a bit more complexity than that, but you get the general idea.
And here are some concrete examples of further complexity that is likely to accompany an actual real life REST-ful API:
- In case of POST or PUT methods there may be further conventions on how the exact information that the client wants to POST or PUT is described in the HTTP request body, so that the server can meaningfully interpret it and create the corresponding resources on its end.
- While it is probably OK to allow anyone to GET resources, we would likely want to limit who can POST, PUT or DELETE resources, thus we would need to look at adding extra steps to perform access permission checking.
- Upon receiving an HTTP request as per above, the server may reply with some non-standard HTTP responses (as opposed to the usual ‘200 OK’) and there needs to be a corresponding protocol on the client end for handling such cases.
- REST-ful APIs often provide ways to search through a large volume of data by using a query string embedded in the URL. For example imagine having millions of profile records on the server and obviously not wanting to remember individual URLs for each profile. One way around this is to have our API allow the following HTTP request from the client to trigger a database query to retrieve all profiles with names John Smith:
HTTP verb Endpoint
Real-time Communication Via APIs
One peculiar feature of the HTTP protocol is that clients can send HTTP requests any time but servers can only respond with HTTP responses to requests that have been made and cannot just arbitrarily reach out to clients unless there is a specific HTTP request that they can respond to. In addition, HTTP requests have timeout limits.
These two features of the HTTP protocol can become an annoying issue in web API-s. Imagine for example that a client has requested the server to perform a very long computation and to come back with the result when ready, or if the client has asked the server to notify when an event occurs without any particular time limit of when the event must occur by (for example an email client asking an email server to notify when an email arrives on the server). How can this be done if the server can only come back to the client via an HTTP response that is linked to a concrete HTTP request that may have timed out a long time ago?
One approach is polling – the client keeps sending the requests to the server, the requests keep timing out, the client keeps resending the requests, and as soon as the task is complete the server finally replies back to the latest request. This is kind of like the client sitting in the back of the car and asking “Are we there yet?” every five seconds – you will get the response “Yes we are here now” in due course. You can make this polling as frequent as possible, thus achieving a high degree of real-time responsiveness, but the more frequent the polling is, the more it wastes the bandwidth – all the requests apart from the very last one (i.e. the one to which the server actually responds) are essentially a waste.
Another option is long polling – increasing the timeout limits of HTTP requests, so that all the requests that arrive at the server sit there for as long as possible until the result is ready and then the server responds to the HTTP request. This means that there is only one round of request-response and so the network bandwidth is not wasted. However we have a different problem now – the server must keep track of the HTTP request for a very long time. This is not really scalable – if you plan to have millions of clients bombard your server with these kinds of requests at the same time then your server will run out of memory quickly.
Yet another approach is to use webhooks – the client sends a request to the server, along with a URL where it (the client) can receive events. The server sets off performing whatever the task it needs to do, taking however long it needs to take and the original HTTP request happily expires whenever it is due to expire. When the server is done with the task, it sends an HTTP request to the callback URL, notifying the client that the task is done and providing whatever necessary information is needed in the HTTP request body. This is perfectly “legal” from the HTTP protocol point of view – the client sent a request to the server, the server received the request and kicked off a task, the original request has expired, the task on the server has completed and the server, temporarily putting on a hat of a client, sends an HTTP request to the client (who has in turn, temporarily put on a hat of a server) notifying of this.
These are just some of the methods used in enabling two-way real-time communication between clients and servers. There are other methods and this is an area of very active research as the web becomes an even more interactive and real-time place. For another example see my previous post on WebRTC.