by Richard Warburton
9. Obtaining Data Part 2: Working with Web Data
Download the resources for this lesson here.
Extracting Files from The Web
In the lessons so far, we have seen numerous examples of files being loaded into Python and Pandas. These files have been either Excel or CSV files and have been stored locally on the PC we are working from.
Both pd.read_excel()
and pd.read_csv
are quite versatile. We can pass a file name, and it will look for the file in the working directory, we can pass a file location and it will load the file form the specified path, or we can provide it with a URL.
Direct File Reads
Passing a URL to the function call may be a little counterintuitive, but when files are stored and made available for download on the web, we can pass this to the respective read function and the file will download.
The following code will download the full fruit and veg price data used in Lesson 6 directly from the web.
# Define URL and file name to import
file_url = 'https://assets.publishing.service.gov.uk/media/65201892244f8e000d8e7194/'
file_name = 'fruitvegprices-10oct23.csv'
# Import fruit and veg prices from url
fruit_veg_prices = pd.read_csv(file_url + file_name)
This process also works for files stored on intranets and SharePoint sites as well as other repositories such as GitHub.
Read_HTML Function
Occasionally we may also want to extract data from tables presented on websites. The code required to create a table on a website means that behind the scenes, the data is quite well structured, however it is not in an easy format to directly input into a table. Fortunately, Pandas has an inbuilt function pd.read_html()
that will read tables from a website and return any it finds in a dataframe format. If more than one table is found, then the function will return a list of tables.
We have already seen an example which utilised this process. In lesson 7 we investigated the case of the England Cricket player Alan Jones. The data for this was downloaded from Wikipedia and processed to produce a single data table for analysis. We will now explore the code to download the data.
Firstly, let’s use pd.read_HTML()
to extract the data it finds on the web page.
# Import required packages
import pandas as pd
# Define URL to parse for tables
wiki_url = 'https://en.wikipedia.org/wiki/List_of_England_Test_cricketers'
# Extract tables found in to list
wiki_data = pd.read_html(wiki_url)
We can use the following code to inspect the imported data in more detail.
# Determine how many tables found
print(len(wiki_data), 'Tables found on page:')
# Output shape of each table found
for i in range(len(wiki_data)):
print('Table at index' , i , 'is of dtype:',
type(wiki_data[i]), 'and has shape', wiki_data[i].shape )
This results in the output in figure 1.
Figure 1. Analysis of each of the 16 tables found on the Wikipedia page, showing the index, data type and shape of each respectively.
We can see that this has imported a list of dataframes. We see that the tables from indices 1 to 11 all have the same number of columns and inspecting these (see supporting notebook), we can see that they correspond to the data we are interested in.
Concatenating these and using several techniques from our earlier lessons we can clean the data and coerce the values to the correct types. The code to implement this runs to several lines and has been omitted from the lesson text. It can be found in the supporting notebook and the accompanying lesson video.
The resulting dataframe now contains clean data relating to all of England’s capped cricket players.
Figure 2. Head of the England cricket caps data frame displayed in the notebook, post import, concatenation, and cleaning.
Using APIs to Extract Data
What Is An API
An Application Programming Interface (API) is a mechanism that allows two software components to speak to each other. They are used extensively in the cloud computing world and from a data perspective, provide a user (or a client) with a way to extract specific data from an online system (or server).
The client defines a request, passes this to the server, which then returns the requested data.
APIs have many uses and enable clients and servers to communicate effectively. For this lesson, we will only consider the case where a client is extracting data from a server.
Basic Terminology
There are some key definitions that we will need to understand to implement a basic API call.
REST API
If we are using an API to access data, it is likely we are using a REST API. This stands for REpresentational State Transfer and will define several methods that a client can use to interact with a server. These methods include GET, which provides a client with the ability to extract data; DELETE, which provides a client with the ability to delete data; and PUT, which enables new data to be added to the server. There are also many other methods that can be defined.
The client defines a request, usually in the form of a text string and sends this to the server. The server then sends back a response. One of the main features of a REST API is statelessness. This means that the response data is not stored by the server after the request has been made. The server provides a response to the client and the client can then utilise this as they wish.
We will focus on GET commands in this lesson as we are interested in extracting data for analysis.
API Documentation
Every API will be slightly different and the details about how a client should use and implement requests for each will be contained within the documentation. A well-documented API should give examples of syntax as well as examples of how to use the API.
Authentication Tokens & API Keys
Very few APIs are open access. To control use and ensure that both the system and the user are authenticated, access tokens and API keys are issued. These are integrated in the request and ensure that the server only returns data to an authorised client. Obtaining tokens and keys will usually require registration and quite often a payment to be made.
Details of how to obtain and integrate these into a request will be contained within the API documentation.
End Point
An endpoint is the location where a request for a specific server is sent for actioning. This is usually in the form of a 'Universal Resource Locator' (or URL)
Parameters
The parameters for an API call allow the user to specify specific criteria for the request. Conceptually this is like filtering data and obtaining subsets of interest.
Java Script Object Notation (JSON)
Java Script Object Notation (JSON) is an open data file format that uses human readable text to transfer data consisting of attribute and value pairs. These are very similar to dictionaries in Python. When a response is received from an API call, it is quite common for the data to be received in a JSON format.
An example of a JSON can be seen in figure 3. Fortunately, Python and Pandas have a number of built in tools for processing such data effectively. (We will explore how the pd.DataFrame()
function works with JSON data below. The pd.read_json()
function and the json
package offer further functionality as well).
Figure 3. Example of a JSON in plain text format. Note the similarities with a Python dictionary.
Response Codes
In addition to any requested data, an API will provide other information in the response. One of the key data points in this is the response code. This code indicates how the API call has been actioned and if there are any factors to be aware of. (You may already be familiar with Error 404, page not found).
The codes are three-digit numbers and fall in to five categories.
Code | Code Category | Code Description |
---|---|---|
1XX | Informational | Communicates information relating to the transfer of data. |
2XX | Successful | Indicates that the client’s request was accepted successfully. |
3XX | Redirection | Indicates that the client must take some additional action to complete their request. |
4XX | Client Error | Indicates an error which has occurred in the client’s request. |
5XX | Server Error | Indicates an error which has occurred in the server. |
A Simple API Call Example
To investigate how we can use APIs to obtain data, we will use a public API provided by the Environment Agency under an Open Government Licence. As this is public we will not need to worry about implementing authorisation tokens and keys. The documentation for this API can be found here.
We will make a call to obtain the ‘temperature’ data for rivers. The following code does this:
# Import required packages
import requests
# Define the endpoint and parameters for the API call
end_point_URL_root = 'http://environment.data.gov.uk/flood-monitoring/id/measures/'
parameters = '?parameter=temperature'
# Get response from server
response = requests.get(end_point_URL_root + parameters)
#Output response code
print('The response code for this API call is:',response.status_code)
We can see by inspecting the response code that this has been successful as a 200 code has been returned. (See notebook).
By inspecting the response object, we can also see that it has three keys within it, one of which (the ‘items’) contains the data we are after. We can convert this to a data frame with the following code:
# Build dataframe from JSON object
response_dataframe = pd.DataFrame(response_json['items'])
This yields the data frame in figure 4
Figure 4. Example of dataframe head returned from the API call. Note the full column range is not displayed in this image.
We can also expand the call to use more parameters by passing a dictionary to the params variable in the response function. This removes the need to add the query to the request URL. The following code will obtain the temperature and flow data for site 3680.
# Define dictionary of parameters to query
params_query ={'parameter':['temperature','flow'],'stationReference':[3680]}
# Get response from server using the params variable
response = requests.get(end_point_URL_root, params=params_query)
We can see from the response code and by inspecting the data returned that this has been successful.
The supporting notebook and video contain the output to confirm this. They also explore a more detailed processing of the data extracted from this API and outline the additional cleaning steps required to format the data in a tidy way.
Conclusion, Next Steps & Exercises
APIs provide a powerful method for obtaining data for use in analysis and data work. If your work integrates with a cloud service or third-party data provider, then it is highly likely that you will need to integrate with an API.
The example above shows us that while we can link into APIs easily, there is still often cleansing work that we need to complete prior to using the data for its intended purpose. While Excel comes with the capability to extract data from simple APIs, we have seen over the last few lessons that cleansing work is best completed in Python. When dealing with APIs Python provides us with the tools and methods to conduct both processes in a few lines of code. If we want the data in Excel for analysis, we should use Python to obtain it, clean it, then push it to Excel.
The notebook supporting this lesson contains a deeper dive into the results outlined above. It also contains the code to demonstrate failures and issues in the API calls. The supporting exercise set for this lesson will explore building API calls in to other publicly available data.