by Richard Warburton
10. Case Study: Bringing it All Together
Bringing it All Together
Over the last few lessons, we have seen several tools and techniques for preparing and cleaning data. The intention of this has been to take unclean data from a multitude of differing sources and produce a clean and tidy data set that can be used in Excel to produce analysis.
In the wild, data is rarely clean. We need a range of tools and techniques to transform data in to a useable format. The knowledge, tools, and best practices to clean data are agnostic to the end use and having a wide range of techniques at our disposal is beneficial regardless of our field of expertise.
In this lesson, we will be bringing together the knowledge we have amassed and applying it to a real-life data problem.
The Problem:
The HS2 (High Speed 2) Rail network is a national infrastructure programme currently being delivered in the United Kingdom. The building of new rail infrastructure is a significant undertaking and involves numerous individual building sites. Each of these building sites needs to be considerate to the surrounding areas and not produce too much disruption. One of the possible causes of disruption is noise. To monitor the problem, sound sensors are installed at each building site and hourly samples of noise levels are recorded.
HS2 Limited provides this data monthly via the open data portal gov.data.uk and makes it available for use under an Open Government Licence.
We will use this data to demonstrate how the tools and techniques we have learned can be used to clean and prepare data for analysis.
There are four questions we have been asked to investigate:
- What days of the week produce the highest noise levels?
- What weeks of the year produce the highest noise levels and how do they trend over time?
- What months produce the highest noise levels and how do they trend over time?
- How do noise levels vary throughout a 24 hour day?
The Data:
The data is split by month and covers April 2023 – August 2023 in zip files. Within each zip file, there is an individual Excel File outlining the noise level for each area. Within the file for each area, there is then a specific sheet for the location of the building site. This sheet contains the various noise level measurements. The data starts on Row 3 in each file and there are two individual data points (Monitoring reference and Location) in rows 1 and 2.
Figure 1: Example of individual data file in Excel (.xlsx) format. We can see that there are data sheets for each location in the area along with a metadata tab in each file.
It is also worth noting that there are additional files for air quality in the zip file. These can be ignored for this analysis.
The data contains three measures for each hour of the 24 hour day.
We will use the equivalent continuous sound level with A-weighted frequency weighting measured over a specific time period T (LpAeq,T). This is measured in decibels (a logarithmic scale) and is weighted to emphasise the frequencies which are specific to human hearing (c1khx – 4 khz)1. This is commonly used to measure the impact of sound on humans in an industrial setting2.
The Metadata sheet in each file contains useful information for the data. For each location, there is an address and a grid reference as well as measure descriptions.
Footnotes and citations
Initial Investigations
Manual inspection provides us with an initial understanding and intuition of the data. However, it is necessary to start programmatically manipulating the data to get an in depth feel for it.
Amalgamating Data
A few lines of code, indicates that we have 70 noise files and approximately 1400 sheets of data to amalgamate.
# Import required packages
import pandas as pd
# Define data source path
pth = 'data/'
# List files in pth directory
all_files = os.listdir(pth)
# Build a list of all noise data files (i.e. ignore air quality files)
noise_files = [file for file in all_files if 'noise_data' in file.lower()]
# Initiate Counters
file_count = 0
sheet_count = 0
# Initiate dictionary to store list of sheets per file
sheets_in_file = {}
# For each file
for file in noise_files:
# Link to the Excel File
xl = pd.ExcelFile(pth + file)
# Get list of sheet names from Excel files
sheet_names = xl.sheet_names
#Store list of sheet names per file in a dictionary
sheets_in_file[file] = sheet_names
# Increment Counters
file_count += 1
sheet_count += len(sheet_names)
# Output analysis on screen
print('Files identified:',file_count)
print('Sheets Identified:',sheet_count)
We could amalgamate these files manually in Excel. However if we assumed that it takes 30 seconds, to manually copy the data out of a sheet, it would take over 11 hours to amalgamate the data in 1400 sheets! Far from practical!
We can use the techniques outlined in Lesson 8 to import and amalgamate into a single file. The code to do this is below:
# Initiate list to store imported data
imported_data = []
# For each file
for file in noise_files:
# Process file name to extract area name
area_name = file.lower()
# Get list of sheet names that are not Metadata or sheet1
sheet_names = [s for s in xl.sheet_names if s not in ['Metadata','Sheet1']]
# For each sheet in the file
for sheet in sheets_in_file[file]:
#If the sheet is not metadata or sheet1
if sheet not in ['Metadata','Sheet1']:
# Read in the data
df = pd.read_excel(pth + file, sheet_name=sheet, skiprows=3, usecols='A:C')
#Set new variables in data to define location and area
df['Location'] = sheet
df['Area'] = area_name
# Add imported data to list of all data
imported_data.append(df)
We will walk through this in the supporting lesson video.
Missing Data:
A visual inspection of the files indicates that we are expecting some missing data (See Figure 2).
Figure 2. Example of additional data added below the main data table for a site location. Note the hidden rows in the excel file, highlighting the gap from the main data to the erroneous additional data. This highlights one of the two missing data issues we need to fix.
The expected missing data falls in to two categories. Firstly, situations where the sensor data is missing and secondly, situations where additional data has been appended to the bottom of the data file (with no time details).
We will need to decide how best to manage this missing data and what (if any) corrective action is needed. We will also need to determine if any other data is missing following the import.
Data to Parse
We can see that the data we are primarily interested in is contained within the sheets labelled with a monitoring reference. There is a 1:1 correlation between the monitoring reference and the location of the sensor. The data on these sheets has a header row and starts in row 3. The data we are interested in for the analysis can be fount in columns A and C.
There is some potentially useful information on the metadata sheet, and in the first two rows of the data sheets.
We can also see that in a handful of files (see hs2*noise_data_areanorth_north_warwickshire_April 2023.xlsx) that there is a hidden *‘sheet1’_. This contains no data and can be safely ignored.
Data in File Name
Finally, we can see that the name of the area is embedded in the file name.
Figure 3. Example of data file names. We can clearly see the prefix, the date and file suffix and the area name between them. We can also see that spaces have been used in place of underscores for some file names.
The file name format contains a standard prefix, then the area name, and then a standard suffix (consisting of the month, year and .xlsx file extension). Spaces and underscores are used interchangeably in the file names, and we will need to work around this when parsing data.
Date Data
The problem we have been posed requires us to extract several specific values from the date time stamp of each measurement. We will need to separate the date and time as well as extract new variables for Day of the Week and Month.
The Plan
We will follow the same three step process outlined in Lesson 2 to import, process, and export the data. This will contain most of the cleansing tasks in step 2. However as there are some tasks that will need to be completed for each file (i.e. parsing file names), there will be a few cleansing tasks carried out during step 1.
The techniques used will primarily draw upon previous lessons. Where new processes are used, these will be discussed in the supporting video tutorial.
Data Required
In Lesson 1, we introduced Tidy Data and indicated that it can be used to give us a target state for clean data.
Reviewing our requirement, if we can build an amalgamated dataset in a Tidy format which contains the following variables
- Area
- Location
- Date
- Time
- Day
- Week
- Month
- LpAeq,T
Each row of data in this table will be a specific observation for the location and time of day. Such a data set will enable us to complete the requested analysis.
We may also want to consider adding in additional data points at this stage as it could enhance the possibility of future analysis and cut off any requests we may receive (A crystal Ball and ability to predict the future is an important data skill). In this case, such data may include the address for each monitoring site and the exact location. This would make the end analysis more human readable, however is not critical based on the ask. (We will explore this element as part of the exercise set for this lesson.)
The Proposed Process
Given the intuition we have developed about the data, from both manual and coded exploration, we can start to form a plan to import and clean the data. This is as follows:
Step 1: Import Data
-
For each file and each sheet within the file:
Extract the data on the monitoring reference sheets. (Limited to row 3 onwards and columns A to C) to a data frame.
Extract the monitoring reference for each sheet and add as a variable to the respective dataframe.
Parse the file name for the Area and add as a variable to the respective dataframe.
Add resulting dataframe to a list for storage in local PC memory.
-
Concatenate list of dataframes for the c1400 sheets in to a single dataframe
Step 2: Clean Data
-
Decide how to manage missing data and implement fix. This follows the process outlined in Lesson 6 and there are two decisions to make.
Decide how to manage data where we have sensor information, but not date time information.
Decide how to manage gaps in the sensor data.
-
Parse date time field to extract additional variables. This follows the process outlined in Lesson 8.
-
Identify and resolve any additional issues identified during cleaning phase.
With such a data cleansing task, we can form a plan for the known issues, but we will almost definitely uncover currently unknown issues. These will need to be identified and resolved as part of the cleaning process.
Step 3: Export Data to Excel
- Export the data (minus the indices) back to Excel for analysis.
Implementing the Plan
The code to implement the plan can be found in the supporting lesson material. The length of the code means a line-by-line description here is impractical. We will walk through this in the supporting lesson video.
It is however worth discussing three elements here.
Decisions Regarding Missing Data
Having imported and amalgamated the data, we can inspect the data to see where we have missing data and to inspect the data types of each variable.
Figure 4. Dataframe info output for the amalgamated data. We can see there is a difference between total entries and the non null counts for Date, Period and LpAeq,T, meaning we have missing values. We can also see that the LpAeq,T is an object not a float.
Inspecting the information for the dataframe (figure 4) we can see that, there are 4,177 observations where we have sensor data, but no time stamp. Inspecting the raw data, we believe that this is human input error and these data points have been copied in by accident. We would also struggle to estimate these values or retrospectively recover them from the sensors. We will delete these points.
We can also see that we have 59,539 data points where we have missing sensor data (c6% of the expected count of values). There are several reasons for this outlined in the source data files. Mostly these are due to sensor malfunction. We are not able to recover this data, so we either need to ignore the issue or estimate the values. As we have a significant amount of data available to us, we will leave these values as missing and take no action. We may see some discontinuity in data at a location level as a result, however as there is an ask to analyse at a top line level only, we have enough data to draw meaningful conclusions.
Attribute Names
During the import and processing of the data, we can see that we have five files where the variables in the data carry a different name to those in most of the other files. We are assuming that this is a data input error and are overwriting the variable names with those found in corresponding cells in the other data files.
Location Counts
With five months of data, we should expect up to five sets of data (or sheets of data) for each location. This is up to five as it may be that a building site was not live for the full five months. A quick sanity check of this in the data shows that there is a single Location with 10 sets of data (HF-N1). Inspecting the raw data files (hs2*noise_data_areacentral_West_Northamptonshire_April 2023 and hs2_noise_data_areanorth_stratford_on_avon_April_2023) we can see that this is most likely a human input error involved with the manual look up of a code for *‘Hall Farm’_ and _‘Harp Farm’_. As we are looking at the top line data in analysis, we can ignore this issue.
Outcome, Results and Analysis
Running the code (it takes a few minutes to fully execute), results in a data set which has 1,027,368 observations.
The analysis requested can be quickly collated using Excel pivot tables to summarise the data. The summary analysis from this is outlined below and contained within the HS2_combined_noise_analysis.xlsx file in the lesson material.
Figure 5. Bar chart showing average noise levels (LpAeq,T) by day of the week. We observe a higher noise level during the work week.
Figure 6. Bar chart showing average noise levels (LpAeq,T) by day of week of the year. We observe a higher noise level during weeks 18-21 before a pronounced drop off in weeks 22-25.
Figure 7. Bar chart showing average noise levels (LpAeq,T) by month. We observe the same pattern as in figure 6, with higher noise levels in May, before dropping back in June.
Figure 8. Radar graph showing average noise levels by hour of the day. We observe increased noise levels during the working day (08:00 – 17:00) before dropping back outside of these hours.
From figure 6, we can see that we have a pronounced increase in noise levels during weeks 18-21. Correlating this with figure 7, we can see that this corresponds to an increased noise level during May. While figure 6 shows a drop off in weeks 22-24, figure 7 demonstrates that the noise levels for the other months are roughly comparable.
From figure 5, we can see that there is no strong observable evidence to suggest that one day of the work week is louder than the others. However Tuesday to Thursday do show marginally higher noise readings. We can also see that the weekends are significantly quieter.
The analysis of noise levels during hours of the day, figure 8, shows an increased level during the work day (08:00 – 17:00). There are no pronounced spikes in this.
With observed data, we can not draw direct conclusions about causation, we can however form hypotheses. Perhaps there was specific work in May which created the increased noise levels or an increased volume of work on the site. We may also want to consider statistical tests to see if the day of the work week is a significant factor in the noise levels.
Conclusion, Next Steps & Exercises
The focus of this course has been to look at how Python can help Data Analysts prepare data for analysis in Excel. This case study shows us how real life situations give rise to data that is messy and requires cleaning prior to analysis. This situation is far from uncommon and regularly arises. The knowledge and skills gained from these lessons will put you in good stead for managing such situations.
However, the lessons are also intended to help you future-proof your career. While Excel is a powerful tool, it has its limitations. Consider this case study, the data is only 27,207 rows, maxing out the rows that Excel can manage in one sheet. Should an additional month of data be released, Excel would no longer be able to manage the raw data in a sheet and we would need other solutions to analyse. Again this is where Python can help. The requested analysis can also be produced in Python, bypassing the need for Excel to be used.
Excel however remains the work horse of office analysis. It is familiar to the majority of employees and its use will remain a part of data analysis work for the foreseeable future. Learning to clean data with Python before analysing in Excel provides you with a powerful set of tools to increase your range of output as an analyst.