by Richard Warburton
6. Managing Missing Values Part 1
Download the resources for this lesson here.
The Problem With Missing Data
It is estimated that when you work with data, 80% of your time is spent preparing the data for analysis. We have been exploring how to make this process simpler when analysing data in Excel by using various cleaning techniques implemented in Python.
One area we have not explored yet is missing data. Missing data could be a course in its own right! It is a vast subject with many different techniques to use in differing situations. Over the next two lessons, we will take our first steps in exploring this vast subject.
Missing data can occur in many ways. We may not capture a variable that is critical for the analysis. We may not have observations from an adequate proportion of the population we are investigating. Or we may be missing specific values for variables and observations, something we saw in the last lesson, where values were lost during the import process.
Missing data and the causes of missing data can be particularly problematic, and we need to have tools, techniques, and strategies for dealing with the issue when found.
For this introductory session, we will consider the situation where we have missing values only. We will assume that we have all the variables and observations required to complete the analysis.
Identifying Missing Data
When we work in Excel only, there are many ways to spot missing data. The first of these is quite simple, we may visibly spot gaps in the data. This however is limited as it will not flag values which are subtly inputted or formatted incorrectly. It is impractical when we have a large dataset. We may also see that formulas or analyses fail in the event of missing data, but again this is unreliable as a lot of formulas work around this issue.
In short, we need a more practical way to spot missing values across large data sets. Again, Python provides us with a practical solution here.
Pandas uses special values (either NaN or NA) to indicate where a value is missing. When we import data from Excel, if a cell is null, it will be assigned to one of these values. If an error occurs when we try to coerce data into the correct data type, then we can set Python to return such a value.
This means, that if we import data into Python and ensure that each series has the correct data type, then we can spot missing data by analysing the NaN or NA values.
The .info()
method provides us with an easy way to do this in Python.
An Example: Fruit and Veg Prices over Time
For the rest of this lesson, we will use the Fruit and Vegetable Wholesale Prices data file supplied by DEFRA under the open government licence. This has been modified to ensure we have missing values of interest. The data for analysis along with the code used here can be found in the supporting material for the lesson.
This data file provides us with weekly wholesale prices and typical weights for five fruit and veg items. We have a data point for each week in 2023 from the 13th January to the 22nd September (37 weeks of data for each fruit)
We will assume a simple analysis request. We have been asked to trend prices over time for the fruits and provide a mean weight for each based on the 2023 data.
Inspecting The Data & Finding Missing Values
Using the code below, we can import the data and inspect the resulting dataframe using the .info()
method.
# Import required packages
import Pandas as pd
# Import fruit and veg price data and assign to prices
prices = pd.read_Excel('2023 Fruit and Veg Prices.xlsx')
# Display prices information
prices.info()
The output of this code is displayed in figure 1.
Figure 1. Notebook output of prices information extracted with the .info() method. For each column in the dataframe, this provides a non-null entry count and the data type.
The difference between the 185 entries and the Non-Null count shows us where we have missing values. shows us that we are missing 7 sample weight values and 38 price points.
What To Do About Missing Data?
The analysis in figure 1 shows us that we have missing data. But what do we do about it?
The first step is always to understand what is missing and if possible, understand why. This information will then help us make an informed decision about how to progress.
Using the techniques presented in the last lesson we can quickly filter the data to show the observations where values are missing.
First, let’s consider the missing sample weight values. The .isna()
function is a useful tool in Pandas. It analyses a value and returns a Boolean value of True if it is NaN
or Na
. We can apply this across a series and use the resulting Boolean mask in a filter. The following code does this.
# Identify the average weight values with no values
null_weight_filter = pd.isna(prices['sample weight'])
# Display records in notebook
prices[null_weight_filter]
This yields the results in figure 2:
Figure 2. Notebook output of dataframe showing the seven observations with missing sample weights.
From this dataframe, we can see that we are missing a single weight for the apples, savoy cabbage and curly kale, and two weights beetroot and red cabbage. Out of 185 observations (37 for each observation variant), this is a low volume to be missing. There is no clear pattern to the missing data.
The following code will allow us to investigate the missing price observations.
# Identify the average weight values with no values
null_price_filter = pd.isna(prices['price'])
# Display records in notebook
prices[null_price_filter]
Running this code shows us that we are missing 38 values. One for apples and all 37 for savoy cabbage. We have not included the image of this analysis due to the size of the output. It is available in the notebook contained within the supporting lesson materials.
At this stage, it is useful to ask why. Why is this data missing? This is a useful question as the answer again may help us understand how to proceed. For example, we may identify an issue with data capture and can prevent future issues at source, or we may be able to identify the specific values that are missing.
For this example, we will assume that the data was not recorded and is not available to us. We now need to decide how to progress.
What Options Do We Have for Managing Missing Data?
We have identified the missing data and we have also identified some insight to help us decide how to progress. We now have four options:
- Do nothing
- Update the values
- Remove the datapoint
- Stop the analysis
As the processor of the data, it is down to us to decide which approach is the most appropriate to apply. This is heavily dependent on the data and the situation. Whatever course of action you choose, remember, it could introduce bias or inaccuracies into your work and you may need to justify your assumptions and action at a later stage.
Do Nothing
This is the simplest response to missing data. It requires no action or no code to implement, we continue with the data as it is.
This would be an advisable approach if we are confident that the missing value will have limited impact on the analysis requested.
Update The Values
If doing nothing is not an option, then updating the data is the next step to consider. This is the most involved of the options and careful consideration needs to be taken.
During our preliminary investigations, we may have gotten lucky and have identified the actual values that are missing, if this is the case, then we can update these in the data set.
We may also be able to make an inference about the missing values and estimate (or guesstimate) and inferred value. This could involve updating to a chosen fixed value; a calculated statistic such as a population mean; or another sensible method of choosing a value (such as repeating the last available data point). There are also many other techniques that can be implemented here.
If we choose to take this path, then we need to ensure that we do not unnecessarily bias the subsequent analysis and can justify our assumptions.
Remove The Data Point.
If ignoring the data point is not an option and we are not able to update the values, then we are left with the removal option.
This can be dangerous, as again it may skew the results and understanding why data is missing is vitally important. We will also need to decide if we need to make the consumers of the analysis aware of the removal of the data as it may impact their interpretation.
It is also important to understand the ask as the removal of certain data may prevent the analysis from being completed.
Stop the Analysis
If we are not able to implement one of the three steps above safely, and we cannot complete the analysis without the missing data, then we are left with no option other than to stop the analysis.
If we find ourselves in this situation, we must revisit how the data is captured and determine if capturing the missing data is possible.
Deciding The Best Way Forwards
Let’s consider this in the example above. We have three differing missing data points to deal with and therefore have three decisions to make:
- How to manage the missing weights
- How to manage the missing apple price
- How to manage the price missing against all savoy cabbage observations
How to Manage Missing Weights
When considering this, we need to remember that we have been asked to calculate the average weights for each fruit or vegetable.
The Law of Large Numbers states that as the sample size increases, the sample mean approaches the population mean. In short, one or two data points from 38 is unlikely to have a dramatic impact on the mean value.
As a result, we can safely do nothing and be confident that the missing data will not be of tangible impact to the analysis. As the number of missing values increases, we can have less confidence in such a statement.
How To Manage the Missing Apple Price
We are missing a single price point for the apples. We have been tasked with producing a trend analysis for the prices.
Ignoring the missing value would result in an obvious point of discontinuity in the trend line, possibly inviting unnecessary queries. Replacing the value with the prior week’s price point will enable us to complete the analysis and prevent such a point of discontinuity.
We would need to state such an action in the end analysis to ensure people are aware it has been implemented.
How To Manage the Missing Savoy Cabbage Price
As we are missing all price points for the savoy cabbage data, we have no data to infer the missing values from. In this situation, we will remove the savoy cabbage data and complete the analysis with the remaining data. We would have to follow up on such an issue and indicate in the final analysis that the data has been omitted.
Implementing The Planned Updates
These updates are straightforward and are easy to implement in Python
Updating the Missing Apple Price Point
We can see from the inspection of the data (see the missing price point dataframe in the supporting notebook) that observation with missing data is in index 2 and the prior value in index 1. This means that the following code will quickly update the missing data.
# Update price in index 2 to that of index 1
prices.loc[2,'price'] = prices.loc[1,'price']
By inspecting this value post update we can check that it has successfully implemented the update.
Removing Remaining Missing Price Points
Pandas has a built-in function called dropna()
. As you would expect, this drops observations containing NaN
from a data frame. By default, this will drop observations where a NaN
is found in any variable. We can adjust the behaviour by providing different settings to the input of the function. By limiting the function to the price variable, we can quickly carry out the update. The following code implements this.
# Drop the remaining NAN values from the price variable
prices = prices.dropna(subset='price')
By re-inspecting the information for the process dataframe, we can see that this has been successfully implemented.
Figure 3. Notebook output of prices information following the removal of the savoy cabbage observations.
Figure 3 shows us that we now have 148 entries and that all price values are non-null.
Conclusion, Next Steps & Exercises
Missing data will be a regular issue for anyone who works with data. It is not always practical to spot missing data in Excel and Python not only provides simpler ways to inspect large data sets, but it is also easier to update and manage missing data in Python. Once identified, there are only a handful of high-level options for managing missing data and it is the analyst’s responsibility to decide the best way forward. Whenever making updates to data, you will need to ensure that you caveat any analysis to make users aware and are able to justify and explain the decisions you have made.
An exercise set to reinforce the learning above can be found in the supporting lesson materials. We will continue exploring methods for managing missing data in the next lesson.