Managing Missing Values Part 2

📂 Resources

Download the resources for this lesson here.

Recap: What Have We Covered So Far?

The previous lesson introduced some basic techniques for identifying and managing missing values. In this lesson we will build on this and look at some more complicated techniques for updating missing data.

But before we get on to that, we are now able to explore an important feature of Pandas.

Copies and Views of dataframes

One of the techniques we have used for managing missing values is updating. This means we implement code to overwrite the missing values with new ones in the data frame. This exposes us to an important feature within Pandas, that of views and copies. This is best explained with an example.

The following code builds a simple dataframe and then takes two separate subsets of it using techniques discussed in the previous lesson.

# Define example data to use
example_data = {'Label':['A','B','C','D','E','F'],'Value':[1,2,3,4,5,6]}

#Generate example dataframe
example_df = pd.dataframe(example_data)

# Define two subsets of the data, using different sub setting techniques
example_subset_1 = example_df.iloc[0:1,:]
example_subset_2 = example_df.loc[example_df['Label'] == 'D',:]

This yields the dataframes displayed in figure 1

Figure 1. Notebook output of the main dataframe and the two subsets taken using differing techniques.

Using the following code, we can make updates to the value in the first row and second column in both.

# Update values in first row and second column of both subsets
example_subset_1.iloc[0,1] = 1000
example_subset_2.iloc[0,1] = 1000

Notebook output following the updates made to the two subsets. We can see the updates actioned in the two subsets and one (that made to subset 1) has been reflected in the main dataframe.

Figure 2. Notebook output following the updates made to the two subsets. We can see the updates actioned in the two subsets and one (that made to subset 1) has been reflected in the main dataframe.

Re-inspecting the output (figure 2), we can see that the updates have been actioned in both subsets, but one has also carried through to the original dataframe. Why is this?

Well, one update (that applied to example_subset_2) is actioned upon a copy of the data frame and the other on a view. Conceptually, when a filter of the original data produces a view, we are effectively viewing the original data through a different window. This means that any updates made to the data, will be reflected in the original.

Conversely, if a filtering operation produces a copy, then we are working on a replication of the original data and any updates will only be made in the copy. Depending on your task, both behaviours can be desirable and undesirable. Our task here involves updating values and we will need to ensure that we are working on a view of the data as a result.

Generally, whenever an array or list of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single indexing and slicing views will be returned. See the Lesson Notebook for further examples

Generally, views will be returned by:

Taking Slices
Single Label Indexing (for [], .loc[... , ...] and ..iloc[... , ...])
Indexing with a Boolean Array

And copies will be returned by:

Indexing with a Boolean Expression
Indexing with a list (for [], .loc[... , ...] and ..iloc[... , ...])
Forcing a copy with the .copy() method

Examples of these can be found in the supporting lesson materials

Updating To a Population Average

Let’s revisit the data from the last lesson. If you recall, we decided not to update the missing weight values. Another course of action would have been to update the values to the sample population.

In this example, this is not a totally straightforward calculation. This is because each weight value is reflective of an item and variety of fruit or vegetables. As a result, we can’t just calculate the mean over the whole data set and use this. We will need to calculate a mean for each individual variety and then use this to patch in the respective missing values.

Let’s start by importing the data from Excel and carrying out the same updates as the last lesson. We won’t cover the code for this here, however it is available in the Notebook in the supporting lesson material.

First, we need to calculate the mean averages for each item. The following line of code does this in Python.

# Calculate the mean sample weight by item
mean_weights = prices.groupby(['item'])['sample weight'].mean()

This yields the data displayed in figure 3.

Figure 3. Pandas series containing mean sample weights by item.

Having calculated these values, we now need to map them on to the missing values based on the item for each respective observation.

We do this in three steps:

First, we create a dictionary, mapping the item to the value. We apply a lambda function to this view that looks up the update weight for each observation based on the item. This results in an array containing the updated values for each observation missing data (importantly, in the same order).

We overwrite the sample weight variable in the view filtering for the NaN values with the array of data created in step 2.

NOTE: The array created in step 2 will be of the same length and have observations in the same order as those the view of missing values. This allows us to assign the array directly to the prices variable in the view in step 3.

The following code does this and further explanations of how this code works can be found in the lesson video.

#STEP1: create a dictionary providing a key value mapping from item to average weight
mean_weight_dict = dict(zip(mean_weights.index,mean_weights.values))

#STEP2: Apply lambda function to look up new values via the step 1 dictionary
update_values = prices[null_weight_filter].apply(lambda x: mean_weight_dict[x['item']], axis=1)

#STEP 3: Assign the update values to the to NaN filtered view
prices.loc[null_weight_filter,'sample weight'] = update_values.values

As we are updating a view of the data, this update carries back to the original data. We can confirm this by reinspecting the data for indices with missing values, see figure 4.

Notebook output of the indices within the prices dataframe that originally had missing sample weight values, confirming update success.

Figure 4. Notebook output of the indices within the prices dataframe that originally had missing sample weight values, confirming update success.

Pandas GroupBy Method

The code above introduces the pandas .groupby() method. This, along with its aggregate functions, is a powerful tool that allows vast data sets to be easily summarised (in a similar way to a pivot table in Excel).

We will not explore this in detail here as our focus for this course is cleaning data. However, there are two things to note about this method at this stage:

If you would like to learn more about this method, then the Pandas Group By Documentation provides details for the implantation.
The 'groupby() method has several other aggregation functions, such as SUM(), COUNT(), MEDIAN(), MIN(), MAX() and MODE(). This allows us to easily update missing values to other statistics should we choose to do so

Case Study: England Cricket Cap 696

Sports generates a vast wealth of data. In any sport, coaches, commentators, and spectators churn over reams of data, looking for insight that can help them see things in new ways or can give them a competitive edge. The on-plate average in baseball, the assists of a player in football (or Soccer for the global audience) or the batting and bowling averages of cricket players are just a few of numerous examples.

Cricket is a quintessentially English game and can be confusing to people at first. For centuries, it has produced data. Vast amounts of data. Games played, runs scored, overs and balls bowled, batting averages, bowling averages…the list goes on. It is buried in this vast archive, in the performance of England Test Match Players (or capped players) that we find an interesting lesson in missing data.

Wikipedia provides us with our best source of data to see this and provides a table of all main statistics for every England Player. From #1 Tom Armitage in 1877, to #711 Josh Tongue in 2023 (Correct at time of writing).

Some simple code (which will be explained in a future lesson) extracts this data into Pandas and by searching for missing values, we see quite a few. The nuance of cricket means that not everyone will have every statistic, all players bat (some better than others), some also bowl as well.

Figure 5. Subset of data from the list of capped England players following import into Python. We see legitimate missing values in this selection.

There are a couple of points to note regarding figure 5:

The columns presented are a subset of the total available in the data set. This is to aid presentation in the lesson text.
We may expect some of the values to be integers. (such as 'Runs Batting', or 'Balls Bowling'). NaN values are a special case of the float data type and as a result, it is not possible to store an integer data type with a missing value.

There is however, one player, Alan Jones #696 who has no data whatsoever. How can this be?

Subset of the Capped England Players data showing Alan Jones (#696) has all his playing statistics missing.

With our data hat on, we have noticed that there are missing values, and we need to decide how to progress. The first step in deciding how to progress is to investigate why data is missing.

Alan Jones Played a single match for England against a combined ‘Rest of the world side’ in 1970. The original match had been scheduled against South Africa, however, as a stand against apartheid, the match and tour were cancelled. (The international Cricket Council - ICC introduced a moratorium on South African Tours in 1970). The match went ahead, Alan Jones played, but the ICC revoked the test match status. As a result, Alan Jones, who never played for England again, was never recognised as an England Player. He never received his England Cap.

In June 2020, 50 years after the match, the England and Wales Cricket Board (ECB), moved to right what they saw as a great injustice. They recognised him as a capped player and awarded him cap #696, the next available number in the long list of players. While the record books do not show Alan as a capped player, the history of English players does.

As a result of this, he has a legitimate place in the list (#696), but no statistics. Or to put this another way, the data has an observation for him, but all values are missing.

How should we manage this missing data. Well, that depends on what we have been asked to do with it. Without such an ask, all options remain on the table. However, what this little digression into the world of sport tells us, is that missing values can hide important information and stories.

When dealing with data in the wild, always investigate missing values, you may miss something important.

Conclusion, Next Steps & Exercises

Missing data is one of the numerous issues we will face when cleaning data. How to manage missing data is a vast subject, and we have only touched the surface here. Over the last two lessons, we have seen differing examples of how we can manage missing data and implement the corrective actions in Python code. Several of these techniques are not easily implementable in Excel and provide us with another argument to reinforce why cleaning data in Python is advantageous.

An exercise set for this lesson is included in the lesson support materials.

7. Managing Missing Values Part 2