by Richard Warburton
5. Manipulating and working with Pandas dataframes
Download the resources for this lesson here.
Introduction & Background
So far, we have looked at how we can use Python in conjunction with Excel to clean data. We have seen that by using the Pandas library, we can easily push data between Excel and Python.
A good starting place to learn more about Pandas is to look at how some of the familiar Excel manipulations can be implemented. There are many similarities between data tables in Excel and dataframes in Pandas. This gives us a tangible way to conceptualise the Pandas operations while we are learning about them. Once we are happy with the basics, we can then start to think about the enhanced features of Pandas that take the functionality considerably further than that of Excel.
In this article, we will explore Pandas dataframes in more detail, and compare some of the basic operations to those of Excel.
Anatomy Of a Dataframe
Let’s consider the same data presented in both Excel and Pandas. The data here is from the DEFRA website and provides the wholesale price of various data over time.
Figure 1. The Head of the fruit and veg prices dataframe with the key components labelled.
The data and code to explore this are contained within the resources panel for this lesson.
Indices
An index in Python refers to a position within an ordered item. The item could be a list, a string or data structure, such as a dataframe.
As dataframes are 2D objects, we need two indices to uniquely identify a value. This is analogous to Excel where the Row number and Column Letter will be used to refer to a cell. In Pandas, we have a Row Index (1 in fig 1) and a column Index (2 in fig 1). Together these form a unique reverence for each item in the dataframe (3 in fig 1 ). This is analogous to a cell reference (such as “A1”) in Excel.
By default, Pandas uses integers (starting from 0) to index a dataframe. You can however choose to overwrite these. In Figure 1 we have used variable names for column indices and integers for rows. This is quite a common dataframe indexing approach.
Series
A dataframe can be thought of as a grid of data. This is conceptually similar to a column within a data table in Excel. However, there are subtle differences.
An individual column of data in a dataframe is called a Series (4 in fig 1). This will inherit the same row indices as the overall dataframe. All values within a Pandas series must be of the same data type (we will cover these shortly). This is different to Excel, where a column of data can contain multiple different types of data.
Values
An individual value within a series (5 in fig 1) can be thought of as a cell in an Excel data table. It is a place where we can store a datapoint that is uniquely referenced by two indices.
As mentioned above, the data type or dtype specifies the type of data which the values must take within the series.
Other Useful Terminology
Axis
Often when performing a calculation in Pandas we will need to specify if this is carried out across the rows or columns. In Pandas a dataframe is a 2 Dimensional object, this makes things simple and means we only have two axes, axis = 0 refers to rows and axis = 1 refers to columns (Note, the axis value has more uses in NumPy where arrays of N dimensions can be defined)
Data Type (dType)
We have already made a few references to data types. The data type specifies what type of values should be stored in the series. For example, we may want to store integers or whole numbers for a volume count; we would need a floating point decimal number to store a price point; and we will need a string data type to store the name of a fruit. By specifying and ensuring the type of data that should be in a series, we open a large library of methods and functions that can be applied to vast data sets with a few lines of code. To operate effectively, these methods and functions need to guarantee the type of data being inputted to them.
A full list of data types can be found in the Pandas Data Type Documentation. The most common we will work with are:
Type of Data | Value Example | String Alias Example (Python Code) |
---|---|---|
Integer | 1,2,3, ... | Int64 or Int32 |
Floats | 1.23, 4.566, 7.8999 | Float64 or Float32 |
Boolean | True or False | boolean |
Text | ‘The cat sat on the mat’ | object or StringDType |
NOTE: The object d type can effectively sort differing values, but they will be treated as a string. If you see this and are expecting a numeric value, something has gone wrong.
The alias reference in Python will often contain 32 or 64. This refers to the number of bits that are used to store the data. This limits the numbers that can be stored in each to specific ranges and there are cases where lower memory usage matters. As we are starting our Python journey and are unlikely to see these limiting issues, I would recommend using the 64 bit variants for now.
Assumptions For This Lesson
This lesson will focus on the Python code needed to carry out the required operations. We will be assuming that you are familiar with the process of transferring data between Excel and Python which was outlined in Supercharge Your Microsoft Excel Data Cleaning with Python!
The data and code used in this lesson can be found in the supporting lesson material along with a set of exercises to support your learning.
Filtering, Indexing and Slicing
One of the first features of Excel we are introduced to is the ability to filter. This allows us to select rows of data from a data table that we are interested in. In addition to this, we can also hide rows and columns to present a subset of the data on the screen. This feature extends to pivot tables where analysis can be quickly subsetted by filtering the report and selecting what we want to analyse.
In short, we are often interested in subsets of our data and need to filter this data out based on some criteria.
The ability to subset data is something that is easily implemented in Pandas. We will introduce the most common ways of doing this here. The Indexing and Selecting Data section of the Pandas Documentation contains details of additional techniques and enhanced ways to use the methods below.
Simple Indexing
At its most simple level, indexing in Pandas allows us to quickly select columns from dataframes or items from data series.
Indexing via Square Brackets
We can use square brackets to get an item from a dataframe or series. For a dataframe, we can pass a column name, or a list of column names and we will get these returned. For example, the following code:
# Import required packages
import Pandas as pd
# Import data from Excel
fruit_and_veg_prices = pd.read_Excel('fruit_veg_prices.xlsx')
# Extract a single column from the dataframe and display in notebook
fruit_and_veg_prices["item"]
Results in the output displayed in figure 2:
Figure 2. Notebook output from filtering the fruit and veg price dataframe for a single column index.
NOTE: If an item in the list passed is not found in the column names, then an error will be returned.
If a single value is passed, then a series will be returned, as demonstrated in figure 1.
If two or more are passed, then a dataframe will be returned. The following code and the output in figure 3 demonstrate this.
# Extract a multiple columns from the dataframe
fruit_and_veg_prices[["item","variety"]]
Figure 3. Example of filtering the fruit and veg dataframe for multiple column indices.
When using square brackets with a series, we will be referencing the index in the series. If the series is extracted from a dataframe, this will be the row index as well.
Indexing Via Attributes
It is also possible to access an individual column as an attribute of the data frame. This provides a quick way to access individual series within a dataframe. For example, the following code will yield the same results displayed in figure 1.
# Extract the item series from the fruit_and_veg_prices dataframe
fruit_and_veg_prices.item
Indexing With The .loc[... , ...]
And .iloc[... , ...]
Methods
When referencing in Excel, we use a combination of the row and column references to define the ranges of interest. (i.e. “A1” or “B1:B10”)
Pandas has a similar referencing system we can use in the .loc and .iloc methods. These both take a row and column reference or references and return a subset of the data. They both work in a similar way and the main difference is that .loc works directly on the indices as they are named and .iloc[... , ...]
uses an integer coordinate (starting from 0), instead of the defined index names.
The example below explores this in more detail. We have overwritten the index on the data frame so that the row indexes are all prefixed with an ‘A’.
By referencing the unique coordinates for a value, we can extract it. The following code would extract the value in the row indexed 'A0' and the column indexed 'category'.
By passing two strings for the indices, we extract a single value. The code:
# Extract the value from the column 'category' and row 'A0'
fruit_and_veg_prices.loc['A0','category']
Would extract 'fruit' from the fruit and veg dataframe.
By passing a list of row indices and a column as a string, we extract the data as a series. For example, the code:
# Extract the 'category' and 'item' values for row 'A0'
fruit_and_veg_prices.loc['A0',['category','item']]
Extracts the series shown in figure 4.
Figure 4. Notebook output demonstrating that a series is returned when a dataframe is filtered with .loc[... , ...] using a string and a list as indices.
By passing data in two lists, we extract a dataframe. This can be seen by running the following code:
# Extract the category and item values for row 'A0, 'A1', & 'A100'
fruit_and_veg_prices.loc[['A0','A1','A100'],'category']
Which outputs figure 5 in the notebook.
Figure 5. Notebook output demonstrating that a dataframe is returned when a dataframe is filtered with .loc[... , ...] using two lists as indices.
The .iloc[… , ]
method works in an identical way, but uses integer coordinates for the filter values. The following code will yield the same results displayed in figure 5.
# Extract the data in 1st, 2nd and 101st first rows and the first two columns
fruit_and_veg_prices.iloc[[0,1,100],[0,1]]
Selecting Indicies with Slices
Slices allow us to easily select ranges within a dataset and provide a shorthand notation for filtering data. There are various ways to input a slice and the basic format is [start:stop]
where start and stop are integer index values. This will then slice the data from index ‘start’ up to (but not including) index ‘stop’.
When passed to a dataframe in square brackets, it will filter the rows of the dataframe based on integer index position, even if your index is text.
For example, the following code:
# Select the first 4 rows of data
fruit_and_veg_prices[0:4]
Results in a slice of the first four rows of the data set (even though the index is text). This is displayed in figure 6.
Figure 6. Notebook output from a slice applied in square brackets to a dataframe. This selects the first four rows of data.
Slices can also be passed to the .iloc[... , ...]
method enabling subsets to be selected based on position. For example, the following code:
# Select the first four rows of data and columns 3 to 5
fruit_and_veg_prices.iloc[:4,3:6]
Filters the fruit and veg dataframe for the first four rows and for columns three to five. The notebook output of this is displayed in figure 7.
Figure 7. Notebook output showing results of slicing data using the .iloc[... , ...] method.
The notation used to input a slice varies the behaviour of the slice and the data returned. There are many interesting things you can do such as reverse indices, take every x^th^ item and combinations thereof. We won’t cover all the options here, however the most common are outlined below.
Slice | Behaviour |
---|---|
[start:stop] | Indices from start to stop -1 |
[start:] | Indices from start to the end |
[:stop] | Indices from the start to stop-1 |
[:] | All indices |
Further details can be obtained in the Python documentation on slices.
Using Boolean Masks to Filter Data
We can use Boolean comparators to ask questions of our data. In Excel, the filter option allows us to carry out such queries on a single column. We will see how we can replicate this functionality here and explore more complicated applications in the exercises.
One of the advantages of ensuring that a series consists of the same data types is that we can use logical comparators to ask questions of the data. This is particularly useful when we have numerical data as we can use greater than (>), less than (<) or equal to (=), to ask questions of the data.
We can do this with one line of code and the resulting answer is a data series of Boolean values flagging True when the statement is true and false otherwise.
We can use the results of such a question to filter a dataframe and return only the items where the statement is True. The code below, will determine where all price values are greater than 1.7 and then filter the fruit and veg prices dataframe for these observations.
# Determine where the price is >= 1.7
boolean_mask = fruit_and_veg_prices['price'] >= 1.7
# Filter the data for the rows where the mask value is True, and show in notebook
fruit_and_veg_prices[boolean_mask].head()
This results in the dataframe displayed in figure 8.
Figure 8. Notebook output resulting from using a Boolean mask to find all data points where the price is >= 1.7
You can filter with a Boolean mask using square brackets (see example above) or by passing the data as inputs to the .loc[… , …]
method. Examples of both are included in the notebook contained in the supporting lesson material.
Coercing Values to New Data Types
If you recall from earlier, a Pandas series must consist of a single data type and a dataframe can consist of many series, which may have many differing data types.
When importing data into Pandas, the to_Excel()
function will look to identify the datatype as best as possible.
The most common import issue you will see is demonstrated by the following code and its output in figure 9.
# Import Fruit order dataset
fruit_orders = pd.read_Excel('fruit_orders.xlsx')
# Display data type of Volume
print('Data type of volume is:', fruit_orders['Price'].dtype)
# Output fruit_orders to the screen
fruit_orders
Figure 9. Example of a numeric data type importing as an object from Excel.
The Pandas function to_numeric()
provides us with an easy way to carry out this implementation.
If we apply this function directly to the series, we will see that it returns an error. By default, the function is set to raise errors. We can overrule this by setting errors= 'coerce'
. When this is done, the function attempts to resolve any issues and returns a ‘coerced’ value. Care needs to be taken when doing this to avoid introducing errors into the data.
If we do this with the example above using the following code:
# Coerce series to a numeric value and assign back to variable in dataframe
fruit_orders['Price'] = pd.to_numeric(fruit_orders['Price'],errors = 'coerce')
# Display data type of Volume
print('Data type of volume is:', fruit_orders['Price'].dtype)
We see that the data type has changed to flaot64
and it has failed to coerce 'BANANAS' to a numeric, returning NaN
. See figure 10.
Figure 10. Results of coercing data to a numeric data type, note the introduction of a NaN value where the coercion has failed.
NaN
stands for 'Not a Number' and it is a special type of float value that helps Python deal with missing numerical values. (You will also see NA for other missing values). We will only consider the coercion of objects to numeric here.
There are other useful coercion functions and methods to be aware of. We will explore .astype()
in the exercise set and it is good to be aware of convert_dtypes()
and to_datetime()
.
Moving Rows to Columns and Columns To Rows
In our first lesson, we introduced Tidy Data. This has three key principles.
- Each value is a cell.
- Each variable is a column.
- Each observation is a row.
We have explored several techniques for extracting and isolating values, which is the focus of our efforts to meet the first principle.
However, while we have seen the need for techniques to clean data to ensure the latter two, we are yet to explore implementation in Python.
Moving Columns to Rows
The Pandas melt function provides us with the capability to move columns to rows. As with all Pandas functions, its behaviour can be tweaked by varying the settings. We will consider a simple example here.
To use this function, we must determine what columns we wish to leave as columns (id_vars=
) and those that we wish to ‘un pivot’ into rows.
Consider the following simple example:
Figure 11. Example of an untidy dataset in Excel.
In this, we wish to keep the 'Country' as a column, move the remaining columns to be a variable called 'Year' and then maintain the population as a variable in a separate column. This is data in a tidy format.
The following code does this:
# Melt the UK_Pop data to be in a tidy format
tidy_uk_pop = pd.melt(UK_pop,
id_vars =['Country'],
var_name = 'Year',
value_vars=[1960,1961,1962],
value_name = 'Population')
In this code, id_vars
specifies the variables we wish to keep as columns; value_vars
, the columns we wish to melt in to an individual variable with the name specified in var_name
; the remaining values are then added against the respective observation in a variable with a name specified in value_name
. This yields the output displayed in figure 16.
Figure 12. Tidy UK Population Data following the application of the Pandas melt function.
Moving Rows to Columns
The Pandas function pivot
can be used to move variables in a data set that have been recorded as rows. We can see an example of this in the 'World Population.xlsx' file where the measure column contains both population and gross domestic product (GDP) values.
Figure 13. Example of untidy data in Excel which requires rows moving to columns.
The results returned by this function are a little different to those seen before, as it will use a multi-index. We will demonstrate the base code and results here and the method to reset the multi-index and return data to variables is contained within the supporting notebook.
We need to assess our data and determine what data we want to see as rows, columns, and values.
We want to keep the *'Country' and 'Measure' as '*Columns*', move the unique items within the measure column to be their own respective variables and then align the correct items from the values column to the correct variable and observation.
# Pivot the world_pop data to be in a tidy format
tidy_world_pop = pd.pivot(world_pop,
index =['Country','Year'],
columns =['Measure'],
values = ['Value'])
In this code the inputs we have provided for the index
input will effectively remain as columns, but will be set as a multi index in the resulting output; the variables provided for the columns
input will be 'pivoted' to be columns; and then the respective inputs against the values
input will be the dataframe values output. This yields the output displayed in figure 18.
World Population Data following the application of the Pandas pivot function. Note the multi index returned for the country and year variables.
Conclusion, Next Steps & Exercises
Pandas is a powerful tool for manipulating data in Python. Its conceptual similarity to data tables in Excel make it accessible to power users and analysts. The inbuilt methods and functions contained within Pandas provide users with an extensive toolset for manipulation and processing of data. This dramatically expands the capabilities of Excel and again strengthens the argument for pushing Excel data into Python for cleaning tasks.
Knowledge and experience of Pandas is also important for progressing careers in Data Science and Machine Learning. Pandas is one of the tools that is regularly used to process and manipulate data and the skills learnt here are directly transferable to this use.