by Richard Warburton
Updated 17 November 2023
Strategies and Techniques for Data Cleaning in Excel and Python
Course Description
Clean data is data that does not require any transformation or updates prior to commencing analysis. If your data requires cleansing work prior to use, then you have messy data.
If you primarily work with Excel, working with messy data means you regularly find yourself carrying out tasks such as:
- Removing non printable characters such as line breaks.
- Removing leading and trailing spaces or extra spaces.
- Using the text to columns wizard to split data into separate columns.
- Populating blank cells or removing blank rows or columns.
- Identifying duplicate values or duplicate data.
- Using conditional formatting to highlight errors.
- Correcting the capitalisation of text.
- Using paste special to clear formats.
This list is far from exhaustive and messy data is commonplace. Every messy dataset encountered is messy in its own unique way and requires its own unique processes to clean. However, once all such issues are solved and we have clean data, we can progress to the interesting part of the job, the analysis.
In this course we'll work through the complete data-cleaning pipeline, from importing data, to cleaning and manipulating it, to exporting it for analysis.
We'll do this using both Excel and Python, and we'll see how Python can be used to speed up our data cleaning operations.
Figure 1. Basic data exploration in Python
We'll start by focusing on the Tidy Data Principles. These principles are a set of rules that help us to structure our data in a way that makes it easy to clean, manipulate and analyse.
We'll then move on to look at some of the most common data cleaning techniques in Excel, and we'll see how we can use Python to speed up our data cleaning operations.
We'll then move on to look at how we can use Python to clean and manipulate data using Pandas, one of the most popular data science libraries.
When you complete this course, you'll have the ability to take real-world data and clean it for further analysis in your own data science projects.