Richard Warburton

by Richard Warburton

Lesson

8. Obtaining Data Part 1: Working with Files

📂 Resources

Download the resources for this lesson here.

Introducing Separated Variable Files


Since the mid-80s, Microsoft Excel has been the workhorse of offices. It is rare to find a business that does not utilise this powerful tool somewhere for analysis.

However, the .xlsx (or.xls) file format is proprietary to Microsoft. While a lot of other software providers find ways of working with this data format, this is not always the case. There is a need to have a standard data format which can transfer data between programmes and systems which don’t operate with the same native file types.

This means that as part of the data cleaning process, we need to be familiar with other file types and be happy transforming them into a format that can be used for analysis.

Comma Separated Variable Files (CSV Files)

One such format is the comma separated variable, or CSV file. This is a plain text file, where rows of data (aka observations) are entered on rows of the text file and the data for the columns are separated on that row by a delimiter, a comma in this case. An example of CSV data can be seen in figure 1.

Example of a raw CSV file in notepad.  Rows of data are entered on rows in the text file and columns are separated (delimited) with a comma on each row of data.  In this example, we have a header of variable names in row 1.

Figure 1. Example of a raw CSV file in notepad. Rows of data are entered on rows in the text file and columns are separated (delimited) with a comma on each row of data. In this example, we have a header of variable names in row 1.

Such data files may or may not contain a row of headers and are commonly found when exporting data from SQL databases, or other similar structured stores of data. Excel is very good at reading and loading single CSV files. In fact, you may have already worked with them and not realised. However, there are some limitations which can be fiddly to work around in Excel.

Such limitations include spaces and gaps in files; reading free text which may contain a comma within the value; and a CSV file containing more data points than can fit in Excel. We must also recognise that a CSV file, is a text file. As a result, we will need to load the data into a programme like Excel to perform any requested analysis.

Other Separated Variables (TSV, Other Delimiters)

The delimiter in such a file is not limited to a comma and it can in theory be any value. Tab delimiters (i.e. cell data separated by a tab spacing), defined widths for cells or a user defined character such as ‘%>%’ are all plausible.

For simplicity, we will only consider CSV files here, but be aware that other data formats exist and can be processed using very similar techniques to those we will outline.

Working With CSV Files in Python and Pandas


When working with a single, well formatted CSV file, Excel is more than capable of loading the file directly for analysis. However, data is rarely provided in such a state.

Python and Pandas provide us with a set of highly customisable tools for processing and cleansing CSV data. As we progress our data journey, we are more likely to find raw data in CSV formats and building a set of tools to process such data is a valuable addition to the data analyst’s armoury.

Once any cleansing steps have been completed, we can export the data back to Excel for analysis. We could also use many of the modelling and visualisation tools available in Python should we so wish.

Importing Using Read_CSV

Pandas offers us an out of the box function to read CSV files. This is highly customisable and allows us to quickly import CSV data into Pandas and Python. In its simplest form, the function takes a file path and name and will import the data to a dataframe.

# Import required packages
import Pandas as pd

# Read the photo plankton CSV data and assign to a dataframe
plankton_data = pd.read_csv('PhytoplanktonResults060423.csv')

The pd.read_csv function has several useful settings to be aware of.

The skiprows= and skipfooter= settings will allow us to ignore a defined number of rows at the start and end of a file respectively. These can be defined within the function variables when importing. These are useful if your CSV file has metadata such as a key in either of these places. (Something we can see in the supporting notebook for the photo plankton data and in figure 2)

dataframe output of photo plankton data displayed in a notebook.  This shows that the key meta data has been imported in addition to the core data.

Figure 2. dataframe output of photo plankton data displayed in a notebook. This shows that the key meta data has been imported in addition to the core data.

The revised code below will skip the key at the end of the data.

# Read the photo plankton CSV data, skipping the last 8 rows
# NOTE: Engine = ‘Python’ required for skipfooter to function correctly
plankton_data_trimmed = pd.read_csv('PhytoplanktonResults060423.csv',
                                    skipfooter=8, engine='Python')

Inspecting the tail of the data (figure 3), we can see that the key has been ommited.

dataframe tail output of the photo plankton data displayed in a notebook where skipfooter was set to 8 on import.

Figure 3. dataframe tail output of the photo plankton data displayed in a notebook where skipfooter was set to 8 on import.

The header= setting allows us to control the variable names in the Pandas dataframe. By default, read_csv infers the header and assumes they are the first row in the data. If our data does not contain a header, we can either provide a list of names to use or set header=None. The former will apply the given names and the latter will see default numbering used.

The final setting to draw your attention to is quotechar=. This is particularly useful when importing variables which contain free text. It is plausible, that a block of free text could legitimately contain a comma as a punctuation mark. Providing an input to this setting allows us to wrap all such data within the defined quotes (often a quotation mark) and ensure that any delimiters that occur between them are ignored. We will explore how to use these settings in the supporting notebook and the exercise set for the lesson.

We will explore how to use these settings in supporting notebook and the exercise set for the lesson.

There are many other settings and configurations for this function. Overviews of these can be found in the pd.read_csv documentation

Other Import Methods

There are other import methods and libraries which can be of use for importing text files (including CSV files).

One you will often see involves the 'With Open' method. This will allow a user to open a file and then process it line by line. It usually requires a few more lines of code but results in a faster import process which can implement specific cleaning tasks. The general format for the code is as follows:

# Open specified file for reading ('r')
with open('FILE_NAME.csv' ,'r') as infile:
    for line in infile:
        """LINE BY LINE PROCESSING CODE HERE"""

We mention this method here as it is one you will see used on occasion. We will however remain focused on pd.read_CSV() for this lesson.

Working With Multiple Files in Folders


We are now in a position where we can explore working with CSV files with an example. It is quite common to have a set of data distributed over multiple CSV files. To complete the analysis requested, we will often need to amalgamate the data into a single file.

Let’s consider a simple example. A manufacturing company runs three machines producing identical widgets. Every day a data file is produced that stores the number of widgets manufactured and the average operating temperature of the machine for every 15 minute period of the working day. They hypothesise that the efficiency of the machine is linked to the operating temperature and that there are temperature thresholds which if exceeded, cause the machine to shut down for the rest of the day. They would like to amalgamate the data and see if there is evidence to support this hypothesis.

The data from these machines is contained within the ‘machine_data’ folder in the supporting lesson materials.

Introducing pd.concat(), A Useful Panda Function.

Core to this example, is the ability to append multiple files together to create a single data set for analysis. The Pandas function pd.concat() provides us with this capability.

We can do this either by concatenating (or appending) data on to existing columns (i.e. concatenate along axis 1) or by adding new columns of data to existing rows (concatenate along axis 0). This behaviour is demonstrated in figure X.

The behaviour of the function can be controlled by passing differing variables into the function call.

By default, the function will append data to rows and will not reset the indices of the dataframes. This may result in duplicate indices.

# Define two example dataframes
df1 = pd.dataframe({'Label':['A','B','C'],'Volume':[10,20,30]})
df2 = pd.dataframe({'Label':['A','B','C'],'Volume':[1,2,3]})

# Default concat call on a list of data frames
pd.concat([df1,df2])

This yields the dataframe displayed in figure 4.

The result of concatenating two simple dataframes using the default pd.concat settings.  Note the index values have been inherited from the respective dataframes.

Figure 4. The result of concatenating two simple dataframes using the default pd.concat settings. Note the index values have been inherited from the respective dataframes.

Indices can easily be reset by setting ignore_index=True. This may however lose information. Another option would be to pass a set of keys to identify which dataframe the data is from. This results in a multi-index.

By setting the axis=1, we can append new columns of data on to each row. Pandas will align data by index and care needs to be taken to ensure that data aligns correctly. This process may introduce missing values.

A similar process occurs where there are mismatched columns. A full exploration of these settings and techniques can be found in the supporting notebook. For further details of the varying settings, see the pd.concat() documentation.

The OS Package

To solve the problem posed in the example, we will also need to list all files contained within a set directory on a computer. The OS package provides this capability in Python. It enables Python to interact with the operating system and find out specific information and details about files and settings. For this lesson we will use a specific function within it.

The os.listdir() function will print all files within a directory. This will enable us to access a folder of data files and loop through them each in succession.

# Import required packages
import os

# List all files in the directory
file_list = os.listdir('machine_data\\')

A sample of the output from this can be seen in figure 5.

First three filenames found in the machine_data folder contained within a list.

Figure 5. First three filenames found in the machine_data folder contained within a list.

By default, Python will assume that the location of your script or notebook is the working directory and will take this as the root path when defining folders. You can overwrite this if you like with a full file path.

Full details of the OS package can be found in the OS documentation.

Missing Data - Variables in File Names

Inspecting the individual files in the machine_data folder, we can see that there is an issue. Each file has the time, temperature and units produced (figure 1), but We are missing the machine and the date. This is an essential data point for the requested analysis.

Looking in more detail at the data we have, we see that the missing data points can be extracted from the file name (figure 5). But how do we do this?

Each character in a Python string can be indexed by its place in the string. For example, in the string ‘abc’, ‘a’ is at the zeroth index, ‘b’ at the first and ‘c’ at the second.

This means that the same indexing and slicing methods we learnt in the last lesson can be used for strings and the following code will extract these data points for us.

# Set example file name
file = '20230201_MACHINE_A.csv'

# Extract the machiene from the file name
machine = file[-5:-4]

# Extract the date from the file name (dd/mm/yyyy format)
date = file[6:8] + '\\' + file[4:6] + '\\' + file[:4]

These can then be added as variables in the imported dataframes.

Looping Through and Importing Files

We now have all the tools we need to solve this problem. The following code will loop through each of the files; import them into Python; extract missing key data from the file name; build a dataframe containing all the key data; and then append it to a list for storage.

# Initiate a list to hold each imported datafile
imported_files = []

# For each file in directory
for file in file_list:
    # Extract the machiene from the file name
    machine = file[-5:-4]

    # Extract the date from the file name (dd/mm/yyyy format)
    date = file[6:8] + '\\' + file[4:6] + '\\' + file[:4]
    data_file = pd.read_csv('machine_data\\' + file)

    # add date and machine to respective data file (Recall broadcasting!!!)
    data_file['date'] = date
    data_file['machine'] = machine

    # Append imported data with machine and data labels to list
    imported_files.append(data_file)

Once we have a list of all imported data, we can use the concat function and the following code to build the overall data set needed for the analysis and Export back to Excel.

# Concatenate list of single dataframes in to a single one along axis = 0 (i.e. append rows)
machine_operating_data = pd.concat(imported_files, axis = 0, ignore_index = True)

# Export the data to Excel
machine_operating_data.to_Excel('machine_operating_data.xlsx', index=False)

Analysis Output

The focus of this lesson is the process to amalgamate the data. An example of the analysis output has been included in the supporting lesson material and from this we can see that there is evidence to suggest that the machines will shut down above 80 degrees and below 70 degrees Celsius. There is also some evidence to suggest that the optimal operating temperature is between 70 and 75 degrees Celsius, with a higher widget production count when the machines operate in this temperature range.

Conclusion, Next Steps & Exercises


This lesson has looked at varying ways we can use Python to amalgamate data distributed across multiple files and produce a single data set that can be used for analysis. We have also introduced CSV files, a file type that is commonly used to store data and information.

The tools and techniques outlined in this lesson can be adapted to several differing use cases and variants of this process can be explored in the supporting exercise set.

We have focused in on CSV files in this lesson and in situations where there are physical files stored to read. However, there are other sources of data that we may wish to access. In the next lesson, we will explore how to read lists from websites and read data from online APIs.