by Richard Warburton
4. Extracting Values From Free Text: Introduction to REGEX
Download the resources for this lesson here.
The Curse of Free Text
It’s a big problem! what makes data and information readable for humans, can make it very difficult for machines to process. Free text input is a perfect example of this!
A simple form for collecting information may have several input boxes for different data points. Some of these may be specific types, such as integers or times; some may limit your choices to a predefined list of values; however, on quite a regular basis, we will see free text.
From a human perspective, free text is brilliant! We can use clear language to articulate what it is we want to say and ensure all the required detail is there. However, from a machine point of view, it is difficult to process. We need to interpret the information and identify and extract the relevant values and variables.
Recall our previous lesson What Does Clean Data Look Like in Excel? where we introduced the tidy data principles. Every column should be a variable and every cell a value, burying this data in free text results in messy data and creates major problems for completing analysis.
Reviewing such blocks of free text to find relevant data points is a challenge and one you will regularly face when processing data. Natural Language Processing (NLP) tools are rapidly advancing, enabling some impressive and useful analysis, however you will often need to identify and extract specific data points from a block of free text.
A Common Problem: Extracting Values from Free Text
We will regularly need to extract information such as dates emails and phone numbers from free text.
Consider the following example of a feedback form:
Figure 1: Example of customer feedback form with useful data values added into free text.
We can see that there are three useful data points in the free text, a date, a phone number, and an email address. Extracting such information could be vital for many reasons. In this case we may want to follow up with the individual, and we may also want to analyse trends of a significant volume of responses.
Potential Solutions
As an analyst, if we get handed hundreds or even thousands of such free text data points, how can we approach extracting the data for analysis?
Manual Extraction
The manual extraction of such data is always an option that remains open. We are in total control of the extraction and can capture the data in a format that is usable for subsequent work. It is however time consuming. For a few data points, this may be the optimal option, however if you have several hundred or thousands of data points to parse, it becomes unrealistic. Not a recommended approach.
Excel Formulas
Excel has a large array of built in formulas, and these can be chained together to build a parsing tool with very specific capability.
The example in figure 2 shows us that it is possible to parse dates from free text, but there are limitations in the approach.
Figure 2: Example of an Excel formula extracting dates, highlighting both the functionality successfully working and limitations.
It works, however it does have its limitations:
-
It is an exceptionally complicated formula.
This is not something that you can just casually type in and it certainly not something that anyone else following you work will be able to read easily.
-
It works in some situations, however, can return problematic results in others.
Figure 2 shows us that written dates and inputs with multiple dates can yield responses that are problematic for analysts.
-
Outputs are in differing formats.
The process above only returns values in the format they are found (and as a text data type). Further processing would be required to coerce dates into a standard date format. This is not straight forward in Excel.
-
The formula only works for dates.
If we wanted to parse emails for example, a totally different formula would be needed. While some rework for a different application is a reasonable expectation, such a formula would be a totally different approach and solution.
It is worth noting that a similar approach can work in the Power Query environment (using the M language) However you will rapidly hit the same issues. Excel formulas offer a functional but problematic solution that can work in certain use cases. It has its limitations and is not easily transferable and adaptable. This is not a recommended solution for parsing values from free text.
Python and Regular Expressions
Python provides us with a very simple solution for such problems. It allows us to define a ‘pattern’ that we want to find in text and then finds all matches. Once found, we can then perform several operations, including, extract the matched text. The extracts will initially be returned in the found format. However, unlike the comparable situation with Excel, coercing the returned values to a new format is a much simpler task in Python.
These patterns are also called Regular Expressions (or REGEX) and are common to several programming languages and not just Python.
REGEX patterns are defined with text strings, with differing characters representing different types of matches. Multiple characters can then be brought together to define a pattern to match. The REGEX pattern can range from the simple (e.g. find al instances of the two letters ‘Ca’) or can be much more complicated and involve logical statements regarding content and length of match (e.g. find all three letter words starting with ‘Ca’ or ‘Do’.
Figure 3: REGEX code to extract differing date formats and return in the same format implemented in Python using the RE package.
Figure 3 demonstrates this process in action for parsing multiple dates from a free text value. This defines a REGEX pattern to match the differing date formats; extracts all matches; coerces values to a standard data format; then returns the found matches in a list.
This is a simple, repeatable, and highly adaptable solution. If you are presented with free text in Excel and have differing data points you wish to extract, then using Python and REGEX would be the recommended solution. To do this, we need to firstly understand what regular expressions are and how to build patterns.
What are Regular Expressions?
We have already explained that REGEX patterns are text strings that define a pattern to match. Each character within a REGEX expression performs a match and is either a literal character or metacharacter.
Literal Characters
Literal characters are the easiest to interpret as we are looking for an identical match to the pattern in the string. Such examples include:
REGEX | What Does It Do? |
---|---|
'C' | This will find matches anywhere there is a capital ‘C’.This will find matches in ‘Cat’, ‘Carphone’ and ‘Columbus’. It will not find a match in ‘accent’ due to the case sensitivity. |
‘CA’ | This will find matches anywhere the capitalised letters ‘CA’ are found. For example, this will find matches in ‘CA123’ but will not in ‘123456’ or ‘cat’ |
'123' | This will match anywhere where the numbers ‘123’ are found in succession. For example, it will find matches in ‘123456’, ‘912399’ and ‘654123’ but not ‘12234’ |
Examples of the code to implement these matches and the results can be found in the supporting lesson materials.
Meta Characters
Metacharacters are more abstract in their matching capabilities and allow us to deal with variants of input and apply logic into our matches. Examples include:
REGEX | What Does It Do? |
---|---|
'[a-z]' | This will match one lowercase character in the range of a-z. For example, this would match on each letter of ‘cat’ but not match anything in ‘CAT’ |
'[0-5]' | This returns a match when a digit in the range 0-5 is found. for example, this would match each digit of ‘1234’ and one match in ‘7819’ but find nothing in ‘789’. |
'[sz]' | This matches when one of the letters ‘s’ or ‘z’ is found. This will find matches in ‘s’, ‘z’, ‘zs’ but not ‘xwqx’. |
'*' | This matches the preceding character zero or more times. Such meta characters can be used to expand the matching capabilities of literal and meta characters. For example, consider the pattern ‘AB’_ would find a match for ‘AAB’, ‘B’ and ‘AB’ but would not match _‘AC’*. |
Examples of the code to implement these matches and the results can be found in the supporting lesson materials.
Simple REGEX Example: American and English Spelling
We can chain together strings of literal and meta characters to describe a pattern that we want to match. A very simple example can be demonstrated by parsing text for either the American or English spellings of words such as ‘apologize’ (or ‘apologise’).
Figure 4: Demonstrates how we can use the pattern ‘apologi[sz]e’ to find and replace either the American or English spellings with different text in Python.
Why does this code work? It literally matches the first part and the last part of the pattern ‘apologi’ and ‘e’ and will then match when a single character from the list of [sz] is found sandwiched between them. The function re.sub
is then used to find the first instance of a match and replace it with the given value (‘am sorry’ in this example).
Why does this code work? It literally matches the first part and the last part of the pattern ‘apologi’ and ‘e’ and will then match when a single character from the list of [sz] is found sandwiched between them. The function re.sub is then used to find the first instance of a match and replace it with the given value (‘am sorry’ in this example)
REGEX Character Details
There are a vast number of characters available for us to use within a REGEX pattern. As we will not be covering them all here, further details can be found within the documentation for the re package.
Complex REGEX Example: Extracting Emails From Text.
Let’s look at a more complex example and one which we are more likely to experience in a real-life scenario, the extraction of emails from free text.
The Structure of an Email
Before we start to explain how to build a pattern to match an email address, we should understand its structure. When ever we want to build a REGEX pattern to extract data from free text, understanding its structure is a good place to start.
road.runner@acme.com
This has a prefix (road.runner), a domain (acme) and a top-level domain or TLD (.com). The prefix and domain are separated by an ‘@’ symbol and the general format is:
prefix@domain.tld
There are strict rules as to what is permissible for each element of an email address, particularly the top level domains (see here for full details). For simplicity we will assume that:
Part | Aormat Assumption |
---|---|
Prefix | Must consist of characters 'a-z', numbers '0-9' and either ‘-‘, ‘.’ Or ‘_’ |
Domain | Must consist of characters 'a-z', numbers '0-9' and either ‘-‘, ‘.’ Or ‘_’ |
TLD | Consists of 2 to 6 characters 'a-z, which can include another ‘.’ (i.e. ‘.co.uk’) |
These are simplifying assumptions, and they do not perfectly describe what an email address should look like. However, these assumptions provide a simpler starting point for a REGEX explanation.
Given these simplified assumptions, the structure of an email address can be thought of as some letters, numbers and certain punctuation, followed by an ‘@’ symbol then of some letters, numbers and certain punctuation.
The Pattern
Let’s start by stating the pattern required, we can then work backwards and understand how each element of this works. For the problem of extracting emails from our free text we will need the following pattern.
email_pattern = [\w\.-]+@[\w\.-]+
The Characters
This uses some characters that we have not seen before.
REGEX | What Does It Do? |
---|---|
'\w' | Match a word character. This will match a single alpha numeric character (a-z, 0-9) in either upper or lower cases and underscores. |
'.' | A literal match for a ‘.’ (full stop).As the ‘.’ Character is a meta character in REGEX (It means match any character). We will need to escape this with a slash to perform a literal match. |
'+' | Match one or more of the preceding characters. NOTE: this differs to * by being one or more, not zero or more. |
'@' | Literal character, match ‘@’ |
How The Characters Form the Pattern
The first part of the pattern matches the prefix and ‘@’ symbol of the email address.
Figure 5: REGEX pattern breakdown demonstrating how the first part ([w.-]+@) matches the prefix and @ symbol of an email address.
It identifies one or more-word characters, dashes and full stops which are followed by an ‘@’ symbol.
The second part matches the domain and top level domain.
Figure 6: REGEX pattern breakdown demonstrating how latter part of the pattern ([w.-]+) matches the domain and top level domain.
This matches word characters, dashes, and full stops. When coupled with the first part of the expression, it will match our assumed simplified structure of an email address.
This pattern is based on our simplified assumptions and will not match every single permissible email address. It may also return some erroneous results, however it is more than adequate for most of the use case and enables us to take our first steps with more advanced REGEX patterns.
Examples of the code to implement these matches and the results can be found in the supporting lesson materials.
Approach
We will follow the three step process we outlined in Supercharge your Excel data cleaning with python. As Steps 1 and 3 (Importing and exporting data from Excel) have been covered extensively in this section, we will not cover in detail here. The strategy we will implement within step 2 will be.
- Define a function to: a) Use REGEX to identify all emails in a given free text string. b) Collapse the list of identified emails into a string, separating each email with a semi colon.
- Apply the function to the free text variable and generate a new one to contain responses.
Implementation
The following function completes the entirety of step 1 above.
# Define function to extract emails matches and return those found
def extract_email(str):
"""Function to find email pattern in a string (str) and return all matches """
# Define pattern to match
pat = '[\w\.-]+@[\w\.-]+'
#Find all matches (returns a list)
found_emails = re.findall(pat,str)
# join multiple list matches with a ';' to enable email use
return '; '.join(found_emails)
This takes an inputted string (str) and then finds all matches for the pattern (pat). This will return a list. It will be null if no values are found or will contain an item for every match returned by the re.findall
function call.
The .join()
method will collapse a list into a text string, separating each item with the preceding text (in this case a semi colon and space).
Having imported the data and assigned this Data Frame to the variable free_text_data. We can use the pandas .apply()
method to literally apply this function to all values in a data series and return the results. We can then assign these results to a new variable within the data frame. The following single line of code will do this across the 10,000 data points.
# Apply function to text column and create a new email column
free_text_data['Email'] = free_text_data['text'].apply(extract_email)
Exporting the data back to excel and reviewing the outputs shows us that we have successfully extracted the emails.
Figure 7: Processed data outputted to Excel, showing correctly extracted email addresses.
The notebook implementing this solution can be found in the supporting lesson material.
Conclusion & Next Steps
Extracting values from free text is a common ask. Unfortunately, Excel has no off the shelf solution to help us. Regular Expressions provide us with the ability to quickly spot patterns in free text and perform differing operations, such as extract the matched values. If we can define a pattern to search for, we can extract data from free text and use it in subsequent analysis. This can only be done in excel by using it in conjunction with Python.
Regular Expressions do take a little practice to understand. To assist with this, there is a set of exercises (and solutions) contained within the supporting lesson material.