Working with strings in R

You’ve seen thus far, that in addition to numbers, we often deal with characters (text) in our datasets of interest. We often want to search or perform some kind of manipulation of the character values in our dataset. R has a host of tools/functions for dealing with operations on character data (FYI, in computer science a set of characters is called a string) and there is a great package called stringr that provides many additional and easy to implement functions for handling strings. We will primarily rely on the stringr package, which simplifies string operations, much like dplyr does for data manipulation/wrangling operations¹.

IMPORTANT NOTE: As you work through today’s lecture add your own code block for each of the techniques learned and test out your own example.

Load in `stringr`

The stringr package is part of the tidyverse package collection, so you’ve actually installed stringr already. Let’s load in the stringr package so that we can use its functionality later on in the lecture. We could load tidyverse and this would load in stringr along with a bunch of other packages or we could just load in stringr. Either will work.

library(tidyverse)

Search for patterns in strings

When working with strings we often need to locate some pattern. We might need to search a text file for a particular keyword (e.g. the name of a US state of interest) or sequence of letters (e.g. a biologist searching through DNA sequence for a particular gene).

We’ll learn some techniques for performing these types of operations.

Finding exact matches

A very basic operation that we will frequently perform is to search for an exact match in a string. We can do this using conditional statements to check for equality. You’ve actually already done this a few times when checking for state abbreviations in your datasets, but now we’ll learn the underlying concepts.

str_1 <- "Hello how are you"
str_1 == "Hello how are you"

## [1] TRUE

You see that we assigned “Hello how are you” to the string object called str_1 and then we tested to see if str_1 was equal to “Hello how are you”. The test, unsurprisingly returned TRUE.

Note that R is CASE SENSITIVE and thus the following will return a FALSE

str_1 == "hello how are you"

## [1] FALSE

Ignoring case sensitivity

In some cases, we’ll want to ignore case sensitivity. We can do this by forcing all of the strings to lower-case by using the tolower function.

Let’s try out tolower

str_1 # print out str_1

## [1] "Hello how are you"

tolower(str_1) # print out str_1 in all lower-case

## [1] "hello how are you"

Now let’s to a comparison where we force str_1 to be all lower-case

tolower(str_1)  == "hello how are you"

## [1] TRUE

Since str_1 was mixed-case (i.e. had lower and upper-case letters), we forced it to lower-case. If you are interested in ignoring case sensitivity it is actually good practice force all of the strings being compared to lower-case. The example below should highlight why you would do this.

str_1 <- "Hello how are you"
str_2 <- "heLLo How aRe yOU"

The following should return a FALSE

tolower(str_1)  == str_2

## [1] FALSE

The following should return a TRUE, since we converted everything to lower-case.

tolower(str_1)  == tolower(str_2)

## [1] TRUE

FYI, similar to tolower there is a toupper function and I’m sure you can guess what it does

my_tweet <- "this is how i write all my tweets"
toupper(my_tweet)

## [1] "THIS IS HOW I WRITE ALL MY TWEETS"

If I were writing a book I might want to convert the string to title format. You can do this with the str_to_title function

str_to_title(my_tweet)

## [1] "This Is How I Write All My Tweets"

Search for partial matches

Finding partial matches is frequently required when dealing with strings. For instance, we might want to search for a single character or a particular word in a longer string.

Match anywhere in the string

To see if our str_1 object below contains the pattern "ab" anywhere in the string we use the str_detect function in stringr. The syntax is str_detect(object to search, pattern to search for).

str_1 <- "abc bca cab aaab"

str_detect(str_1, "ab")

## [1] TRUE

str_detect(str_1, "xyz")

## [1] FALSE

You can include spaces in the pattern you want to look for. Thus, the following code should return TRUE.

str_detect(str_1, "a c")

## [1] TRUE

Whereas this code should return FALSE, since there is no place in str_1 where "ac" occurs

str_detect(str_1, "ac")

## [1] FALSE

Match the start of a string

To check if a string starts with a pattern you use the carat ^ character in front of the pattern

str_detect(str_1, "^ab")

## [1] TRUE

str_detect(str_1, "^ac")

## [1] FALSE

Match the end of a string

We can also check if a string ends with a particular pattern. To do this we put the dollar sign character $ at the end of the pattern.

str_detect(str_1, "aab$")

## [1] TRUE

Finding the position of a pattern in a string

Often times we need to not just determine if a pattern is found within a string, but also where the particular pattern sits within that string. To do this we use the str_locate function. Let’s see an example.

str_locate(str_1, "b a")

##      start end
## [1,]    11  13

You can see that the str_locate function returns the start and end position of the pattern within the string of interest.
If the pattern does not exist, the str_locate will return a value of NA for the start and end positions.

str_locate(str_1, "ABC123")

##      start end
## [1,]    NA  NA

If a pattern exists more than once within the searched string, then str_locate will only return the positions of the first occurrence

str_3 <- "Ab then Ab"
str_locate(str_3, "Ab")

##      start end
## [1,]     1   2

To locate ALL of the occurrences, you can use the str_locate_all function

str_locate_all(str_3, "Ab")

## [[1]]
##      start end
## [1,]     1   2
## [2,]     9  10

Let’s run the code above again, but this time we’ll save the output to data object

pattern_position <- str_locate_all(str_3, "Ab")

If you look in your Environment pane you can see that str_locate returns a list data object.

Since we data frames are nice to work with (and we have lots of experience working with them) we can convert the output to a data frame using the as.data.frame function

pattern_position <- as.data.frame(pattern_position)

However, there is a reason why str_locate_all returns the output as a list as opposed to a data frame. In some situations, it is not possible to format the output as a data frame. The code below presents an example illustrating this point.

# Create a 5 element string vector
str_4 <- c("Abc", "Def ", "Abc Def Ab", " bc ", "ef ")

# Search for all instances of "Ab"
str_locate_all(str_4,"Ab")

## [[1]]
##      start end
## [1,]     1   2
## 
## [[2]]
##      start end
## 
## [[3]]
##      start end
## [1,]     1   2
## [2,]     9  10
## 
## [[4]]
##      start end
## 
## [[5]]
##      start end

The str_4 object is a five element string vector. The 1^st element in the vector is "Abc", the 2^nd element is "Def", …

Thus, str_locate_all returns a result for EACH element of the str_4 vector. Since the pattern can be found multiple times in a given element of str_4 (e.g. the 3^rd element has two instances of “Ab”), then the most convenient data structure for the output is a list.

When we have a string vector (i.e. a vector containing strings as its elements), we can locate the position of the element within the vector that contains a pattern of interest by using the str_which function.

Let’s take a look at an example below

# Create a 5 element string vector
str_4 <- c("Abc", "Def ", "Abc Def Ab", " bc ", "ef ")

# Search for all instances of "Ab"
str_which(str_4,"Ab")

## [1] 1 3

You see that str_which returns 1 and 3, which indicates that the pattern "Ab" is found in the 1^st element (i.e “Abc”) and the 3^rd element (i.e. “Abc Def Ab”) of the str_4 vector

Determine the length of a string

In addition to locating patterns in strings, we often need to determine the length of a string. We can do this using the str_length function

string_cheese <- "I like mozzarella, cheddar, and feta"
str_length(string_cheese)

## [1] 36

The above tells us that string_cheese is 36 characters long (including spaces, punctuation, etc.)

If we have a string vector then str_length outputs the length of each element in the vector

box_of_cheese <- c("cheddar", "mozzarella", "feta", "sharp cheddar")
str_length(box_of_cheese)

## [1]  7 10  4 13

Determine the frequency of a pattern

We can also count how many times (i.e. frequency) a particular pattern occurs within a string using the str_count function

On a string

str_ens215 <- "In this class we are learning data analysis in R"
str_count(str_ens215, "ar")

## [1] 2

On a string vector

str_count(box_of_cheese, "ar")

## [1] 1 1 0 2

Modifying (mutating) strings

Managing string lengths

In many instances you will want to modify the length of a string, either by adding to the string “padding”, or by removing from the string. These operations are often used to ensure that a list or set of strings have consistent formatting. Below we’ll see how (and why) we do this.

Padding strings

Imagine you have a list of days in numeric format. In the case of days of the month you could represent the 2^nd day with either a 2 or 02. The 02 format is often useful when you are naming files/objects since they will sort better in a file folder (e.g. while files are ordered 2 would come before 10 which in the case of dates you would not want, whereas 02 would not). This is just one example of where you would want to pad the strings.

days_vec <- c("1","2","3","10","11","5","24") # create a list of days 

str_pad(days_vec, width = 2, side = "left", pad = "0")

## [1] "01" "02" "03" "10" "11" "05" "24"

The width controls the length to pad until, thus the “10”, “11”, and “24” were not padded since they are already of length 2. The side controls where the padding occurs (either left or right of string) and the pad specifies the character to use in the padding.

Try out the str_pad function on your own

# Your code here

Shortening strings

There are also cases where you will want to shorten the length of strings.

Trim whitespace

One case where you may want to shorten strings is when there is whitespace on either or both ends of a string. To do this you use the str_trim function.

spaced_out <- c("Jan ", "Feb ", "Mar ", "Apr ", "May ")
str_trim(spaced_out, side = "right")

## [1] "Jan" "Feb" "Mar" "Apr" "May"

You can trim from the left, right or both sides by setting side equal to “left”, “right” or “both”

Truncate strings

str_years <- c("2000","2001","2002","2020","2050")
str_trunc(str_years, 2, side = "left", ellipsis = "")

## [1] "00" "01" "02" "20" "50"

Check out the help file on str_trunc to learn what each of the function parameters is doing and what options they can take.

Joining strings

If you want to join strings you can use the str_c function. Below are some examples

Joining multiple strings

s1 <- "The time of collection was"
s2 <- "10:25 AM EST"
s3 <- "on Jan 25th"

str_c(s1,s2,s3, sep = " ") # combine strings and use a space as separator between the strings

## [1] "The time of collection was 10:25 AM EST on Jan 25th"

Collapse a string vector

You can also use str_c to collapse a string vector into a single string

str_c(str_years,collapse = ", ") # collapse and use ", " to separate the items

## [1] "2000, 2001, 2002, 2020, 2050"

Replace elements of a string

There are many situations where you want to find a specific pattern and replace it with another string (e.g. replace a word with an abbreviation, replace spaces with commas,…). The str_replace_all function allows you to do this.

Let’s replace the spaces in the following vector string with dashes

str_dates <- c("Jan 2010", "Jul 2018", "Nov 2020")
str_replace_all(str_dates," ","-")

## [1] "Jan-2010" "Jul-2018" "Nov-2020"

Extracting parts of a string

Extract by indices

Just as we often want to extract parts of a data frame object, there are similarly many situations were we need to extract parts of a string.

We can specify the indices of a string object that we would like to extract by using the str_sub function

course_title <- "ENS-215: Exploring Environmental Data"
str_sub(course_title, start = 1, end = 7) # get elements 1 through 7

## [1] "ENS-215"

str_sub(course_title, start = 10, end = 18) # get elements 10 through 18

## [1] "Exploring"

If you don’t specify the start position then str_sub will extract from the first index to your specified end. If you don’t specify the end position then str_subt will extract from the specified start to the end of the string.

Sometimes you want to extract from a string, but would like to start extracting from the end. To do this you can use a negative index

str_sub(course_title, -4) # get the last four elements of the string

## [1] "Data"

Split strings based on a specified character

There are many situations where you might need to split a string and there are a few approaches available (depending on the exact situation). In our class, a likely situation might be where you have a single data frame column (or a vector) that has start and end times (these might indicate start and end times of an experiment, an environmental event such as a storm, …)

storm_times <- c("10:15-14:20", "18:00-22:40", "15:00-21:30") # vector with start and end times of recent rainstorms
storm_times

## [1] "10:15-14:20" "18:00-22:40" "15:00-21:30"

We might want to split this into two columns, one that has the start time and the other having the end time. We will specify the character that indicates the location we would like to make the split (in this case the split character is the -). To do this we use the str_split_fixed funtion.

a<- str_split_fixed(storm_times,"-", 2) # split storm_times based on the dash "-" and output results to 2 columns

Other string operations

There are tons more string operations that you can do in R. Today we just covered a few of the more common operations that you may need to do in your data analysis in the environmental sciences.

For more info to reinforce what you learned here and to see additional topics you can check out your textbook R4DS Chapter 14 and your stringr cheatsheet.

Some of today’s lecture material was inspired by material from the ES218 course by Manny Gimond↩︎