You’ve seen thus far, that in addition to numbers, we often deal with characters (text) in our datasets of interest. We often want to search or perform some kind of manipulation of the character values in our dataset. R has a host of tools/functions for dealing with operations on character data (FYI, in computer science a set of characters is called a string) and there is a great package called stringr
that provides many additional and easy to implement functions for handling strings. We will primarily rely on the stringr
package, which simplifies string operations, much like dplyr
does for data manipulation/wrangling operations1.
IMPORTANT NOTE: As you work through today’s lecture add your own code block for each of the techniques learned and test out your own example.
stringr
The stringr
package is part of the tidyverse
package collection, so you’ve actually installed stringr
already. Let’s load in the stringr
package so that we can use its functionality later on in the lecture. We could load tidyverse
and this would load in stringr
along with a bunch of other packages or we could just load in stringr
. Either will work.
library(tidyverse)
When working with strings we often need to locate some pattern. We might need to search a text file for a particular keyword (e.g. the name of a US state of interest) or sequence of letters (e.g. a biologist searching through DNA sequence for a particular gene).
We’ll learn some techniques for performing these types of operations.
A very basic operation that we will frequently perform is to search for an exact match in a string. We can do this using conditional statements to check for equality. You’ve actually already done this a few times when checking for state abbreviations in your datasets, but now we’ll learn the underlying concepts.
str_1 <- "Hello how are you"
str_1 == "Hello how are you"
## [1] TRUE
You see that we assigned “Hello how are you” to the string object called str_1
and then we tested to see if str_1
was equal to “Hello how are you”. The test, unsurprisingly returned TRUE
.
Note that R is CASE SENSITIVE and thus the following will return a FALSE
str_1 == "hello how are you"
## [1] FALSE
In some cases, we’ll want to ignore case sensitivity. We can do this by forcing all of the strings to lower-case by using the tolower
function.
Let’s try out tolower
str_1 # print out str_1
## [1] "Hello how are you"
tolower(str_1) # print out str_1 in all lower-case
## [1] "hello how are you"
Now let’s to a comparison where we force str_1
to be all lower-case
tolower(str_1) == "hello how are you"
## [1] TRUE
Since str_1
was mixed-case (i.e. had lower and upper-case letters), we forced it to lower-case. If you are interested in ignoring case sensitivity it is actually good practice force all of the strings being compared to lower-case. The example below should highlight why you would do this.
str_1 <- "Hello how are you"
str_2 <- "heLLo How aRe yOU"
The following should return a FALSE
tolower(str_1) == str_2
## [1] FALSE
The following should return a TRUE
, since we converted everything to lower-case.
tolower(str_1) == tolower(str_2)
## [1] TRUE
FYI, similar to tolower
there is a toupper
function and I’m sure you can guess what it does
my_tweet <- "this is how i write all my tweets"
toupper(my_tweet)
## [1] "THIS IS HOW I WRITE ALL MY TWEETS"
If I were writing a book I might want to convert the string to title format. You can do this with the str_to_title
function
str_to_title(my_tweet)
## [1] "This Is How I Write All My Tweets"
Finding partial matches is frequently required when dealing with strings. For instance, we might want to search for a single character or a particular word in a longer string.
To see if our str_1
object below contains the pattern "ab"
anywhere in the string we use the str_detect
function in stringr
. The syntax is str_detect(object to search, pattern to search for)
.
str_1 <- "abc bca cab aaab"
str_detect(str_1, "ab")
## [1] TRUE
str_detect(str_1, "xyz")
## [1] FALSE
You can include spaces in the pattern you want to look for. Thus, the following code should return TRUE
.
str_detect(str_1, "a c")
## [1] TRUE
Whereas this code should return FALSE
, since there is no place in str_1
where "ac"
occurs
str_detect(str_1, "ac")
## [1] FALSE
To check if a string starts with a pattern you use the carat ^
character in front of the pattern
str_detect(str_1, "^ab")
## [1] TRUE
str_detect(str_1, "^ac")
## [1] FALSE
We can also check if a string ends with a particular pattern. To do this we put the dollar sign character $
at the end of the pattern.
str_detect(str_1, "aab$")
## [1] TRUE
Often times we need to not just determine if a pattern is found within a string, but also where the particular pattern sits within that string. To do this we use the str_locate
function. Let’s see an example.
str_locate(str_1, "b a")
## start end
## [1,] 11 13
You can see that the str_locate
function returns the start and end position of the pattern within the string of interest.
If the pattern does not exist, the str_locate
will return a value of NA
for the start and end positions.
str_locate(str_1, "ABC123")
## start end
## [1,] NA NA
If a pattern exists more than once within the searched string, then str_locate
will only return the positions of the first occurrence
str_3 <- "Ab then Ab"
str_locate(str_3, "Ab")
## start end
## [1,] 1 2
To locate ALL of the occurrences, you can use the str_locate_all
function
str_locate_all(str_3, "Ab")
## [[1]]
## start end
## [1,] 1 2
## [2,] 9 10
Let’s run the code above again, but this time we’ll save the output to data object
pattern_position <- str_locate_all(str_3, "Ab")
If you look in your Environment pane you can see that str_locate
returns a list
data object.
Since we data frames are nice to work with (and we have lots of experience working with them) we can convert the output to a data frame using the as.data.frame
function
pattern_position <- as.data.frame(pattern_position)
However, there is a reason why str_locate_all
returns the output as a list
as opposed to a data frame
. In some situations, it is not possible to format the output as a data frame
. The code below presents an example illustrating this point.
# Create a 5 element string vector
str_4 <- c("Abc", "Def ", "Abc Def Ab", " bc ", "ef ")
# Search for all instances of "Ab"
str_locate_all(str_4,"Ab")
## [[1]]
## start end
## [1,] 1 2
##
## [[2]]
## start end
##
## [[3]]
## start end
## [1,] 1 2
## [2,] 9 10
##
## [[4]]
## start end
##
## [[5]]
## start end
The str_4
object is a five element string vector. The 1st element in the vector is "Abc"
, the 2nd element is "Def"
, …
Thus, str_locate_all
returns a result for EACH element of the str_4
vector. Since the pattern can be found multiple times in a given element of str_4
(e.g. the 3rd element has two instances of “Ab”), then the most convenient data structure for the output is a list
.
When we have a string vector
(i.e. a vector containing strings as its elements), we can locate the position of the element within the vector that contains a pattern of interest by using the str_which
function.
Let’s take a look at an example below
# Create a 5 element string vector
str_4 <- c("Abc", "Def ", "Abc Def Ab", " bc ", "ef ")
# Search for all instances of "Ab"
str_which(str_4,"Ab")
## [1] 1 3
You see that str_which
returns 1 and 3, which indicates that the pattern "Ab"
is found in the 1st element (i.e “Abc”) and the 3rd element (i.e. “Abc Def Ab”) of the str_4
vector
In addition to locating patterns in strings, we often need to determine the length of a string. We can do this using the str_length
function
string_cheese <- "I like mozzarella, cheddar, and feta"
str_length(string_cheese)
## [1] 36
The above tells us that string_cheese
is 36 characters long (including spaces, punctuation, etc.)
If we have a string vector
then str_length
outputs the length of each element in the vector
box_of_cheese <- c("cheddar", "mozzarella", "feta", "sharp cheddar")
str_length(box_of_cheese)
## [1] 7 10 4 13
We can also count how many times (i.e. frequency) a particular pattern occurs within a string using the str_count
function
On a string
str_ens215 <- "In this class we are learning data analysis in R"
str_count(str_ens215, "ar")
## [1] 2
On a string vector
str_count(box_of_cheese, "ar")
## [1] 1 1 0 2
In many instances you will want to modify the length of a string, either by adding to the string “padding”, or by removing from the string. These operations are often used to ensure that a list or set of strings have consistent formatting. Below we’ll see how (and why) we do this.
Imagine you have a list of days in numeric format. In the case of days of the month you could represent the 2nd day with either a 2
or 02
. The 02
format is often useful when you are naming files/objects since they will sort better in a file folder (e.g. while files are ordered 2
would come before 10
which in the case of dates you would not want, whereas 02
would not). This is just one example of where you would want to pad the strings.
days_vec <- c("1","2","3","10","11","5","24") # create a list of days
str_pad(days_vec, width = 2, side = "left", pad = "0")
## [1] "01" "02" "03" "10" "11" "05" "24"
The width
controls the length to pad until, thus the “10”, “11”, and “24” were not padded since they are already of length 2. The side
controls where the padding occurs (either left or right of string) and the pad
specifies the character to use in the padding.
Try out the str_pad
function on your own
# Your code here
There are also cases where you will want to shorten the length of strings.
One case where you may want to shorten strings is when there is whitespace on either or both ends of a string. To do this you use the str_trim
function.
spaced_out <- c("Jan ", "Feb ", "Mar ", "Apr ", "May ")
str_trim(spaced_out, side = "right")
## [1] "Jan" "Feb" "Mar" "Apr" "May"
You can trim from the left, right or both sides by setting side
equal to “left”, “right” or “both”
str_years <- c("2000","2001","2002","2020","2050")
str_trunc(str_years, 2, side = "left", ellipsis = "")
## [1] "00" "01" "02" "20" "50"
Check out the help file on str_trunc
to learn what each of the function parameters is doing and what options they can take.
If you want to join strings you can use the str_c
function. Below are some examples
s1 <- "The time of collection was"
s2 <- "10:25 AM EST"
s3 <- "on Jan 25th"
str_c(s1,s2,s3, sep = " ") # combine strings and use a space as separator between the strings
## [1] "The time of collection was 10:25 AM EST on Jan 25th"
You can also use str_c
to collapse a string vector into a single string
str_c(str_years,collapse = ", ") # collapse and use ", " to separate the items
## [1] "2000, 2001, 2002, 2020, 2050"
There are many situations where you want to find a specific pattern and replace it with another string (e.g. replace a word with an abbreviation, replace spaces with commas,…). The str_replace_all
function allows you to do this.
Let’s replace the spaces in the following vector string with dashes
str_dates <- c("Jan 2010", "Jul 2018", "Nov 2020")
str_replace_all(str_dates," ","-")
## [1] "Jan-2010" "Jul-2018" "Nov-2020"
Just as we often want to extract parts of a data frame object, there are similarly many situations were we need to extract parts of a string.
We can specify the indices of a string object that we would like to extract by using the str_sub
function
course_title <- "ENS-215: Exploring Environmental Data"
str_sub(course_title, start = 1, end = 7) # get elements 1 through 7
## [1] "ENS-215"
str_sub(course_title, start = 10, end = 18) # get elements 10 through 18
## [1] "Exploring"
If you don’t specify the start
position then str_sub
will extract from the first index to your specified end. If you don’t specify the end
position then str_subt
will extract from the specified start
to the end of the string.
Sometimes you want to extract from a string, but would like to start extracting from the end. To do this you can use a negative index
str_sub(course_title, -4) # get the last four elements of the string
## [1] "Data"
There are many situations where you might need to split a string and there are a few approaches available (depending on the exact situation). In our class, a likely situation might be where you have a single data frame column (or a vector) that has start and end times (these might indicate start and end times of an experiment, an environmental event such as a storm, …)
storm_times <- c("10:15-14:20", "18:00-22:40", "15:00-21:30") # vector with start and end times of recent rainstorms
storm_times
## [1] "10:15-14:20" "18:00-22:40" "15:00-21:30"
We might want to split this into two columns, one that has the start time and the other having the end time. We will specify the character that indicates the location we would like to make the split (in this case the split character is the -
). To do this we use the str_split_fixed
funtion.
a<- str_split_fixed(storm_times,"-", 2) # split storm_times based on the dash "-" and output results to 2 columns
There are tons more string operations that you can do in R. Today we just covered a few of the more common operations that you may need to do in your data analysis in the environmental sciences.
For more info to reinforce what you learned here and to see additional topics you can check out your textbook R4DS Chapter 14 and your stringr
cheatsheet.
Some of today’s lecture material was inspired by material from the ES218 course by Manny Gimond↩︎