Let’s now jump into some data manipulation using the functionality in the dplyr
package. This package is going to increase your data handling skills by a factor of 100 or more1!
You’ll see how with dplyr
data manipulation operations that may have required long, complex, and difficult to decipher code can often be reduced to a single line that is easy to code and easy to read!
As you work through the code, pay close attention and make sure that you understand what is happening in each line of code. Check help files, ask your neighbor, and of course ask me if you have any questions or don’t fully understand something.
Note: The dplyr cheatsheet will be very helpful when working through today’s lesson
dplyr
packageWe’ll actually load in tidyverse
which contains dplyr
along with many other packages.
library(tidyverse)
gapminder
packageWe’ll also load in the the gapminder
package. This package has a great dataset to use when learning and showcasing the functionality of dplyr
.
The gapminder
dataset contains data on life expectancy, GDP per capita, and populations by country (and through time). This is a great dataset to look at for thinking about global development. As global development is inherently tied to environmental aspects (e.g. water resources, mineral resources, ecology) this dataset should be of interest to anyone studying the environment, and of particular interest to those focused on policy.
If you haven’t installed the gapminder
package yet, you should go to your package window and do this. Once you have the package let’s load it in.
library(gapminder)
Let’s print out some of the data and take a look.
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
Let’s also get a summary()
to learn a bit more
summary(gapminder)
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
You should take a minute to learn more about the dataset by typing ?gapminder
in your CONSOLE.
Also take a minute to look at the data by using the View()
function. Do this in your CONSOLE.
You can see that there is a lot of data here, a number of variables, and many different ways we might want to filter, group, select,… the data. dplyr
is going to come to the rescue here.
Before we proceed, let’s assign the gapminder data to a new object (just so you aren’t worrying about messing up the original data)
gap_data <- gapminder
filter()
to subset rows of dataNo longer do we need to use complex logical statements as our indices when trying to subset data along rows (recall the complex statements that you would put into the []
when subsetting). filter()
makes this operation painless.
The filter()
function takes your logical comparison and selects all of the rows for which the test is TRUE
.
filter(gap_data, gdpPercap > 5000)
## # A tibble: 692 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 2007 76.4 3600523 5937.
## 2 Algeria Africa 1982 61.4 20033753 5745.
## 3 Algeria Africa 1987 65.8 23254956 5681.
## 4 Algeria Africa 1992 67.7 26298373 5023.
## 5 Algeria Africa 2002 71.0 31287142 5288.
## 6 Algeria Africa 2007 72.3 33333216 6223.
## 7 Angola Africa 1967 36.0 5247469 5523.
## 8 Angola Africa 1972 37.9 5894858 5473.
## 9 Argentina Americas 1952 62.5 17876956 5911.
## 10 Argentina Americas 1957 64.4 19610538 6857.
## # ... with 682 more rows
We just filtered the gap_data so that only the rows where the gdpPercap was greater than 5000 were selected. That was super easy!
We can add even more conditions to filter on.
filter(gap_data, gdpPercap > 5000, year == 2007)
## # A tibble: 76 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 2007 76.4 3600523 5937.
## 2 Algeria Africa 2007 72.3 33333216 6223.
## 3 Argentina Americas 2007 75.3 40301927 12779.
## 4 Australia Oceania 2007 81.2 20434176 34435.
## 5 Austria Europe 2007 79.8 8199783 36126.
## 6 Bahrain Asia 2007 75.6 708573 29796.
## 7 Belgium Europe 2007 79.4 10392226 33693.
## 8 Bosnia and Herzegovina Europe 2007 74.9 4552198 7446.
## 9 Botswana Africa 2007 50.7 1639131 12570.
## 10 Brazil Americas 2007 72.4 190010647 9066.
## # ... with 66 more rows
We now selected all of the rows where the per capita GDP > 5000 AND the year is 2007.
You see that the syntax is filter(dataset, criteria 1, criteria 2,..., criteria n)
Note that filter()
returns the output you want but it does not modify the original dataset. This is good, since we likely want to keep using our original dataset later on in our code.
Furthermore, you’ll see that since we didn’t assign the output from filter()
to a data object, we haven’t created any new data objects yet. We can easily assign the output to a new object, the same way we do all other object assignments
data_2007_highGDP <- filter(gap_data, gdpPercap > 5000, year == 2007)
However, before you go ahead and make tons of new data frames to store filtered versions of your data, you should realize that generally we can pass the filtered data directly to another function (e.g. a plotting function, mean, sum,…) which may eliminate the need to creating tons of new data frames. This will help to keep our environment clean (something you should care about as environmental scientists).
Here’s how you would filter based on a variable that has text values
filter(gap_data, gdpPercap > 5000, continent == "Asia")
## # A tibble: 133 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Bahrain Asia 1952 50.9 120447 9867.
## 2 Bahrain Asia 1957 53.8 138655 11636.
## 3 Bahrain Asia 1962 56.9 171863 12753.
## 4 Bahrain Asia 1967 59.9 202182 14805.
## 5 Bahrain Asia 1972 63.3 230800 18269.
## 6 Bahrain Asia 1977 65.6 297410 19340.
## 7 Bahrain Asia 1982 69.1 377967 19211.
## 8 Bahrain Asia 1987 70.8 454612 18524.
## 9 Bahrain Asia 1992 72.6 529491 19036.
## 10 Bahrain Asia 1997 73.9 598561 20292.
## # ... with 123 more rows
Now I want you to try a few more filter operations on the gap_data
. Take your time and get comfortable with the basics here.
# Your code here
filter()
and in
If you want to filter based on several values for a given variable, you could use the OR |
operator in your criteria. For example,
filter(gap_data, country == "Canada" | country == "China")
## # A tibble: 24 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Canada Americas 1952 68.8 14785584 11367.
## 2 Canada Americas 1957 70.0 17010154 12490.
## 3 Canada Americas 1962 71.3 18985849 13462.
## 4 Canada Americas 1967 72.1 20819767 16077.
## 5 Canada Americas 1972 72.9 22284500 18971.
## 6 Canada Americas 1977 74.2 23796400 22091.
## 7 Canada Americas 1982 75.8 25201900 22899.
## 8 Canada Americas 1987 76.9 26549700 26627.
## 9 Canada Americas 1992 78.0 28523502 26343.
## 10 Canada Americas 1997 78.6 30305843 28955.
## # ... with 14 more rows
This will select the rows for the both Canada and China. What I did above works, but it is cumbersome and somewhat difficult to read the code.
We can achieve the same result and greatly simplify our code
filter(gap_data, country %in% c("Canada", "China"))
## # A tibble: 24 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Canada Americas 1952 68.8 14785584 11367.
## 2 Canada Americas 1957 70.0 17010154 12490.
## 3 Canada Americas 1962 71.3 18985849 13462.
## 4 Canada Americas 1967 72.1 20819767 16077.
## 5 Canada Americas 1972 72.9 22284500 18971.
## 6 Canada Americas 1977 74.2 23796400 22091.
## 7 Canada Americas 1982 75.8 25201900 22899.
## 8 Canada Americas 1987 76.9 26549700 26627.
## 9 Canada Americas 1992 78.0 28523502 26343.
## 10 Canada Americas 1997 78.6 30305843 28955.
## # ... with 14 more rows
You can imagine if you have a very long vector of items, the %in%
approach is much easier than writing a bunch of OR |
statements.
Try filtering the data to get only rows with Canada, China, France, India, and Argentina. You should save this set of countries to its own vector and then use this vector in your filter()
statement. This will make your code much, much easier to read then defining the vector in your filter()
directly after the %in%
(as I did above).
# Your code here
filter()
is pretty nice right?Remember those days before dplyr
when we wanted to select rows from a data frame according to criteria.
In base R
gap_data[gap_data$year == 2007 & gap_data$gdpPercap > 5000, ] # the old way in base R
## # A tibble: 76 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 2007 76.4 3600523 5937.
## 2 Algeria Africa 2007 72.3 33333216 6223.
## 3 Argentina Americas 2007 75.3 40301927 12779.
## 4 Australia Oceania 2007 81.2 20434176 34435.
## 5 Austria Europe 2007 79.8 8199783 36126.
## 6 Bahrain Asia 2007 75.6 708573 29796.
## 7 Belgium Europe 2007 79.4 10392226 33693.
## 8 Bosnia and Herzegovina Europe 2007 74.9 4552198 7446.
## 9 Botswana Africa 2007 50.7 1639131 12570.
## 10 Brazil Americas 2007 72.4 190010647 9066.
## # ... with 66 more rows
With dplyr
filter(gap_data, year == 2007, gdpPercap > 5000) # with dplyr
## # A tibble: 76 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 2007 76.4 3600523 5937.
## 2 Algeria Africa 2007 72.3 33333216 6223.
## 3 Argentina Americas 2007 75.3 40301927 12779.
## 4 Australia Oceania 2007 81.2 20434176 34435.
## 5 Austria Europe 2007 79.8 8199783 36126.
## 6 Bahrain Asia 2007 75.6 708573 29796.
## 7 Belgium Europe 2007 79.4 10392226 33693.
## 8 Bosnia and Herzegovina Europe 2007 74.9 4552198 7446.
## 9 Botswana Africa 2007 50.7 1639131 12570.
## 10 Brazil Americas 2007 72.4 190010647 9066.
## # ... with 66 more rows
Both of the above code blocks do the exact same thing: select the rows in gap_data
where the year is 2007 AND the per capita GDP > 5000.
The dplyr
method is so much easier to code and read.
arrange()
Let’s use the arrange()
function to sort our dataset.
We’ll save to sorted data to another object, so that you can take a look at it easier.
gap_sorted <- arrange(gap_data, gdpPercap)
View()
the sorted data (remember to type do this in the console) and confirm that everything worked.
arrange()
sorts data in ascending order (smallest to largest). To sort data in descending order use the desc()
function
gap_sorted <- arrange(gap_data, desc(gdpPercap))
Take a look and make sure everything looks good. Were you surprised to see the country that had the highest GDP (and the year when this occurred)? As a interesting aside, discuss with your neighbor why you think this might be the case.
Test out a few more sorts on the gap_data
using the arrange()
function
# Your code here
arrange()
by multiple variablesYou can sort data according to multiple variables using the arrange()
function. Below we’ll sort the gap_data
by continent first and then by life expectancy. This will give us a sorted dataset that will have the life expectancies sorted within each continent
gap_sorted <- arrange(gap_data, continent, lifeExp)
That worked well, but you’ll see that we didn’t consider the year in our sorting. Let’s now sort by continent, then year, then life expectancy
# your code here
Take a look at the data. Does it look different than when we sorted without the year. Does this particular organization seem more useful?
Repeat your code block directly above, but sort the years in descending order so that we can see the most recent data at the top of the data frame.
# Your code here
Before we go further, I want to introduce the pipe operator. The pipe operator will make your code much easier to write and read. It allows you to carry out a sequence of operations without having to nest the operations within one another.
%>%
%>%
is
The pipe %>%
operator allows you to 1. Take output of one function and pipe it as input to the next function 2. String together many pipes to create a single chain of operations
Let’s take a look at an example
filter(gap_data, year == 2007, continent == "Asia") %>% arrange(gdpPercap)
## # A tibble: 33 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Myanmar Asia 2007 62.1 47761980 944
## 2 Afghanistan Asia 2007 43.8 31889923 975.
## 3 Nepal Asia 2007 63.8 28901790 1091.
## 4 Bangladesh Asia 2007 64.1 150448339 1391.
## 5 Korea, Dem. Rep. Asia 2007 67.3 23301725 1593.
## 6 Cambodia Asia 2007 59.7 14131858 1714.
## 7 Yemen, Rep. Asia 2007 62.7 22211743 2281.
## 8 Vietnam Asia 2007 74.2 85262356 2442.
## 9 India Asia 2007 64.7 1110396331 2452.
## 10 Pakistan Asia 2007 65.5 169270617 2606.
## # ... with 23 more rows
The above code first filtered the data to select only the observations for year 2007 and in Asia. Then it sent this filtered dataset arrange()
where it was sorted by per capita GDP.
Notice, how I did not specify the data frame to use in arrange()
. This is because the pipe operator passes the output from the previous operation so this output is necessarily the input to the arrange()
function.
We could have piped the gap_data
to the filter()
function to make it even easier to read.
gap_data %>% filter(year == 2007, continent == "Asia") %>% arrange(gdpPercap)
## # A tibble: 33 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Myanmar Asia 2007 62.1 47761980 944
## 2 Afghanistan Asia 2007 43.8 31889923 975.
## 3 Nepal Asia 2007 63.8 28901790 1091.
## 4 Bangladesh Asia 2007 64.1 150448339 1391.
## 5 Korea, Dem. Rep. Asia 2007 67.3 23301725 1593.
## 6 Cambodia Asia 2007 59.7 14131858 1714.
## 7 Yemen, Rep. Asia 2007 62.7 22211743 2281.
## 8 Vietnam Asia 2007 74.2 85262356 2442.
## 9 India Asia 2007 64.7 1110396331 2452.
## 10 Pakistan Asia 2007 65.5 169270617 2606.
## # ... with 23 more rows
The code below does the exact same thing but without the pipe operator. See how much easier it is to read the code that used pipe. With the piped code we can read the sequence of operation from left to right, whereas in the nested code below we had to read from inside out (which is much more difficult from a human perspective).
arrange(filter(gap_data, year == 2007, continent == "Asia"), gdpPercap) # without using pipe operator
## # A tibble: 33 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Myanmar Asia 2007 62.1 47761980 944
## 2 Afghanistan Asia 2007 43.8 31889923 975.
## 3 Nepal Asia 2007 63.8 28901790 1091.
## 4 Bangladesh Asia 2007 64.1 150448339 1391.
## 5 Korea, Dem. Rep. Asia 2007 67.3 23301725 1593.
## 6 Cambodia Asia 2007 59.7 14131858 1714.
## 7 Yemen, Rep. Asia 2007 62.7 22211743 2281.
## 8 Vietnam Asia 2007 74.2 85262356 2442.
## 9 India Asia 2007 64.7 1110396331 2452.
## 10 Pakistan Asia 2007 65.5 169270617 2606.
## # ... with 23 more rows
FYI, the pipe operator is actually in the magrittr
package, but this package is part of tidyverse
so it loads in everytime we load tidyverse
.
select()
columnsWe can select a subset of columns from a data frame using the select()
function. This is extremely useful when you load in a data frame that might have tens or hundreds of columns and you are only interested.
Let’s give it a test with our gap_data
. We’ll select the country, year, lifeExp, and gdpPercap columns
select(gap_data, country, year, lifeExp, gdpPercap)
## # A tibble: 1,704 x 4
## country year lifeExp gdpPercap
## <fct> <int> <dbl> <dbl>
## 1 Afghanistan 1952 28.8 779.
## 2 Afghanistan 1957 30.3 821.
## 3 Afghanistan 1962 32.0 853.
## 4 Afghanistan 1967 34.0 836.
## 5 Afghanistan 1972 36.1 740.
## 6 Afghanistan 1977 38.4 786.
## 7 Afghanistan 1982 39.9 978.
## 8 Afghanistan 1987 40.8 852.
## 9 Afghanistan 1992 41.7 649.
## 10 Afghanistan 1997 41.8 635.
## # ... with 1,694 more rows
You can also rename the columns when you select them. Let’s give that a try and rename the gdpPercap
variable to GDP_percap
select(gap_data, country, year, lifeExp, GDP_percap = gdpPercap)
## # A tibble: 1,704 x 4
## country year lifeExp GDP_percap
## <fct> <int> <dbl> <dbl>
## 1 Afghanistan 1952 28.8 779.
## 2 Afghanistan 1957 30.3 821.
## 3 Afghanistan 1962 32.0 853.
## 4 Afghanistan 1967 34.0 836.
## 5 Afghanistan 1972 36.1 740.
## 6 Afghanistan 1977 38.4 786.
## 7 Afghanistan 1982 39.9 978.
## 8 Afghanistan 1987 40.8 852.
## 9 Afghanistan 1992 41.7 649.
## 10 Afghanistan 1997 41.8 635.
## # ... with 1,694 more rows
This is helpful when the original dataset has column names that we don’t like for some reason (e.g. not meaningful, too long,…)
Try selecting and renaming some variables
# Your code here
If you want to select most of the columns, then it is easier to specify which ones you don’t want to keep as opposed to which ones you want to keep.
Imagine we want all of the columns in gap_data
except for pop
(population) and continent
. It is less typing to specify the ones we don’t want. We can do this as follows
gap_data %>% select(-pop,-continent)
## # A tibble: 1,704 x 4
## country year lifeExp gdpPercap
## <fct> <int> <dbl> <dbl>
## 1 Afghanistan 1952 28.8 779.
## 2 Afghanistan 1957 30.3 821.
## 3 Afghanistan 1962 32.0 853.
## 4 Afghanistan 1967 34.0 836.
## 5 Afghanistan 1972 36.1 740.
## 6 Afghanistan 1977 38.4 786.
## 7 Afghanistan 1982 39.9 978.
## 8 Afghanistan 1987 40.8 852.
## 9 Afghanistan 1992 41.7 649.
## 10 Afghanistan 1997 41.8 635.
## # ... with 1,694 more rows
Using the gap_data
do the following:
Think about what the above data can tell you
Think about what the above data can tell you
Think about what the above data can tell you
Once you finish these exercises, keep practicing on your own. Come up with some questions/ideas you would like to investigate and use the pipe and dplyr
tools to help answer them
# Your code blocks here
Based on my analysis you can expect an average improvement by a factor of 128.93 \(\pm\) 2.45.↩︎