Let’s now jump into some data manipulation using the functionality in
the dplyr
package. This package is going to dramatically
increase your data handling skills.
You’ll see how with dplyr
data manipulation operations
that may have required long, complex, and difficult to decipher code can
often be reduced to a single line that is easy to code and easy to
read!
As you work through the code, pay close attention and make sure that you understand what is happening in each line of code. Check help files, ask your neighbor, and of course ask me if you have any questions or don’t fully understand something.
Note: The dplyr cheatsheet will be very helpful when working through today’s lesson
dplyr
packageWe’ll actually load in tidyverse
which contains
dplyr
along with many other packages.
library(tidyverse)
gapminder
packageWe’ll also load in the the gapminder
package. This
package has a great dataset to use when learning and showcasing the
functionality of dplyr
.
The gapminder
dataset contains data on life expectancy,
GDP per capita, and populations by country (and through time). This is a
great dataset to look at for thinking about global development. As
global development is inherently tied to environmental aspects
(e.g. water resources, mineral resources, ecology) this dataset should
be of interest to anyone studying the environment, and of particular
interest to those focused on policy.
If you haven’t installed the gapminder
package yet, you
should go to your package window and do this. Once you have the package
let’s load it in.
library(gapminder)
Let’s print out some of the data and take a look.
gapminder
Let’s also get a summary()
to learn a bit more
summary(gapminder)
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
You should take a minute to learn more about the dataset by typing
?gapminder
in your CONSOLE.
Also take a minute to look at the data by using the
View()
function. Do this in your CONSOLE.
You can see that there is a lot of data here, a number of variables,
and many different ways we might want to filter, group, select,… the
data. dplyr
is going to come to the rescue here.
Before we proceed, let’s assign the gapminder data to a new object (just so you aren’t worrying about messing up the original data)
gap_data <- gapminder
filter()
to subset rows of dataNo longer do we need to use complex logical statements as our indices
when trying to subset data along rows (recall the complex statements
that you would put into the []
when subsetting).
filter()
makes this operation painless.
The filter()
function takes your logical comparison and
selects all of the rows for which the test is TRUE
.
filter(gap_data, gdpPercap > 5000)
We just filtered the gap_data so that only the rows where the gdpPercap was greater than 5000 were selected. That was super easy!
We can add even more conditions to filter on.
filter(gap_data, gdpPercap > 5000, year == 2007)
We now selected all of the rows where the per capita GDP > 5000 AND the year is 2007.
You see that the syntax is
filter(dataset, criteria 1, criteria 2,..., criteria n)
Note that filter()
returns the output you want but it
does not modify the original dataset. This is good, since we likely want
to keep using our original dataset later on in our code.
Furthermore, you’ll see that since we didn’t assign the output from
filter()
to a data object, we haven’t created any new data
objects yet. We can easily assign the output to a new object, the same
way we do all other object assignments
data_2007_highGDP <- filter(gap_data, gdpPercap > 5000, year == 2007)
However, before you go ahead and make tons of new data frames to store filtered versions of your data, you should realize that generally we can pass the filtered data directly to another function (e.g. a plotting function, mean, sum,…) which may eliminate the need to creating tons of new data frames. This will help to keep our environment clean (something you should care about as environmental scientists).
Here’s how you would filter based on a variable that has text values
filter(gap_data, gdpPercap > 5000, continent == "Asia")
Now I want you to try a few more filter operations on the
gap_data
. Take your time and get comfortable with the
basics here.
# Your code here
filter()
and in
If you want to filter based on several values for a given variable,
you could use the OR |
operator in your criteria. For
example,
filter(gap_data, country == "Canada" | country == "China")
This will select the rows for the both Canada and China. What I did above works, but it is cumbersome and somewhat difficult to read the code.
We can achieve the same result and greatly simplify our code
filter(gap_data, country %in% c("Canada", "China"))
You can imagine if you have a very long vector of items, the
%in%
approach is much easier than writing a bunch of OR
|
statements.
Try filtering the data to get only rows with Canada, China, France,
India, and Argentina. You should save this set of countries to its own
vector and then use this vector in your filter()
statement.
This will make your code much, much easier to read then defining the
vector in your filter()
directly after the
%in%
(as I did above).
# Your code here
filter()
is pretty nice right?Remember those days before dplyr
when we wanted to
select rows from a data frame according to criteria.
In base R
gap_data[gap_data$year == 2007 & gap_data$gdpPercap > 5000, ] # the old way in base R
With dplyr
filter(gap_data, year == 2007, gdpPercap > 5000) # with dplyr
Both of the above code blocks do the exact same thing: select the
rows in gap_data
where the year is 2007 AND the per capita
GDP > 5000.
The dplyr
method is so much easier to code and read.
arrange()
Let’s use the arrange()
function to sort our
dataset.
We’ll save to sorted data to another object, so that you can take a look at it easier.
gap_sorted <- arrange(gap_data, gdpPercap)
View()
the sorted data (remember to type do this in the
console) and confirm that everything worked.
arrange()
sorts data in ascending order (smallest to
largest). To sort data in descending order use the desc()
function
gap_sorted <- arrange(gap_data, desc(gdpPercap))
Take a look and make sure everything looks good. Were you surprised to see the country that had the highest GDP (and the year when this occurred)? As a interesting aside, discuss with your neighbor why you think this might be the case.
Test out a few more sorts on the gap_data
using the
arrange()
function
# Your code here
arrange()
by multiple variablesYou can sort data according to multiple variables using the
arrange()
function. Below we’ll sort the
gap_data
by continent first and then by life expectancy.
This will give us a sorted dataset that will have the life expectancies
sorted within each continent
gap_sorted <- arrange(gap_data, continent, lifeExp)
That worked well, but you’ll see that we didn’t consider the year in our sorting. Let’s now sort by continent, then year, then life expectancy
# your code here
Take a look at the data. Does it look different than when we sorted without the year. Does this particular organization seem more useful?
Repeat your code block directly above, but sort the years in descending order so that we can see the most recent data at the top of the data frame.
# Your code here
Before we go further, I want to introduce the pipe operator. The pipe operator will make your code much easier to write and read. It allows you to carry out a sequence of operations without having to nest the operations within one another.
%>%
(you can also use
|>
as the pipe operator)%>%
is
The pipe %>%
operator allows you to 1. Take
output of one function and pipe it as
input to the next function 2. String together many
pipes to create a single chain of operations
Let’s take a look at an example
filter(gap_data, year == 2007, continent == "Asia") %>%
arrange(gdpPercap)
The above code first filtered the data to select only the
observations for year 2007 and in Asia. Then it sent this filtered
dataset arrange()
where it was sorted by per capita
GDP.
Notice, how I did not specify the data frame to use in
arrange()
. This is because the pipe
operator passes the output from the previous operation so this output is
necessarily the input to the arrange()
function.
We could have piped the gap_data
to the
filter()
function to make it even easier to read.
gap_data %>%
filter(year == 2007, continent == "Asia") %>%
arrange(gdpPercap)
The code below does the exact same thing but without the pipe operator. See how much easier it is to read the code that used pipe. With the piped code we can read the sequence of operation from left to right, whereas in the nested code below we had to read from inside out (which is much more difficult from a human perspective).
arrange(filter(gap_data, year == 2007, continent == "Asia"), gdpPercap) # without using pipe operator
FYI, the pipe operator is actually in the magrittr
package, but this package is part of tidyverse
so it loads
in everytime we load tidyverse
.
select()
columnsWe can select a subset of columns from a data frame using the
select()
function. This is extremely useful when you load
in a data frame that might have tens or hundreds of columns and you are
only interested.
Let’s give it a test with our gap_data
. We’ll select the
country, year, lifeExp, and gdpPercap columns
select(gap_data, country, year, lifeExp, gdpPercap)
You can also rename the columns when you select them. Let’s give that
a try and rename the gdpPercap
variable to
GDP_percap
select(gap_data, country, year, lifeExp, GDP_percap = gdpPercap)
This is helpful when the original dataset has column names that we don’t like for some reason (e.g. not meaningful, too long,…)
Try selecting and renaming some variables
# Your code here
If you want to select most of the columns, then it is easier to specify which ones you don’t want to keep as opposed to which ones you want to keep.
Imagine we want all of the columns in gap_data
except
for pop
(population) and continent
. It is less
typing to specify the ones we don’t want. We can do this as follows
gap_data %>%
select(-pop,-continent)
Using the gap_data
do the following:
Think about what the above data can tell you
Think about what the above data can tell you
Think about what the above data can tell you
Once you finish these exercises, keep practicing on your own. Come up
with some questions/ideas you would like to investigate and use the pipe
and dplyr
tools to help answer them
# Your code blocks here