Code
library(tidyverse)Let’s now jump into some data manipulation using the functionality in the dplyr package. This package is going to dramatically increase your data handling skills.
You’ll see how with dplyr data manipulation operations that may have required long, complex, and difficult to decipher code can often be reduced to a single line that is easy to code and easy to read!
As you work through the code, pay close attention and make sure that you understand what is happening in each line of code. Check help files, ask your neighbor, and of course ask me if you have any questions or don’t fully understand something.
Note: The dplyr cheatsheet will be very helpful when working through today’s lesson
dplyr packageWe’ll actually load in tidyverse which contains dplyr along with many other packages.
library(tidyverse)gapminder packageWe’ll also load in the the gapminder package. This package has a great dataset to use when learning and showcasing the functionality of dplyr.
The gapminder dataset contains data on life expectancy, GDP per capita, and populations by country (and through time). This is a great dataset to look at for thinking about global development. As global development is inherently tied to environmental aspects (e.g. water resources, mineral resources, ecology) this dataset should be of interest to anyone studying the environment, and of particular interest to those focused on policy.
If you haven’t installed the gapminder package yet, you should go to your package window and do this. Once you have the package let’s load it in.
library(gapminder)Let’s print out some of the data and take a look.
gapminderLet’s also get a summary() to learn a bit more
summary(gapminder) country continent year lifeExp
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
Algeria : 12 Asia :396 Median :1980 Median :60.71
Angola : 12 Europe :360 Mean :1980 Mean :59.47
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
Australia : 12 Max. :2007 Max. :82.60
(Other) :1632
pop gdpPercap
Min. :6.001e+04 Min. : 241.2
1st Qu.:2.794e+06 1st Qu.: 1202.1
Median :7.024e+06 Median : 3531.8
Mean :2.960e+07 Mean : 7215.3
3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Max. :1.319e+09 Max. :113523.1
You should take a minute to learn more about the dataset by typing ?gapminder in your CONSOLE.
Also take a minute to look at the data by using the View() function. Do this in your CONSOLE.
You can see that there is a lot of data here, a number of variables, and many different ways we might want to filter, group, select,… the data. dplyr is going to come to the rescue here.
Before we proceed, let’s assign the gapminder data to a new object (just so you aren’t worrying about messing up the original data)
gap_data <- gapminderfilter() to subset rows of dataNo longer do we need to use complex logical statements as our indices when trying to subset data along rows (recall the complex statements that you would put into the [] when subsetting). filter() makes this operation painless.
The filter() function takes your logical comparison and selects all of the rows for which the test is TRUE.
filter(gap_data, gdpPercap > 5000)We just filtered the gap_data so that only the rows where the gdpPercap was greater than 5000 were selected. That was super easy!
We can add even more conditions to filter on.
filter(gap_data, gdpPercap > 5000, year == 2007)We now selected all of the rows where the per capita GDP > 5000 AND the year is 2007.
You see that the syntax is filter(dataset, criteria 1, criteria 2,..., criteria n)
Note that filter() returns the output you want but it does not modify the original dataset. This is good, since we likely want to keep using our original dataset later on in our code.
Furthermore, you’ll see that since we didn’t assign the output from filter() to a data object, we haven’t created any new data objects yet. We can easily assign the output to a new object, the same way we do all other object assignments
data_2007_highGDP <- filter(gap_data, gdpPercap > 5000, year == 2007)However, before you go ahead and make tons of new data frames to store filtered versions of your data, you should realize that generally we can pass the filtered data directly to another function (e.g. a plotting function, mean, sum,…) which may eliminate the need to creating tons of new data frames. This will help to keep our environment clean (something you should care about as environmental scientists).
Here’s how you would filter based on a variable that has text values
filter(gap_data, gdpPercap > 5000, continent == "Asia")Now I want you to try a few more filter operations on the gap_data. Take your time and get comfortable with the basics here.
# Your code herefilter() and inIf you want to filter based on several values for a given variable, you could use the OR | operator in your criteria. For example,
filter(gap_data, country == "Canada" | country == "China")This will select the rows for the both Canada and China. What I did above works, but it is cumbersome and somewhat difficult to read the code.
We can achieve the same result and greatly simplify our code
filter(gap_data, country %in% c("Canada", "China"))You can imagine if you have a very long vector of items, the %in% approach is much easier than writing a bunch of OR | statements.
Try filtering the data to get only rows with Canada, China, France, India, and Argentina. You should save this set of countries to its own vector and then use this vector in your filter() statement. This will make your code much, much easier to read then defining the vector in your filter() directly after the %in% (as I did above).
# Your code here filter() is pretty nice right?Remember those days before dplyr when we wanted to select rows from a data frame according to criteria.
In base R
gap_data[gap_data$year == 2007 & gap_data$gdpPercap > 5000, ] # the old way in base RWith dplyr
filter(gap_data, year == 2007, gdpPercap > 5000) # with dplyrBoth of the above code blocks do the exact same thing: select the rows in gap_data where the year is 2007 AND the per capita GDP > 5000.
The dplyr method is so much easier to code and read.
arrange()Let’s use the arrange() function to sort our dataset.
We’ll save to sorted data to another object, so that you can take a look at it easier.
gap_sorted <- arrange(gap_data, gdpPercap)View() the sorted data (remember to type do this in the console) and confirm that everything worked.
arrange() sorts data in ascending order (smallest to largest). To sort data in descending order use the desc() function
gap_sorted <- arrange(gap_data, desc(gdpPercap))Take a look and make sure everything looks good. Were you surprised to see the country that had the highest GDP (and the year when this occurred)? As a interesting aside, discuss with your neighbor why you think this might be the case.
Test out a few more sorts on the gap_data using the arrange() function
# Your code here arrange() by multiple variablesYou can sort data according to multiple variables using the arrange() function. Below we’ll sort the gap_data by continent first and then by life expectancy. This will give us a sorted dataset that will have the life expectancies sorted within each continent
gap_sorted <- arrange(gap_data, continent, lifeExp)That worked well, but you’ll see that we didn’t consider the year in our sorting. Let’s now sort by continent, then year, then life expectancy
# your code hereTake a look at the data. Does it look different than when we sorted without the year. Does this particular organization seem more useful?
Repeat your code block directly above, but sort the years in descending order so that we can see the most recent data at the top of the data frame.
# Your code hereBefore we go further, I want to introduce the pipe operator. The pipe operator will make your code much easier to write and read. It allows you to carry out a sequence of operations without having to nest the operations within one another.
%>% (you can also use |> as the pipe operator)%>% is
The pipe %>% operator allows you to 1. Take output of one function and pipe it as input to the next function 2. String together many pipes to create a single chain of operations
Let’s take a look at an example
filter(gap_data, year == 2007, continent == "Asia") %>%
arrange(gdpPercap)The above code first filtered the data to select only the observations for year 2007 and in Asia. Then it sent this filtered dataset arrange() where it was sorted by per capita GDP.
Notice, how I did not specify the data frame to use in arrange(). This is because the pipe operator passes the output from the previous operation so this output is necessarily the input to the arrange() function.
We could have piped the gap_data to the filter() function to make it even easier to read.
gap_data %>%
filter(year == 2007, continent == "Asia") %>%
arrange(gdpPercap)The code below does the exact same thing but without the pipe operator. See how much easier it is to read the code that used pipe. With the piped code we can read the sequence of operation from left to right, whereas in the nested code below we had to read from inside out (which is much more difficult from a human perspective).
arrange(filter(gap_data, year == 2007, continent == "Asia"), gdpPercap) # without using pipe operatorFYI, the pipe operator is actually in the magrittr package, but this package is part of tidyverse so it loads in everytime we load tidyverse.
select() columnsWe can select a subset of columns from a data frame using the select() function. This is extremely useful when you load in a data frame that might have tens or hundreds of columns and you are only interested.
Let’s give it a test with our gap_data. We’ll select the country, year, lifeExp, and gdpPercap columns
select(gap_data, country, year, lifeExp, gdpPercap)You can also rename the columns when you select them. Let’s give that a try and rename the gdpPercap variable to GDP_percap
select(gap_data, country, year, lifeExp, GDP_percap = gdpPercap)This is helpful when the original dataset has column names that we don’t like for some reason (e.g. not meaningful, too long,…)
Try selecting and renaming some variables
# Your code hereIf you want to select most of the columns, then it is easier to specify which ones you don’t want to keep as opposed to which ones you want to keep.
Imagine we want all of the columns in gap_data except for pop (population) and continent. It is less typing to specify the ones we don’t want. We can do this as follows
gap_data %>%
select(-pop,-continent)mutate()We often want to create new variables (columns) in a dataset, where the new variable is a function of existing variables. For instance if we have a column with precipitation data in inches, we might want to create a new column that has the same precipitation data in centimeters. In this case we would simply multiply our precipitation in inches by 2.54 (number of cm per inch) to get the new, desired column.
Let’s create a new column in our gap_data dataset that has the total GDP (i.e. per capita GDP multiplied by the population)
mutate(gap_data,
tot_gdp = gdpPercap * pop)We calculated this new variable. However, remember if we want to save this information we need to assign it to a data object. Let’s save this new column to our gap_data data
my_gap <- mutate(gap_data,
tot_gdp = gdpPercap * pop) Take at look at your gap_data data to confirm that you’ve added this new variable.
dplyr functions to help you here)Ok now create a new column called pop_mill that has the population in millions (e.g. 1,000,000 should appear as 1) and assign make sure to add this variable to your gap_data data
# Your code here Make sure you understand exactly what is going on in the code above. If you have any questions, discuss with your neighbor or me before moving ahead.
Using the gap_data do the following:
Think about what the above data can tell you
Think about what the above data can tell you
Think about what the above data can tell you
Once you finish these exercises, keep practicing on your own. Come up with some questions/ideas you would like to investigate and use the pipe and dplyr tools to help answer them
# Your code blocks here