Introduction to Data Manipulation

Let’s now jump into some data manipulation using the functionality in the dplyr package. This package is going to increase your data handling skills by a factor of 100 or more¹!

You’ll see how with dplyr data manipulation operations that may have required long, complex, and difficult to decipher code can often be reduced to a single line that is easy to code and easy to read!

As you work through the code, pay close attention and make sure that you understand what is happening in each line of code. Check help files, ask your neighbor, and of course ask me if you have any questions or don’t fully understand something.

Note: The dplyr cheatsheet will be very helpful when working through today’s lesson

Load in the `dplyr` package

We’ll actually load in tidyverse which contains dplyr along with many other packages.

library(tidyverse)

Load in the `gapminder` package

We’ll also load in the the gapminder package. This package has a great dataset to use when learning and showcasing the functionality of dplyr.

The gapminder dataset contains data on life expectancy, GDP per capita, and populations by country (and through time). This is a great dataset to look at for thinking about global development. As global development is inherently tied to environmental aspects (e.g. water resources, mineral resources, ecology) this dataset should be of interest to anyone studying the environment, and of particular interest to those focused on policy.

If you haven’t installed the gapminder package yet, you should go to your package window and do this. Once you have the package let’s load it in.

library(gapminder)

Look at the Gapminder dataset

Let’s print out some of the data and take a look.

gapminder

## # A tibble: 1,704 x 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

Let’s also get a summary() to learn a bit more

summary(gapminder)

##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
##

You should take a minute to learn more about the dataset by typing ?gapminder in your CONSOLE.

Also take a minute to look at the data by using the View() function. Do this in your CONSOLE.

You can see that there is a lot of data here, a number of variables, and many different ways we might want to filter, group, select,… the data. dplyr is going to come to the rescue here.

Before we proceed, let’s assign the gapminder data to a new object (just so you aren’t worrying about messing up the original data)

gap_data <- gapminder

Using `filter()` to subset rows of data

No longer do we need to use complex logical statements as our indices when trying to subset data along rows (recall the complex statements that you would put into the [] when subsetting). filter() makes this operation painless.

The filter() function takes your logical comparison and selects all of the rows for which the test is TRUE.

filter(gap_data, gdpPercap > 5000)

## # A tibble: 692 x 6
##    country   continent  year lifeExp      pop gdpPercap
##    <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Albania   Europe     2007    76.4  3600523     5937.
##  2 Algeria   Africa     1982    61.4 20033753     5745.
##  3 Algeria   Africa     1987    65.8 23254956     5681.
##  4 Algeria   Africa     1992    67.7 26298373     5023.
##  5 Algeria   Africa     2002    71.0 31287142     5288.
##  6 Algeria   Africa     2007    72.3 33333216     6223.
##  7 Angola    Africa     1967    36.0  5247469     5523.
##  8 Angola    Africa     1972    37.9  5894858     5473.
##  9 Argentina Americas   1952    62.5 17876956     5911.
## 10 Argentina Americas   1957    64.4 19610538     6857.
## # ... with 682 more rows

We just filtered the gap_data so that only the rows where the gdpPercap was greater than 5000 were selected. That was super easy!

We can add even more conditions to filter on.

filter(gap_data, gdpPercap > 5000, year == 2007)

## # A tibble: 76 x 6
##    country                continent  year lifeExp       pop gdpPercap
##    <fct>                  <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Albania                Europe     2007    76.4   3600523     5937.
##  2 Algeria                Africa     2007    72.3  33333216     6223.
##  3 Argentina              Americas   2007    75.3  40301927    12779.
##  4 Australia              Oceania    2007    81.2  20434176    34435.
##  5 Austria                Europe     2007    79.8   8199783    36126.
##  6 Bahrain                Asia       2007    75.6    708573    29796.
##  7 Belgium                Europe     2007    79.4  10392226    33693.
##  8 Bosnia and Herzegovina Europe     2007    74.9   4552198     7446.
##  9 Botswana               Africa     2007    50.7   1639131    12570.
## 10 Brazil                 Americas   2007    72.4 190010647     9066.
## # ... with 66 more rows

We now selected all of the rows where the per capita GDP > 5000 AND the year is 2007.

You see that the syntax is filter(dataset, criteria 1, criteria 2,..., criteria n)

Note that filter() returns the output you want but it does not modify the original dataset. This is good, since we likely want to keep using our original dataset later on in our code.

Furthermore, you’ll see that since we didn’t assign the output from filter() to a data object, we haven’t created any new data objects yet. We can easily assign the output to a new object, the same way we do all other object assignments

data_2007_highGDP <- filter(gap_data, gdpPercap > 5000, year == 2007)

However, before you go ahead and make tons of new data frames to store filtered versions of your data, you should realize that generally we can pass the filtered data directly to another function (e.g. a plotting function, mean, sum,…) which may eliminate the need to creating tons of new data frames. This will help to keep our environment clean (something you should care about as environmental scientists).

Here’s how you would filter based on a variable that has text values

filter(gap_data, gdpPercap > 5000, continent == "Asia")

## # A tibble: 133 x 6
##    country continent  year lifeExp    pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>  <int>     <dbl>
##  1 Bahrain Asia       1952    50.9 120447     9867.
##  2 Bahrain Asia       1957    53.8 138655    11636.
##  3 Bahrain Asia       1962    56.9 171863    12753.
##  4 Bahrain Asia       1967    59.9 202182    14805.
##  5 Bahrain Asia       1972    63.3 230800    18269.
##  6 Bahrain Asia       1977    65.6 297410    19340.
##  7 Bahrain Asia       1982    69.1 377967    19211.
##  8 Bahrain Asia       1987    70.8 454612    18524.
##  9 Bahrain Asia       1992    72.6 529491    19036.
## 10 Bahrain Asia       1997    73.9 598561    20292.
## # ... with 123 more rows

Now I want you to try a few more filter operations on the gap_data. Take your time and get comfortable with the basics here.

# Your code here

`filter()` and `in`

If you want to filter based on several values for a given variable, you could use the OR | operator in your criteria. For example,

filter(gap_data, country == "Canada" | country == "China")

## # A tibble: 24 x 6
##    country continent  year lifeExp      pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Canada  Americas   1952    68.8 14785584    11367.
##  2 Canada  Americas   1957    70.0 17010154    12490.
##  3 Canada  Americas   1962    71.3 18985849    13462.
##  4 Canada  Americas   1967    72.1 20819767    16077.
##  5 Canada  Americas   1972    72.9 22284500    18971.
##  6 Canada  Americas   1977    74.2 23796400    22091.
##  7 Canada  Americas   1982    75.8 25201900    22899.
##  8 Canada  Americas   1987    76.9 26549700    26627.
##  9 Canada  Americas   1992    78.0 28523502    26343.
## 10 Canada  Americas   1997    78.6 30305843    28955.
## # ... with 14 more rows

This will select the rows for the both Canada and China. What I did above works, but it is cumbersome and somewhat difficult to read the code.

We can achieve the same result and greatly simplify our code

filter(gap_data, country %in% c("Canada", "China"))

## # A tibble: 24 x 6
##    country continent  year lifeExp      pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Canada  Americas   1952    68.8 14785584    11367.
##  2 Canada  Americas   1957    70.0 17010154    12490.
##  3 Canada  Americas   1962    71.3 18985849    13462.
##  4 Canada  Americas   1967    72.1 20819767    16077.
##  5 Canada  Americas   1972    72.9 22284500    18971.
##  6 Canada  Americas   1977    74.2 23796400    22091.
##  7 Canada  Americas   1982    75.8 25201900    22899.
##  8 Canada  Americas   1987    76.9 26549700    26627.
##  9 Canada  Americas   1992    78.0 28523502    26343.
## 10 Canada  Americas   1997    78.6 30305843    28955.
## # ... with 14 more rows

You can imagine if you have a very long vector of items, the %in% approach is much easier than writing a bunch of OR | statements.

Try filtering the data to get only rows with Canada, China, France, India, and Argentina. You should save this set of countries to its own vector and then use this vector in your filter() statement. This will make your code much, much easier to read then defining the vector in your filter() directly after the %in% (as I did above).

# Your code here

`filter()` is pretty nice right?

Remember those days before dplyr when we wanted to select rows from a data frame according to criteria.

In base R

gap_data[gap_data$year == 2007 & gap_data$gdpPercap > 5000, ] # the old way in base R

## # A tibble: 76 x 6
##    country                continent  year lifeExp       pop gdpPercap
##    <fct>                  <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Albania                Europe     2007    76.4   3600523     5937.
##  2 Algeria                Africa     2007    72.3  33333216     6223.
##  3 Argentina              Americas   2007    75.3  40301927    12779.
##  4 Australia              Oceania    2007    81.2  20434176    34435.
##  5 Austria                Europe     2007    79.8   8199783    36126.
##  6 Bahrain                Asia       2007    75.6    708573    29796.
##  7 Belgium                Europe     2007    79.4  10392226    33693.
##  8 Bosnia and Herzegovina Europe     2007    74.9   4552198     7446.
##  9 Botswana               Africa     2007    50.7   1639131    12570.
## 10 Brazil                 Americas   2007    72.4 190010647     9066.
## # ... with 66 more rows

With dplyr

filter(gap_data, year == 2007, gdpPercap > 5000) # with dplyr

## # A tibble: 76 x 6
##    country                continent  year lifeExp       pop gdpPercap
##    <fct>                  <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Albania                Europe     2007    76.4   3600523     5937.
##  2 Algeria                Africa     2007    72.3  33333216     6223.
##  3 Argentina              Americas   2007    75.3  40301927    12779.
##  4 Australia              Oceania    2007    81.2  20434176    34435.
##  5 Austria                Europe     2007    79.8   8199783    36126.
##  6 Bahrain                Asia       2007    75.6    708573    29796.
##  7 Belgium                Europe     2007    79.4  10392226    33693.
##  8 Bosnia and Herzegovina Europe     2007    74.9   4552198     7446.
##  9 Botswana               Africa     2007    50.7   1639131    12570.
## 10 Brazil                 Americas   2007    72.4 190010647     9066.
## # ... with 66 more rows

Both of the above code blocks do the exact same thing: select the rows in gap_data where the year is 2007 AND the per capita GDP > 5000.

The dplyr method is so much easier to code and read.

Sort data with `arrange()`

Let’s use the arrange() function to sort our dataset.

We’ll save to sorted data to another object, so that you can take a look at it easier.

gap_sorted <- arrange(gap_data, gdpPercap)

View() the sorted data (remember to type do this in the console) and confirm that everything worked.

arrange() sorts data in ascending order (smallest to largest). To sort data in descending order use the desc() function

gap_sorted <- arrange(gap_data, desc(gdpPercap))

Take a look and make sure everything looks good. Were you surprised to see the country that had the highest GDP (and the year when this occurred)? As a interesting aside, discuss with your neighbor why you think this might be the case.

Test out a few more sorts on the gap_data using the arrange() function

# Your code here

`arrange()` by multiple variables

You can sort data according to multiple variables using the arrange() function. Below we’ll sort the gap_data by continent first and then by life expectancy. This will give us a sorted dataset that will have the life expectancies sorted within each continent

gap_sorted <- arrange(gap_data, continent, lifeExp)

That worked well, but you’ll see that we didn’t consider the year in our sorting. Let’s now sort by continent, then year, then life expectancy

# your code here

Take a look at the data. Does it look different than when we sorted without the year. Does this particular organization seem more useful?

Repeat your code block directly above, but sort the years in descending order so that we can see the most recent data at the top of the data frame.

# Your code here

Piping

Before we go further, I want to introduce the pipe operator. The pipe operator will make your code much easier to write and read. It allows you to carry out a sequence of operations without having to nest the operations within one another.

The pipe operator is %>%
Prounced “then” when reading code in “English”
Shortcut to insert %>% is
- PC: CTRL + SHIFT + M
- MAC: COMMAND + SHIFT + M

The pipe %>% operator allows you to 1. Take output of one function and pipe it as input to the next function 2. String together many pipes to create a single chain of operations

Let’s take a look at an example

filter(gap_data, year == 2007, continent == "Asia") %>% arrange(gdpPercap)

## # A tibble: 33 x 6
##    country          continent  year lifeExp        pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>      <int>     <dbl>
##  1 Myanmar          Asia       2007    62.1   47761980      944 
##  2 Afghanistan      Asia       2007    43.8   31889923      975.
##  3 Nepal            Asia       2007    63.8   28901790     1091.
##  4 Bangladesh       Asia       2007    64.1  150448339     1391.
##  5 Korea, Dem. Rep. Asia       2007    67.3   23301725     1593.
##  6 Cambodia         Asia       2007    59.7   14131858     1714.
##  7 Yemen, Rep.      Asia       2007    62.7   22211743     2281.
##  8 Vietnam          Asia       2007    74.2   85262356     2442.
##  9 India            Asia       2007    64.7 1110396331     2452.
## 10 Pakistan         Asia       2007    65.5  169270617     2606.
## # ... with 23 more rows

The above code first filtered the data to select only the observations for year 2007 and in Asia. Then it sent this filtered dataset arrange() where it was sorted by per capita GDP.

Notice, how I did not specify the data frame to use in arrange(). This is because the pipe operator passes the output from the previous operation so this output is necessarily the input to the arrange() function.

We could have piped the gap_data to the filter() function to make it even easier to read.

gap_data %>% filter(year == 2007, continent == "Asia") %>% arrange(gdpPercap)

## # A tibble: 33 x 6
##    country          continent  year lifeExp        pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>      <int>     <dbl>
##  1 Myanmar          Asia       2007    62.1   47761980      944 
##  2 Afghanistan      Asia       2007    43.8   31889923      975.
##  3 Nepal            Asia       2007    63.8   28901790     1091.
##  4 Bangladesh       Asia       2007    64.1  150448339     1391.
##  5 Korea, Dem. Rep. Asia       2007    67.3   23301725     1593.
##  6 Cambodia         Asia       2007    59.7   14131858     1714.
##  7 Yemen, Rep.      Asia       2007    62.7   22211743     2281.
##  8 Vietnam          Asia       2007    74.2   85262356     2442.
##  9 India            Asia       2007    64.7 1110396331     2452.
## 10 Pakistan         Asia       2007    65.5  169270617     2606.
## # ... with 23 more rows

The code below does the exact same thing but without the pipe operator. See how much easier it is to read the code that used pipe. With the piped code we can read the sequence of operation from left to right, whereas in the nested code below we had to read from inside out (which is much more difficult from a human perspective).

arrange(filter(gap_data, year == 2007, continent == "Asia"), gdpPercap) # without using pipe operator

## # A tibble: 33 x 6
##    country          continent  year lifeExp        pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>      <int>     <dbl>
##  1 Myanmar          Asia       2007    62.1   47761980      944 
##  2 Afghanistan      Asia       2007    43.8   31889923      975.
##  3 Nepal            Asia       2007    63.8   28901790     1091.
##  4 Bangladesh       Asia       2007    64.1  150448339     1391.
##  5 Korea, Dem. Rep. Asia       2007    67.3   23301725     1593.
##  6 Cambodia         Asia       2007    59.7   14131858     1714.
##  7 Yemen, Rep.      Asia       2007    62.7   22211743     2281.
##  8 Vietnam          Asia       2007    74.2   85262356     2442.
##  9 India            Asia       2007    64.7 1110396331     2452.
## 10 Pakistan         Asia       2007    65.5  169270617     2606.
## # ... with 23 more rows

FYI, the pipe operator is actually in the magrittr package, but this package is part of tidyverse so it loads in everytime we load tidyverse.

`select()` columns

We can select a subset of columns from a data frame using the select() function. This is extremely useful when you load in a data frame that might have tens or hundreds of columns and you are only interested.

Let’s give it a test with our gap_data. We’ll select the country, year, lifeExp, and gdpPercap columns

select(gap_data, country, year, lifeExp, gdpPercap)

## # A tibble: 1,704 x 4
##    country      year lifeExp gdpPercap
##    <fct>       <int>   <dbl>     <dbl>
##  1 Afghanistan  1952    28.8      779.
##  2 Afghanistan  1957    30.3      821.
##  3 Afghanistan  1962    32.0      853.
##  4 Afghanistan  1967    34.0      836.
##  5 Afghanistan  1972    36.1      740.
##  6 Afghanistan  1977    38.4      786.
##  7 Afghanistan  1982    39.9      978.
##  8 Afghanistan  1987    40.8      852.
##  9 Afghanistan  1992    41.7      649.
## 10 Afghanistan  1997    41.8      635.
## # ... with 1,694 more rows

You can also rename the columns when you select them. Let’s give that a try and rename the gdpPercap variable to GDP_percap

select(gap_data, country, year, lifeExp, GDP_percap = gdpPercap)

## # A tibble: 1,704 x 4
##    country      year lifeExp GDP_percap
##    <fct>       <int>   <dbl>      <dbl>
##  1 Afghanistan  1952    28.8       779.
##  2 Afghanistan  1957    30.3       821.
##  3 Afghanistan  1962    32.0       853.
##  4 Afghanistan  1967    34.0       836.
##  5 Afghanistan  1972    36.1       740.
##  6 Afghanistan  1977    38.4       786.
##  7 Afghanistan  1982    39.9       978.
##  8 Afghanistan  1987    40.8       852.
##  9 Afghanistan  1992    41.7       649.
## 10 Afghanistan  1997    41.8       635.
## # ... with 1,694 more rows

This is helpful when the original dataset has column names that we don’t like for some reason (e.g. not meaningful, too long,…)

Try selecting and renaming some variables

# Your code here

If you want to select most of the columns, then it is easier to specify which ones you don’t want to keep as opposed to which ones you want to keep.

Imagine we want all of the columns in gap_data except for pop (population) and continent. It is less typing to specify the ones we don’t want. We can do this as follows

gap_data %>% select(-pop,-continent)

## # A tibble: 1,704 x 4
##    country      year lifeExp gdpPercap
##    <fct>       <int>   <dbl>     <dbl>
##  1 Afghanistan  1952    28.8      779.
##  2 Afghanistan  1957    30.3      821.
##  3 Afghanistan  1962    32.0      853.
##  4 Afghanistan  1967    34.0      836.
##  5 Afghanistan  1972    36.1      740.
##  6 Afghanistan  1977    38.4      786.
##  7 Afghanistan  1982    39.9      978.
##  8 Afghanistan  1987    40.8      852.
##  9 Afghanistan  1992    41.7      649.
## 10 Afghanistan  1997    41.8      635.
## # ... with 1,694 more rows

Exercises

Using the gap_data do the following:

Create a new data frame that meets the following conditions
- Only has entries with years after 1980
- Only has entries in Asia and Africa
- Data is sorted by continent and then by life expectancy

Think about what the above data can tell you

Create a new data frame that meets the following conditions
- Has the continent column removed
- Only has entries for year 2007
- Only has entries for life expectancies less than 60
- Only has entries with with a gdp per capita that is greater than 10000

Think about what the above data can tell you

Create a new data frame that meets the following conditions
- Has entries with life expectancies greater than the mean life expectancy of all the data
- Has entries with populations greater than 20 million

Think about what the above data can tell you

Which country had the lowest life expectancy in 1977?
Which country in Europe had the lowest life expectancy in 2007?
What were the top countries by total GDP (note that this is different from per capita GDP) in 1952?

Once you finish these exercises, keep practicing on your own. Come up with some questions/ideas you would like to investigate and use the pipe and dplyr tools to help answer them

# Your code blocks here

Based on my analysis you can expect an average improvement by a factor of 128.93 \(\pm\) 2.45.↩︎

Introduction to Data Manipulation

ENS-215

16-Jan-2023

Load in the `dplyr` package

Load in the `gapminder` package

Look at the Gapminder dataset

Using `filter()` to subset rows of data

`filter()` and `in`

`filter()` is pretty nice right?

Sort data with `arrange()`

`arrange()` by multiple variables

Piping

`select()` columns

Exercises

Introduction to Data Manipulation

ENS-215

16-Jan-2023

Load in the dplyr package

Load in the gapminder package

Look at the Gapminder dataset

Using filter() to subset rows of data

filter() and in

filter() is pretty nice right?

Sort data with arrange()

arrange() by multiple variables

Piping

select() columns

Exercises

Load in the `dplyr` package

Load in the `gapminder` package

Using `filter()` to subset rows of data

`filter()` and `in`

`filter()` is pretty nice right?

Sort data with `arrange()`

`arrange()` by multiple variables

`select()` columns