Introduction to Data Manipulation

Author

Mason Stahl (ENS-215)

Published

January 15, 2026

Let’s now jump into some data manipulation using the functionality in the dplyr package. This package is going to dramatically increase your data handling skills.

You’ll see how with dplyr data manipulation operations that may have required long, complex, and difficult to decipher code can often be reduced to a single line that is easy to code and easy to read!

As you work through the code, pay close attention and make sure that you understand what is happening in each line of code. Check help files, ask your neighbor, and of course ask me if you have any questions or don’t fully understand something.

Note: The dplyr cheatsheet will be very helpful when working through today’s lesson

Load in the `dplyr` package

We’ll actually load in tidyverse which contains dplyr along with many other packages.

Code

library(tidyverse)

Load in the `gapminder` package

We’ll also load in the the gapminder package. This package has a great dataset to use when learning and showcasing the functionality of dplyr.

The gapminder dataset contains data on life expectancy, GDP per capita, and populations by country (and through time). This is a great dataset to look at for thinking about global development. As global development is inherently tied to environmental aspects (e.g. water resources, mineral resources, ecology) this dataset should be of interest to anyone studying the environment, and of particular interest to those focused on policy.

If you haven’t installed the gapminder package yet, you should go to your package window and do this. Once you have the package let’s load it in.

Code

library(gapminder)

Look at the Gapminder dataset

Let’s print out some of the data and take a look.

Code

gapminder

Let’s also get a summary() to learn a bit more

Code

summary(gapminder)

        country        continent        year         lifeExp     
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
 Australia  :  12                  Max.   :2007   Max.   :82.60  
 (Other)    :1632                                                
      pop              gdpPercap       
 Min.   :6.001e+04   Min.   :   241.2  
 1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Median :7.024e+06   Median :  3531.8  
 Mean   :2.960e+07   Mean   :  7215.3  
 3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Max.   :1.319e+09   Max.   :113523.1

You should take a minute to learn more about the dataset by typing ?gapminder in your CONSOLE.

Also take a minute to look at the data by using the View() function. Do this in your CONSOLE.

You can see that there is a lot of data here, a number of variables, and many different ways we might want to filter, group, select,… the data. dplyr is going to come to the rescue here.

Before we proceed, let’s assign the gapminder data to a new object (just so you aren’t worrying about messing up the original data)

Code

gap_data <- gapminder

Using `filter()` to subset rows of data

No longer do we need to use complex logical statements as our indices when trying to subset data along rows (recall the complex statements that you would put into the [] when subsetting). filter() makes this operation painless.

The filter() function takes your logical comparison and selects all of the rows for which the test is TRUE.

Code

filter(gap_data, gdpPercap > 5000)

We just filtered the gap_data so that only the rows where the gdpPercap was greater than 5000 were selected. That was super easy!

We can add even more conditions to filter on.

Code

filter(gap_data, gdpPercap > 5000, year == 2007)

We now selected all of the rows where the per capita GDP > 5000 AND the year is 2007.

You see that the syntax is filter(dataset, criteria 1, criteria 2,..., criteria n)

Note that filter() returns the output you want but it does not modify the original dataset. This is good, since we likely want to keep using our original dataset later on in our code.

Furthermore, you’ll see that since we didn’t assign the output from filter() to a data object, we haven’t created any new data objects yet. We can easily assign the output to a new object, the same way we do all other object assignments

Code

data_2007_highGDP <- filter(gap_data, gdpPercap > 5000, year == 2007)

However, before you go ahead and make tons of new data frames to store filtered versions of your data, you should realize that generally we can pass the filtered data directly to another function (e.g. a plotting function, mean, sum,…) which may eliminate the need to creating tons of new data frames. This will help to keep our environment clean (something you should care about as environmental scientists).

Here’s how you would filter based on a variable that has text values

Code

filter(gap_data, gdpPercap > 5000, continent == "Asia")

Now I want you to try a few more filter operations on the gap_data. Take your time and get comfortable with the basics here.

Code

# Your code here

`filter()` and `in`

If you want to filter based on several values for a given variable, you could use the OR | operator in your criteria. For example,

Code

filter(gap_data, country == "Canada" | country == "China")

This will select the rows for the both Canada and China. What I did above works, but it is cumbersome and somewhat difficult to read the code.

We can achieve the same result and greatly simplify our code

Code

filter(gap_data, country %in% c("Canada", "China"))

You can imagine if you have a very long vector of items, the %in% approach is much easier than writing a bunch of OR | statements.

Try filtering the data to get only rows with Canada, China, France, India, and Argentina. You should save this set of countries to its own vector and then use this vector in your filter() statement. This will make your code much, much easier to read then defining the vector in your filter() directly after the %in% (as I did above).

Code

# Your code here

`filter()` is pretty nice right?

Remember those days before dplyr when we wanted to select rows from a data frame according to criteria.

In base R

Code

gap_data[gap_data$year == 2007 & gap_data$gdpPercap > 5000, ] # the old way in base R

With dplyr

Code

filter(gap_data, year == 2007, gdpPercap > 5000) # with dplyr

Both of the above code blocks do the exact same thing: select the rows in gap_data where the year is 2007 AND the per capita GDP > 5000.

The dplyr method is so much easier to code and read.

Sort data with `arrange()`

Let’s use the arrange() function to sort our dataset.

We’ll save to sorted data to another object, so that you can take a look at it easier.

Code

gap_sorted <- arrange(gap_data, gdpPercap)

View() the sorted data (remember to type do this in the console) and confirm that everything worked.

arrange() sorts data in ascending order (smallest to largest). To sort data in descending order use the desc() function

Code

gap_sorted <- arrange(gap_data, desc(gdpPercap))

Take a look and make sure everything looks good. Were you surprised to see the country that had the highest GDP (and the year when this occurred)? As a interesting aside, discuss with your neighbor why you think this might be the case.

Test out a few more sorts on the gap_data using the arrange() function

Code

# Your code here

`arrange()` by multiple variables

You can sort data according to multiple variables using the arrange() function. Below we’ll sort the gap_data by continent first and then by life expectancy. This will give us a sorted dataset that will have the life expectancies sorted within each continent

Code

gap_sorted <- arrange(gap_data, continent, lifeExp)

That worked well, but you’ll see that we didn’t consider the year in our sorting. Let’s now sort by continent, then year, then life expectancy

Code

# your code here

Take a look at the data. Does it look different than when we sorted without the year. Does this particular organization seem more useful?

Repeat your code block directly above, but sort the years in descending order so that we can see the most recent data at the top of the data frame.

Code

# Your code here

Piping

Before we go further, I want to introduce the pipe operator. The pipe operator will make your code much easier to write and read. It allows you to carry out a sequence of operations without having to nest the operations within one another.

The pipe operator is %>% (you can also use |> as the pipe operator)
Prounced “then” when reading code in “English”
Shortcut to insert %>% is
- PC: CTRL + SHIFT + M
- MAC: COMMAND + SHIFT + M

The pipe %>% operator allows you to 1. Take output of one function and pipe it as input to the next function 2. String together many pipes to create a single chain of operations

Let’s take a look at an example

Code

filter(gap_data, year == 2007, continent == "Asia") %>% 
  arrange(gdpPercap)

The above code first filtered the data to select only the observations for year 2007 and in Asia. Then it sent this filtered dataset arrange() where it was sorted by per capita GDP.

Notice, how I did not specify the data frame to use in arrange(). This is because the pipe operator passes the output from the previous operation so this output is necessarily the input to the arrange() function.

We could have piped the gap_data to the filter() function to make it even easier to read.

Code

gap_data %>% 
  filter(year == 2007, continent == "Asia") %>% 
  arrange(gdpPercap)

The code below does the exact same thing but without the pipe operator. See how much easier it is to read the code that used pipe. With the piped code we can read the sequence of operation from left to right, whereas in the nested code below we had to read from inside out (which is much more difficult from a human perspective).

Code

arrange(filter(gap_data, year == 2007, continent == "Asia"), gdpPercap) # without using pipe operator

FYI, the pipe operator is actually in the magrittr package, but this package is part of tidyverse so it loads in everytime we load tidyverse.

`select()` columns

We can select a subset of columns from a data frame using the select() function. This is extremely useful when you load in a data frame that might have tens or hundreds of columns and you are only interested.

Let’s give it a test with our gap_data. We’ll select the country, year, lifeExp, and gdpPercap columns

Code

select(gap_data, country, year, lifeExp, gdpPercap)

You can also rename the columns when you select them. Let’s give that a try and rename the gdpPercap variable to GDP_percap

Code

select(gap_data, country, year, lifeExp, GDP_percap = gdpPercap)

This is helpful when the original dataset has column names that we don’t like for some reason (e.g. not meaningful, too long,…)

Try selecting and renaming some variables

Code

# Your code here

If you want to select most of the columns, then it is easier to specify which ones you don’t want to keep as opposed to which ones you want to keep.

Imagine we want all of the columns in gap_data except for pop (population) and continent. It is less typing to specify the ones we don’t want. We can do this as follows

Code

gap_data %>% 
  select(-pop,-continent)

Create new variables with `mutate()`

We often want to create new variables (columns) in a dataset, where the new variable is a function of existing variables. For instance if we have a column with precipitation data in inches, we might want to create a new column that has the same precipitation data in centimeters. In this case we would simply multiply our precipitation in inches by 2.54 (number of cm per inch) to get the new, desired column.

Let’s create a new column in our gap_data dataset that has the total GDP (i.e. per capita GDP multiplied by the population)

Code

mutate(gap_data, 
       tot_gdp = gdpPercap * pop)

We calculated this new variable. However, remember if we want to save this information we need to assign it to a data object. Let’s save this new column to our gap_data data

Code

my_gap <- mutate(gap_data, 
                 tot_gdp = gdpPercap * pop)

Take at look at your gap_data data to confirm that you’ve added this new variable.

In 1952 which country had the largest total GDP? (Rely on the dplyr functions to help you here)
In 2007 which country had the largest total GDP?

Ok now create a new column called pop_mill that has the population in millions (e.g. 1,000,000 should appear as 1) and assign make sure to add this variable to your gap_data data

Code

# Your code here

Make sure you understand exactly what is going on in the code above. If you have any questions, discuss with your neighbor or me before moving ahead.

Exercises

Using the gap_data do the following:

Create a new data frame that meets the following conditions
- Only has entries with years after 1980
- Only has entries in Asia and Africa
- Data is sorted by continent and then by life expectancy

Think about what the above data can tell you

Create a new data frame that meets the following conditions
- Has the continent column removed
- Only has entries for year 2007
- Only has entries for life expectancies less than 65
- Only has entries with with a gdp per capita that is greater than 7000

Think about what the above data can tell you

Create a new data frame that meets the following conditions
- Has entries with life expectancies greater than the mean life expectancy of all the data
- Has entries with populations greater than 20 million

Think about what the above data can tell you

Which country had the lowest life expectancy in 1977?
Which country in Europe had the lowest life expectancy in 2007?
What were the top countries by total GDP (note that this is different from per capita GDP) in 1952?

Once you finish these exercises, keep practicing on your own. Come up with some questions/ideas you would like to investigate and use the pipe and dplyr tools to help answer them

Code

# Your code blocks here

Load in the dplyr package

Load in the gapminder package

Look at the Gapminder dataset

Using filter() to subset rows of data

filter() and in

filter() is pretty nice right?

Sort data with arrange()

arrange() by multiple variables

Piping

select() columns

Create new variables with mutate()

Exercises

Load in the `dplyr` package

Load in the `gapminder` package

Using `filter()` to subset rows of data

`filter()` and `in`

`filter()` is pretty nice right?

Sort data with `arrange()`

`arrange()` by multiple variables

`select()` columns

Create new variables with `mutate()`