Introduction to Data Manipulation: group_by() and summarize()

Author

Mason Stahl (ENS-215)

Published

January 20, 2026


We’ve been working with the dplyr package for the past few classes and we’ve seen just how powerful it is when manipulating/transforming data. So far you’ve used the dplyr functions summarized in the table below

dplyr function Description
filter() Subset by row values
arrange() Sort rows by column values
select() Subset columns
mutate() Add new columns

Today we are going to introduce a few more dplyr functions that will help you with data manipulation and data analysis. In particular we are going to see how the functions group_by and summarize allow you do yield rapid insight into your data. Then we’ll work on a bunch of exercises to reinforce/test the concepts that you’ve learned.

Note: Before starting make sure to clear your Environment so that we can get rid of any data objects from last class. To do this you can go to your Environment tab and click the icon that looks like a broom.

Load in the dplyr package and the gapminder package

We’ll load in tidyverse which contains dplyr (as well as many other packages). We will also load in the gapminder library so we can continue to work with the gapminder data we’ve been using in the past few lectures.

Code
library(tidyverse)
library(gapminder)


We’ll create our own copy of the gapminder data that we’ll use in the upcoming sections.

Code
my_gap <- gapminder


Revisiting mutate() and learning some additional applications/approaches

We often want to create new variables (columns) in a dataset, where the new variable is a function of existing variables. For instance if we have a column with precipitation data in inches, we might want to create a new column that has the same precipitation data in centimeters. In this case we would simply multiply our precipitation in inches by 2.54 (number of cm per inch) to get the new, desired column.

Let’s create a new column in our my_gap dataset that has the total GDP (i.e. per capita GDP multiplied by the population)

Code
mutate(my_gap, 
       tot_gdp = gdpPercap * pop)


We calculated this new variable. However, remember if we want to save this information we need to assign it to a data object. Let’s save this new column to our my_gap data

Code
my_gap <- mutate(my_gap, 
                 tot_gdp = gdpPercap * pop) 

Take at look at your my_gap data to confirm that you’ve added this new variable.

  • In 1952 which country had the largest total GDP? (Rely on the dplyr functions to help you here)
  • In 2007 which country had the largest total GDP?


Ok now create a new column called pop_mill that has the population in millions (e.g. 1,000,000 should appear as 1) and assign make sure to add this variable to your my_gap data

Code
# Your code here 

Make sure you understand exactly what is going on in the code above. If you have any questions, discuss with your neighbor or me before moving ahead.


Vector functions with mutate()

We can also apply functions to the data when creating a new column (variable). We can perform just about any mathematical operation (you’ve already seen multiplication when creating a new variable) – for a list of additional operations check out your dplyr cheatsheet.

For instance, we might want to create a new column that has the log10 of the population data. In this case we can simply employ the log10 function in our mutate() operation

Code
my_gap <- mutate(my_gap, 
                 log10_pop = log10(pop))


Try out the mutate() function to create a new variable gdp_percap_ratio, where you divide all of the per capita GDP values, by the maximum per capita GDP observed. This will allow you to see how a given observation compares to the maximum observed.

Code
# Your code here

Take a look at the results and think about what you are observing.


Conditional statements with if_else()

We often want to create a new variable using mutate() where the values are based on some conditional statement. For example, we might want to create a categorical variable where countries are labeled “lower-income” or “higher-income” based on their per capita GDP.

We can use the if_else() function with mutate() to do these types of operations. First let’s create a new variable income_status and assign it to my_gap.

Code
my_gap <- mutate(my_gap, 
                 income_status = if_else(gdpPercap > 7500, "higher-income","lower-income")) 

Take a look at my_gap and make sure you understand what we did here before moving on.


Now try creating your own variable using the mutate() and if_else() functions and add this variable to your my_gap data

Code
# Your code here


Conditional statements with case_when()

The if_else() function is great when you have just two cases that you would like to assign (e.g. “lower-income” and “higher-income”). However, there are instances where we would like to assign values based on more than two cases. In this instances we can use the case_when() function.

Let’s change our income_status variable to cover three cases, “low income”, “middle income”, and “high income”.

Code
my_gap <- mutate(my_gap, 
                 income_status = case_when(gdpPercap > 7500 ~ "high income", 
                                           gdpPercap > 3500 & gdpPercap <= 7500 ~ "middle income", 
                                           gdpPercap <= 3500 ~ "low income") )


Now create a variable life_exp_status where:
high life exp” if life expectancy is > 72
med life exp” if life exp is <= 72 and > 65
low life exp” otherwise.

Code
# Your code here 


Rename variables with rename()

We often want to rename columns (variables) in a dataset. Often, we’ll load in data that has a column name that we don’t like for one reason or another (too long, not descriptive, includes spaces or odd characters,…). We can use the rename() function to do this.

Let’s rename the columns in our my_gap data so that they are all in a consistent format/style. For this example let’s have only lower case letters in our column names and lets indicate spaces between words with and underscore _. This means that we’ll need to rename our lifeExp and gdpPercap column and the other columns can remain unchanged.

Code
my_gap <- rename(my_gap, 
                 life_exp = lifeExp, 
                 gdp_per_cap = gdpPercap)


Quick aside regarding the select() function

Remember how we used the select function to keep only the variable we wanted? We’ll there is some additional functionality that you can use with select() that you will now appreciate.

Imagine we have a dataset with lots of variables and we only wanted to select variable using some criteria of their name. We can use some helper functions with select() to perform these operations.

Imagine we just wanted the year, country, and any columns containing gdp information. Since our columns with gdp information, all have “gdp” somewhere in the name, we can use the contains() function with select()

Code
select(my_gap, year, country, contains("gdp"))

Take a look at the output and make sure you understand what is going on. Also take a look at your dplyr cheatsheet and you’ll see some other functions that you can use with select().

Try testing out some of these functions that you can use with select(). While we don’t have very many columns in our current dataset, you can imagine these select functions will become more and more useful as the number of variables grows.

Code
# Your code here


Select n rows by a variable ranking top_n()

We are often interested in the selecting rows (observations) based on their rank. For instance, we might want to just get the top 10 observations by life expectancy. We can use the top_n() function to do this.

Code
top_n(my_gap, 10, life_exp) # top 10 countries by life expectancy


What were the top 10 countries by total GDP in 1952. Make sure to output the list in descending order by total GDP. You’ll probably want to use top_n() in addition other dplyr function.

Code
# Your code here


What were the top 10 countries by total GDP in 2007. Make sure to output the list in descending order by total GDP. You’ll probably want to use top_n() in addition other dplyr function.

Code
# Your code here

Did the top 10 countries change much between 1952 and 2007?


summarize()

When analyzing a dataset, we are often interested in generating a table with statistics that summarize that data. As the name suggests the summarize() function helps us do just that.

Let’s compute average life expectancies and per capita GDP on our gap_data. Before doing this, let’s filter our data so we are just looking at year 2007.

Code
my_gap_2007 <- filter(my_gap, year == 2007)


Now, let’s use the summarize() function. The basic syntax is the summarize(dataset, variable_name_1 = statistic, variable_name_2 = statistic,...).

Note: both the American English spelling summarize() and British English spelling summarise() will work.

Code
summarize(my_gap_2007, 
          avg_life = mean(life_exp), 
          avg_gdp_per_cap = mean(gdp_per_cap) )

You can use a ton of other summary statistics functions (see your dplyr cheatsheet).

Create a few more summary tables using your my_gap data (note you may want to filter your data first as we did with year 2007).

Code
# Your code here 

You can use a ton of other summary statistics functions (see your dplyr cheatsheet).

Create a few more summary tables using your my_gap data (note you may want to filter your data first as we did with year 2007). Try to test out some of the additional summary statistics functions from the dplyr cheatsheet.

Code
# Your code here 


Did you learn anything interesting? If so, feel free to share what you found with the class or your neighbor.


group_by() and summarize()

As you’ve seen, the summarize() function is really powerful. However, when we first group our data and then summarize we can often do so much more. Let’s see just how powerful summarize() is when we’ve first employed the group_by() function.

The group_by() function will create a “grouped” copy of a table and subsequent dplyr operations will manipulate each group separately and then the results will be combined.


Let’s try out an example to help make this clearer. You want to determine the minimum, mean, and maximum life expectancy observed on each continent in the year 2007. So first let’s group our my_gap_2007 data by continent

Code
my_gap_2007 <- group_by(my_gap_2007, continent) # group the data by continent


Now, let’s apply the summarize() function to our “grouped” dataset

Code
summarize(my_gap_2007, 
          min_life = min(life_exp), 
          mean_life = mean(life_exp), 
          max_life = max(life_exp))


Look at that! We’ve now got a summary table telling us the minimum, average, and maximum life expectancies observed on each continent in the year 2007! We did this with just a few lines of code! Really beats, creating a for loop to loop over each continent and compute the statistics.


You can even group by multiple variables. This is often incredibly useful. For instance, we might want to see how the life expectancy statistics by continent have changed over time. In this case we would group by continent and year before applying the summarize() function.

Code
my_gap <- group_by(my_gap, continent, year) # group by continent then year
Code
summarize(my_gap, 
          min_life = min(life_exp), 
          mean_life = mean(life_exp), 
          max_life = max(life_exp))
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.


Pretty cool right!

FYI, if you want to ungroup a dataset that you’ve grouped, you can use the ungroup() function. You can of course always regroup the data if you want.

Also remember that there are tons of statistics functions that you can use with summarize(). Take a look the the Summary Functions section of your dplyr cheatsheet for more info.

Code
my_gap <- ungroup(my_gap) # ungroup the my_gap data


group_by() with other dplyr functions

While group_by() is often used along with summarize(), you can use group_by() with other dplyr functions as well.


We can grab the top three per capita GDPs for each of the years that data was collected. To do this we’ll need to group the data by year and then apply the top_n() function. I’m going to use the pipe operator to do all of this in a single line of code.

Code
my_gap %>% 
  group_by(year) %>% 
  top_n(3, gdp_per_cap)


You can see that this worked, but the data was sorted in alphabetical order by country. It would be more useful to have the data sorted by year. Let’s modify the code above to sort the data too.

Code
my_gap %>% 
  group_by(year) %>% 
  top_n(3, gdp_per_cap) %>% 
  arrange(year)


Take a look at the summary table. Did you find anything interesting?


You’ve seen how group_by can be powerfully combined with dplyr functions, in particular the summarize() function. Now try out several interesting things of your own. Think of a few interesting question that you’d like to answer and answer them below.

Code
# Your code here


Exercises with gapminder data

At this point in the term we’ve established a pretty solid toolkit for programming in R and doing some data wrangling and analysis. Below are some exercises that will allow you to test out your skills, with a specific focus on the dplyr tools that you are now familiar with. Remember you can use the pipe %>% operator to easily string together many operations.


  1. Using only the data for year 2007, create a table that reports the following for each continent in the my_gap dataset
    1. The minimum, maximum and mean per capita GDP
    2. The minimum, maximum and mean life expectancy
    3. The number of countries in that continent
Code
# Your code here


  1. Create a table that reports the following for each country in the my_gap dataset
    1. The ratio of the total GDP between the most recent (last recorded year) and the earliest (first recorded year)
    2. The ratio of the per capita GDP between the most recent (last recorded year) and the earliest (first recorded year)
    3. The change in life expectancy during the period of record (i.e. between the last recorded year and the first recorded year)

Save the table to a new object called gap_summary_table and sort this table in descending order by the change in life expectancy. Did you notice anything interesting? Think about what might explain the observed changes.

Next sort this same table in ascending order based on life expectancy. Do you observe anything interesting?

Once you’ve done the above, try sorting the table according to the ratio of per capita GDP

Hint: Look at your dplyr cheatsheet in the “Summary Functions” section for some functions that will be useful for the exercise above

Code
# Your code here


Exercises: NOAA precipitation data

Let’s take a look at some precipitation data from the National Oceanic and Atmospheric Agency (NOAA) to gain insight into annual and seasonal variability in precipitation for a US state of your choosing.

The dataset that we will use has monthly precipitation data from 1895 through 2024 for each US state. We are going to generate a summary of annual precipitation for each year on record for your state of interest. For example if you choose New York, you will create a dataframe that has the total annual precipitation for New York for each year from 1895 through 2024 If you didn’t have the tools from the dplyr package you would have to do this using a loop and indexing the dataframe with base R (cumbersome, requires more coding, harder to read, more prone to making a coding error). Thankfully we’ve learned how to use dplyr.


First we’ll load in dataset from last week

Code
precip_data <- read_csv("https://stahlm.github.io/ENS_215/Data/noaa_cag_state_precipitation.csv")
Rows: 76538 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): STATE, MONTH
dbl (3): Date, Value, YEAR

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
precip_data <- precip_data %>% 
  select(YEAR, MONTH, STATE, Precip_inches = Value) %>% 
  filter(YEAR < 2025) %>% 
  mutate(MONTH = as.numeric(MONTH))

NOAA_State_Precip_LabData.csv

Take a few minutes to look at the dataset to familiarize yourself with it (e.g. view it, use the summary() function,…).

Now let’s select a state of interest. Remember that if we weren’t using dplyr we would have to select the state of interest using code like what is shown directly below.

Code
state_2_get <- "New York" # State I want to select

state_data <- precip_data[precip_data$STATE == state_2_get, ] # get the rows with desired state, and get all columns


That code works, but it is cumbersome to write, easy to make an error, and sort of difficult to decipher. We can do much better now.

Use dplyr to create a state_data object that has all of the data for your state of choosing.

Code
# Your code here


Ok, now let’s proceed to the exercises below which will allow you to demonstrate/test your dplyr skills and will highlight just how powerful the dplyr package is.

Some of these exercises will be challenging, so make sure to consult your dplyr cheatsheet, discuss with your classmates and me.

Some helpful advice: Remember to think step-by-step. Test each step as you go. Even for an experienced programmer/scientist it is often necessary to break the task into smaller chunks. Piping with %>% can make your task easier by allowing you to combine many individual steps together. If you finish all of the exercises you should go back and take more time to examine your results and think about the environmental siginificane/implications.


Now use the tools from dplyr to do the following:

  1. For your chosen state create a table that has
    1. The monthly average precipitation (i.e. for January you would take the average of all the January data, for Feb…)
    2. Minimum observed precipitation for a given month
    3. Maximum observed precipitation for a given month
Code
# Your code here


  1. For your chosen state, create a table that has
    1. total annual precipitation for each year on record
Code
# Your code here


  1. Using the dataset precip_data, which has data for all of the states, create a single table that has
    1. The total annual precipitation for each year, for each state (e.g. total annual precip for all years on record for Alabama, same for Arkansas, … , same for New York, …)
Code
# Your code here


  1. Using your table from the exercise above, create another table that has
    1. The same statistics, but just for the wettest state in each year (i.e. the one with the highest total precip for that year)
    2. Sort this table by year (in ascending order)
Code
# Your code here


  1. Using the dataset precip_data, which has data for all of the states, create a table that has
    1. Average annual precipitation for each state (i.e. a single value for each state. For instance NY might have averaged 45 inches of rain annually from 1895-2017, CA might have averaged…)
    2. Minimum annual precipitation recorded for each state (i.e. a single value for each state)
    3. Maximum annual precipitation recorded for each state (i.e. a single value for each state)
    4. Sort the table by the average annual precipitation (desending order)

Hint: There are many ways to do this, however it might help to make an “intermediate” table of stats that you then operate on to get your final table (this isn’t necessary, but might help)

Code
# Your code here


  1. Challenge: For your chosen state do the following
    1. Add a new variable (column) called season to your state_data that has a “flag” (categorical value) that takes the value “season-1” if the month is March through August and takes the value “season-2” otherwise
    2. Now using your state_data create a table that reports for each year of record the ratio of precipitation in season-1 to the ratio of the total precipitation in that same year.

This table will allow us to see how the seasonal distribution of precipitation may have changed (or not) over the years on record.

Code
# Your code here