group_by()
and summarize()
We’ve been working with the dplyr
package for the past
few classes and we’ve seen just how powerful it is when
manipulating/transforming data. So far you’ve used the
dplyr
functions summarized in the table below
dplyr function |
Description |
---|---|
filter() |
Subset by row values |
arrange() |
Sort rows by column values |
select() |
Subset columns |
mutate() |
Add new columns |
rename() |
Rename columns |
top_n() |
Select and order the top n entries according to a column (variable) |
summarize() |
Summarize columns |
Today we are going to introduce a few more dplyr
functions that will help you with data manipulation and data analysis.
In particular we are going to see how the functions
group_by
and summarize
allow you do yield
rapid insight into your data. Then we’ll work on a bunch of exercises to
reinforce/test the concepts that you’ve learned.
Note: Before starting make sure to clear your Environment so that we can get rid of any data objects from last class. To do this you can go to your Environment tab and click the icon that looks like a broom.
dplyr
package and the
gapminder
packageWe’ll load in tidyverse
which contains
dplyr
(as well as many other packages). We will also load
in the gapminder
library so we can continue to work with
the gapminder
data we’ve been using in the past few
lectures.
library(tidyverse)
library(readr) # we'll use this package later in the lecture to load in files from our class website
library(gapminder)
We’ll create our own copy of the gapminder data that we’ll use in the upcoming sections.
my_gap <- gapminder
summarize()
refresherIf you got to the summarize()
section of last lecture,
then work through this section below as a refresher, otherwise this will
be new.
When analyzing a dataset, we are often interested in generating a
table with statistics that summarize that data. As the name suggests the
summarize()
function helps us do just that.
Let’s compute average life expectancies and per capita GDP on our
gap_data
. Before doing this, let’s filter our data so we
are just looking at year 2007.
my_gap_2007 <- filter(my_gap, year == 2007)
Now, let’s use the summarize()
function. The basic
syntax is the
summarize(dataset, variable_name_1 = statistic, variable_name_2 = statistic,...)
.
Note: both the American English spelling summarize()
and
British English spelling summarise()
will work. If you want
to make your code a bit classier, I suggest you use the British
English.
summarize(my_gap_2007,
avg_life = mean(lifeExp),
avg_gdp_per_cap = mean(gdpPercap) )
You can use a ton of other summary statistics functions (see your
dplyr
cheatsheet).
Create a few more summary tables using your my_gap
data
(note you may want to filter your data first as we did with year 2007).
Try to test out some of the additional summary statistics functions from
the dplyr
cheatsheet.
# Your code here
Did you learn anything interesting? If so, feel free to share what you found with the class or your neighbor.
group_by()
and summarize()
As you’ve seen, the summarize()
function is really
powerful. However, when we first group our data and then summarize we
can often do so much more. Let’s see just how powerful
summarize()
is when we’ve first employed the
group_by()
function.
The group_by()
function will create a “grouped” copy of
a table and subsequent dplyr
operations will manipulate
each group separately and then the results will be combined.
Let’s try out an example to help make this clearer. You want to
determine the minimum, mean, and maximum life expectancy observed on
each continent in the year 2007. So first let’s group our
my_gap_2007
data by continent
my_gap_2007 <- group_by(my_gap_2007, continent) # group the data by continent
Now, let’s apply the summarize()
function to our
“grouped” dataset
summarize(my_gap_2007,
min_life = min(lifeExp),
mean_life = mean(lifeExp),
max_life = max(lifeExp))
Look at that! We’ve now got a summary table telling us the minimum,
average, and maximum life expectancies observed on each continent in the
year 2007! We did this with just a few lines of code! Really beats,
creating a for
loop to loop over each continent and compute
the statistics.
You can even group by multiple variables. This is often incredibly
useful. For instance, we might want to see how the life expectancy
statistics by continent have changed over time. In this case we would
group by continent and year before applying the summarize()
function.
my_gap <- group_by(my_gap, continent, year) # group by continent then year
summarize(my_gap,
min_life = min(lifeExp),
mean_life = mean(lifeExp),
max_life = max(lifeExp))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
Pretty cool right!
FYI, if you want to ungroup a dataset that you’ve grouped, you can
use the ungroup()
function. You can of course always
regroup the data if you want.
Also remember that there are tons of statistics functions that you
can use with summarize()
. Take a look the the
Summary Functions section of your dplyr
cheatsheet for more info.
my_gap <- ungroup(my_gap) # ungroup the my_gap data
group_by()
with other dplyr
functionsWhile group_by()
is often used along with
summarize()
, you can use group_by()
with other
dplyr
functions as well.
We can grab the top three per capita GDPs for each of the years that
data was collected. To do this we’ll need to group the data by year and
then apply the top_n()
function. I’m going to use the pipe
operator to do all of this in a single line of code.
my_gap %>%
group_by(year) %>%
top_n(3, gdpPercap)
You can see that this worked, but the data was sorted in alphabetical order by country. It would be more useful to have the data sorted by year. Let’s modify the code above to sort the data too.
my_gap %>%
group_by(year) %>%
top_n(3, gdpPercap) %>%
arrange(year)
Take a look at the summary table. Did you find anything interesting?
You’ve seen how group_by
can be powerfully combined with
dplyr
functions, in particular the summarize()
function. Now try out several interesting things of your own. Think of a
few interesting question that you’d like to answer and answer them
below.
# Your code here
gapminder
dataAt this point in the term we’ve established a pretty solid toolkit
for programming in R and doing some data wrangling and analysis. Below
are some exercises that will allow you to test out your skills, with a
specific focus on the dplyr
tools that you are now familiar
with. Remember you can use the pipe %>%
operator to
easily string together many operations.
my_gap
dataset
# Your code here
my_gap
dataset
Save the table to a new object called gap_summary_table
and sort this table in descending order by the change in life
expectancy. Did you notice anything interesting? Think about what might
explain the observed changes.
Next sort this same table in ascending order based on life expectancy. Do you observe anything interesting?
Once you’ve done the above, try sorting the table according to the ratio of per capita GDP
Hint: Look at your dplyr
cheatsheet in
the “Summary Functions” section for some functions that will be useful
for the exercise above
# Your code here
Let’s take a look at some precipitation data from the National Oceanic and Atmospheric Agency (NOAA) to gain insight into annual and seasonal variability in precipitation for a US state of your choosing.
The dataset that we will use has monthly precipitation data from 1895
through 2017 for each US state. We are going to generate a summary of
annual precipitation for each year on record for your state of interest.
For example if you choose NY, you will create a dataframe that has the
total annual precipitation for NY for each year from 1895 through 2017.
If you didn’t have the tools from the dplyr
package you
would have to do this using a loop and indexing the dataframe with base
R (cumbersome, requires more coding, harder to read, more prone to
making a coding error). Thankfully we’ve learned how to use
dplyr
.
First we’ll load in dataset from last week
precip_data <- read_csv("https://stahlm.github.io/ENS_215/Data/NOAA_State_Precip_LabData.csv")
## Rows: 70848 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): state_cd
## dbl (3): Year, Month, Precip_inches
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Take a few minutes to look at the dataset to familiarize yourself
with it (e.g. view it, use the summary()
function,…).
Now let’s select a state of interest. Remember that if we weren’t
using dplyr
we would have to select the state of interest
using code like what is shown directly below.
state_2_get <- "NY" # Abbreviation code for state I want to select
state_data <- precip_data[precip_data$state_cd == state_2_get, ] # get the rows with desired state, and get all columns
That code works, but it is cumbersome to write, easy to make an error, and sort of difficult to decipher. We can do much better now.
Use dplyr
to create a state_data
object
that has all of the data for your state of choosing.
# Your code here
Ok, now let’s proceed to the exercises below which will allow you to
demonstrate/test your dplyr
skills and will highlight just
how powerful the dplyr
package is.
Some of these exercises will be challenging, so make sure to consult
your dplyr
cheatsheet, discuss with your classmates and
me.
Some helpful advice: Remember to think step-by-step.
Test each step as you go. Even for an experienced programmer/scientist
it is often necessary to break the task into smaller chunks. Piping with
%>%
can make your task easier by allowing you to combine
many individual steps together. If you finish all of the exercises you
should go back and take more time to examine your results and think
about the environmental siginificane/implications.
Now use the tools from dplyr
to do the following:
# Your code here
# Your code here
precip_data
, which has data for all
of the states, create a single table that has
# Your code here
# Your code here
precip_data
, which has data for all
of the states, create a table that has
Hint: There are many ways to do this, however it might help to make an “intermediate” table of stats that you then operate on to get your final table (this isn’t necessary, but might help)
# Your code here
season
to your
state_data
that has a “flag” (categorical value) that takes
the value “season-1” if the month is March through August and takes the
value “season-2” otherwisestate_data
create a table that reports
for each year of record the ratio of precipitation in
season-1 to the ratio of the total precipitation in
that same year.This table will allow us to see how the seasonal distribution of precipitation may have changed (or not) over the years on record.
# Your code here