Last week you got an introduction to the dplyr
package, which allows us to transform our data in an efficient, easy to read and code manner. With dplyr
we can perform many of the most common data manipulation/transformation operations with the functions available in the package. Below is a table with the dplyr
functions we’ve learned thus far. Today we will learn some more functions in dplyr
that will greatly improve your capabilities with data manipulation and analysis.
dplyr function |
Description |
---|---|
filter() |
Subset by row values |
arrange() |
Sort rows by column values |
select() |
Subset columns |
dplyr
package and the gapminder
packageWe’ll load in tidyverse
which contains dplyr
(as well as many other packages). We will also load in the gapminder
library so we can continue to work with the gapminder
data the we used last week.
library(tidyverse)
library(gapminder)
Let’s assign the gapminder data to our own data frame that we’ll call my_gap
my_gap <- gapminder
dplyr
verbs and concepts we learned last weekfilter()
We can select the rows (observations) in a data set according to criteria that we specify. Remember the syntax is filter(dataset, criteria 1, criteria 2, ...)
filter(my_gap,
lifeExp > 65,
gdpPercap < 5000,
year == 2007)
## # A tibble: 18 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Bolivia Americas 2007 65.6 9119152 3822.
## 2 China Asia 2007 73.0 1318683096 4959.
## 3 Comoros Africa 2007 65.2 710960 986.
## 4 Honduras Americas 2007 70.2 7483763 3548.
## 5 Indonesia Asia 2007 70.6 223547000 3541.
## 6 Jordan Asia 2007 72.5 6053193 4519.
## 7 Korea, Dem. Rep. Asia 2007 67.3 23301725 1593.
## 8 Mongolia Asia 2007 66.8 2874127 3096.
## 9 Morocco Africa 2007 71.2 33757175 3820.
## 10 Nicaragua Americas 2007 72.9 5675356 2749.
## 11 Pakistan Asia 2007 65.5 169270617 2606.
## 12 Paraguay Americas 2007 71.8 6667147 4173.
## 13 Philippines Asia 2007 71.7 91077287 3190.
## 14 Sao Tome and Principe Africa 2007 65.5 199579 1598.
## 15 Sri Lanka Asia 2007 72.4 20378239 3970.
## 16 Syria Asia 2007 74.1 19314747 4185.
## 17 Vietnam Asia 2007 74.2 85262356 2442.
## 18 West Bank and Gaza Asia 2007 73.4 4018332 3025.
Try out your own filter operation. Think of something interesting to try. If you found something cool, share it with the class.
# Your code here
Take a look at your notes from last class and make sure you understand the %in%
operator and how you can use it along with filter()
arrange()
Arrange allows us to sort data by variable(s) of interest. Let’s sort my_gap
by year (descending order) and then by population (ascending order). Remember the default sort order is ascending (smallest to largest) and to sort in descending order we need to use the desc()
function.
arrange(my_gap, desc(year), pop)
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Sao Tome and Principe Africa 2007 65.5 199579 1598.
## 2 Iceland Europe 2007 81.8 301931 36181.
## 3 Djibouti Africa 2007 54.8 496374 2082.
## 4 Equatorial Guinea Africa 2007 51.6 551201 12154.
## 5 Montenegro Europe 2007 74.5 684736 9254.
## 6 Bahrain Asia 2007 75.6 708573 29796.
## 7 Comoros Africa 2007 65.2 710960 986.
## 8 Reunion Africa 2007 76.4 798094 7670.
## 9 Trinidad and Tobago Americas 2007 69.8 1056608 18009.
## 10 Swaziland Africa 2007 39.6 1133066 4513.
## # ... with 1,694 more rows
Practice with arrange()
. Think of something interesting to try. If you found something cool, share it with the class.
# Your code here
select()
The select()
function allows us to grab only the variables (columns) we want out of a data set.
You can specify the ones you want
select(my_gap, country, year, lifeExp)
## # A tibble: 1,704 x 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # ... with 1,694 more rows
If you want you can rename the columns you select
select(my_gap, country, year, Life_in_years = lifeExp)
## # A tibble: 1,704 x 3
## country year Life_in_years
## <fct> <int> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # ... with 1,694 more rows
Or the ones you don’t want by putting -
before the variable
select(my_gap, -pop, -gdpPercap)
## # A tibble: 1,704 x 4
## country continent year lifeExp
## <fct> <fct> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8
## 2 Afghanistan Asia 1957 30.3
## 3 Afghanistan Asia 1962 32.0
## 4 Afghanistan Asia 1967 34.0
## 5 Afghanistan Asia 1972 36.1
## 6 Afghanistan Asia 1977 38.4
## 7 Afghanistan Asia 1982 39.9
## 8 Afghanistan Asia 1987 40.8
## 9 Afghanistan Asia 1992 41.7
## 10 Afghanistan Asia 1997 41.8
## # ... with 1,694 more rows
Give select()
some more practice
# Your code here
%>%
We can pass the output from one function to be the input to another function by using the “pipe” command %>%
Let’s filter our my_gap
data and then pipe it to the arrange()
to sort the filtered data. Piping is super useful when you want to perform a sequence of operations.
filter(my_gap, year == 1952) %>% arrange(pop)
## # A tibble: 142 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Sao Tome and Principe Africa 1952 46.5 60011 880.
## 2 Djibouti Africa 1952 34.8 63149 2670.
## 3 Bahrain Asia 1952 50.9 120447 9867.
## 4 Iceland Europe 1952 72.5 147962 7268.
## 5 Comoros Africa 1952 40.7 153936 1103.
## 6 Kuwait Asia 1952 55.6 160000 108382.
## 7 Equatorial Guinea Africa 1952 34.5 216964 376.
## 8 Reunion Africa 1952 52.7 257700 2719.
## 9 Gambia Africa 1952 30 284320 485.
## 10 Swaziland Africa 1952 41.4 290243 1148.
## # ... with 132 more rows
Notice how I didn’t specify the input data in the arrange()
function. This is because we piped the output from the previous function, so the input to arrange has already been specified.
Make sure you understand what is going on with the pipe operation. If you don’t, then ask your neighbor or me.
Test out the pipe operator on your own
# Your code here
Let’s bring everything above together.
Using a single line of code perform the following sequence of operations on the my_gap
data:
Take a look at your results. Do you see anything interesting/note-worthy?
dplyr
Ok, so we’ve gotten some more practice with the dplyr
functions that we saw last week. Now, let’s learn some more tools that dplyr
has to offer.
Just in case you’ve saved modifications/changes to your my_gap
data, let’s recreate a fresh copy from the original gapminder
data before moving ahead.
my_gap <- gapminder
mutate()
We often want to create new variables (columns) in a dataset, where the new variable is a function of exisiting variables. For instance if we have a column with precipitation data in inches, we might want to create a new column that has the same precipitation data in centimeters. In this case we would simply multiply our precipitation in inches by 2.54 (number of cm per inch) to get the new, desired column.
Let’s create a new column in our my_gap
dataset that has the total GDP (i.e. per capita GDP multiplied by the population)
mutate(my_gap,
tot_gdp = gdpPercap * pop)
## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap tot_gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
## 5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
## 6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
## 7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.
## 8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.
## 9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.
## 10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.
## # ... with 1,694 more rows
We calculated this new variable. However, remember if we want to save this information we need to assign it to a data object. Let’s save this new column to our my_gap
data
my_gap <- mutate(my_gap,
tot_gdp = gdpPercap * pop)
Take at look at your my_gap
data to confirm that you’ve added this new variable.
dplyr
functions to help you here)Ok now create a new column called pop_mill
that has the population in millions (e.g. 1,000,000 should appear as 1) and assign make sure to add this variable to your my_gap
data
# Your code here
Make sure you understand exactly what is going on in the code above. If you have any questions, discuss with your neighbor or me before moving ahead.
mutate()
We can also apply functions to the data when creating a new column (variable). We can perform just about any mathematical operation (you’ve already seen multiplication when creating a new variable) – for a list of additional operations check out your dplyr
cheatsheet.
For instance, we might want to create a new column that has the log10 of the population data. In this case we can simply employ the log10
function in our mutate()
operation
my_gap <- mutate(my_gap,
log10_pop = log10(pop))
Try out the mutate()
function to create a new variable gdp_percap_ratio
, where you divide all of the per capita GDP values, by the maximum per capita GDP observed. This will allow you to see how a given observation compares to the maximum observed.
# Your code here
Take a look at the results and think about what you are observing.
if_else()
We often want to create a new variable using mutate()
where the values are based on some conditional statement. For example, we might want to create a categorical variable where countries are labeled “lower-income” or “higher-income” based on their per capita GDP.
We can use the if_else()
function with mutate()
to do these types of operations. First let’s create a new variable income_status
and assign it to my_gap
.
my_gap <- mutate(my_gap,
income_status = if_else(gdpPercap > 7500, "higher-income","lower-income"))
Take a look at my_gap
and make sure you understand what we did here before moving on.
Now try creating your own variable using the mutate()
and if_else()
functions and add this variable to your my_gap
data
# Your code here
case_when()
The if_else()
function is great when you have just two cases that you would like to assign (e.g. “lower-income” and “higher-income”). However, there are instances where we would like to assign values based on more than two cases. In this instances we can use the case_when()
function.
Let’s change our income_status
variable to cover three cases, “low income”, “middle income”, and “high income”.
my_gap <- mutate(my_gap,
income_status = case_when(gdpPercap > 7500 ~ "high income",
gdpPercap > 3500 & gdpPercap <= 7500 ~ "middle income",
gdpPercap <= 3500 ~ "low income") )
Now create a variable life_exp_status
where:
“high life exp” if life expectancy is > 72
“med life exp” if life exp is <= 72 and > 65
“low life exp” otherwise.
# Your code here
rename()
We often want to rename columns (variables) in a dataset. Often, we’ll load in data that has a column name that we don’t like for one reason or another (too long, not descriptive, includes spaces or odd characters,…). We can use the rename()
function to do this.
Let’s rename the columns in our my_gap
data so that they are all in a consistent format/style. For this example let’s have only lower case letters in our column names and lets indicate spaces between words with and underscore _
. This means that we’ll need to rename our lifeExp
and gdpPercap
column and the other columns can remain unchanged.
my_gap <- rename(my_gap,
life_exp = lifeExp,
gdp_per_cap = gdpPercap)
select()
functionRemember how we used the select function to keep only the variable we wanted? We’ll there is some additional functionality that you can use with select()
that you will now appreciate.
Imagine we have a dataset with lots of variables and we only wanted to select variable using some criteria of their name. We can use some helper functions with select()
to perform these operations.
Imagine we just wanted the year, country, and any columns containing gdp information. Since our columns with gdp information, all have “gdp” somewhere in the name, we can use the contains()
function with select()
select(my_gap, year, country, contains("gdp"))
## # A tibble: 1,704 x 4
## year country gdp_per_cap tot_gdp
## <int> <fct> <dbl> <dbl>
## 1 1952 Afghanistan 779. 6567086330.
## 2 1957 Afghanistan 821. 7585448670.
## 3 1962 Afghanistan 853. 8758855797.
## 4 1967 Afghanistan 836. 9648014150.
## 5 1972 Afghanistan 740. 9678553274.
## 6 1977 Afghanistan 786. 11697659231.
## 7 1982 Afghanistan 978. 12598563401.
## 8 1987 Afghanistan 852. 11820990309.
## 9 1992 Afghanistan 649. 10595901589.
## 10 1997 Afghanistan 635. 14121995875.
## # ... with 1,694 more rows
Take a look at the output and make sure you understand what is going on. Also take a look at your dplyr
cheatsheet and you’ll see some other functions that you can use with select()
.
Try testing out some of these functions that you can use with select()
. While we don’t have very many columns in our current dataset, you can imagine these select functions will become more and more useful as the number of variables grows.
# Your code here
top_n()
We are often interested in the selecting rows (observations) based on their rank. For instance, we might want to just get the top 10 observations by life expectancy. We can use the top_n()
function to do this.
top_n(my_gap, 10, life_exp) # top 10 countries by life expectancy
## # A tibble: 10 x 9
## country continent year life_exp pop gdp_per_cap tot_gdp log10_pop
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Australia Oceania 2007 81.2 2.04e7 34435. 7.04e11 7.31
## 2 Hong Kong, Chi~ Asia 2002 81.5 6.76e6 30209. 2.04e11 6.83
## 3 Hong Kong, Chi~ Asia 2007 82.2 6.98e6 39725. 2.77e11 6.84
## 4 Iceland Europe 2007 81.8 3.02e5 36181. 1.09e10 5.48
## 5 Israel Asia 2007 80.7 6.43e6 25523. 1.64e11 6.81
## 6 Japan Asia 2002 82 1.27e8 28605. 3.63e12 8.10
## 7 Japan Asia 2007 82.6 1.27e8 31656. 4.04e12 8.11
## 8 Spain Europe 2007 80.9 4.04e7 28821. 1.17e12 7.61
## 9 Sweden Europe 2007 80.9 9.03e6 33860. 3.06e11 6.96
## 10 Switzerland Europe 2007 81.7 7.55e6 37506. 2.83e11 6.88
## # ... with 1 more variable: income_status <chr>
What were the top 10 countries by total GDP in 1952. Make sure to output the list in descending order by total GDP. You’ll probably want to use top_n()
in addition other dplyr
function.
# Your code here
What were the top 10 countries by total GDP in 2007. Make sure to output the list in descending order by total GDP. You’ll probably want to use top_n()
in addition other dplyr
function.
# Your code here
Did the top 10 countries change much between 1952 and 2007?
summarize()
When analyzing a dataset, we are often interested in generating a table with statistics that summarize that data. As the name suggests the summarize()
function helps us do just that.
Let’s compute average life expectancies and per capita GDP on our gap_data
. Before doing this, let’s filter our data so we are just looking at year 2007.
my_gap_2007 <- filter(my_gap, year == 2007)
Now, let’s use the summarize()
function. The basic syntax is the summarize(dataset, variable_name_1 = statistic, variable_name_2 = statistic,...)
.
Note: both the American English spelling summarize()
and British English spelling summarise()
will work.
summarize(my_gap_2007,
avg_life = mean(life_exp),
avg_gdp_per_cap = mean(gdp_per_cap) )
## # A tibble: 1 x 2
## avg_life avg_gdp_per_cap
## <dbl> <dbl>
## 1 67.0 11680.
You can use a ton of other summary statistics functions (see your dplyr
cheatsheet).
Create a few more summary tables using your my_gap
data (note you may want to filter your data first as we did with year 2007).
# Your code here
Did you learn anything interesting? If so, feel free to share what you found with the class.
If you finish early, spend some time exploring the gapminder dataset while applying the new dplyr
tools you’ve learned. Formulate some questions and use what you’ve learned to try and answer/explore them.
# your code here