Introduction to Data Manipulation (continued)

Last week you got an introduction to the dplyr package, which allows us to transform our data in an efficient, easy to read and code manner. With dplyr we can perform many of the most common data manipulation/transformation operations with the functions available in the package. Below is a table with the dplyr functions we’ve learned thus far. Today we will learn some more functions in dplyr that will greatly improve your capabilities with data manipulation and analysis.

`dplyr` function	Description
`filter()`	Subset by row values
`arrange()`	Sort rows by column values
`select()`	Subset columns

Load in the `dplyr` package and the `gapminder` package

We’ll load in tidyverse which contains dplyr (as well as many other packages). We will also load in the gapminder library so we can continue to work with the gapminder data the we used last week.

library(tidyverse)
library(gapminder)

Let’s assign the gapminder data to our own data frame that we’ll call my_gap

my_gap <- gapminder

Quick refresher `dplyr` verbs and concepts we learned last week

`filter()`

We can select the rows (observations) in a data set according to criteria that we specify. Remember the syntax is filter(dataset, criteria 1, criteria 2, ...)

filter(my_gap, 
       lifeExp > 65, 
       gdpPercap < 5000, 
       year == 2007)

## # A tibble: 18 x 6
##    country               continent  year lifeExp        pop gdpPercap
##    <fct>                 <fct>     <int>   <dbl>      <int>     <dbl>
##  1 Bolivia               Americas   2007    65.6    9119152     3822.
##  2 China                 Asia       2007    73.0 1318683096     4959.
##  3 Comoros               Africa     2007    65.2     710960      986.
##  4 Honduras              Americas   2007    70.2    7483763     3548.
##  5 Indonesia             Asia       2007    70.6  223547000     3541.
##  6 Jordan                Asia       2007    72.5    6053193     4519.
##  7 Korea, Dem. Rep.      Asia       2007    67.3   23301725     1593.
##  8 Mongolia              Asia       2007    66.8    2874127     3096.
##  9 Morocco               Africa     2007    71.2   33757175     3820.
## 10 Nicaragua             Americas   2007    72.9    5675356     2749.
## 11 Pakistan              Asia       2007    65.5  169270617     2606.
## 12 Paraguay              Americas   2007    71.8    6667147     4173.
## 13 Philippines           Asia       2007    71.7   91077287     3190.
## 14 Sao Tome and Principe Africa     2007    65.5     199579     1598.
## 15 Sri Lanka             Asia       2007    72.4   20378239     3970.
## 16 Syria                 Asia       2007    74.1   19314747     4185.
## 17 Vietnam               Asia       2007    74.2   85262356     2442.
## 18 West Bank and Gaza    Asia       2007    73.4    4018332     3025.

Try out your own filter operation. Think of something interesting to try. If you found something cool, share it with the class.

# Your code here

Take a look at your notes from last class and make sure you understand the %in% operator and how you can use it along with filter()

`arrange()`

Arrange allows us to sort data by variable(s) of interest. Let’s sort my_gap by year (descending order) and then by population (ascending order). Remember the default sort order is ascending (smallest to largest) and to sort in descending order we need to use the desc() function.

arrange(my_gap, desc(year), pop)

## # A tibble: 1,704 x 6
##    country               continent  year lifeExp     pop gdpPercap
##    <fct>                 <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Sao Tome and Principe Africa     2007    65.5  199579     1598.
##  2 Iceland               Europe     2007    81.8  301931    36181.
##  3 Djibouti              Africa     2007    54.8  496374     2082.
##  4 Equatorial Guinea     Africa     2007    51.6  551201    12154.
##  5 Montenegro            Europe     2007    74.5  684736     9254.
##  6 Bahrain               Asia       2007    75.6  708573    29796.
##  7 Comoros               Africa     2007    65.2  710960      986.
##  8 Reunion               Africa     2007    76.4  798094     7670.
##  9 Trinidad and Tobago   Americas   2007    69.8 1056608    18009.
## 10 Swaziland             Africa     2007    39.6 1133066     4513.
## # ... with 1,694 more rows

Practice with arrange(). Think of something interesting to try. If you found something cool, share it with the class.

# Your code here

`select()`

The select() function allows us to grab only the variables (columns) we want out of a data set.

You can specify the ones you want

select(my_gap, country, year, lifeExp)

## # A tibble: 1,704 x 3
##    country      year lifeExp
##    <fct>       <int>   <dbl>
##  1 Afghanistan  1952    28.8
##  2 Afghanistan  1957    30.3
##  3 Afghanistan  1962    32.0
##  4 Afghanistan  1967    34.0
##  5 Afghanistan  1972    36.1
##  6 Afghanistan  1977    38.4
##  7 Afghanistan  1982    39.9
##  8 Afghanistan  1987    40.8
##  9 Afghanistan  1992    41.7
## 10 Afghanistan  1997    41.8
## # ... with 1,694 more rows

If you want you can rename the columns you select

select(my_gap, country, year, Life_in_years = lifeExp)

## # A tibble: 1,704 x 3
##    country      year Life_in_years
##    <fct>       <int>         <dbl>
##  1 Afghanistan  1952          28.8
##  2 Afghanistan  1957          30.3
##  3 Afghanistan  1962          32.0
##  4 Afghanistan  1967          34.0
##  5 Afghanistan  1972          36.1
##  6 Afghanistan  1977          38.4
##  7 Afghanistan  1982          39.9
##  8 Afghanistan  1987          40.8
##  9 Afghanistan  1992          41.7
## 10 Afghanistan  1997          41.8
## # ... with 1,694 more rows

Or the ones you don’t want by putting - before the variable

select(my_gap, -pop, -gdpPercap)

## # A tibble: 1,704 x 4
##    country     continent  year lifeExp
##    <fct>       <fct>     <int>   <dbl>
##  1 Afghanistan Asia       1952    28.8
##  2 Afghanistan Asia       1957    30.3
##  3 Afghanistan Asia       1962    32.0
##  4 Afghanistan Asia       1967    34.0
##  5 Afghanistan Asia       1972    36.1
##  6 Afghanistan Asia       1977    38.4
##  7 Afghanistan Asia       1982    39.9
##  8 Afghanistan Asia       1987    40.8
##  9 Afghanistan Asia       1992    41.7
## 10 Afghanistan Asia       1997    41.8
## # ... with 1,694 more rows

Give select() some more practice

# Your code here

Piping with `%>%`

We can pass the output from one function to be the input to another function by using the “pipe” command %>%

Let’s filter our my_gap data and then pipe it to the arrange() to sort the filtered data. Piping is super useful when you want to perform a sequence of operations.

filter(my_gap, year == 1952) %>% arrange(pop)

## # A tibble: 142 x 6
##    country               continent  year lifeExp    pop gdpPercap
##    <fct>                 <fct>     <int>   <dbl>  <int>     <dbl>
##  1 Sao Tome and Principe Africa     1952    46.5  60011      880.
##  2 Djibouti              Africa     1952    34.8  63149     2670.
##  3 Bahrain               Asia       1952    50.9 120447     9867.
##  4 Iceland               Europe     1952    72.5 147962     7268.
##  5 Comoros               Africa     1952    40.7 153936     1103.
##  6 Kuwait                Asia       1952    55.6 160000   108382.
##  7 Equatorial Guinea     Africa     1952    34.5 216964      376.
##  8 Reunion               Africa     1952    52.7 257700     2719.
##  9 Gambia                Africa     1952    30   284320      485.
## 10 Swaziland             Africa     1952    41.4 290243     1148.
## # ... with 132 more rows

Notice how I didn’t specify the input data in the arrange() function. This is because we piped the output from the previous function, so the input to arrange has already been specified.

Make sure you understand what is going on with the pipe operation. If you don’t, then ask your neighbor or me.

Test out the pipe operator on your own

# Your code here

Exercise

Let’s bring everything above together.

Using a single line of code perform the following sequence of operations on the my_gap data:

Remove the continent column
Keep only the rows for the years 1952 and 2007
Keep only the rows that have countries with per capita GDP > 10000
Sort the data by year in ascending order and then by per capita GDP in descending order

Take a look at your results. Do you see anything interesting/note-worthy?

More `dplyr`

Ok, so we’ve gotten some more practice with the dplyr functions that we saw last week. Now, let’s learn some more tools that dplyr has to offer.

Just in case you’ve saved modifications/changes to your my_gap data, let’s recreate a fresh copy from the original gapminder data before moving ahead.

my_gap <- gapminder

Create new variables with `mutate()`

We often want to create new variables (columns) in a dataset, where the new variable is a function of exisiting variables. For instance if we have a column with precipitation data in inches, we might want to create a new column that has the same precipitation data in centimeters. In this case we would simply multiply our precipitation in inches by 2.54 (number of cm per inch) to get the new, desired column.

Let’s create a new column in our my_gap dataset that has the total GDP (i.e. per capita GDP multiplied by the population)

mutate(my_gap, 
       tot_gdp = gdpPercap * pop)

## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap      tot_gdp
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
## # ... with 1,694 more rows

We calculated this new variable. However, remember if we want to save this information we need to assign it to a data object. Let’s save this new column to our my_gap data

my_gap <- mutate(my_gap, 
                 tot_gdp = gdpPercap * pop)

Take at look at your my_gap data to confirm that you’ve added this new variable.

In 1952 which country had the largest total GDP? (Rely on the dplyr functions to help you here)
In 2007 which country had the largest total GDP?

Ok now create a new column called pop_mill that has the population in millions (e.g. 1,000,000 should appear as 1) and assign make sure to add this variable to your my_gap data

# Your code here

Make sure you understand exactly what is going on in the code above. If you have any questions, discuss with your neighbor or me before moving ahead.

Vector functions with `mutate()`

We can also apply functions to the data when creating a new column (variable). We can perform just about any mathematical operation (you’ve already seen multiplication when creating a new variable) – for a list of additional operations check out your dplyr cheatsheet.

For instance, we might want to create a new column that has the log₁₀ of the population data. In this case we can simply employ the log10 function in our mutate() operation

my_gap <- mutate(my_gap, 
                 log10_pop = log10(pop))

Try out the mutate() function to create a new variable gdp_percap_ratio, where you divide all of the per capita GDP values, by the maximum per capita GDP observed. This will allow you to see how a given observation compares to the maximum observed.

# Your code here

Take a look at the results and think about what you are observing.

Conditional statements with `if_else()`

We often want to create a new variable using mutate() where the values are based on some conditional statement. For example, we might want to create a categorical variable where countries are labeled “lower-income” or “higher-income” based on their per capita GDP.

We can use the if_else() function with mutate() to do these types of operations. First let’s create a new variable income_status and assign it to my_gap.

my_gap <- mutate(my_gap, 
                 income_status = if_else(gdpPercap > 7500, "higher-income","lower-income"))

Take a look at my_gap and make sure you understand what we did here before moving on.

Now try creating your own variable using the mutate() and if_else() functions and add this variable to your my_gap data

# Your code here

Conditional statements with `case_when()`

The if_else() function is great when you have just two cases that you would like to assign (e.g. “lower-income” and “higher-income”). However, there are instances where we would like to assign values based on more than two cases. In this instances we can use the case_when() function.

Let’s change our income_status variable to cover three cases, “low income”, “middle income”, and “high income”.

my_gap <- mutate(my_gap, 
                 income_status = case_when(gdpPercap > 7500 ~ "high income", 
                                           gdpPercap > 3500 & gdpPercap <= 7500 ~ "middle income", 
                                           gdpPercap <= 3500 ~ "low income") )

Now create a variable life_exp_status where:
“high life exp” if life expectancy is > 72
“med life exp” if life exp is <= 72 and > 65
“low life exp” otherwise.

# Your code here

Rename variables with `rename()`

We often want to rename columns (variables) in a dataset. Often, we’ll load in data that has a column name that we don’t like for one reason or another (too long, not descriptive, includes spaces or odd characters,…). We can use the rename() function to do this.

Let’s rename the columns in our my_gap data so that they are all in a consistent format/style. For this example let’s have only lower case letters in our column names and lets indicate spaces between words with and underscore _. This means that we’ll need to rename our lifeExp and gdpPercap column and the other columns can remain unchanged.

my_gap <- rename(my_gap, 
                 life_exp = lifeExp, 
                 gdp_per_cap = gdpPercap)

Quick aside regarding the `select()` function

Remember how we used the select function to keep only the variable we wanted? We’ll there is some additional functionality that you can use with select() that you will now appreciate.

Imagine we have a dataset with lots of variables and we only wanted to select variable using some criteria of their name. We can use some helper functions with select() to perform these operations.

Imagine we just wanted the year, country, and any columns containing gdp information. Since our columns with gdp information, all have “gdp” somewhere in the name, we can use the contains() function with select()

select(my_gap, year, country, contains("gdp"))

## # A tibble: 1,704 x 4
##     year country     gdp_per_cap      tot_gdp
##    <int> <fct>             <dbl>        <dbl>
##  1  1952 Afghanistan        779.  6567086330.
##  2  1957 Afghanistan        821.  7585448670.
##  3  1962 Afghanistan        853.  8758855797.
##  4  1967 Afghanistan        836.  9648014150.
##  5  1972 Afghanistan        740.  9678553274.
##  6  1977 Afghanistan        786. 11697659231.
##  7  1982 Afghanistan        978. 12598563401.
##  8  1987 Afghanistan        852. 11820990309.
##  9  1992 Afghanistan        649. 10595901589.
## 10  1997 Afghanistan        635. 14121995875.
## # ... with 1,694 more rows

Take a look at the output and make sure you understand what is going on. Also take a look at your dplyr cheatsheet and you’ll see some other functions that you can use with select().

Try testing out some of these functions that you can use with select(). While we don’t have very many columns in our current dataset, you can imagine these select functions will become more and more useful as the number of variables grows.

# Your code here

Select n rows by a variable ranking `top_n()`

We are often interested in the selecting rows (observations) based on their rank. For instance, we might want to just get the top 10 observations by life expectancy. We can use the top_n() function to do this.

top_n(my_gap, 10, life_exp) # top 10 countries by life expectancy

## # A tibble: 10 x 9
##    country         continent  year life_exp    pop gdp_per_cap tot_gdp log10_pop
##    <fct>           <fct>     <int>    <dbl>  <int>       <dbl>   <dbl>     <dbl>
##  1 Australia       Oceania    2007     81.2 2.04e7      34435. 7.04e11      7.31
##  2 Hong Kong, Chi~ Asia       2002     81.5 6.76e6      30209. 2.04e11      6.83
##  3 Hong Kong, Chi~ Asia       2007     82.2 6.98e6      39725. 2.77e11      6.84
##  4 Iceland         Europe     2007     81.8 3.02e5      36181. 1.09e10      5.48
##  5 Israel          Asia       2007     80.7 6.43e6      25523. 1.64e11      6.81
##  6 Japan           Asia       2002     82   1.27e8      28605. 3.63e12      8.10
##  7 Japan           Asia       2007     82.6 1.27e8      31656. 4.04e12      8.11
##  8 Spain           Europe     2007     80.9 4.04e7      28821. 1.17e12      7.61
##  9 Sweden          Europe     2007     80.9 9.03e6      33860. 3.06e11      6.96
## 10 Switzerland     Europe     2007     81.7 7.55e6      37506. 2.83e11      6.88
## # ... with 1 more variable: income_status <chr>

What were the top 10 countries by total GDP in 1952. Make sure to output the list in descending order by total GDP. You’ll probably want to use top_n() in addition other dplyr function.

# Your code here

What were the top 10 countries by total GDP in 2007. Make sure to output the list in descending order by total GDP. You’ll probably want to use top_n() in addition other dplyr function.

# Your code here

Did the top 10 countries change much between 1952 and 2007?

`summarize()`

When analyzing a dataset, we are often interested in generating a table with statistics that summarize that data. As the name suggests the summarize() function helps us do just that.

Let’s compute average life expectancies and per capita GDP on our gap_data. Before doing this, let’s filter our data so we are just looking at year 2007.

my_gap_2007 <- filter(my_gap, year == 2007)

Now, let’s use the summarize() function. The basic syntax is the summarize(dataset, variable_name_1 = statistic, variable_name_2 = statistic,...).

Note: both the American English spelling summarize() and British English spelling summarise() will work.

summarize(my_gap_2007, 
          avg_life = mean(life_exp), 
          avg_gdp_per_cap = mean(gdp_per_cap) )

## # A tibble: 1 x 2
##   avg_life avg_gdp_per_cap
##      <dbl>           <dbl>
## 1     67.0          11680.

You can use a ton of other summary statistics functions (see your dplyr cheatsheet).

Create a few more summary tables using your my_gap data (note you may want to filter your data first as we did with year 2007).

# Your code here

Did you learn anything interesting? If so, feel free to share what you found with the class.

Exercises

If you finish early, spend some time exploring the gapminder dataset while applying the new dplyr tools you’ve learned. Formulate some questions and use what you’ve learned to try and answer/explore them.

# your code here

Introduction to Data Manipulation (continued)

ENS-215

18-Jan-2023

Load in the `dplyr` package and the `gapminder` package

Quick refresher `dplyr` verbs and concepts we learned last week

`filter()`

`arrange()`

`select()`

Piping with `%>%`

Exercise

More `dplyr`

Create new variables with `mutate()`

Vector functions with `mutate()`

Conditional statements with `if_else()`

Conditional statements with `case_when()`

Rename variables with `rename()`

Quick aside regarding the `select()` function

Select n rows by a variable ranking `top_n()`

`summarize()`

Exercises

Introduction to Data Manipulation (continued)

ENS-215

18-Jan-2023

Load in the dplyr package and the gapminder package

Quick refresher dplyr verbs and concepts we learned last week

filter()

arrange()

select()

Piping with %>%

Exercise

More dplyr

Create new variables with mutate()

Vector functions with mutate()

Conditional statements with if_else()

Conditional statements with case_when()

Rename variables with rename()

Quick aside regarding the select() function

Select n rows by a variable ranking top_n()

summarize()

Exercises

Load in the `dplyr` package and the `gapminder` package

Quick refresher `dplyr` verbs and concepts we learned last week

`filter()`

`arrange()`

`select()`

Piping with `%>%`

More `dplyr`

Create new variables with `mutate()`

Vector functions with `mutate()`

Conditional statements with `if_else()`

Conditional statements with `case_when()`

Rename variables with `rename()`

Quick aside regarding the `select()` function

Select n rows by a variable ranking `top_n()`

`summarize()`