Introduction to Data Manipulation (continued)

Last week you got an introduction to the dplyr package, which allows us to transform our data in an efficient, easy to read and code manner. With dplyr we can perform many of the most common data manipulation/transformation operations with the functions available in the package. Below is a table with the dplyr functions we’ve learned thus far. Today we will learn some more functions in dplyr that will greatly improve your capabilities with data manipulation and analysis.

`dplyr` function	Description
`filter()`	Subset by row values
`arrange()`	Sort rows by column values
`select()`	Subset columns

Load in the `dplyr` package and the `gapminder` package

We’ll load in tidyverse which contains dplyr (as well as many other packages). We will also load in the gapminder library so we can continue to work with the gapminder data the we used last week.

library(tidyverse)
library(gapminder)

Let’s assign the gapminder data to our own data frame that we’ll call my_gap

my_gap <- gapminder

Quick refresher `dplyr` verbs and concepts we learned last week

`filter()`

We can select the rows (observations) in a data set according to criteria that we specify. Remember the syntax is filter(dataset, criteria 1, criteria 2, ...)

filter(my_gap, 
       lifeExp > 65, 
       gdpPercap < 5000, 
       year == 2007)

Try out your own filter operation. Think of something interesting to try. If you found something cool, share it with the class.

# Your code here

Take a look at your notes from last class and make sure you understand the %in% operator and how you can use it along with filter()

`arrange()`

Arrange allows us to sort data by variable(s) of interest. Let’s sort my_gap by year (descending order) and then by population (ascending order). Remember the default sort order is ascending (smallest to largest) and to sort in descending order we need to use the desc() function.

arrange(my_gap, desc(year), pop)

Practice with arrange(). Think of something interesting to try. If you found something cool, share it with the class.

# Your code here

`select()`

The select() function allows us to grab only the variables (columns) we want out of a data set.

You can specify the ones you want

select(my_gap, country, year, lifeExp)

If you want you can rename the columns you select

select(my_gap, country, year, Life_in_years = lifeExp)

Or the ones you don’t want by putting - before the variable

select(my_gap, -pop, -gdpPercap)

Give select() some more practice

# Your code here

Piping with `%>%`

We can pass the output from one function to be the input to another function by using the “pipe” command %>%

Let’s filter our my_gap data and then pipe it to the arrange() to sort the filtered data. Piping is super useful when you want to perform a sequence of operations.

filter(my_gap, year == 1952) %>% 
  arrange(pop)

Notice how I didn’t specify the input data in the arrange() function. This is because we piped the output from the previous function, so the input to arrange has already been specified.

Make sure you understand what is going on with the pipe operation. If you don’t, then ask your neighbor or me.

Test out the pipe operator on your own

# Your code here

Exercise

Let’s bring everything above together.

Using a single line of code perform the following sequence of operations on the my_gap data:

Remove the continent column
Keep only the rows for the years 1952 and 2007
Keep only the rows that have countries with per capita GDP > 10000
Sort the data by year in ascending order and then by per capita GDP in descending order

Take a look at your results. Do you see anything interesting/note-worthy?

More `dplyr`

Ok, so we’ve gotten some more practice with the dplyr functions that we saw last week. Now, let’s learn some more tools that dplyr has to offer.

Just in case you’ve saved modifications/changes to your my_gap data, let’s recreate a fresh copy from the original gapminder data before moving ahead.

my_gap <- gapminder

Create new variables with `mutate()`

We often want to create new variables (columns) in a dataset, where the new variable is a function of exisiting variables. For instance if we have a column with precipitation data in inches, we might want to create a new column that has the same precipitation data in centimeters. In this case we would simply multiply our precipitation in inches by 2.54 (number of cm per inch) to get the new, desired column.

Let’s create a new column in our my_gap dataset that has the total GDP (i.e. per capita GDP multiplied by the population)

mutate(my_gap, 
       tot_gdp = gdpPercap * pop)

We calculated this new variable. However, remember if we want to save this information we need to assign it to a data object. Let’s save this new column to our my_gap data

my_gap <- mutate(my_gap, 
                 tot_gdp = gdpPercap * pop)

Take at look at your my_gap data to confirm that you’ve added this new variable.

In 1952 which country had the largest total GDP? (Rely on the dplyr functions to help you here)
In 2007 which country had the largest total GDP?

Ok now create a new column called pop_mill that has the population in millions (e.g. 1,000,000 should appear as 1) and assign make sure to add this variable to your my_gap data

# Your code here

Make sure you understand exactly what is going on in the code above. If you have any questions, discuss with your neighbor or me before moving ahead.

Vector functions with `mutate()`

We can also apply functions to the data when creating a new column (variable). We can perform just about any mathematical operation (you’ve already seen multiplication when creating a new variable) – for a list of additional operations check out your dplyr cheatsheet.

For instance, we might want to create a new column that has the log₁₀ of the population data. In this case we can simply employ the log10 function in our mutate() operation

my_gap <- mutate(my_gap, 
                 log10_pop = log10(pop))

Try out the mutate() function to create a new variable gdp_percap_ratio, where you divide all of the per capita GDP values, by the maximum per capita GDP observed. This will allow you to see how a given observation compares to the maximum observed.

# Your code here

Take a look at the results and think about what you are observing.

Conditional statements with `if_else()`

We often want to create a new variable using mutate() where the values are based on some conditional statement. For example, we might want to create a categorical variable where countries are labeled “lower-income” or “higher-income” based on their per capita GDP.

We can use the if_else() function with mutate() to do these types of operations. First let’s create a new variable income_status and assign it to my_gap.

my_gap <- mutate(my_gap, 
                 income_status = if_else(gdpPercap > 7500, "higher-income","lower-income"))

Take a look at my_gap and make sure you understand what we did here before moving on.

Now try creating your own variable using the mutate() and if_else() functions and add this variable to your my_gap data

# Your code here

Conditional statements with `case_when()`

The if_else() function is great when you have just two cases that you would like to assign (e.g. “lower-income” and “higher-income”). However, there are instances where we would like to assign values based on more than two cases. In this instances we can use the case_when() function.

Let’s change our income_status variable to cover three cases, “low income”, “middle income”, and “high income”.

my_gap <- mutate(my_gap, 
                 income_status = case_when(gdpPercap > 7500 ~ "high income", 
                                           gdpPercap > 3500 & gdpPercap <= 7500 ~ "middle income", 
                                           gdpPercap <= 3500 ~ "low income") )

Now create a variable life_exp_status where:
“high life exp” if life expectancy is > 72
“med life exp” if life exp is <= 72 and > 65
“low life exp” otherwise.

# Your code here

Rename variables with `rename()`

We often want to rename columns (variables) in a dataset. Often, we’ll load in data that has a column name that we don’t like for one reason or another (too long, not descriptive, includes spaces or odd characters,…). We can use the rename() function to do this.

Let’s rename the columns in our my_gap data so that they are all in a consistent format/style. For this example let’s have only lower case letters in our column names and lets indicate spaces between words with and underscore _. This means that we’ll need to rename our lifeExp and gdpPercap column and the other columns can remain unchanged.

my_gap <- rename(my_gap, 
                 life_exp = lifeExp, 
                 gdp_per_cap = gdpPercap)

Quick aside regarding the `select()` function

Remember how we used the select function to keep only the variable we wanted? We’ll there is some additional functionality that you can use with select() that you will now appreciate.

Imagine we have a dataset with lots of variables and we only wanted to select variable using some criteria of their name. We can use some helper functions with select() to perform these operations.

Imagine we just wanted the year, country, and any columns containing gdp information. Since our columns with gdp information, all have “gdp” somewhere in the name, we can use the contains() function with select()

select(my_gap, year, country, contains("gdp"))

Take a look at the output and make sure you understand what is going on. Also take a look at your dplyr cheatsheet and you’ll see some other functions that you can use with select().

Try testing out some of these functions that you can use with select(). While we don’t have very many columns in our current dataset, you can imagine these select functions will become more and more useful as the number of variables grows.

# Your code here

Select n rows by a variable ranking `top_n()`

We are often interested in the selecting rows (observations) based on their rank. For instance, we might want to just get the top 10 observations by life expectancy. We can use the top_n() function to do this.

top_n(my_gap, 10, life_exp) # top 10 countries by life expectancy

What were the top 10 countries by total GDP in 1952. Make sure to output the list in descending order by total GDP. You’ll probably want to use top_n() in addition other dplyr function.

# Your code here

What were the top 10 countries by total GDP in 2007. Make sure to output the list in descending order by total GDP. You’ll probably want to use top_n() in addition other dplyr function.

# Your code here

Did the top 10 countries change much between 1952 and 2007?

`summarize()`

When analyzing a dataset, we are often interested in generating a table with statistics that summarize that data. As the name suggests the summarize() function helps us do just that.

Let’s compute average life expectancies and per capita GDP on our gap_data. Before doing this, let’s filter our data so we are just looking at year 2007.

my_gap_2007 <- filter(my_gap, year == 2007)

Now, let’s use the summarize() function. The basic syntax is the summarize(dataset, variable_name_1 = statistic, variable_name_2 = statistic,...).

Note: both the American English spelling summarize() and British English spelling summarise() will work.

summarize(my_gap_2007, 
          avg_life = mean(life_exp), 
          avg_gdp_per_cap = mean(gdp_per_cap) )

You can use a ton of other summary statistics functions (see your dplyr cheatsheet).

Create a few more summary tables using your my_gap data (note you may want to filter your data first as we did with year 2007).

# Your code here

Did you learn anything interesting? If so, feel free to share what you found with the class.

Exercises

If you finish early, spend some time exploring the gapminder dataset while applying the new dplyr tools you’ve learned. Formulate some questions and use what you’ve learned to try and answer/explore them.

# your code here

Introduction to Data Manipulation (continued)

ENS-215

20-Jan-2025

Load in the `dplyr` package and the `gapminder` package

Quick refresher `dplyr` verbs and concepts we learned last week

`filter()`

`arrange()`

`select()`

Piping with `%>%`

Exercise

More `dplyr`

Create new variables with `mutate()`

Vector functions with `mutate()`

Conditional statements with `if_else()`

Conditional statements with `case_when()`

Rename variables with `rename()`

Quick aside regarding the `select()` function

Select n rows by a variable ranking `top_n()`

`summarize()`

Exercises

Introduction to Data Manipulation (continued)

ENS-215

20-Jan-2025

Load in the dplyr package and the gapminder package

Quick refresher dplyr verbs and concepts we learned last week

filter()

arrange()

select()

Piping with %>%

Exercise

More dplyr

Create new variables with mutate()

Vector functions with mutate()

Conditional statements with if_else()

Conditional statements with case_when()

Rename variables with rename()

Quick aside regarding the select() function

Select n rows by a variable ranking top_n()

summarize()

Exercises

Load in the `dplyr` package and the `gapminder` package

Quick refresher `dplyr` verbs and concepts we learned last week

`filter()`

`arrange()`

`select()`

Piping with `%>%`

More `dplyr`

Create new variables with `mutate()`

Vector functions with `mutate()`

Conditional statements with `if_else()`

Conditional statements with `case_when()`

Rename variables with `rename()`

Quick aside regarding the `select()` function

Select n rows by a variable ranking `top_n()`

`summarize()`