You’ve now come to see that in many cases datasets are not in the ideal format/structure for analysis. You have learned some key concepts in cleaning (tidying) and reshaping data so that data is in the ideal format for computation and analysis. The tools you learned were largely in the dplyr
and tidyr
packages.
It is often that case that the data you need will be distributed among a number of different dataset. Once you have a dataset(s) in good shape for analysis, you will often need to merge/join these datasets together.
Today we will learn some important concepts related to merging/joining datasets and we will learn how to implement these concepts using tools from the dplyr
package.
Let’s first load in the packages that we’ll need for today’s analysis.
library(tidyverse)
Often you will have multiple files containing data for the same project/study. In these cases you will want to combine the multiple datasets into a single dataset.
Now, let’s learn how we do these types of operations.
You may have multiple related datasets that have the same variables (columns). Dealing with a single, combined dataset is much easier, so in this case you want to combine rows (observations).
You’ve already seen the product of this type of operation – the precipitation data used in Lab 2, was originally 50 individual files (one for each US state) that I then combined into a single large dataset. Working with a single dataset is much easier than if we had worked with 50 individual ones. This combined dataset had the exact same columns (variables) as each of the individual datasets, though it had many more rows (observations).
bind_rows
The bind_rows
function, takes two datasets that have the same columns (variables) and simply stacks them, one on top of the other, to create a larger dataset.
Let’s load in two data files that each contain NOAA precipitation data. The first file has monthly precip for Massachusetts from 2015-2017 and the other file has the same but for New York.
precip_MA <- read_csv("https://stahlm.github.io/ENS_215/Data/precip_MA.csv")
precip_NY <- read_csv("https://stahlm.github.io/ENS_215/Data/precip_NY.csv")
Take a look at each dataset to confirm that they have the same variables (columns).
Ok, now it will be much easier to work with these datasets in R if they are in a single data frame. We’ll do this with bind_rows
precip_combined <- bind_rows(precip_MA, precip_NY)
Let’s keep precip_combined
as is now, since we will want to use it later on in today’s exercises.
This stacked the datasets one on top of the other. Take a look to confirm all looks well.
You can bind more than two datasets at a time if you want. Simply add any additional datasets to your bindrows()
function call.
bindrows()
is smart and will bind two datasets even if the columns are not in the same order. It will intelligently bind the rows from the same column (variable) together.
Test this out. Switch the ordering of the columns in one of the data frames (hint select()
can help you do this) and then bind the two data frames to verify that the binding still works correctly.
# Your code here
Since bind_rows()
binds rows from the same column (variable) it can also deal with binding data frames that might only have some columns in common. Try deleting (or adding) a column to one of the data frames and then bind it to the other data frame. You’ll see that missing data is dealt with by filling with NA
# Your code here
intersect
Image source: Wikipedia.org
The intersect()
function keeps the rows that appear in both of the datasets under consideration.
There are many situations where we need to identify the rows in common between two (or more) datasets. A very common example is where you have two datasets containing names (e.g. states, elements, species) and you want to find the names that are in both datasets. An intersect operation allows you to do this.
Let’s consider another example. You would like to perform analysis examining how water quality in a stream affects the health of fish. The USGS has collected water quality samples periodically on this stream. The EPA has collected fish tissue samples on this same stream, though their efforts were not coordinated so they didn’t always sample on the same day. You would like to only analyze data for which there were both fish and water samples collected on the same day.
Let’s load in tables of sampling dates from each of the agencies (EPA and USGS).
dates_EPA <- read_csv("https://stahlm.github.io/ENS_215/Data/Sampling_dates_EPA_fish_tissue.csv")
dates_USGS <- read_csv("https://stahlm.github.io/ENS_215/Data/Sampling_dates_USGS_water_quality.csv")
Take a look at each of the data frames. You’ll see that they each have a Year
, Month
, and Day
column.
To determine which dates they have in common we can use intersect
dates_common <- intersect(dates_EPA, dates_USGS)
setdiff
The difference between sets can be obtained by using the setdiff()
function. The setdiff()
function returns the rows that appear in set A but not in set B, where A
is the first set specified in setdiff()
and B
is the 2nd set specified.
Let’s continue using our USGS and EPA sampling dates to test out the setdiff()
function. First let’s determine the dates where there are only USGS (water quality) samples
dates_only_USGS <- setdiff(dates_USGS, dates_EPA)
Now you should determine the days when there are only EPA (fish tissue) samples
# Your code here
union
of setsThe union of sets finds the rows that appear in set A or set B (as illustrated in the venn-diagram above). Duplicates are removed – thus any row that appears in both A and B will only appear ONCE in the union.
Let’s take the union of the USGS and EPA sampling dates. This will tell us how many unique dates samples were collected on.
dates_USGS_EPA <- union(dates_USGS, dates_EPA)
dates_USGS_EPA
data frame?dates_USGS
and dates_EPA
?bind_cols
Bind columns is a very straightforward function. It simply pastes two data frames side-by-side, thus making a single, larger data frame. While bind_cols()
is useful, you should be careful when using it as it does not do anything to ensure that observations (rows) are properly matched - thus it is up to you to make sure that the ordering of the rows makes sense before binding two data frames with bind_cols()
Let’s try out a quick example.
First let’s create two new data frames
years_USGS <- select(dates_USGS, Year)
months_days_USGS <- select(dates_USGS, Month, Day)
Take a look at both of these data frames. Once it makes sense to you, then let’s use bind_cols()
to join the data. While this is a bit of an artificial example, it nonetheless highlights how you can use bind_cols()
.
bind_example <- bind_cols(years_USGS, months_days_USGS)
head(bind_example)
joins
Joins allow you to combine the columns of two dataset, matching the rows by a key. The key is a column (variable) or set of columns that contains identifying information for each row (observation). For example if the key might be a column with a location name (e.g. US state) if we wanted to join two datasets with data for each US state. The key also be a set of columns - for example we might want to join two dataset and match the rows by “Year”, “Month”, and “Day”, which are contained in three columns in each dataset.
These concepts will make much more sense as you work through the examples in the following sections.
See Chapter 13 (in particular 13.4) of R4DS and Chapter 3.7 of ModernDive for a nice discussion of joins.
The diagrams below help to graphically illustrate how columns from two datasets are joined - matching up the rows according to the key variable.
Image source: R4DS
The key variable(s) (the numbers in the colored cells) are matched between the two datasets to be joined and then the columns are merged with the matching rows retained and aligned.
Image source: R4DS
There are two types of joins - inner joins and outer joins.
An inner join will match observations when their keys are equal and the output data frame contains columns from both datasets for only the observations common to both datasets. Thus unmatched rows are not included in the output of an inner join
An outer join differs in that it keeps observations that appear in at least one of the tables. There are three types of outer joins:
x
y
x
and y
Where x
and y
are the 1st and 2nd datasets respectively specified in the join function.
Let’s load in NOAA temperature data that we would like to join with our NOAA precipitation data. The NOAA temperature dataset has monthly temperature data for all the state in the US for every year from 1895 through 2017.
temperature_data <- read_csv("https://stahlm.github.io/ENS_215/Data/NOAA_State_Temp_Lab_Data.csv")
Let’s remove the year 2016 data for New York from the temperature dataset. We are doing this, since it will help to illustrate how unmatched observations are dealt with during joins.
temperature_data <- filter(temperature_data, !(Year == 2016 & state_cd == "NY")) # remove data for NY in year 2016
inner_join
Image source: R4DS
An inner_join
(shown in the diagram above) is the simplest type of join. Let’s join our precipitation and temperature data so that we can can precip and temp data for each observation.
First let’s take a look at the first few rows of each dataset. I’ve highlighted the key variables (columns) in yellow.
head(precip_combined)
head(temperature_data)
Year | Month | Precip_inches | state_cd |
---|---|---|---|
2015 | 1 | 4.03 | MA |
2015 | 2 | 2.99 | MA |
2015 | 3 | 2.84 | MA |
2015 | 4 | 2.55 | MA |
2015 | 5 | 0.96 | MA |
2015 | 6 | 6.57 | MA |
Year | Month | Avg_Temp_F | state_cd |
---|---|---|---|
1895 | 1 | 43.1 | AL |
1895 | 2 | 37.4 | AL |
1895 | 3 | 54.5 | AL |
1895 | 4 | 63.4 | AL |
1895 | 5 | 69.5 | AL |
1895 | 6 | 77.5 | AL |
Now let’s join the temperature and precipitation data using an inner_join()
climate_data <- inner_join(precip_combined, temperature_data)
## Joining, by = c("Year", "Month", "state_cd")
In the message above you see that inner_join()
, joined the datasets by the “Year”, “Month”, “state_cd” columns (i.e. it used the values in these columns to match rows (observations) ). You are able to explicitly specify the columns you would like to join by, simply by passing by = c("Column name 1", "Column name 2",...)
as an argument in the inner_join
function (or any of the other join functions). However, the join functions in tidyr
are smart and will join by all of the common columns (variables) unless you specify otherwise.
Year | Month | Precip_inches | state_cd | Avg_Temp_F |
---|---|---|---|---|
2015 | 1 | 4.03 | MA | 21.8 |
2015 | 2 | 2.99 | MA | 14.6 |
2015 | 3 | 2.84 | MA | 29.3 |
2015 | 4 | 2.55 | MA | 45.5 |
2015 | 5 | 0.96 | MA | 61.8 |
2015 | 6 | 6.57 | MA | 63.6 |
Ok now, take a closer look at the joined dataset climate_data
and make sure you understand what is going on.
climate_data
have observations for Massachusetts? Does this make sense to you?climate_data
have observations for New York? Does this make sense to you?left_join
Image source: R4DS
Now let’s perform a left_join
to add variables (columns) from temperature_data
to precip_combined
. The left_join
will only add observations from temperature_data
that match observations in precip_combined
. In this case, the observations are matched according to the Year
, Month
, and state_cd
in each table.
climate_data <- left_join(precip_combined, temperature_data)
## Joining, by = c("Year", "Month", "state_cd")
NA
s)? Does this make sense?The left_join
is the most commonly used outer join operation and should be your default outer join choice unless you have a compelling reason to use another join. However, there are other joins available and we’ll introduce these below so that you are aware of them.
Let’s use left_join
to join a dataset containing information about the region the states are in to the our dataset climate_data
.
First you should load in the dataset with the region data.
regions_data <- read_csv("https://stahlm.github.io/ENS_215/Data/state_regions.csv")
Take a look at the data to make sure everything looks ok.
Now, you should join this data to your climate_data
so that you have regions associated with each observation in the climate_data
. Note you may have to make some slight modifications to one of the datasets prior to joining them.
Once you’ve joined the data take a look and make sure you understand exactly what happened.
right_join
Image source: R4DS
The right_join
is conceptually similar to left_join
except it matches in the opposite direction as left_join
.
Test out right_join
on your precip_combined
and temperature_data
. Examine the joined dataset and make sure you understand what happened. Try the join twice (switching the order of the datasets in the function calls) and make sure it is clear what is happening?
Is one of your right_joins
equivalent to your left_join
performed earlier on these datasets?
# Your code here
full_join
Image source: R4DS
The full_join
function joins the two datasets, keeping all values and all rows.
Test out full_join
on your precip_combined
and temperature_data
. Examine the joined dataset and make sure you understand what happened.
# Your code here
full_join()
function call make a difference in terms of the content of your joined dataset?gapminder
data and the dams data from lab 3. Then add the population and per capita GPD to the dams data (using the 2007 gapminder data).Once you’ve joined these dataset, try some additional exploratory analysis of the enhanced dams data that now has population and per capita GDP data.
The dams data and gapminder data are here
library(gapminder)
my_gap <- gapminder
dams_data <- read_csv("https://stahlm.github.io/ENS_215/Data/Dams_FAO_SelectCols_LabData.csv")