Bivariate data is data that has observations with two quantitative variables associated with each observation. For example if we have a dataset where the temperature and precipitation are measured for each observation (e.g. US state) then we have a bivariate dataset. This differs from univariate data where we only have one quantitative variable for each observation (e.g. just precipitation).
In many cases (as you have already seen) you will have datasets with many more than two quantitative variables for each observation (e.g. the BGS Groundwater Chemistry dataset for Bangladesh). In this case our dataset is multivariate, however we when we consider two quantitative variables at a time (e.g. arsenic and iron) then we are nonetheless performing bivariate analysis.
While we’ve been dealing with bivariate data and performing bivariate analysis for much of the term, we haven’t yet explicitly covered a number of the more technical details and concepts. In today’s and the upcoming lectures this week we will:
Let’s load in the packages that we’ll use today
library(tidyverse)
library(lubridate)
library(ggExtra)
Scatter plots, which we’ve been using throughout the term, are a commonly used to visualize bivariate data. While at this point in the term you are very familiar with this type of graphic, there are a few additional concepts and techniques that are worth pointing in our discussion of bivariate data. In particular, when using a scatter plot to display bivariate data you should consider whether one of the variables is dependent on the other. When there is a dependent variable it is generally plotted in the y-axis, while the independent variable is on the x-axis. For instance imagine you had a dataset of recording the heights and ages of trees. Conceptually, the height of the tree depends on the tree’s age – as the tree gets older it has had more time to grow and will increase in height – while the age does not depend on the height. Thus you would plot age on the x-axis and height on the y-axis. In cases, where there is not a clear dependence then the choice of what axis to plot each variable on will depend on preference, aesthetics, and the goal of the figure.
Let’s load in the BGS Bangladesh groundwater chemistry data to use in the analysis to follow.
bangladesh_gw <- read_csv("https://stahlm.github.io/ENS_215/Data/NationalSurveyData_DPHE_BGS_LabData.csv") %>%
drop_na()
Let’s investigate if how well depth relates to the latitude of the location where the well is situated. We are interested in this because it could potentially provide information that we can use to make inferences about the geology. Typically well are installed at a depth where water can be extracted at sufficient quantity and quality. If the geology allows for sufficient quantities of water to be extracted at shallow depth, then you would prefer to install a shallow well (all other things being equal) since it is cheaper to install shallow wells.
Since there is not a clear dependent variable in this case, we could put either variable on either axis. However, conceptually it is easier to interpret this plot with latitude on the y-axis, since latitude varies as you move north to south and people tend to conceptualize north to south as the vertical direction on a page.
fig_1 <- bangladesh_gw %>%
ggplot(aes(x = WELL_DEPTH_m, y = LAT_DEG)) +
geom_point(alpha = 0.3) +
theme_classic()
fig_1
When there are many points (especially many overlapping ones) marginal plots can be very helpful in providing additional information. Below we’ll generate marginal histograms that display the distributions of each of the variables.
Note: sometimes the marginal plots don’t show up in your notebook
(you won’t have any issues when you knit however). If this happens, type
fig_2
in your console and the graphic will appear in your
plotting pane.
To create marginal plots, we can use the ggMarginal
function from the ggExtra
package. I’ve specified
type = "histogram"
below. You can also specify
type = "density"
or type = boxplot
. Note, that
we pass fig_1
(our scatter plot) to the
ggMarginal()
function – this tells the function to add the
marginal plots to our fig_1
.
fig_2 <- ggMarginal(fig_1, type = "histogram", fill = "gray")
fig_2
Another way to show the density of points in each region of the graph
is to use the geom_hex()
function. This function breaks up
the plot area containing data into hexagonal sub-regions. The
sub-regions are color-coded according to the density of points in that
area. This achieves something similar to creating a scatter plot where
the points are semi-transparent (using the alpha setting in
geom_point()
) – though geom_hex()
can help to
reduce the “noise” in the graphic by reducing the number of points
displayed.
bangladesh_gw %>%
ggplot(aes(x = WELL_DEPTH_m, y = LAT_DEG)) +
geom_hex() +
theme_classic()
You an change the palette used to fill the hexagons by using the
scale_fill_
palette functions in ggplot. For example to use
the viridis palette you would simply do the following.
bangladesh_gw %>%
ggplot(aes(x = WELL_DEPTH_m, y = LAT_DEG)) +
geom_hex() +
scale_fill_viridis_c() +
theme_classic()
A common type of bivariate data is time-series data, where a numeric variable is recorded over time. We’ve dealt with time-series data many times throughout the term, though we’ll go into some additional tools/concepts below.
First let’s load in some streamflow data from the USGS streamgage at Schoharie Creek (01351500).
flow <- read_csv("https://stahlm.github.io/ENS_215/Data/USGS_streamflow_01351500.csv") %>%
drop_na() %>%
filter(Year >= 1940 & Year < 2025) %>% # select years 1940 through 2024
mutate(Date = make_date(Year, Month, Day)) # create a Date column that has the dates as an R date object
Let’s create a standard time-series plot of the streamflow data, where time is on the x-axis and flow is on the y-axis. A time-series plot of streamflow is referred to as a hydrograph.
flow %>%
ggplot(aes(x = Date, y = flow_cfs)) +
geom_line() +
theme_classic() +
labs(title = "USGS Gage 01351500",
y = "Flow (cfs)",
x = "")
Since streamflows span a large range of values, we it can often be helpful to display the data with a log10 scale y-axis.
flow %>%
ggplot(aes(x = Date, y = flow_cfs)) +
geom_line() +
theme_classic() +
scale_y_log10() +
labs(title = "USGS Gage 01351500",
y = "Flow (cfs)",
x = "")
decade
variable to facet by.Ok, so these are typical time-series, plots which you are relatively familiar with from our work throughout the term. Now let’s take a look at a few other ways of visualizing time-series data.
Raster plots take time-series data convert and plot year on one axis and the day of the year (or month) on the other axis. The pixels in the plot have the variable value (in this case flow) mapped to the fill. These plots are called raster plots, as a raster graphic is simply a rectangular image where the pixels are colored according to some variable. Digital images like the ones you take on your phone are called raster graphics.
Raster plots are an excellent way of displaying streamflow data. When streamflow data is presented in this way we call it a raster hydrograph.
Let’s use the geom_tile()
function to create a raster
hydrograph for the streamflow data at Schoharie Creek. We’ll plot the
day of the year on the x-axis, the year on the y-axis, and we will fill
the pixels with a color corresponding the the log10(flow). We
are using the log10(flow) to prevent the color mapping from
being heavily skewed to the highest values (recall that streamflow is
often log-normally distributed).
midpoint2use <- median(log10(flow$flow_cfs))
flow %>%
mutate(DOY = yday(Date)) %>%
ggplot(aes(x = DOY, y = Year,
fill = log10(flow_cfs))) +
geom_tile() +
scale_fill_gradient2(low = "red", mid = "green", high = "blue",
midpoint = midpoint2use) +
theme_classic() +
labs(title = "Schoharie Creek Flows",
subtitle = "USGS Gage 01351500",
y = "Year",
x = "Day of year",
fill = "log(flow)")
With a raster hydrograph it is much easier to identify seasonal patterns as each year is stacked (in rows) on top of one another and their seasons line up in the vertical direction – thus we can easily see seasonal patterns that are present across years. Furthermore, we can easily identify long-term trends by seeing how the intensity of the colors change as you look up and down the figure.
Another benefit of raster hydrographs is that we can observe if there are shifts in flow timing. For instance, if over time, high flows began to occur later in the year then we would see the blue part of the image shift to the right as the years increases.
With a raster hydrograph you can also more easily identify anomalous (e.g. particularly wet or dry years) much easier than in a standard hydrograph. On the raster hydrograph employing the color-scheme above these years will appear as a more red row (dry years) or more blue row (wet year).
Based in the raster hydrograph for Schoharie Creek:
You can also use a raster plot to visualize other types of data. In a similar vein, monthly precipitation data lends itself to being nicely represented in raster format.
Let’s load in the NOAA precipitation data that we’ve used in the past.
precip_data <- read_csv("https://stahlm.github.io/ENS_215/Data/noaa_cag_state_precipitation.csv")
precip_data <- precip_data %>%
rename(Precip_inches = Value)
Take a quick look and re-familiarize yourself with the data.
Now, let’s create a raster graphic of monthly precipitation for California. We’ll filter the data so that we have years 1960 onwards (otherwise we’ll have too many years and the graphic will be difficult to read).
Also note that in the code below we are filling the pixels according to the percentile ranking of each precipitation observation. This prevents the color-scale from being skewed by outlier data. We could also fill the pixels according to the precipitation in inches (as opposed to the ranking) though a bit more tweaking of the color scheme would be required.
Take a look at the code below and make sure you understand what we are doing. Also note, that we’ve put year on the x-axis and month on the y-axis (you are free to switch them as neither orientation is more correct).
precip_data %>%
filter(STATE == "California", YEAR >= 1960, YEAR < 2025) %>%
ggplot(aes(y = factor(MONTH), x = YEAR)) +
geom_tile(aes(fill = percent_rank(Precip_inches)), color = "black") +
scale_fill_gradient2(low = "red", mid = "white", high = "blue", midpoint = 0.5) +
coord_equal() +
theme_classic() +
labs(y = "Month",
fill = "Precip rank")
In many time-series data there are seasonal patterns (e.g. temperature, precipitation, streamflow,…). Displaying the data as a standard time-series plot where many years of data are shown as a single line can sometimes make it difficult to clearly identify and characterize these seasonal patterns.
Let’s load in some data of atmospheric CO2 concentrations that have been recorded at Mauna Loa, Hawaii since the late-1950s. The dataset has monthly average concentrations.
mauna_loa <- read_csv("https://stahlm.github.io/ENS_215/Data/Mauna_loa_CO2_data.csv", skip = 2)
Below I’m plotting a time-series of the monthly data. This graphic may look familiar to you – the data here is famous and often appears in discussions of climate change.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
You mostl likely noticed above, that there is a both a long-term trend and a seasonal oscillation in CO2 concentrations. Since the x-axis spans many decades it is difficult to identify what these seasonal trends are (e.g. does CO2 peak annually in summer? spring? fall? winter?).
To more easily visualize the seasonal patterns we can create a seasonal plot. In this graphic we will have month (e.g. Jan through Dec) on the x-axis and CO2 concentration on the y-axis. Each year is plotted as its own line. We color code the lines by year to allow for them to be visually differentiated.
In the plot below we are just plotting years 2010 onwards.
mauna_loa %>%
filter(Year >= 2010) %>%
ggplot(aes(x = Month, y = CO2_ppm, group = Year, color = Year)) +
geom_line(linewidth = 1) +
scale_color_gradient(low = "blue", high = "red") +
theme_classic() +
labs(title = expression("Atmospheric CO"[2]),
subtitle = "Measured at Mauna Loa, Hawaii",
x = "Month",
y = expression("CO"[2]* " (ppm)"),
caption = "Data source: NOAA/ESRL") +
scale_x_continuous(breaks = seq(1:12))
You can see how much easier it is to identify the seasonal trends in this graphic as opposed to the standard time-series plot. You can also identify longer-term trends in the data.
Based on the graphic above:
Now remake the above graphic, but this time include all of the years post 1958.
Create a monthly mean flow raster plot of the streamflow data in
the flow
data frame (similar to the graphic you made for
precipitation)
Create a seasonal plot for the monthly mean flow using the
flow
data. Try making the graphic with all of the years.
Also try making the graphic with just a subset of years.