Bivariate and time-series data

Bivariate data: concepts/intro

Bivariate data is data that has observations with two quantitative variables associated with each observation. For example if we have a dataset where the temperature and precipitation are measured for each observation (e.g. US state) then we have a bivariate dataset. This differs from univariate data where we only have one quantitative variable for each observation (e.g. just precipitation).

In many cases (as you have already seen) you will have datasets with many more than two quantitative variables for each observation (e.g. the BGS Groundwater Chemistry dataset for Bangladesh). In this case our dataset is multivariate, however we when we consider two quantitative variables at a time (e.g. arsenic and iron) then we are nonetheless performing bivariate analysis.

While we’ve been dealing with bivariate data and performing bivariate analysis for much of the term, we haven’t yet explicitly covered a number of the more technical details and concepts. In today’s and the upcoming lectures this week we will:

Refresh our memory on some basics of bivariate analysis
Introduce some additional ways of visualizing bivariate data
Begin learning about fitting data (data models)

Let’s load in the packages that we’ll use today

library(tidyverse)
library(lubridate)
library(ggExtra)

Scatter plots: additional concepts

Scatter plots, which we’ve been using throughout the term, are a commonly used to visualize bivariate data. While at this point in the term you are very familiar with this type of graphic, there are a few additional concepts and techniques that are worth pointing in our discussion of bivariate data. In particular, when using a scatter plot to display bivariate data you should consider whether one of the variables is dependent on the other. When there is a dependent variable it is generally plotted in the y-axis, while the independent variable is on the x-axis. For instance imagine you had a dataset of recording the heights and ages of trees. Conceptually, the height of the tree depends on the tree’s age – as the tree gets older it has had more time to grow and will increase in height – while the age does not depend on the height. Thus you would plot age on the x-axis and height on the y-axis. In cases, where there is not a clear dependence then the choice of what axis to plot each variable on will depend on preference, aesthetics, and the goal of the figure.

Let’s load in the BGS Bangladesh groundwater chemistry data to use in the analysis to follow.

bangladesh_gw <- read_csv("https://stahlm.github.io/ENS_215/Data/NationalSurveyData_DPHE_BGS_LabData.csv") %>% 
  drop_na()

Let’s investigate if how well depth relates to the latitude of the location where the well is situated. We are interested in this because it could potentially provide information that we can use to make inferences about the geology. Typically well are installed at a depth where water can be extracted at sufficient quantity and quality. If the geology allows for sufficient quantities of water to be extracted at shallow depth, then you would prefer to install a shallow well (all other things being equal) since it is cheaper to install shallow wells.

Since there is not a clear dependent variable in this case, we could put either variable on either axis. However, conceptually it is easier to interpret this plot with latitude on the y-axis, since latitude varies as you move north to south and people tend to conceptualize north to south as the vertical direction on a page.

fig_1 <- bangladesh_gw %>% 
  ggplot(aes(x = WELL_DEPTH_m, y = LAT_DEG)) +
  geom_point(alpha = 0.3) +
  theme_classic()

fig_1

Do you observe any patterns in well depths as you move from north to south in Bangladesh? What might these patterns suggest?

When there are many points (especially many overlapping ones) marginal plots can be very helpful in providing additional information. Below we’ll generate marginal histograms that display the distributions of each of the variables.

Note: sometimes the marginal plots don’t show up in your notebook (you won’t have any issues when you knit however). If this happens, type fig_2 in your console and the graphic will appear in your plotting pane.

To create marginal plots, we can use the ggMarginal function from the ggExtra package. I’ve specified type = "histogram" below. You can also specify type = "density" or type = boxplot. Note, that we pass fig_1 (our scatter plot) to the ggMarginal() function – this tells the function to add the marginal plots to our fig_1.

fig_2 <- ggMarginal(fig_1, type = "histogram", fill = "gray")
fig_2

Over what depths are most of the wells?
Are most of the wells confined to a narrow band of latitudes or are they relatively well distributed from north to south?
Remake the graphic above, but this time have marginal plots as density curves or boxplots.

Another way to show the density of points in each region of the graph is to use the geom_hex() function. This function breaks up the plot area containing data into hexagonal sub-regions. The sub-regions are color-coded according to the density of points in that area. This achieves something similar to creating a scatter plot where the points are semi-transparent (using the alpha setting in geom_point()) – though geom_hex() can help to reduce the “noise” in the graphic by reducing the number of points displayed.

bangladesh_gw %>% 
  ggplot(aes(x = WELL_DEPTH_m, y = LAT_DEG)) +
  geom_hex() + 
  theme_classic()

You an change the palette used to fill the hexagons by using the scale_fill_ palette functions in ggplot. For example to use the viridis palette you would simply do the following.

bangladesh_gw %>% 
  ggplot(aes(x = WELL_DEPTH_m, y = LAT_DEG)) +
  geom_hex() + 
  
  scale_fill_viridis_c() +
  
  theme_classic()

Time-series plots

A common type of bivariate data is time-series data, where a numeric variable is recorded over time. We’ve dealt with time-series data many times throughout the term, though we’ll go into some additional tools/concepts below.

First let’s load in some streamflow data from the USGS streamgage at Schoharie Creek (01351500).

flow <- read_csv("https://stahlm.github.io/ENS_215/Data/USGS_streamflow_01351500.csv") %>% 
  drop_na() %>% 
  filter(Year >= 1940 & Year < 2025) %>%  # select years 1940 through 2024
  mutate(Date = make_date(Year, Month, Day)) # create a Date column that has the dates as an R date object

Let’s create a standard time-series plot of the streamflow data, where time is on the x-axis and flow is on the y-axis. A time-series plot of streamflow is referred to as a hydrograph.

flow %>% 
  ggplot(aes(x = Date, y = flow_cfs)) +
  geom_line() +
  theme_classic() +
  labs(title = "USGS Gage 01351500", 
       y = "Flow (cfs)",
       x = "")

Since streamflows span a large range of values, we it can often be helpful to display the data with a log₁₀ scale y-axis.

flow %>% 
  ggplot(aes(x = Date, y = flow_cfs)) +
  geom_line() +
  theme_classic() +
  scale_y_log10() +
  labs(title = "USGS Gage 01351500", 
       y = "Flow (cfs)",
       x = "")

Exercise: Create a faceted time-series plot of Schoharie Creek streamflow. Facet the plot by decade. Note that you will need to create a decade variable to facet by.

Ok, so these are typical time-series, plots which you are relatively familiar with from our work throughout the term. Now let’s take a look at a few other ways of visualizing time-series data.

Raster plots

Raster plots take time-series data convert and plot year on one axis and the day of the year (or month) on the other axis. The pixels in the plot have the variable value (in this case flow) mapped to the fill. These plots are called raster plots, as a raster graphic is simply a rectangular image where the pixels are colored according to some variable. Digital images like the ones you take on your phone are called raster graphics.

Raster plots are an excellent way of displaying streamflow data. When streamflow data is presented in this way we call it a raster hydrograph.

Let’s use the geom_tile() function to create a raster hydrograph for the streamflow data at Schoharie Creek. We’ll plot the day of the year on the x-axis, the year on the y-axis, and we will fill the pixels with a color corresponding the the log₁₀(flow). We are using the log₁₀(flow) to prevent the color mapping from being heavily skewed to the highest values (recall that streamflow is often log-normally distributed).

midpoint2use <- median(log10(flow$flow_cfs))

flow %>% 
  mutate(DOY = yday(Date)) %>% 
  
  ggplot(aes(x = DOY, y = Year, 
             fill = log10(flow_cfs))) +
  
  geom_tile() +
  
  scale_fill_gradient2(low = "red", mid = "green", high = "blue", 
                       midpoint = midpoint2use) +
  theme_classic() +
  labs(title = "Schoharie Creek Flows",
       subtitle = "USGS Gage 01351500", 
       y = "Year",
       x = "Day of year",
       fill = "log(flow)")

With a raster hydrograph it is much easier to identify seasonal patterns as each year is stacked (in rows) on top of one another and their seasons line up in the vertical direction – thus we can easily see seasonal patterns that are present across years. Furthermore, we can easily identify long-term trends by seeing how the intensity of the colors change as you look up and down the figure.

Another benefit of raster hydrographs is that we can observe if there are shifts in flow timing. For instance, if over time, high flows began to occur later in the year then we would see the blue part of the image shift to the right as the years increases.

With a raster hydrograph you can also more easily identify anomalous (e.g. particularly wet or dry years) much easier than in a standard hydrograph. On the raster hydrograph employing the color-scheme above these years will appear as a more red row (dry years) or more blue row (wet year).

Based in the raster hydrograph for Schoharie Creek:

Compare your raster hydrograph to the standard hydrograph – do you seen any benefits to the raster version?
Is there any seasonal signal in the flow (higher flow vs. lower flow periods)? If so, what part of the year has higher flows and what part lower flows?
Do any years stand out as particularly high flow or particularly low flow?
Do there appear to be any long-term trends in the data?

You can also use a raster plot to visualize other types of data. In a similar vein, monthly precipitation data lends itself to being nicely represented in raster format.

Let’s load in the NOAA precipitation data that we’ve used in the past.

precip_data <- read_csv("https://stahlm.github.io/ENS_215/Data/noaa_cag_state_precipitation.csv")

precip_data <- precip_data %>% 
  rename(Precip_inches = Value)

Take a quick look and re-familiarize yourself with the data.

Now, let’s create a raster graphic of monthly precipitation for California. We’ll filter the data so that we have years 1960 onwards (otherwise we’ll have too many years and the graphic will be difficult to read).

Also note that in the code below we are filling the pixels according to the percentile ranking of each precipitation observation. This prevents the color-scale from being skewed by outlier data. We could also fill the pixels according to the precipitation in inches (as opposed to the ranking) though a bit more tweaking of the color scheme would be required.

Take a look at the code below and make sure you understand what we are doing. Also note, that we’ve put year on the x-axis and month on the y-axis (you are free to switch them as neither orientation is more correct).

precip_data %>% 
  filter(STATE == "California", YEAR >= 1960, YEAR < 2025) %>% 
  
  ggplot(aes(y = factor(MONTH), x = YEAR)) + 
  geom_tile(aes(fill = percent_rank(Precip_inches)), color = "black") + 
  
  scale_fill_gradient2(low = "red", mid = "white", high = "blue", midpoint = 0.5) + 
  
  coord_equal() +
  theme_classic() +
  labs(y = "Month",
       fill = "Precip rank")

Do the precipitation data in CA display any seasonality? If so describe the seasonality.
Do you see any long-term trends in precipitation. For instance does it appear to be getting wetter or drier in CA?
Remake the graphic above for another state of interest. Examine the data and respond to the same questions above.

Seasonal plot

In many time-series data there are seasonal patterns (e.g. temperature, precipitation, streamflow,…). Displaying the data as a standard time-series plot where many years of data are shown as a single line can sometimes make it difficult to clearly identify and characterize these seasonal patterns.

Let’s load in some data of atmospheric CO₂ concentrations that have been recorded at Mauna Loa, Hawaii since the late-1950s. The dataset has monthly average concentrations.

mauna_loa <- read_csv("https://stahlm.github.io/ENS_215/Data/Mauna_loa_CO2_data.csv", skip = 2)

Below I’m plotting a time-series of the monthly data. This graphic may look familiar to you – the data here is famous and often appears in discussions of climate change.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Do you observe any long-term trends?
Does there appear to be a seasonal signal in the data?

You mostl likely noticed above, that there is a both a long-term trend and a seasonal oscillation in CO₂ concentrations. Since the x-axis spans many decades it is difficult to identify what these seasonal trends are (e.g. does CO₂ peak annually in summer? spring? fall? winter?).

To more easily visualize the seasonal patterns we can create a seasonal plot. In this graphic we will have month (e.g. Jan through Dec) on the x-axis and CO₂ concentration on the y-axis. Each year is plotted as its own line. We color code the lines by year to allow for them to be visually differentiated.

In the plot below we are just plotting years 2010 onwards.

mauna_loa %>% 
  filter(Year >= 2010) %>% 
  ggplot(aes(x = Month, y = CO2_ppm, group = Year, color = Year)) +
  geom_line(linewidth = 1) +
  scale_color_gradient(low = "blue", high = "red") +
  
  theme_classic() +
  labs(title = expression("Atmospheric CO"[2]),
       subtitle = "Measured at Mauna Loa, Hawaii",
       x = "Month",
       y = expression("CO"[2]* " (ppm)"),
       caption = "Data source: NOAA/ESRL") +
  scale_x_continuous(breaks = seq(1:12))

You can see how much easier it is to identify the seasonal trends in this graphic as opposed to the standard time-series plot. You can also identify longer-term trends in the data.

Based on the graphic above:

Describe the seasonal patterns in CO₂ concentrations. Any ideas what might be responsible for these patterns? Discuss with you neighbors.
Do you seen any long-term trends in the data?

Now remake the above graphic, but this time include all of the years post 1958.

What do you observe?

Exercises

Create a monthly mean flow raster plot of the streamflow data in the flow data frame (similar to the graphic you made for precipitation)
Create a seasonal plot for the monthly mean flow using the flow data. Try making the graphic with all of the years. Also try making the graphic with just a subset of years.