We’ve already learned how to use a number of different graphical and statistical techniques to characterize the distribution of data and to compare distributions between two groups. Some of the concepts we’ve learned include:
Today we will continue down this path, introducing some more advanced tools for comparing and interpreting data distributions. In particular we will learn how to construct and interpret quantile-quantile (Q-Q) plots
Let’s load in the packages we’ll use today.
library(tidyverse)
library(stats)
Today’s lecture has a lot of information about Q-Q plots. If you are interested in some more information you can find additional helpful resources at the following website from the US National Institute of Standards and Technology (NIST)
A quantile-quantile plot (Q-Q plot) takes the quantile data from two different datasets, pairs them based on their f-value and then plots the paired quantile values. A Q-Q plot allows you to compare the distributions between two datasets. For instance, if you wanted to compare whether daily temperatures in NYC have a similar distribution as daily temperatures in Boston, then to create a Q-Q plot you would do the following:
Below is an example that illustrates the underlying concepts behind a Q-Q plot. In the example I’ve generated two sets of data (Batch A and Batch B) and they are drawn from different underlying distributions. First we construct individual a quantile plots – one for Batch A and one for Batch B. On each plot the points are labeled with their corresponding f-value. To properly compare the distributions of the Batch A and Batch B, we then construct a Q-Q plot, by pairing the quantile values (e.g. 10th percentile in A with 10th percentile in B and so on) and then generating a graphic.
If the data in the two batches come from the same distribution then they will plot very close to the 45\(^\circ\) line.
Q-Q plots are not the same thing as a scatter plot. A scatter plot is used to display the relationship between paired varaiables (e.g. daily temperature and precipitation where the data are paired for a given location). A scatter plot helps us to examine the relationship between the paired variables (i.e. if one of the variables is a function of the other).
A Q-Q plot is intended to examine if the data have similar distributions – it is not intended to examine if the variables are correlated.
To help illustrate this concept, let’s examine the example below. Here we are creating two vectors – each of which contains 5000 randomly drawn values from a Normal distribution with a mean = 0 and a standard deviation = 1. Thus vec_a
and vec_b
should have very, very similar distributions since they are both drawn from the same theoretical distribution.
vec_a <- rnorm(5000, mean = 0, sd = 1)
vec_b <- rnorm(5000, mean = 0, sd = 1)
However, if we were to pair the values in vec_a
and vec_b
(e.g. 1st value in a with 1st value in b, 2nd value in a with 2nd value in b, …) there is no reason to believe that they should be correlated. That is, if the 1st value in a is high, there is no reason to believe that the 1st value in b should also be relatively high.
Let’s create a scatter plot where we pair a and b to help illustrate this point.
Based on the scatter plot it is clear that the values in vec_a
and vec_b
are not correlated with one another. While they are not correlated we know that they have very similar distributions, as they were both drawn from the same theoretical distribution – the density plots on the margin of the graphic further illustrate the similarity of the sample distributions.
Now let’s create a Q-Q plot with these two datasets to highlight that they come from the same distribution.
q_v <- seq(0,1, by = 0.05) # vector of f-values that we will use to specify quantiles to compute in code below
ggplot() +
geom_point(aes(x = quantile(vec_a, probs = q_v), y = quantile(vec_b, probs = q_v)), size = 2) +
geom_abline(slope = 1, intercept = 0) +
theme_classic() +
labs(title = "Q-Q plot",
x = "A",
y = "B") +
coord_equal() # sets the axis scales equal, which makes Q-Q plot easier to read
You can see that the data fall close to the 45\(^\circ\) line (i.e. line with slope = 1, and intercept = 0), indicating that the data in both datasets (A and B) are drawn from the same distribution.
As we have already discussed, a Q-Q plot allows you to examine how the distributions of data compare between two datasets. This question comes up all the time in science. For instance you may want to know of the distribution of daily temperatures in NYC from 1900-1960 differs from data for 1961-2018. Or you may want to know if the distribution of groundwater arsenic concentrations in Bangladesh is similar to the distribution found in Vietnam. A Q-Q plot allows you to graphically investigate these types of questions.
The Q-Q plot goes beyond simply allowing you to assess if the distributions are similar or different – it also allows you to identify how they differ:
As we saw earlier, when two sets of data have the same underlying distribution, the points on the Q-Q plot will fall on or very close to the 45\(^\circ\) line (i.e. line with slope = 1, and intercept = 0).
When the data have similar distributions, with the only difference being an additive shift i.e. \(Group_B = Group_A + shift\), then the Q-Q plot will plot parallel to the 45\(^\circ\) line, but not on it.
The spread and shape of the distributions of A and B are similar, however it is clear that B has an additive shift of approximately 2. You can see this in the Q-Q plot, as the quantile values for group B are simply the quantile values of group A with an upward shift of two units.
When the points in the Q-Q plot form a line that is NOT parallel to the 45\(^\circ\) line, then we say that there is a multiplicative shift. This means that \(Group_B = Group_A * factor\)
In the example above the distributions of group A and B only differ by a multiplicative factor, with the values in Group B being approximately twice that of Group A i.e. \(Group_B = 2 * Group_A\).
In many situations you will encounter a relationship between distributions that has both a mulitiplicative shift and an additive shift
If the two distributions differ by something other than a linear transformation, then the Q-Q plot will not form a straight line. The example below shows a Q-Q plot for two datasets that are drawn from different distributions (Normal and Exponential distributions respectively)
Let’s load in some data that we will use to create a Q-Q plot. We’ll load in the NOAA monthly precipitation data that we’ve used in the the past.
precip_data <- read_csv("https://stahlm.github.io/ENS_215/Data/NOAA_State_Precip_LabData.csv")
Let’s create a new data frame that just has data for New York (NY) and Rhode Island (RI).
precip_NY_RI <- precip_data %>%
filter(state_cd %in% c("NY", "RI"))
These states are located reasonably close to one another and it would be interesting and useful to know if they have monthly precipitation that is similarly distributed. Let’s create overlaying density curves as a first pass examination before we move on and create a Q-Q plot.
precip_NY_RI %>%
ggplot(aes(x = Precip_inches, fill = state_cd)) +
geom_density(alpha = 0.4) +
theme_classic()
Let’s also create a summary table to compare some statistics between the two groups (NY and RI)
precip_NY_RI %>%
group_by(state_cd) %>%
summarise_at(vars(Precip_inches), funs(mean, sd, p25 = quantile(.,probs = 0.25), p50 = median, p75= quantile(.,probs = 0.75)))
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Now let’s create a Q-Q plot to compare the distribution of monthly precipitation data between NY and RI.
p <- seq(0, 1, by = 0.05) # f-values to use in our quantile calculations
precip_RI <- filter(precip_NY_RI, state_cd == "RI") # Rhode Island data
precip_NY <- filter(precip_NY_RI, state_cd == "NY") # NY data
q_RI <- quantile(precip_RI$Precip_inches, probs = p) # RI quantile data
q_NY <- quantile(precip_NY$Precip_inches, probs = p) # NY quantile data
ggplot() +
geom_point(aes(x = q_RI, y= q_NY)) +
geom_abline(slope = 1, intercept = 0) +
theme_classic() +
xlim(0, 16) +
ylim(0,16) +
coord_equal() +
labs(title = "Precipitation Q-Q plot",
subtitle = "New York and Rhode Island",
x = "Rhode Island Precipitation (inches)",
y = "New York Precipitation (inches)")
Ok, now let’s investigate the shifts in the distributions between RI and NY that you should have observed in the Q-Q plot you generated above.
To tease out the the shifts, we’ll take a trial and error approach. The multiplicative shift we can be teased out by multiplying either the q_NY
or the q_RI
data by a factor – once the Q-Q plot forms a line parallel to the 45\(^\circ\) line, then we have identified the correct shift. Then once we have the multiplicative shift, we can tease out the additive shift by adding or subtracting from either the q_NY
or the q_RI
data.
Let’s operate just on the q_NY
data (we could also just operate on q_RI
if we wanted).
Let’s start with a multiplicative shift shift_mult
= 3. We’ll keep the additive shift shift_add
= 0, until we have the multiplicative shift squared away.
Look in the geom_point()
function below and note that we are applying the shifts. Run the code and see if you get the line to be parallel to the to the 45\(^\circ\) line. If not, then adjust the shift_mult
and run the code again until you get a value that you are satisfied with. Once you’ve got the shift_mult
settled, then adjust the shift_add
to get the Q-Q plot to fall as close to the 45\(^\circ\) line as possible.
shift_mult <- 3
shift_add <- 0
p <- seq(0, 1, by = 0.05) # f-values to use in our quantile calculations
precip_RI <- filter(precip_NY_RI, state_cd == "RI") # Rhode Island data
precip_NY <- filter(precip_NY_RI, state_cd == "NY") # NY data
q_RI <- quantile(precip_RI$Precip_inches, probs = p) # RI quantile data
q_NY <- quantile(precip_NY$Precip_inches, probs = p) # NY quantile data
ggplot() +
geom_point(aes(x = q_RI, y= q_NY * shift_mult + shift_add)) +
geom_abline(slope = 1, intercept = 0) +
theme_classic() +
xlim(0, 16) +
ylim(0,16) +
coord_equal() +
labs(title = "Precipitation Q-Q plot",
subtitle = "New York and Rhode Island",
x = "Rhode Island Precipitation (inches)",
y = "New York Precipitation (inches)")
Choose two other US states and create a Q-Q plot comparing their monthly precipitation distributions.
Choose a US state of interest and create a Q-Q plot comparing the distribution of monthly precipitation between two time periods. How do the distributions between the two time periods compare? Note that this type of exercise is useful if you are interested in assessing a shift in climate. Discuss the results with your neighbor and with me.
Load in the British Geological Survey arsenic data from Bangladesh and create a Q-Q plot comparing the distribution of groundwater arsenic between the Dhaka and Khulna divisions. When creating your Q-Q plot, you will want to convert the axes to log scale using scale_x_log10
and scale_y_log10
(this will make the graphic easier to read). Also make sure to add a 45\(^\circ\) line to aid in the interpretation of the Q-Q plot.
Code to load in the data
bangladesh_gw <- read_csv("https://stahlm.github.io/ENS_215/Data/NationalSurveyData_DPHE_BGS_LabData.csv")
division_1 <- bangladesh_gw %>%
filter(DIVISION == "Khulna")
division_2 <- bangladesh_gw %>%
filter(DIVISION == "Dhaka")
# Your code here