At this point in the term, you’ve
Today and in upcoming lectures you will learn some foundational concepts and technical aspects related to the interpretation and statistical components of exploratory data analysis. These tools and concepts will allow you extract greater information and understanding from your data and to make more rigorous and defensible statements related to your analysis.
We will begin this stage of the class with a refresher/introduction to basic statistics and deeper dive into univariate data.
Univariate data is data that consists of observations of a single quantitative variable. For instance when we examine a a set of temperature data we are dealing with univariate data (i.e. the single characteristic in this case is temperature). Often we will break univariate data into categories (e.g. temperatures before year 1950 and temperatures after 1950) to see how the statistics and distributions of these groups differ.
When we compare two quantitative variables (e.g. temperature and precipitation) together we are dealing with bivariate data, which we will dive into in a greater detail in upcoming lectures. However, for today we will just focus on univariate data.
Before we move on, let’s load in a univariate dataset that we can work with in today’s lecture. We’ll load in the NOAA monthly precipitation dataset that we’ve worked with prior
library(tidyverse)
precip_data <- read_csv("https://stahlm.github.io/ENS_215/Data/NOAA_State_Precip_LabData.csv")
Take a quick look at the data to re-familiarize yourself with it.
Now let’s create a new dataset that just has the precipitation data for NY.
ny_precip <- filter(precip_data, state_cd == "NY")
Nominal variables are categorical data that have no inherent ordering/ranking. For example:
Ordinal data is similar to categorical except that there is an inherent ordering/ranking to the data. For example:
Measurements can be described as either continuous or discrete.
Continuous variables have values that can take all possible intermediate values. For example temperature, concentration, flow are all continuous variables.
Discrete variables can only take a certain set of values. For example number of individuals is reported as integers and can only take integer values (e.g. number of individuals could be 500 people or 501 people but not 500.02524 people).
A population (in statistics) refers to a the whole group of objects or values that exist. Typically is it not possible to measure all possible objects, and thus a sample, which is a subset of the population is measured.
For instance, imagine you are a fisheries biologist and you would like to know the average mass of salmon in a river. It is not feasible to capture and weigh every salmon (i.e. the population), so instead you collect measurements on a subset of the population (i.e. a sample) and make inferences based on this sample.
In the vast majority of cases where you deal with data, you will be
dealing with samples and not populations. In later lectures we will
discuss how we can mathematically evaluate our degree of confidence for
an estimate derived from a sample.
The mean is simply the sum of the all of the values in a sample divided by the sample size (number of values). The mean of a population is typically denoted by the Greek letter \(\mu\) (pronounced “mu”) and the sample mean is denoted by \(\bar x\) (pronounced “x bar”).
\(\mu = \frac{\sum_{i=1}^n x_i}{n}\)
As you are already well aware and have done hundreds of times by now,
the mean is computed in R using the mean()
function.
ny_precip
whose values are “Pre-1950” if the year is before 1950 and “Post-1950”
if the year is 1950 or later. Then compute the following means.
# Your code here
Quantiles are an important way of describing a dataset. The \(f\) quantile, \(q(f)\), of a set of data is the value in a dataset such that a fraction \(f\) of the data are less than or equal to that value (Cleveland, 1994)1. For example, the 0.25 quantile (also referred to as the 25th percentile) is the value in a given dataset where 25% of the data are less than that value.
To compute a quantile \(q(f)\) for a given dataset, you first need to compute the percentile ranking of each value in a dataset. Once the data have been ordered from smallest to largest, the percentile \(f_i\) for the ith value, \(x_i\) is computed as follows:
\(f_i = \frac{i-0.5}{n}\)
\(q(f)\) is then the value of \(x\) that has corresponds to \(f\).
It is worth noting that there are other methods (algorithms)
for computing quantiles, they are all generally similar though may
differ in some of the finer detail (e.g. how tied values are treated,…)
You are all likely familiar with quantiles as you have dealt with medians before. The median of a dataset is the 0.50 quantile (50th percentile). The median, like the mean, is a measure of location. However, the median is not skewed by outlier values like the mean.
You can use the quantile()
function from the
stats
package to compute the value at specified quantiles
for a given dataset. In the example below, we compute the
5th, 25th, 50th, 75th, and
95th percentiles for monthly rainfall in NY.
quantile(ny_precip$Precip_inches, probs = c(0.05, 0.25, 0.50, 0.75, 0.95))
## 5% 25% 50% 75% 95%
## 1.6200 2.5275 3.2500 4.1200 5.5725
Exercise
frac
to the ny_precip
data frame, which has the fraction \(f\) (i.e. the percentile ranking) for each
precipitation value in the dataset. Note: you can
compute it relatively easily using the equation shown above or you can
rely on a function in dplyr
that will do it for you.The spread (or variability) of a variable is an important property to consider when analyzing a dataset. Among other things, the spread of a dataset gives us an idea of the range of possible values a variable can take as well as how confident we can be when making statements regarding the data when we have only a limited sample.
The variance is the average of the of squared deviations from the mean for all of the observations. The population variance is denoted by \(\sigma^2\) and the sample variance is denoted by \(s^2\).
\(\sigma^2 = \frac{1}{n}\sum_{i=1}^n (x_i-\mu)^2\)
The standard deviation, \(\sigma\) (population) and \(s\) (sample), is simply the square root of the the variance. The standard deviation has the same units as the measurements.
So essentially, the variance and standard deviation are telling us on average how much the sample values deviate from the sample mean. A sample with a variance of zero, implies that all of the measured values are equal to the mean (i.e. there is no sample variation). As the variance grows it indicates that there is greater deviation from the mean (i.e. the samples are not tightly clustered about the mean).
In the graphic below I’ve created two samples (A and B) that have the
exact same mean (mean = 5.0), though the variances differ. The variance
of Group A = 0.01, and the variance of Group B = 0.56.
You can easily compute the variance using the var()
function and the standard deviation using the sd()
function.
Exercise
Has the variance (and standard deviation) changed over time? Conceptually what does this mean.
The range is the difference between the largest and smallest value in a dataset. One thing to note is that the range tends to increase as your sample size increases (due to the fact that you are more likely to eventually sample extreme values as you sample size grows).
The interquartile range (also denoted IQR) is the difference between the 25th percentile and 75th percentile values in a datset. The IQR is useful in that it tells us the range in which we can expect to find the middle 50% of the sample.
You can compute the interquartile range either by using the
quartile()
function and taking the difference between the
75th and 25th percentile or by using the
IQR()
function
quantile(ny_precip$Precip_inches, probs = 0.75) - quantile(ny_precip$Precip_inches, probs = 0.25) # using quantile
## 75%
## 1.5925
IQR(ny_precip$Precip_inches) # using IQR function
## [1] 1.5925
The coefficient of variation is a measure of spread and is defined as the ratio of the standard deviation to the mean.
\(CV =\frac{\sigma}{\mu}\)
The CV is widely used in chemistry and engineering (to characterize the precision/repeatability of a process or measurement).
The CV is particularly useful when comparing how meaningful the degree of variation (i.e., \(\sigma\)) is as compared against the central value (i.e., \(\mu\)) of a given dataset. For instance consider a two different manufacturing process, where process A is making a part that should be 100 m in length and process B is making a part that should be 0.01 m in length. As in any manufacturing process there is not perfect precision and thus there is some variation of the length of the parts made within each process. Below are some statistics regarding the manufacture of 5000 parts from both Process A and B.
Process A has a mean of 100.00 m and a standard deviation of 0.100
m.
Process B has a mean of 0.010 m and a standard deviation of 0.001 m.
As you can see the standard deviation is much lower in process B as compared to A. However, this does not tell the whole story. Let’s take a look at the coefficient of variation for the two processes.
Process A has a \(CV = \frac{0.100\, m}{100.00\, m} = 0.001\)
Process B has a \(CV = \frac{0.001\, m}{0.010\, m} = 0.1\)
Thus, we can see that since the mean is also much smaller in process B, the standard deviation of B is a much greater proportion of the mean part length. The CV helps to highlight that Process B has much greater variation in the part sizes relative to the average part size. Thus, the manufacturer would see that the Process B parts are not particularly well manufactured since the variation amounts to 10% of the mean size.
Important note: The CV should only be calculated for data that is measured on measurement scales that have a meaningful zero value (i.e., ratio scale). For instance, you would not want to compute CVs on temperature data that is measured in C or F, since their zero values do not actually represent a true physical minimum value. Thus the computed CV would be dependent on the scale (units) used. The example, I presented above makes sense since length does indeed have a true physical minimum of zero and thus our CV is independent of the units used (i.e., we could have used units of feet and we would get the same answer for the CV).
The shape of distribution helps us in part describe how symmetric (or asymmetric) the data distribution is as well as which portions of the distribution have the greatest density.
The modes, which are the most common values, are frequently used to describe a distribution. The graphic below shows the distribution of a unimodal sample - one that has a single peak in its distribution.
The distribution below has two peaks and thus is described as
bimodal
The skewness of a distribution describes its symmetry (or asymmetry). The graphic below shows a symmetric distribution.
Asymmetric distributions are described as skewed.
A right-skewed distribution has a long-tail to the
right. These distributions are positively skewed, as they have extreme
positive values that occur with low probability.
A left-skewed distribution has a long-tail to the
left. These distributions are negatively skewed, as they have extreme
negative values that occur with low probability.
You’ve already been visualizing and also interpreting univariate distributions, however previously you were more focused on the programming and generating visuals as opposed to rigorously describing the data. Now that you have a basic framework and language to begin characterizing data distributions, we will further expand your understanding by going into the technical details and approches to interpreting data distributions.
Box plots provide a nice visual summary of a sample’s distribution. The box plot does not show all of the data points, but rather displays some key summary statistics that describe the distribution of the data.
The graphic below shows an example box plot, with the features of the graphic nicely annotated.
Image source: towardsdatascience.com
The geom_boxplot()
function from the
ggplot2
package generates a box plot with the features
shown above.
ny_precip %>%
ggplot(aes(x = state_cd, y = Precip_inches)) +
geom_boxplot(fill = "grey") +
labs(y = "Precip (inches)",
x = ""
) +
theme_classic()
the box plot is helpful in that it succinctly displays the summary
statistics related to the distribution of a given variable. However, it
does not display the data points themselves (except for the
outliers), while this is often desired when summarizing data, we
sometimes would like to see all of the data in addition to the box plot.
To do this we can simply add a geom_jitter()
layer on top
of our box plot.
ny_precip %>%
ggplot(aes(x = state_cd, y = Precip_inches)) +
geom_boxplot(fill = "grey") +
geom_jitter(alpha = 0.1, width = 0.15, height = 0) +
labs(y = "Precip (inches)",
x = ""
) +
theme_classic()
Note, that we specified height = 0
in the
geom_jitter()
function to prevent jittering in the
y-direction (since this would jitter the y-values which is the variable
of interest).
Box plots (and other graphics showing a variables distribution) are particularly useful when comparing univariate distributions between groups (e.g. precipitation pre-1950 and post-1950).
In past labs have used box plots to visualize distributions, however we hadn’t gone into the the technical details of comparing distributions. Let’s look at the pre-1950 and post-1950 precipitation in NY and examine how the distributions compare with one another.
ny_precip %>%
ggplot(aes(x = time_period, y = Precip_inches)) +
geom_boxplot(fill = "grey", notch = TRUE) +
labs(y = "Precip (inches)",
x = ""
) +
theme_classic()
Notice how we specified notch = TRUE
in the
geom_boxplot()
function. The notches around the median
values display confidence intervals around the median. Since we
can’t know the “true” median based on any given sample, these confidence
intervals essentially tell us the range of plausible values that the
“true” median is likely to fall within.
If the notches overlap between two groups then we do not have statistical confidence that the medians are truly different between the groups (i.e. difference may simply be due to chance based on limited sample sizes). If the notches do not overlap then we can have statistical confidence that the medians differ.
We will cover confidence intervals in later lectures, though I am introducing them now so that are aware of the concept and can start to employ the concept in your project/lab work if you want.
Exercise
Create a box plot comparing the precipitation distribution between months (just using the NY data)
Like we did with NY, use a box plot to compare the pre and post-1950 precipitation for another state of your choosing. Examine the distributions and discuss how they compare and how they differ between time periods.
Choose a subset of 4 states from the full precipitation dataset
precip_data
. Using this subset of states generate a box
plot that compares the precipitation between the states. Think about how
the precipitation distribution varies across the states and discuss
these features with your classmates.
Practice with the statistics functions. Using the full
precipitation dataset precip_data
you should
Your graphic will look similar to the one below
Visualizing Data (William S. Cleveland, 1993)↩︎