In today’s lecture we are going to learn how to make use of a number of R packages that allow you to directly query and access data from some notable environmental databases. These data access packages are incredibly powerful as they allow you to streamline your workflow by combining the data identification and acquisition steps directly with your data analysis. They also provide a highly efficient and reproducible approach to data identification and acquisition - allowing you to easily adapt or expand your research question with minimal additional effort.
There are many packages that have been developed for accessing data through R, including a wide range of packages for accessing environmental datasets. Today we will look at a few notable packages and in the coming weeks we will explore some additional ones. The packages below highlight a number of notable research-quality datasets that are likely to be useful in many of your class projects, theses, and future research/work.
Let’s load in a few packages that we will use today. It is likely
that many of you may not yet have installed the leaflet
package. If so, you should go ahead and do so before proceeding. FYI,
the leaflet package allows you to make interactive maps. You will learn
how to do this in the upcoming lectures, however a few of the packages
used today will automatically generate leaflet maps and so you will need
this package for those maps to be automatically generated.
library(tidyverse)
library(lubridate)
library(leaflet)
In much of our environmental research we need to use climate/meteorological data. While there are a host of organization/agencies that provide freely available research quality datasets, locating and accessing the data can often be time-consuming and difficult. Furthermore, once you’ve located the data (generally through the website of the agency/organization collecting the data) you typically need to download the file (e.g., csv, Excel), load it into R, and reformat the data for your analysis. The R packages below allow you to quickly locate the data of interest and directly load it into R in a format that is ready to use in R - typically all within a single function call!
The Global Surface Summary of the Day - GSOD is a dataset from the National Oceanic and Atmospheric Administration (NOAA) that provides daily average values for a range of meteorological variables (e.g., air temperature, precipitation, wind speed, barometric pressure) at thousands of locations around the globe. These daily average data are computed hourly observations at meteorological stations (i.e., they are actual observations and not model/simulation outputs) and the data goes back upwards of 90 years at some locations. FYI, the hourly data is available through the Integrated Surface Hourly (ISH) dataset from NOAA, which we will learn how to access later in today’s lecture.
We will learn how to use the GSODR
package
to directly access this data.
First let’s load in the package (FYI, you will need to install the package first if you have not installed it before)
library(GSODR)
The package comes with a dataset that has all of the available meteorological stations and their location information/details (e.g., lat/long, country, state, period of data coverage). We will load in this data below.
load(system.file("extdata", "isd_history.rda", package = "GSODR"))
Take a look at the isd_history
data frame to familiarize
yourself with what it contains. You’ll see that you could use this
information to easily identify sites of interest to your particular
research question. Below you can see a map of all of the available
meteorological stations in the GSOD database.
If you identified a station(s) of interest you can download the data
using the get_GSOD()
function. Let’s try this out for
Windhoek, which is the capital of the country of Namibia (located in SW
Africa). To download data for a given station, you need to supply the
station ID (STNID) and the years of interest. You can locate the STNID
in the isd_history
data frame.
met_df <- get_GSOD(years = 2023:2024, station = "681100-99999")
Take a look at the met_df
data frame. The description of
the variables can be
found here.
Let’s plot the a time-series of the daily minimum (blue line) and maximum (red line) air temperatures from 2023-2024 at the selected met station.
met_df %>%
ggplot() +
geom_line(aes(x = YEARMODA, y = MIN), color = "blue") +
geom_line(aes(x = YEARMODA, y = MAX), color = "red") +
labs(title = met_df$NAME[1],
x = "Date",
y = "Temperature (C)",
caption = paste("Data source GSOD station", met_df$STNID[1])
) +
theme_classic()
Oftentimes you have a study location/area of interest and you would
like to identify any/all meteorological stations that exist nearby your
site. The GSODR
package has a really helpful function,
nearest_stations()
that does this. You need to supply the
lat/long of the area to query and the radial distance (in kilometers)
from your search point. The function will then return a list of the
STNID with the station info for all stations that fall within your
search radius.
Let’s test this out by searching for all of the stations within 25 km of Schenectady.
schdy_met_stations <- nearest_stations(LAT = 42.81,
LON = -73.94,
distance = 25)
schdy_met_stations
## [1] "999999-04782" "744994-04741" "999999-04741" "744994-99999" "999999-14735"
## [6] "725180-14735"
We can see that there are 6 meteorological stations within 25 km of Schenectady.
In the section above, we learned how to use the GSODR
package to retrieve daily meteorological data from thousands of
meteorological stations around the globe. Now we will learn how to
access the hourly data from NOAA’s Integrated Surface Database using the
worldmet
package. Note that this package is very similar to
GSODR
and in fact they access data from the same set of
stations (with GSODR
providing daily summaries and
worldmet
providing access to the hourly data).
Let’s load in the worldmet
package. If you haven’t yet
installed worldmet
you will need to go ahead and do so
first.
library(worldmet)
The worldmet
package is pretty straightforward to use -
you simply need to specify the station you would like access and the
year(s) you would like to download the data for.
We can use the getMeta()
function to obtain a data frame
containing information (e.g., station ID, name, lat/long,…) about each
of the stations that have data available for download.
worldmet_site_df <- getMeta(plot = F)
Take a look at the data frame you just obtained and you will see that
the code
column contains the station ID that is unique to
each site. The data frame also has a bunch of additional useful
information about each site, including the time period over which data
is available.
Note that I set in the function call above I set the
plot
argument to false. If you set it equal to true
(T
), then an interactive map will be generated showing the
location of all the sites. To have the map show up when you knit your
document you will also need to set the returnMap
argument
to T
. Let’s make a map of all of the available sites.
getMeta(plot = T, returnMap = T)
Ok, now we can go ahead and use the importNOAA()
function to download some hourly meteorological data. This function
requires two arguments. The code
argument is where you
specify the station ID. The year argument is where you specify the
year(s) you would like to download. To specify multiple years you would
set the years argument as follows years = 2010:2022
, which
in that case would download hourly data from 2010 through 2022.
Let’s go ahead and download hourly data for Albany International Airport (station code: 725180-14735) for 2024 through the most recent reported measurement (~ 1 or 2 days prior to today).
worldmet_data <- importNOAA(code = "725180-14735",
year = 2024:2025)
##
Importing NOAA Data â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– â– 50% | ETA: 5s
Descriptions of all the variables and their units are available here
Important: All data are reported in Greenwich Meantime (GMT) time zone format. This means that you may want to convert the time data to the stations local time zone.
Let’s quickly plot some of the data to see how it looks.
Hourly temperature data
worldmet_data %>%
ggplot(aes(x = date, y = air_temp)) +
geom_line() +
theme_bw()
Hourly relative humidity data
worldmet_data %>%
ggplot(aes(x = date, y = RH)) +
geom_line() +
theme_bw()
dataRetrieval
packageThe USGS National Water Information Service (NWIS) is an incredible database containing physical and chemical data on the streams/rivers, lakes, and groundwater for nearly 2 million locations across the US and US territories. From NWIS you can download data related to thousands of different parameters including streamflow, groundwater levels, and a wide range of water chemistry and physical conditions.
This data can be queried and retrieved using the
dataRetrieval
package that was developed and is maintained
by the USGS. Given the richness and volume of the data there are many
functions and features within this package and some knowledge of how the
available data is particularly helpful when using
dataRetrieval
. Today we’ll learn how to download daily
streamflow data and also learn the basics of downloading water quality
data. For those interested in learning more I am helping to provide
additional guidance and you can also explore the excellent background
and tutorial
articles that have been put together by the developers of the
package.
When querying the NWIS database to find and/or download available data you generally need to specify a parameter (i.e., what was being measured) and a location/region of interest.
If you already know the USGS site number (the unique numeric identifier the USGS assigns to sampling sites) and you know that the parameter of interest was measured at the site you can proceed to the data acquisition step. However, many times we don’t know what/if available sites with the desired data exist and we need to first identify those sites. Once you’ve identified the sites with the desired data, you can then directly acquire the data for the site(s) of interest.
Let’s work through an example where we locate sites in Schenectady county that have mean daily streamflow and/or gauge height (i.e., stream water level) data. Note that we could look for other parameters (e.g., turbidity, concentration of chloride, pH,…) if we were interested in those measurements.
First let’s load in the dataRetrieval
package. If you
have never used this package before you will need to install it before
proceeding.
library(dataRetrieval)
We are going to query the NWIS database for sites within Schenectady
county (countyCd = "36093"
) that have streamflow (00060)
and/or gauge height (i.e., water level) (00065) data.
Note that every US county has a FIPS county code, which is a unique numeric code that the US has assigned to each county. You can find a list of all the county codes here.
You can also find a list of all of the parameter codes used by the USGS here.
You’ll also note that we specified service = "dv"
, this
means to query for sites that have daily data reported.
sites_schdy <- whatNWISsites(countyCd = "36093",
parameterCd = c("00060","00065"),
service = "dv"
)
## GET: https://waterservices.usgs.gov/nwis/site/?countyCd=36093¶meterCd=00060,00065&format=mapper&hasDataTypeCd=dv
sites_schdy
You can see that there are 8 sites that meet the search criteria.
You’ll also note that under the site_tp_cd
column you see
ST, which stand for stream. If you queried for a parameter
(e.g., water temperature) that could also apply to other site types,
then you would likely see other site_tp_cd
values listed
too (e.g., GW).
Now let’s get information about the available data (e.g., the number of observations, the length of the data record).
sites_what_data <- whatNWISdata(siteNumber = sites_schdy$site_no,
service = "dv",
parameterCd = c("00060","00065"),
statCd = "00003")
## GET: https://waterservices.usgs.gov/nwis/site/?seriesCatalogOutput=true&sites=01351500,01351450,01354230,01354330,01354500,01356190&format=rdb
sites_what_data
In the above function call you see that we set the argument
statCd = "00003"
. This specifies the statistic to download.
The code 00003
means to obtain the average value. We need
to specify this since we are downloading daily values and since most
sites take sub-hourly readings we need to tell the function what type of
daily value to return (e.g., minimum measurement, maximum, average,…).
If you want to obtain a different statistic than the average you can
simply change the statistic code.
You can find a list of USGS statistic codes here.
We can see that we have information about the data records for the sites with available gauge height (parm_cd = 00065) and streamflow (parm_cd = 00060). Below is a map of these sites.
Based on this information you might choose to acquire the data for some or all of these sites. Since there are only a few sites, let’s go ahead and download the water gauge height and streamflow data for all of these sites.
df_stream_data <- readNWISdv(siteNumbers = sites_what_data$site_no,
parameterCd = c("00060","00065"),
statCd = "00003")
## GET: https://waterservices.usgs.gov/nwis/dv/?site=01351450,01351500,01354230,01354330,01354500,01356190&format=waterml,1.1&ParameterCd=00060,00065&StatCd=00003&startDT=1851-01-01
When you download the data it will have column names that refer to
parameter codes and other data descriptors that are not particularly
human readable. Let’s use the renameNWISColumns()
function,
which renames the columns with more human readable names.
df_stream_data <- df_stream_data %>%
renameNWISColumns()
Now let’s plot a time-series of the gauge heights for each of the sites.
df_stream_data %>%
filter(!is.na(GH)) %>% # remove NA gauge heights
ggplot(aes(x =Date, y = GH)) +
geom_line() +
facet_wrap(~ site_no, scales = "free")
Now let’s plot a time-series of the streamflow for each of the sites.
df_stream_data %>%
filter(!is.na(Flow)) %>%
ggplot(aes(x =Date, y = Flow)) +
geom_line() +
facet_wrap(~ site_no, scales = "free")
The NWIS database has over 4.4 million water quality measurements
across hundreds of thousands of locations throughout the US and US
territories. We can use the dataRetrieval
package to query
and access this data. Furthermore, the dataRetrieval
package also allows us to access water quality data from the
Environmental Protection Agency (EPA) and the US Department of
Agriculture (USDA). As noted earlier, the dataRetrieval
package and understanding NWIS data can take a bit of time to learn,
though we’ll work through an example to highlight just how powerful
dataRetrieval
is for obtaining research grade water quality
measurements.
The key first step is to specify what water quality parameter(s) you are interested in obtaining. You can find the list of the available USGS parameter codes here.
You can also obtain a data frame of the USGS, EPA, and USDA parameter
codes directly in R, by calling parameterCdFile
. Let’s do
that now.
parm_cd_df <- parameterCdFile
Take a look at this data frame. You’ll see that there are more than 24,000 parameter codes! You’ll see that each parameter code is assigned to one 16 different groups (e.g., physical, microbiological, nutrient,…). This makes it easier to located parameters by their group/type. You’ll also see that the full name and the units associated with each parameter are reported in the data frame.
Let’s go ahead and obtain data for Nitrate, water, filtered, milligrams per liter as nitrogen which is parameter code (00618).
Since there are likely 10’s or hundreds of thousands on nitrate measurements in the NWIS database we are going to limit our query by location and by date.
Let’s use the whatWQPdata()
function to find out what
data is available for nitrate (see the parameterCd
argument) in New York state (see the stateCd
argument). We
will limit our query to sites that have at least some measurements that
were collected after 1-Jan-2015 (see the startDate
argument). Note that the sites can have data collected before that date,
however they must have at least one measurement after that date in order
to satisfy our search criteria.
Note that we are filtering the data so that we only keep groundwater sites (i.e., wells and springs).
pCode <- c("00618")
NY_NO3_sites <- whatWQPdata(stateCd = "NY",
parameterCd = pCode,
startDate = "2015-01-01"
) %>%
filter(ResolvedMonitoringLocationTypeName %in% c("Well", "Spring"))
## NEWS: Data does not include USGS data newer than March 11, 2024. More details:
## https://doi-usgs.github.io/dataRetrieval/articles/Status.html
## GET: https://www.waterqualitydata.us/data/Station/search?statecode=US%3A36&pCode=00618&startDateLo=01-01-2015&mimeType=geojson
Take a look at the NY_NO3_sites
data frame. You’ll see
that this data frame has information on all of the available data based
on the above query. You’ll see that each row represents a site with
data. The last three columns of this data frame are particularly useful
- they contain the date of the first (begin_date
), last
(end_date
) measurement, and the number of total
measurements for that site (count_nu
). You’ll see that some
sites have multiple measurements, indicating that samples were collected
on a number of different dates/times.
Now, we will use the readWQPqw()
function to acquire the
data for the sites we have identified.
To do this we need to specify a site(s) to obtain. We do this by
providing the siteNumbers
argument with a vector of sites
numbers.
We also need to specify the time period for the measurements we would
like to obtain. We will limit ourselves to only downloading measurements
that were taken from 1-Jan-2015 onward (see the startDate
argument).
We also need to specify the parameter to download. Since we are
interested in nitrate data we set the parameterCd
argument
to be equal to the nitrate parameter code of interest (i.e., 00618).
NY_NO3_recent_data <- readWQPqw(siteNumbers = NY_NO3_sites$MonitoringLocationIdentifier,
parameterCd = pCode,
startDate = "2015-01-01"
)
## POST: https://www.waterqualitydata.us/data/Result/search
## NEWS: Data does not include USGS data newer than March 11, 2024. More details:
## https://doi-usgs.github.io/dataRetrieval/articles/Status.html
Now we’ve downloaded the data - take a look at the data frame. You’ll
see the measurement values in the ResultMeasureValue
column.
Recall that some sites have multiple measurements (since measurements were taken on a number of different dates/times). Let’s summarize our data so that we can get the maximum nitrate concentration observed at each of the sites.
To do this we are going to group by
MonitoringLocationIdentifier
(i.e., group by the different
sampling sites/locations) and then summarize the nitrate measurements to
take the maximum value observed at each site.
NY_NO3_recent_data_summary <- NY_NO3_recent_data %>%
filter(!is.na(ResultMeasureValue)) %>% # remove rows with NA values for the measurement of interest
group_by(MonitoringLocationIdentifier) %>%
summarize(NO3_max = max(ResultMeasureValue, na.rm = T)) %>%
left_join(attr(NY_NO3_recent_data, "siteInfo")) # joining our water quality data with the data frame that contains site information
## Joining with `by = join_by(MonitoringLocationIdentifier)`
Take a look at this data frame to make sure you understand the steps above and what we have done.
Below is a map showing the data that we just summarized.
We will learn how to make these types of maps in upcoming lectures.
Continue to experiment with the packages that we have learned how to use today. Think about how you might apply these packages to your final projects. Some possible things to explore include:
Try obtaining daily or hourly meteorological data for site(s) that you are interested in.
Compare meteorological conditions between sites of interest.
Perform some of the time-series visualization/analysis approaches to the meteorological data.
Use the dataRetrieval
package to obtain data for
additional parameters and sites from the NWIS data base. Spend some time
exploring this data.