At this point in the term, you’ve learned quite a bit of material and its good to stop and look back at what we’ve covered. In today’s lecture we will refresh our memory of the concepts and tools that we’ve seen up to now. In the sections below you’ll work through excercises that cover many of these key topics.

Let’s load in the libraries we’ll use in this lesson

library(tidyverse)


R basics

At the beginning of the term we learned some of the basics of programming in R. You are continually using many of these concepts so they should be relatively fresh. However, below are some questions/exercises that should help to reinforce these concepts.

  1. Name the common data types used in R

  2. Name the common data structures used in R

  3. Create a vector called by_fives that goes from 0 to 50 in increments of 5
    1. Divide every element of this vector by 2 and save it to a new vector
    2. Multiply (element-wise) by_fives by your new vector created in the above step.
  4. Create a vector of factors, with a collection of values “Low”, “Med”, “High”, “Very High”. Have at least 7 values in your vector.
    1. You should make you factor have ordered levels.
  5. Load in the data frame using the code below
    1. Examine the first few rows of the data frame using the head() function and get a quick summary using the summary() function. Also examine the structure using the str() function
    2. Access all rows in the 4th column using bracket [] notation
    3. Access the 3rd column using the $ notation.
earthquake_data <- read_csv("https://stahlm.github.io/ENS_215/Data/Rocky_Mtn_Arsenal_Earthquakes.csv", skip = 2)
  1. Create a vector that goes from 1 to 10. Append this vector to a vector where the value 20 is repeated 5 times and save this new vector to an object new_vec. Test the following conditions (element-wise) on your new_vec vector
    1. If the elements are GREATER THAN 15
    2. If the elements are LESS THAN OR EQUAL TO 10
    3. If the elements are LESS THAN 5 or if they are GREATER THAN 10
    4. If the elements are GREATER THAN 2 and LESS THAN 6


Markdown basics

We learned to use Markdown to nicely format our R Notebooks. The following exercises will refresh your memory on some of these formatting options. You can refer to the Notebooks posted on our class website and/or your R Markdown Cheatsheet.

  1. Create a few section and sub-section headers using #s
  2. Create bold and italic font
  3. Create superscript (e.g. X2.75) and subscript (e.g. Xi+1) text
  4. Make a bulleted list
  5. Make a numbered list that has items
    1. and sub-items
  6. Insert a web-link to our class site
  7. Insert a footnote in your text
  8. Insert the image from this link (https://stahlm.github.io/ENS_215/Lectures/Images/1000px-Anscombes_quartet_3.png)


Conditional programming

As you’ve learned conditional programming allows us execute code when specified conditions are met. We learned how to do this using if, if/else, and if/else-if/else statements.

  1. Create an if statement that checks to see if a number you have guessed between 1 and 100 is larger or smaller than a randomly generated number between 1 and 100. The code below gives you a start. You will need to complete the code and add your if statement. You if statement should print out a message informing you of the result.
rand_number <- runif(1, min = 0, max = 100) # generate a random number between 1 and 100
my_guess <-    # your guess goes here


  1. Create an if/else-if/else statement that tells you how well you guessed
    • When your guess is off by 50 or more then print “Your guess is way off”
    • When your guess differs by more than 20 but less than 50 print “You guess is OK”
    • When your guess differs by more than 10 but less than 20 print “Good guess”
    • When your guess differs by less than 10 print “Excellent guess”

Create your if/else-if/else statement in a well-thought out and efficient manner. Think about the styling of your code and the quality of your implementation.


Loops

We learned that we can repeat a section of code when specified conditions are met by using loops. This allows us to perform repeated operations without having to copy and paste code (which is a very bad practice and very inefficient).

Let’s load in a some daily streamflow on the Hudson River (measured near Waterford, NY) for years 2013-2016. Note that the dataset is complete (i.e. there are no missing days and no missing data)

Hudson_flow <- read_csv("https://stahlm.github.io/ENS_215/Data/Hudson_01335754_review_class.csv")


  1. Create a for loop to find the length (in days) of the longest period of consecutive days where the flow was less than 2,500 cfs. FYI, I’ve determined that the answer is 30 days (let me know if you get something different).

Note: Take some time to think about how to do this. Also write you code in an intelligent manner so that it is flexible (i.e. would run without modification if you were to load in different but identically formatted dataset).

# Your code here


  1. Create a while loop that loops through the Hudson_flow data until it reaches the maximum flow recorded in the dataset at which point the loop stops. You should add a print() statement after the loop that reports the date of the maximum flow. FYI, I get the following answer
## [1] "Max flow occurs on 2014-4-16"


Data wrangling

Basics

We learned tons of ways to wrangle data using the dplyr package. Let’s refresh our skills with these tools (you’ll likely be pretty fresh with these concepts since have been using them heavily).

To practice your skills you should use the Hudson_flow data. Don’t overwrite your Hudson_flow dataset when making modifications. If you happen to do this by accident, you can simply reload in the data.

  1. Use filter() to select only the rows with flows > 7500 cfs
  2. Use filter() to select only the rows with: 2,500 < flows < 12,000 cfs
  3. Use filter() to select only the rows with months Nov, Dec, Jan, Feb (you should use %in% in your filter operation)
  4. Sort the data in ascending order by the flows. Also try sorting the data in descending order by the flows.
  5. Select only the row with data for year 2014 and then sort this data in ascending order by flows. You should use the pipe operator %>% to allow you to do this in a single line of code
  6. Remove the day column using the select() function


Additional dplyr

Let’s practice some of additional (and more advanced) data wrangling skills

  1. Add a new column (variable) to Hudson_flow with a categorical variable that categorizes flow into “Low flow” and “High flow” based on the following conditions
    • if flow < 7,000 cfs then “Low flow”
    • if flow >= 7,000 cfs then “High flow”

You will want to use mutate() and if_else() to accomplish the above. Make sure to reassign the Hudson_flow object so that you carry this variable with you in the later analysis

  1. Use group_by() and summarize() to accomplish the following tasks
    1. Create a table that reports the minimum, mean, and maximum flow by month
    2. Create a table that reports the minimum, mean, and maximum flow by month for each year
  2. Precip data summary all states (min, mean, max) for pre-1950 and post-1950 period. Create a dot plot


Data visualization

This topics is very recent so not much need to refresh your memory, but should still do some excercises to reinforce the concepts.

Let’s generate some graphics using the tools we’ve learned in the ggplot2 package. We’ll use the Hudson_flow data in the exercises below.

  1. Create a graphic with flow vs. month, where flow is the y variable and month the x. You should use geom_jitter() (Hint: you may want to convert your x variable to a factor).
    1. Add a geom_point layer with the mean flow for each month (i.e. twelve points). Make these points blue squares.
    2. Use the theme_classic() iiI) Add axis labels, a title, and a caption iV) Set the alpha of the points to 0.5
    3. Comma formatting for the y-axis (e.g. 10,000)
    4. Make the tick labels on the x-axis, the abbreviation for each month (e.g. Jan, Feb, Mar,…)
    5. Make any other modifications that you think improve the graphic

Note: You will need to load in the scales package for the comma formatting

library(scales)


Clean and tidy data

We just did this last class so we won’t review this topic today, though you should look back at the past few lectures if you need a refresher.