Exploring Environmental Data

Loops (quick refresher)

for (i in 1:10){
  
  j <- i^2
  print(j)
}

## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100

In the above loop, i goes from 1 to 10, starting at 1 and increasing by 1 with each loop.

Loops (quick refresher)

city_list <- c("Schenectady", "New York", "Boston", "Chicago", "Miami")

for (i_city in city_list){
  print(i_city)
}

## [1] "Schenectady"
## [1] "New York"
## [1] "Boston"
## [1] "Chicago"
## [1] "Miami"

In the above loop, the object i_city takes the value of the values in city_list.

On the first iteration of the loop i_city takes the first value in city_list (i.e. Schenectady)
On the second iteration i_city takes the second value in city_list (i.e. New York)
On the third iteration …

Loops (quick refresher)

Using a different approach, I can create a loop that does the exact same thing as the loop on the previous slide

city_list <- c("Schenectady", "New York", "Boston", "Chicago", "Miami")

for (i_city in 1:5){
  print(city_list[i_city])
}

## [1] "Schenectady"
## [1] "New York"
## [1] "Boston"
## [1] "Chicago"
## [1] "Miami"

Notice how in this loop i_city goes from 1 to 5, increasing with each loop.

Thus I can use i_city to specify the index within the city_list object that I would like to access on each loop.

Introduction to Data Manipulation

Recall the data science workflow I showed on the first day of class.
Image source: R4DS

Now that we’ve established foundational skills in R programming, we are going to move into the data manipulation (transform) stage of the workflow.

Introduction to Data Manipulation

Why is this so important?

Data sets are often large and often contain tens or hundreds of variables and tens of thousands of observations. When conducting our analysis we often need to:

Create new variables and get rid of other variables
Select data based on certain criteria
Organize, group, and sort data
Summarize data
Join two or more related datasets

In many cases these operations are central to our analysis of interest.

Introduction to Data Manipulation

At this point we’ve learned some basics of data manipulation using the functionality available in base R.

Today we are going to begin taking these skills to the next level.

We’ll do this by using an amazing package called dplyr

Introduction to Data Manipulation

The dplyr package is included in the tidyverse collection of packages, so you should already have it installed on your computer.

With dplyr we’ll be able to manipulate data using the functions included in the package. This will make doing the types of data manipulation we’ve done thus far much, much easier (both to code and to read/understand).

We will also gain a whole new set of data manipulation operations.

With dplyr we will be able to seamlessly deal with large and complex data.

Introduction to Data Manipulation

The dplyr package gives us a grammar of data manipulation. The package provides the verbs (functions) for many common data manipulation tasks and which can act on our subjects (datasets).

We’ll learn a number of key dplyr verbs, including:

filter() to select cases based on their values
mutate() to add new variables that are functions of existing variables.
arrange() to reorder the cases
select() to select variables based on their names
summarize() to condense multiple values to a single value.
group_by() to group a dataset according by a particular variable