Overview

Today’s lab will consist of two parts. First we will make sure that you are properly set up with all of the software and accounts that you need for this class. Next you will work with some fuel economy data from the Environmental Protection Agency (EPA) to learn more about fuel efficiency in automobiles – which is an important from an environmental/climate perspective given that vehicles are responsible for 51% of CO2 emissions for a typical U.S. household1.

As a reminder I want you to talk to your neighbors and discuss the material and your thoughts. You’ll learn from each other and have much more fun this way too.

Step 0: Lab guidelines and formatting

I will go over the guidelines for lab report expectations including the content, structure, and formatting of your reports. Remember that you can find additional details on lab guidelines and goals on our class website.

Step 1: Analysis of fuel efficiency in automobiles

Automobiles are a significant contributor to both CO2 emissions and air pollution. Understanding fuel efficiency in vehicles and the key factors that affect the efficiency are important from both an environmental and economic perspective. Today you will use EPA vehicle fuel economy data to learn more about this issue and explore some interesting questions.

Fuel Economy Analysis

Set up for analysis

We are going to load in the tidyverse package. If you haven’t installed it yet, you will first need to do that (it only needs to be installed once and once installed can be loaded in anytime you want). To install tidyverse go to the Packages tab and click the install button. In the window that pops up, type tidyverse and click Install.

library(tidyverse)

The data is conveniently part of one of the packages included with tidyverse, so we can easily load in the data, which is called mpg.

We’ll assign the mpg data to a new object that we’ll call fuel_data. However, before you run the code block below, you should first learn a bit more about the mpg dataset.

To do that type ?mpg in your Console (we’ll use the console, because we only want to get this info once, and not every single time we run our Notebook).

fuel_data <- mpg  # assigning mpg to a new object


Use the View() function to see the dataset in a spreadsheet-style viewer.

This can be really helpful when you are trying to get familiar with a dataset that you’ve loaded in.

Let’s type View(fuel_data) in the Console (not in our Notebook) since we just want to do this once (and not every single time we run our Notebook).


Use the str() function to examine the structure of the fuel_data dataset

# Your code here
  • How many variables are there?
  • How many observations?
  • What does chr, int, and num mean in the output above?


Look at your Base R Cheatsheet to find a function that allows you to examine the first six rows of the fuel_data dataset. Apply this function to your fuel_data object and it will print these results to your notebook.

# Your code here
  • What do the disp, cyl and cty columns represent? (Hint: type ?mpg into your console to get more info on this dataset)
  • What website was the data obtained from?


How many rows and columns are there in the fuel_data dataset? There is a function that will give you the dimensions of an object (i.e. number of rows and columns). Look at your Base R Cheatsheet to find this function.

# Your code here
  • Number of rows?
  • Number of columns?
  • Use the approach that you employed to get the dimensions and assign the value for the number of rows to a variable n_rows and number of columns to n_cols (Hint: you can find functions for just the number of rows and just number of columns respectively on your Base R Cheatsheet)


Use the summary() function to get summary statistics on the fuel_data dataset

# Your code here
  • For highway driving how many miles per gallon does
    • The most fuel efficient vehicle get?
    • The least fuel efficient vehicle get?
  • What is the average miles per gallon for city driving?


Create a new variable to store the ratio of the hwy to cty fuel economy

To create a new variable you can use the mutate() function (which is included in the dplyr package that loads in with tidyverse). You can get more info on mutate() by typing ?mutate() to your console or by Googling it. FYI, we are going to learn a lot more about dplyr later in the term`

You can also create a new variable by using the $ notation to assign it. The $ allows us to access (or create) a new variable in a data frame. For instance fuel_data$new_variable_name <- expression that defines the variable

I’ve shown you how to do it both ways in the code below. You can pick one (they do the same thing) and uncomment it so that it will run.

## Use mutate() to create the new variable (hwy2cty) and add it to the fuel_data object
# fuel_data <- mutate(fuel_data, hwy2cty = hwy/cty)

## Use the $ notation to create the new variable (hwy2cty) and add it to the fuel_data object
# fuel_data$hwy2cty <- fuel_data$hwy/fuel_data$cty
  • Examine this new variable
    • Is there any vehicle that gets better gas milage in the city as compared to the highway?
    • Is there any vehicle that has twice as good gas milage on the highway relative to the city?
    • What is the largest ratio between highway and city gas milage?
  • Look at the code above that we used to create the new variable hwy2cty and spend a few minutes discussing with your neighbor(s) what the code is doing.


Examine factors influencing fuel economy: Displacement

Let’s see how the size of a vehicle’s engine (displacement) influences the fuel economy

Use ggplot to create a scatter plot of hwy vs displ. Remove the # and replace the ... with the appropriate values.

# ggplot(data = ...) + geom_point(aes(x = ... , y = ...))
  • How does displacement influence the fuel economy?
  • Do any points appear to be outliers from the observed relationship? Any thoughts on what might explain these outliers?


Examine factors influencing fuel economy: Displacement and vehicle class

Let’s add another variable to our analysis to try and further explain what influences fuel economy. We’ll now color the points by their vehicle class.

# ggplot(data = ...) + geom_point(aes(x = ... , y = ... , color = ... ))
  • What new information did this provide us with?
    • Can we say more about factors affecting fuel economy?
    • Can we explain some of the outliers?


Add a new variable to fuel_data

Below I’ve added a new variable region to fuel_data using the mutate() function. This step is relatively advanced, so don’t worry if it looks well beyond what we’ve covered so far. However, I want you to take a look at the code and try to decipher what I did here. I tried to use variable names that are logical and if you see a function that you don’t understand, try looking at the help file and/or Googling it.

us_makes <- c("chevrolet","dodge","ford","jeep",
              "lincoln","mercury","pontiac") # list of U.S. manufacturers

fuel_data <- mutate(fuel_data, region = if_else(manufacturer %in% us_makes,"US","Foreign") )
  • What is the variable region telling us?
  • Can you briefly explain the logic and steps taken to create the variable region?

You may find region to be another interesting variable to examine in the next section.


Additional analysis of your choosing

With any remaining time you should perform further exploratory analysis of the fuel_data dataset. You should discuss ideas with your classmates and you can also ask me for guidance (but try to brainstorm some ideas before asking). I will be moving about the classroom and checking in with everyone, providing guidance and suggestions, and hearing what you’ve learned about the dataset being analyzed.

Note that this section of the lab is important and should not be given only a cursory work through. An important learning goal for this term is for you to develop your own independent research skill. Thus, all of our labs this term will have a large, independent component where you are expected to apply what you’ve learned to ask novel and interesting questions and furthermore to try and go beyond what you’ve learned in class.

When trying to get started with something new, remember the copy/paste/tweak approach. This approach can help give you a good starting point for your work.


Step 2 (optional): Work with an expanded EPA fuel efficiency dataset

If you are interested in further exploring EPA fuel efficiency data you can download the full EPA dataset here. The description of each of the variables in the dataset is available here.

This dataset has information on thousands of vehicles and their fuel/energy efficiency (along with dozens of other related variables) for models from 1984-2022. Notably this dataset contains information on many electric and hybrid vehicles.

Unlike the mpg dataset you used in Part 1, the full EPA dataset is quite a bit “messier” (e.g. variables have missing data in some rows) and quite a bit more complex (upwards of 80 variables are reported for a given vehicle). However, the larger number of observations (i.e., vehicles) and variables (i.e., vehicle attributes) allows for a much richer dataset.

If you decide to work on this analysis let me know and I can help guide you on how to load in and get started with the data.

Step 3: Submit the lab

Your lab is due prior to the start of next week’s lab. Once you are finished and satisfied with your work you should knit it to an html file and submit both your html and Rmd file to Nexus.

To knit a file you can go to the menu bar at the top of your notebook and click the dropdown that currently says preview and select the knit to html option. This will knit your document, which runs all of your code and generates a nice report in html format (the file is saved in your current working directory).

An even easier way to knit your file is to go to the header at the top of your document and change html_notebook to html_document and then save your file. You will then see that the Preview option in the menu bar will have changed to Knit. Click Knit and your report will be knit.

Before you leave lab today make sure you know how to knit your document

Make sure your file is properly named BEFORE you submit it The correct naming structure is LabName_YourLastName

Make sure to replace the author and date in the header with your info.