Code
library(tidyverse)We’ve been loading data into R in nearly every class, but we haven’t yet focused on the mechanics of doing this (I’ve been supplying the data and code to load it in). Now you’ll learn how to load in your own data.
Most of the data you load in will either be in comma separated values (.csv) or Excel format (.xls or .xlsx). Excel files and csv files are similar with an Excel file basically a “fancier” version of a csv file that allows you to have “markup” - such as colors, font formatting, cell borders - in addition to your data.
You can directly load in a csv file or Excel file from your computer (local file) or from a weblink (we’ve been loading data from my github site in most classes).
csv fileLet’s start by loading in a .csv file.
Go to our class website and right-click the link below in today’s notebook. Select “Save link as…” and then save this file as “co2_mm_mlo.csv” in a sub-folder “Data” within your “Lectures” folder for this class. Make sure you save it as a csv file
Now navigate to the location where you saved the file and open it in Excel or a basic text editor (e.g. Notepad. Don’t use Microsoft Word). You’ll see how the data opens nicely in Excel or a text editor (you’ll notice that in the text editor there are commas separating the values).
Ok, so you’ve downloaded the data and save it to your computer. Now you’ll want to load it into R so you can start to use the data in your analysis.
We’ll need to use the read_csv() function which is in the readr package. Let’s load in tidyverse which will load in readr (along with the many other packages in the tidyverse collection).
library(tidyverse)Now to load in the data all we need to do is use the read_csv() function and point it to the file we want to load in. Remember we should always use relative paths. Since your current working directory is your “Lectures” folder we can start the file path at our “Lectures” folder. You should first check in your console that your working directory is in fact your “Lectures” folder - you do this by typing getwd(). If your working directory is not your “Lectures” folder then you should make sure to open you “Lectures” project (or ask me for help).
Before we load in the data we need to deal with the fact that there were many rows of with metadata (i.e., rows that have info about the data but aren’t the data itself).
We will need to skip over these rows when we load in the data. As you can see in the image above, we have 40 rows with metadata. Row 41 has the column headers (which we will want to keep) and rows 42 onward have the data.
Thus, we will want to skip over the first 40 rows. To do so, we simply need to specify the number of rows to skip using skip = 40 within our read_csv function call.
CO2_data <- read_csv(file = "./Data/co2_mm_mlo.csv", skip = 40)Your data should have been loaded into R and saved to the object CO2_data. As you can see R is smart and treats the first row loaded in as the column headers (variable names). Note that the ./ in your file path indicates that you want to begin your path in your current folder.
So now you know how to load in a csv file!
However, sometimes things are a bit less neat. For example you may have a file header at the top of your file. This head might have useful data (e.g. info about the data source) but you would like to skip the head when loading your data into R. This is easy to do.
xls or xlsx fileLoading in an Excel xls or xlsx file is essentially the same as loading in a csv file. However, we’ll first need to load in the readxl package which has the necessary functions.
library(readxl)Let’s first save another copy of the co2_mm_mlo.csv data, this time saving it as an xlsx file that we’ll call “Mauna_Loa_excel_example”. You should move your data so that it isn’t in the top-left corner of the Excel file (we’ll do this to highlight some of the functionality of the read_xlsx() function).
You can see that in the image above, the top-left corner of my data is cell C8. The bottom-right (which you can’t see in the image) is in cell D679. The read_xlsx() function will always look in the left uppermost cell (i.e. A1) first unless you specify otherwise. Since our data starts elsewhere we’ll need to tell the function where to look, however this is very easy to do by simply specifying the range.
Make sure to specify the range that applies to your specific Excel file.
Mauna_loa_another_example <- read_xlsx("./Data/Mauna_Loa_excel_example.xlsx", range = "C8:D679")There are lots of additional functionality that can be employed with the read_xlsx() function - including loading in data from different sheets (tabs) of the same file. A quick Google search will show you nice examples demonstrating these additional features.