When working with data you will commonly need to deal with dates. We’ve already dealt with dates in this class, however for the most part the I’ve done some preliminary data manipulation to get the dates into an easy format for you to deal with (e.g. separate columns of year, month, and day).
While this approach made the data easier to deal with for learning purposes, it also imposed some limitations. Furthermore, most datasets you encounter will not have had this preliminary manipulation done.
Given the importance of dates in data analysis, we will learn some techniques for handling dates in R. One of the challenges posed by dates is that you will often encounter them in a wide variety of formats, including those that mix both text and numbers. For instance consider the many ways that you could represent February 20th, 2019
The above list is by no means exhaustive, furthermore we could add the time of day to the date as well, which would add additional formats to the above list.
To allow for dates to be properly interpreted by R, we need to convert the date from its initial format into an R date object. R has built in functionality to do this, though the package lubridate
provides additional functionality and ease of handling for dates in R.
Let’s first load in the package lubridate
and also tidyverse
. You will already have lubridate
on your computer since it is included in the tidyverse
packages. However, library(tidyverse)
only loads in the core tidyverse packages, so you’ll still need to type library(lubridate)
to load in `lubridate.
library(tidyverse)
library(lubridate)
Date/time info can be stored as a date, time within a day (e.g. 15:45 EST) or date-time which has both the date and time information.
When creating dates or date-times, you will typically be starting from one of the following
Let’s first take a look at dealing with dates in string format
dates_string <- c("2019-2-20", "2019-2-21", "2019-2-22")
Let’s check the class of date_string
class(dates_string)
## [1] "character"
You can see that is it a character at the moment. To get this into a date format we’ll rely on the ymd()
function in lubridate
. The ymd()
function converts text in year-month-day format into a date object
dates_object <- ymd(dates_string)
class(dates_object)
## [1] "Date"
You can see that R now recognizes the data as a date object (the date has been converted from character to a Date object)
Note that the lubridate
is smart and will easily convert dates that use other separators (e.g. /
or .
instead of -
)
dates_string <- c("2019/2/20", "2019/2/21", "2019/2/22")
class(dates_string)
## [1] "character"
dates_object <- ymd(dates_string)
class(dates_object)
## [1] "Date"
.
to separate the years.months.days. Also “pad” the months and days with zeros so that they are always two digits long (e.g. 02
for February instead of 2
). Then use the ymd()
function to convert your character vector to a vector of dates# Your code here
So we’ve seen that we can convert dates in year-month-day format using the ymd()
function. Lubridate has additional functions to parse other date configurations
Function | Date format |
---|---|
ymd() |
year/month/day |
ydm() |
year/day/month |
mdy() |
month/day/year |
myd() |
month/year/day |
dmy() |
day/month/year |
dym() |
day/year/month |
These lubridate functions are smart and will properly parse date strings that present the date in different styles – the key thing to ensure is that you specify the function that corresponds to the ordering of the date components.
ymd("2018 Mar 31")
## [1] "2018-03-31"
mdy("July 4th 2016")
## [1] "2016-07-04"
dmy("10-Feb-2020")
## [1] "2020-02-10"
# Your code here
When your data is in date-time format (i.e. has the date and time specified) then to parse the data into an R date-time format you simply append _hms
to the applicable lubridate funcition. Let’s illustrate this with an example.
my_datetimes_string <- c("2019/02/22 10:30:00", "2019/10/15 13:45:10")
my_datetimes_string
## [1] "2019/02/22 10:30:00" "2019/10/15 13:45:10"
my_datetimes <- ymd_hms(my_datetimes_string)
my_datetimes
## [1] "2019-02-22 10:30:00 UTC" "2019-10-15 13:45:10 UTC"
You can see that the strings were properly parsed into dates, though note that the time-zone defaults to “UTC”. You can specify the time-zone of interest as follows
my_datetimes <- ymd_hms(my_datetimes_string, tz = "EST")
my_datetimes
## [1] "2019-02-22 10:30:00 EST" "2019-10-15 13:45:10 EST"
Let’s also take a quick look at the class of the my_datetimes
object
class(my_datetimes)
## [1] "POSIXct" "POSIXt"
We can see that the class is “POSIXct”, which is the R class for date-time objects
You can also create date-time objects with _h
, _hm
where the date-time objects would have hours or hours and minutes specified respectively
ymd_h("2019 Feb 3rd 15", tz = "EST")
## [1] "2019-02-03 15:00:00 EST"
ymd_hm("2019 Feb 3rd 10:45", tz = "EST")
## [1] "2019-02-03 10:45:00 EST"
You will commonly come across data that has date information stored across columns - where each column holds part of the date (or date-time) information. For examply you may have three columns with the year in one column, the month in another, and the day in yet another column. While there are certain cases, when this is helpful, oftentimes you would like to have a single column that contains all of the date information - this is particularly useful when creating time-series graphics.
The lubridate
package has the functions make_date()
and make_datetime()
, which does this operation.
First, let’s load in USGS daily streamflow data for the Hudson River (USGS gage 01335754 above Lock 1 near Waterford, NY). This data that has the date information stored across columns.
flow <- read_csv("https://stahlm.github.io/ENS_215/Data/USGS_gage_01335754.csv") %>%
drop_na()
Take a look at the dataset and you’ll see that there is a Year
, Month
, and Day
column. Let’s create a new column called Dates
that has all of the date information (stored as a date object) in a single column. We’ll use make_date()
to do this.
flow <- flow %>%
mutate(Date = make_date(Year, Month, Day))
Now we’ve got the date information in a single column. If you also had hour, minute, and seconds information then you could have used the make_datetime()
function.
Since we now have a single column that contains all of the date information we can get rid of the Year
, Month
, and Day
columns as they are now redundant
flow <- flow %>%
select(Date, flow_cfs)
Now we have a more compact dataframe that still contains all of the original information
head(flow)
Having a single column with the date information is incredibly useful as we can not create a time-series plot of the daily streamflow data
FYI, you can get the current date or current date-time by using the today()
and now()
functions from the lubridate package.
today()
## [1] "2019-02-24"
now()
## [1] "2019-02-24 12:03:01 EST"
There are many situations where you will need to extract the individual components from a date. The lubridate
package has a number of functions that perform these operations.
To demonstrate this, let’s create a date-time object and then extract its components
First we’ll create the date-time object using the current date and time
datetime_test <- ymd_hms(now(), tz = "EST")
datetime_test
## [1] "2019-02-24 12:03:02 EST"
Now let’s extract the components. The functions year()
, month()
will extract the year and month respectively. When extracting days we can use a number of functions – mday()
will get the day of the month, yday()
will get the day of the year (i.e. days from the Jan 1st), wday()
will get the day of the week.
Let’s test out these functions on our datetime_test
object
year(datetime_test) # get the year
## [1] 2019
month(datetime_test) # get the month
## [1] 2
mday(datetime_test) # get the day of month
## [1] 24
yday(datetime_test) # get the day of year
## [1] 55
wday(datetime_test) # get the day of week
## [1] 1
hour(datetime_test) # get the hour
## [1] 12
minute(datetime_test) # get the minute
## [1] 3
second(datetime_test) # get the second
## [1] 2
decimal_date(datetime_test) # get the date in decimal year format. For instance if you are halfway through 2019 then the date would be 2019.50
## [1] 2019.149
For months and days, you can have the functions output the results as a text string containing the abbreviation for the month or day. To do this we simply specify label = TRUE
in the wday()
or month()
function
month(datetime_test, label = TRUE)
## [1] Feb
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
wday(datetime_test, label = TRUE)
## [1] Sun
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
Extracting the date-time components is often extremely helpful. For instance if you want to filter a dataset by year and you can use the year()
function to get the year values from a single date variable (column) without the need to create a column with the year information.
For example, consider the flow
data that we’ve loaded in today. We have a single column with all of the date information. To filter the dataframe by year we can use the year()
function in our filter statement.
flow %>%
filter(year(Date) == 2017)
make_datetime()
function to create a single column DateTime
in the flow
dataset that has all of the date and time info stored as a date-time object. You can set the values to 12, 00, and 00 respectively for hours, minutes, and seconds.flow<- flow %>%
mutate(DateTime = make_datetime(year(Date), month(Date), mday(Date), 12, 00, 00))
Create a string that prints “Today is Fri, Feb 22 and the time is 15:30”. Though have the string print out the information corresponding to the present date and time. You will need to use the paste()
to concatenate strings (we learned this earlier in the term).
Using the flow
data create a summary table with the mean flow for each month (i.e twelve rows, one for each month)
flow
data, create time-series plots of flow vs. time for each year from 2010 onwards. You should use facet_wrap()
to create all of the plots at once. On the x-axis you should have the day of the year.
There are many situations where you will need to perform mathematical operations (e.g. addition, subtraction, division,…) on dates. Using the base R functionality along with the lubridate
package you can carry out these types of operations.
Before moving forward it is helpful to be aware of the three ways time spans are represented in R. The definitions of the three ways time spans are represented are nicely defined in R4DS and are reproduced below
- durations, which represent an exact number of seconds.
- periods, which represent human units like weeks and months.
- intervals, which represent a starting and ending point.
Durations are stored as seconds, since this is the only unit of time with a consistent length. For instance, you could not use units of years to store durations as years can have different lengths (e.g. a leap-year vs. a regular year).
Durations are important to use when you are representing physical processes. For instance, imagine that you have temperature sensor in the ocean and it is powered by a battery. If the battery life remaining was reported in years, there would be some ambiguity (as years can have different lengths), whereas reporting the life remaining in seconds leaves no ambiguity.
You can create durations using lubridate
by using functions such as dyears()
and dweeks()
. The syntax is straightforward – simply append d
to the time duration of interest to create the function. When creating durations, the lubridate functions operate with the following settings (60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, 365 day in a year).
Let’s create some durations to highlight how they work
dyears(1) # duration of 1 year
## [1] "31536000s (~52.14 weeks)"
dyears(2) # duration of 2 years
## [1] "63072000s (~2 years)"
dweeks(75) # duration of 75 weeks
## [1] "45360000s (~1.44 years)"
ddays(10) # duration of 10 days
## [1] "864000s (~1.43 weeks)"
dminutes(15) # duration of 15 minutes
## [1] "900s (~15 minutes)"
You can add, multiply, and divide durations
dyears(1) + dweeks(10) + ddays(25)
## [1] "39744000s (~1.26 years)"
5 * dyears(1)
## [1] "157680000s (~5 years)"
dyears(20)/10
## [1] "63072000s (~2 years)"
You can also add or subtraction durations from dates
right_now <- now() # date-time now
right_now
## [1] "2019-02-24 12:03:02 EST"
right_now - dyears(10) # date-time 10 duration years prior to now
## [1] "2009-02-26 12:03:02 EST"
You’ll notice that the result is not today’s date, ten years ago. This is because durations are not calendar (“human”) units, but are instead an exact number of seconds (i.e. 31536000 seconds for a year duration). Thus using durations may results in results that are unexpected.
Our result above is the date and time that is exactly 10*(3,153,600 seconds) before the moment the code was run.
Durations are useful for determining the time elapsed between two date-times. For example you might want to know exactly how much time has elapsed since Union College was established. First let’s subtract the current date-time from the date when Union College was established (we’ll assume the time of establishment was 12 noon)
Union_time <- now() - ymd_hm("1795, Feb 25th 12:00")
Union_time
## Time difference of 81813.21 days
Subtracting two dates gives you a difftime
object in R and this object records time spans using seconds, minutes, hours, days, or weeks (depending on the situation) and thus can create ambiguity. To remove ambiguity we can convert the difftime object to a duration (which always uses seconds) using the as.duration()
function
as.duration(Union_time)
## [1] "7068661382.75484s (~223.99 years)"
While durations are very useful when dealing with physical processes where you care about the true time elapsed, there are many situations where we are interested in dealing with calendar (“human”) times.
In these situations we can use periods in lubridate. Periods do not have a fixed length in seconds (whereas durations do), and instead they work with “human” time-spans such as calendar days, months, years.
The functions for periods is similar to those for durations with the only difference being we drop the d
at the start of the function name. Thus, for example the functions years()
, months()
, and days()
represent periods of years, months, and days respectively.
Let’s demonstrate how periods in lubridate work.
right_now - years(10)
## [1] "2009-02-24 12:03:02 EST"
You can see that subtracting a period of 10 years from the right_now
date-time, gives us a result that is exactly 10 calendar years prior to today (i.e. the exact same time and date, but 10 years earlier).
Just like with durations you can add, subtract, multiply, and divide periods.
months(2) + days(5) # addition
## [1] "2m 5d 0H 0M 0S"
5 * (months(2) + days(5)) # addition and multiplication
## [1] "10m 25d 0H 0M 0S"
In many cases you will want to output a date-time object as a character string – for instance to annotate a graphic or include in the text of a report. To do this you can you the as.character()
function to convert the date-time object to a character. You can specify the format of the output character using the format =
argument in the function and specifying one of the many format options listed below.
Format codes | Description | Example |
---|---|---|
%a |
Weekday name (abbreviated) | Sun, Tue, Tue |
%A |
Full weekday name | Sunday, Tuesday, Tuesday |
%m |
Month as decimal number | 06, 07, 08 |
%b |
Month name (abbreviated) | Jun, Jul, Aug |
%B |
Full month name | June, July, August |
%c |
Date and time, locale-specific | Tue Aug 12 18:01:59 2014 |
%d |
Day of the month as decimal number | 23, 30, 12 |
%H |
Hours as decimal number (00 to 23) | 03, 14, 18 |
%I |
Hours as decimal number (01 to 12) | 03, 02, 06 |
%p |
AM/PM indicator in the locale | AM, PM, PM |
%j |
Day of year as decimal number | 174, 211, 224 |
%M |
Minute as decimal number (00 to 59) | 45, 23, 01 |
%S |
Second as decimal number | 23, 00, 59 |
%U |
Week of the year starting on the first Sunday | 25, 30, 32 |
%W |
Week of the year starting on the first Monday | 24, 30, 32 |
%w |
Weekday as decimal number (Sunday = 0) | 0, 2, 2 |
%x |
Date (locale-specific) | 8/12/2014 |
%X |
Time (locale-specific) | 2:23:00 PM, 6:01:59 PM |
%Y |
4-digit year | 2013, 2013, 2014 |
%y |
2-digit year | 13, 13, 14 |
%Z |
Abbreviated time zone | EST, MST |
%z |
Time zone | -0500, -0700 |
Let’s test this out.
right_now <- now()
as.character(right_now, format = "%c") # Date and time, locale-specific
## [1] "Sun Feb 24 12:03:02 2019"
as.character(right_now, format = "%B") # Full month name
## [1] "February"
as.character(right_now, format = "%A %B %d") # Full weekday name, full month name, day of month
## [1] "Sunday February 24"
lubridate
package. Look at your lubridate
cheat sheet and practice with some additional functions that seem useful to you but we didn’t cover today. A nice place to start might be the intervals functionality.