Writing functions in R

library(tidyverse)

You have been using functions in R since the very first day of this class. These functions have been functions from base R or from one of many packages that we’ve used. Functions are great as they allow you to automate tasks that you commonly need to do.

Think about one of the most basic functions mean() – this allows you to take the mean of a dataset without having to compute the sum and then divide that sum by the number of values in the dataset. More complex functions such as lm(), which computes a linear model, allow you to perform operations that take many lines of code in just a single function call.

While, base R and the thousands of available R packages provide a wide variety of functions, many times you will find yourself needing a function that does not exist. You can write your own R functions to automate tasks that you frequently find yourself performing. Writing a function is a much, much better approach than simply copying and pasting code to repeat tasks. The benefits of function writing over the copy-and-paste approach is nicely summarized in the R4DS (excerpt below):

You can give a function an evocative name that makes your code easier to understand.

As requirements change, you only need to update code in one place, instead of many.

You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).

So if you find yourself copying and pasting code to perform the same task repeatedly, you should stop and write a function.

Writing functions

Function basics

Before we move ahead, let’s have a brief refresher on some concepts and terminology related to functions. Functions, are passed arguments (i.e. inputs), the body of the function then perform operation(s) on these arguments and then the function returns outputs. When you use a function it is referred to as calling the function.

A function is written in R as follows

function_name <- function(argument_1, argument_2) {
  
  # code performing functions operation(s)
  # code performing functions operation(s)
  # code performing functions operation(s)...
  
}

You can see that on the first line of the code above, the function function_name is defined. By assigning function(argument_1, argument_2) to function_name we are declaring a new function called function_name, which takes two arguments argument_1 and argument_2. Note that you can write functions that have as many (of as few) arguments as you want/need.

Inside the curly brackets {} is the body of the function – this is the code that is executed when the function is called.

In R, a function will return(i.e. output) the result of the last line in the body of the function.

To illustrate how we create a function in R, let’s create a very simple function that takes a number as input and then returns the cube of that number.

cube_it <- function(cube_me){
  cube_me^3 # take the cube of the input cube_me
}

Notice how we gave the function a descriptive name, in this case we called it cube_it. We also gave the argument (input) an informative name, cube_me. While we can technically name our function and its arguments whatever we like, it is smart to give them informative names that will help us (and others) understand the purpose of the function.

Let’s give our function a quick test to confirm that it works as we expect.

cube_it(5)

## [1] 125

Arguments are matched by name or position in the argument list. In the above example with the function cube_it(), we didn’t refer to the argument name when calling the function. The argument is thus matched by position, which in the cube_it() example does not leave any room for confusion since there is only one argument. Note however, that we can call the function as follows

cube_it(cube_me = 5)

## [1] 125

Referring to the arguments by name when you have only one argument is not really necessary, however when you have mulitiple arguments this can be very helpful. Oftentimes you may not recall the position of the arguments in the function definition and thus you do not want to rely on positional matching as you may mismatch arguments.

Let’s highlight this point with a basic example where we create a function that takes one number and raises it to a specified power:

pow_it <- function(base_val, exp_val){
  base_val^exp_val
}

Recall that unnamed arguments are matched in the order in which they are defined in the function declaration. If I do not use the argument names in my function call, then they will be matched by position – this could lead us to generate a result that doesn’t match our intentions. To prevent this from occuring we can specify the names of the arguments when we calll the function.

Let’s raise two to the third power \(2^3\)

pow_it(base_val = 2, exp_val = 3)

## [1] 8

Since we are naming the arguments, the position does not matter – the name takes precedence.

pow_it(exp_val = 3, base_val = 2)

## [1] 8

We could have also called the function and relied on positional matching

pow_it(2, 3)

## [1] 8

Though this latter function call is less clear (from a human perspective) and we could have easily forgotten and called pow(3, 2) which would have squared three \(3^2\), which would not have been our intention.

Exercise

Create a function that converts temperatures from degrees Fahrenheit to Celsius. Make sure to give descriptive names to your function and the function argument(s).
1. Once you’ve created your function, test it out to make sure it is working properly.
2. Try passing a vector of values to your function.

FYI,
\(Celsius = (Fahrenheit-32)*\frac{5}{9}\)

Create a simple function of your own.

Variables and scoping

Argument defaults

When calling a function, if you do not pass a value to an argument that is used in the body of the function, an error will result. Let’s illustrate this point using our pow_it() function. Recall that pow_it() takes two arguments base_val and exp_val. Run the following code and see what happens.

pow_it(base_val = 3)

You’ve now seen that the code above throws the following error, telling us that we are missing an argument

Error in pow_it(3) : argument "exp_val" is missing, with no default

While in many cases we want the code to “break” when it is used incorrectly, there are nonetheless situations where we would like the code to be “smart” and supply reasonable default arguments that prevent an error from occuring.

We can specify default values for arguments when we declare a function. Let’s redeclare pow_it() – this time supplying a reasonable default for the exp_val argument

pow_it <- function(base_val, exp_val = 1){
  base_val^exp_val 
}

In the function declaration above, we’ve supplied exp_val a default value of 1. If the user supplies a value for exp_val in their function call, then the user supplied value will be used, otherwise the default value will be applied. Let’s illustrate this below.

pow_it(base_val = 3)

## [1] 3

You can see that a default of 1 was used for exp_val

pow_it(base_val = 3, exp_val = 3)

## [1] 27

However, when we supply a value for exp_val in our function call, this value is used.

Variable scoping

In programming, scoping refers to the rules used to look up values associated with a variable.

A function first looks within the function itself (i.e. its local environment) to identify the values associated with the variables in use. If it finds the values there, then it stops looking, otherwise it expands its search to the next level of the environment. If a value is never found, then an error is thrown.

Let’s illustrate this with an example.

add_something <- function(my_number){
  add_number <- 1
  my_number + add_number 
}

Notice that the variable add_number does not appear in our Global Environment (see the Environment pane in the top right of your RStudio window). This is because, variables defined within functions are locally defined and thus are only accessible within that function.

R will first look for the variable add_number within the function – since it finds a value there, it stops looking and uses that value for add_number.

add_something(my_number = 5)

## [1] 6

To illustrate that R first looks inside the function and then stops if it finds the variable, let’s create a variable add_number outside of our function. This variable will like in our Global Environment (see the Environment pane in the top right of your RStudio window).

add_number <- 5
add_something(my_number = 5)

## [1] 6

You see that add_something() still uses the locally defined value for add_number when executing the function.

However, if we hadn’t assigned a value to add_number within our function, R would look outside of the function for this value. Let’s illustrate this with an example.

First, we will redefined the add_something() function, though this time we will not supply a value for add_number within the function

add_something <- function(my_number){
  my_number + add_number 
}

Let’s assign add_number a value outside of our function. You’ll see that this variable is found in the Global Environment

add_number <- 10

Now, let’s run add_something()

add_something(my_number = 5)

## [1] 15

First R looked for a value for add_number locally (i.e. within the function), since it did not find a value there it then looked outside of the function and found a value in the global environment, which it then used in the function evaluation.

Returning multiple outputs

Functions in R return the last evaluated expression. In many cases we would like to return more than one value from a function. To return multiple outputs we can have the function return a vector c() or list list() that contains the set of values/objects we want to return.

Let’s illustrate how we do this with a simple example. We will create a function that computes the minimum, maximum, and mean of a dataset.

get_stats_bad <- function(input_data){
  min(input_data)
  max(input_data)
  mean(input_data)
}

Now, let’s test this function on an example dataset

my_data <- tibble(x = runif(1000, min = 0, max = 50))  #generate a vector of 1000 random values between 0 and 50

get_stats_bad(my_data$x)

## [1] 25.1221

Since the last statement evaluated in get_stats_bad() is the mean, then that is the only value that is returned by our function. However, we would like to have the min, max, and mean returned.

Let’s define a new function get_stats_good() that returns all of the computed statistics

get_stats_good <- function(input_data){
  stat_min <- min(input_data)
  stat_max <- max(input_data)
  stat_mean <- mean(input_data)
  
  c(stat_min, stat_max, stat_mean)
}

You can see in the function declaration above, the last evaluated statement is a vector containing all of the values we would like to return. Let’s test out this function to demonstrate how it works.

get_stats_good(my_data$x)

## [1]  0.1010344 49.9893055 25.1220975

You can see that the function performed as desired – returning all three of the computed statistics (min, max, mean) in a single vector.

When you have outputs of the same type, then a vector works great for returning multiple values, however, when you want to return dissimilar data objects (e.g. a figure object and a data frame) a vector will not work. In these cases you can use a list() object, which can contain different data structures within it.

Let’s illustrate this with an updated version of our get_stats_good() function. We will compute the same statistics as before, though now we will also generate a histogram to plot the datasets distribution. The list() at the end of the function returns the statistics along with the fig_hist figure object.

get_stats_good <- function(input_data){
  stat_min <- min(input_data)
  stat_max <- max(input_data)
  stat_mean <- mean(input_data)
  
  fig_hist <- ggplot() +
    geom_histogram(aes(x = input_data)) +
    theme_classic()
  
  list(stat_min, stat_max, stat_mean, fig_hist)
}

Let’s run the function to demonstrate how it works.

get_stats_good(my_data$x)

## [[1]]
## [1] 0.1010344
## 
## [[2]]
## [1] 49.98931
## 
## [[3]]
## [1] 25.1221
## 
## [[4]]

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise

Create a function that take in NOAA state-wide temperature data and generates for a single, user specified state the following
1. Table of monthly average temperatures (i.e. table with 12 rows, one for each month)
2. Table of annual average temperatures (i.e. table with a row for each year in the data)
3. Figure with a plot of time-series plot of the annual average temperatures.

The data for this exercise is here

state_temps <- read_csv("https://stahlm.github.io/ENS_215/Data/NOAA_State_Temp_Lab_Data.csv")

You can call your function state_climate_summary(state_temps, state_to_select). Where the argument state_temps is the data frame with the temperature data and the argument state_to_select is the abbreviation for the state of interest.

An example function call would look like state_climate_summary(state_temps, "MA")

Below is an example of what your output should look like

state_climate_summary(state_temps, "MA")

## [[1]]
## # A tibble: 12 x 4
##    Month month_mean month_max month_min
##    <int>      <dbl>     <dbl>     <dbl>
##  1     1       24.3      33.8      14.7
##  2     2       25.4      34.2      12.8
##  3     3       34.2      44.1      25.7
##  4     4       45.0      50.6      39.9
##  5     5       56.0      61.8      47.7
##  6     6       64.5      68.6      58.4
##  7     7       70.0      75        65.7
##  8     8       67.9      73.5      61.4
##  9     9       60.8      66.3      56.1
## 10    10       50.2      57.4      43.4
## 11    11       39.6      45.8      32.5
## 12    12       28.6      41.8      15.8
## 
## [[2]]
## # A tibble: 124 x 2
##     Year annual_mean
##    <int>       <dbl>
##  1  1895        45.9
##  2  1896        46.0
##  3  1897        46.4
##  4  1898        47.2
##  5  1899        46.4
##  6  1900        47.0
##  7  1901        45.6
##  8  1902        46.0
##  9  1903        45.9
## 10  1904        43.3
## # ... with 114 more rows
## 
## [[3]]

Writing functions in R

ENS-215

6-Mar-2019

Writing functions

Function basics

Exercise

Variables and scoping

Argument defaults

Variable scoping

Returning multiple outputs

Exercise