As you’ve already learned in today’s lecture one of the best tools you have to interpreting/understanding your data are your own eyes! A visual representation of your data can reveal patterns and interesting features that would have been difficult or impossible to identify by looking at a data table.
Plots are not only the most effective way for you to understand your data, they are also the most effective way for you to convey your message to an audience.
For a more in-depth look into the motivations and goals of data visualization, refer back to yesterday’s lecture slides.
Before jumping into the ggplot2
package, let’s become familiar with some of the most common types of graphics that you will make and encounter in you work.
IMPORTANT NOTE: Don’t get hung up on how “pretty” (or not pretty) the graphics look in today’s lecture. We are just going over the basics and later this week we will cover how to make attractive, presentation-quality graphics.
Scatter plots useful for displaying the relationship between two variables (e.g. height vs. weight). If one variable is the independent (controlling variable) it is generally plotted along the x-axis, while the dependant variable is on the y-axis. For example if you were looking out how precipitation affects streamflow you would plot precipitation on the x-axis and streamflow on the y-axis. A third variable can be displayed by scaling or color coding the plotted symbols.
Line graph are useful for displaying the relationship between two variables, where each x-value corresponds to a single y-value. Line graphs are typically used to display time-series data (e.g. streamflow vs. time, temperature vs. time) where there is a single measurement at each time-point.
Column and bar charts useful for displaying number of items or values in different classes/groups, where the height of the bar represents the value. For instance, you could display the water usage by US state using a bar chart, where each state would have its own bar and the height of the bar would be proportional to the amount of water used in that state.
Histograms are similar to a bar chart, but they are used to display the frequency distribution of the data. For example you could display temperature data for a location using a histogram to understand how the data is distributed.
Rose diagrams similar to a histogram but used for displaying data that has a directional component (e.g. wind).
Pie charts useful for displaying proportions of a whole. For instance you might display the land cover (e.g. forested, developed, grasslands,…) of a given region using a pie chart. Stacked bar charts can also be used to display this type of information and they are generally easier to read and more compact than pie charts.
Ternary diagrams Used to show proportions when the three components sum to 100%. Commonly used to display grain size information.
Box plots also called box and whisker plots, they are useful at displaying the statistical distribution of catagorical data.
Contour and surface plots are used to display three variables. Typically the x and y variable is the spatial position and the color (or contoured) variable is some value that varies in space (e.g. concentration).
Maps are excellent at displaying spatial data. There are tons of different styles and types of maps and this is a whole subject on its own. We will cover some basic mapping techniques in R later in the term.
ggplot2
packageThe ggplot2
package designed around the idea that a graphic can be decomposed into its fundamental parts and thus we can build them much like a sentence, by combining these parts according to grammatical rules.
IMPORTANT CLARIFICATION: the package is called ggplot2
however when you create a figure you use the ggplot()
(note that there is no 2
in the function call). Also note that the ggplot2
library is loaded in when you load in tidyverse
. If you want to load in ggplot2
by itself, you can simply type library(ggplot2)
.
Recall that when construction a graphin in ggplot2
the three essential components of a graphic are:
data
: dataset containing the mapped variablesgeom
: geometric object that the data is mapped to (e.g. point, lines, bars, …)aes
: aesthetic attributes of the geometric object. The aesthetics control how the data variables are mapped to the geometric objects (e.g. x/y position, size, shape, color, …)The basic template for creating a graphic in ggplot2
is
ggplot(data = DATASET) + GEOM_FUNCTION(mapping = aes(MAPPINGS) )
geom
function you want to use (e.g. geom_point()
)geom
(e.g. x = gdpPercap, y = lifeExp, color = continent
)You can also set the aes
thetic to be global (i.e. will apply to all of the geoms
associated with that ggplot
call) by defining the aes
right in the ggplot
call. For instance
ggplot(data = DATASET, mapping = aes(MAPPINGS)) + GEOM_FUNCTION()
ggplot2
geometriesLet’s first become familiar with some of the most common geom
etries that you will encounter in the ggplot2
package.
Some of these geoms
you are already familiar with, and others you will likely be seeing for the first time. The goal in today’s lecture is to learn about the geom
s that are available and to become aware how they look/what they do. In upcoming lectures we are going to delve more deeply into the different types of data that exist and the data visualization methods that are most effective with these data types.
let’s first load in the packages we need for today’s work. We will load in tidyverse
(which includes ggplot2
) and gapminder
so that we have a nice dataset to work with.
library(tidyverse)
library(gapminder)
my_gap <- gapminder # assign the gapminder data to our own data object called my_gap
geom
sections below, you should try creating you own graphic for the geom
s. You can copy/paste/tweak the provided code to create your own versions.geom_point()
geom_point
creates a scatter plot. Scatter plots require x and y data. Other aesthetics can be specified but are not necessary.
We’ll create a scatter plot of life expectancy vs. per capita GDP
ggplot(data = my_gap, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()
I want you to notice a few things in the above code. First you’ll see that we typed data = my_gap
. You can actually omit the data =
, since the ggplot()
function is smart it assumes that the first input is the dataset.
Similarly, you can omit the mapping =
since the aes()
always defines a mapping, and thus specifying mapping =
is not required.
Also note that we defined the aes
thetic within the ggplot()
function call as oppossed to within the geom_point()
call. Either works fine here. However, the code above makes the aes()
global and thus it will apply to all of the geom
s that follow the ggplot()
call. This is very useful when you plot multiple geom
s on the same graphic (you’ll do this soon).
In the above code we only specified two aesthetic mappings (i.e. the x and y positions). We can specify additional mappings if we want. This will allow us to represent additional dimensions (variables) of the data on the same graphic. Since we have already specified the x and y positions (spatial mapping) we have to rely on other features, such as color, symbol size, or symbol shape to convey additional dimensions (variables). Note that the concept of adding additional mappings applies to geom
s in general and not just the geom_point()
Let’s represent the continent through the symbol colors.
ggplot(data = my_gap, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point()
Notice how the color = continent
is inside of the aes()
function. This is because we are mapping data from the continent variable to the color of whatever geom
is to be used (which in this case is geom_point
).
If we wanted to specify a single color for the points – i.e. not based on any of the data variables – then we would make that specification OUTSIDE of the aes()
function. This is because, in that case we are NOT mapping data to the color, but instead are simply specifying a single fixed color. For example
ggplot(data = my_gap, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point(color = "blue")
Make sure you understand the difference between the two graphics above and how/why their code differs.
FYI, you can also pipe %>%
your data right to ggplot()
. I recommend doing this - it makes your code easier to read and also when you pipe your data, the variable names from your dataset will autocomplete when you type them (which is really helpful). I also will hit “Enter” to add a line break after every +
in my plotting code - this makes the code much easier to read and edit.
For instance, the following code does the exact same thing as above.
my_gap %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(color = "blue")
geom_line()
Line graphs are useful for displaying the relationship between two variables, where the points are connected by a line. This plots work with data that has a logical ordering according and progression according to the x variable. They are frequently used to represent time-series data.
Let’s create a line plot of per capita GDP vs. time using just the data for China
china_data <- filter(my_gap,
country == "China") # filter rows with data for China
china_data %>%
ggplot(aes(x = year, y = gdpPercap)) +
geom_line()
We could have avoided creating the new data object china_data
by simply piping our filtered data directly to the ggplot()
function call. This is often a good idea, since it allows us to avoid creating tons of additional data objects (which can create a more cluttered environment and increase the possibility of mistakenly using the wrong data object later).
my_gap %>%
filter(country == "China") %>%
ggplot(aes(x = year, y = gdpPercap)) +
geom_line()
geom_boxplot()
We can summarize the distribution of the values of a variable using a box-plot.
Let’s use a box-plot to examine the distribution of life expectancy data within Europe
euro_data <- filter(my_gap,
continent == "Europe") # get data just for Europe
euro_data %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()
We can (and often do) plot multiple box-plots on the same graphic. This allows us to compare distributions of a variable across categories. Let’s see how the distributions of life expectancy vary between continents
my_gap %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_boxplot()
geom_violin()
Another nice way to display a variable’s distribution is with a violin plot. Let’s look at the distribution of life expectancies by continent
my_gap %>%
ggplot(aes(x = continent, y = lifeExp)) +
geom_violin()
Interesting side-note, while the plots do look like violins, contrary to popular belief this is not how they get the name violin plot. They are in-fact named after their creator, Johann deViolin an Austrian statistician from the 19th century (before moving on you should absolutely read more about him here)
geom_histogram()
Histograms are another type of graphic that is useful for visualizing the distribution of a single variable.
Let’s examine the distribution of life expectancies in the gapminder data
my_gap %>% ggplot(aes(x = lifeExp)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are a number of additional features of histograms that we will learn later in the term when we discuss univariate data in greater depth.
geom_bar()
Bar plots are great for displaying summaries of the counts within a categorical variable. For instance the number of countries in each continent. Let’s see how many countries are recorded for each continent the gapminder
data for year 2007.
We’ll first filter the data to include just data for year 2007. Then we’ll make the bar plot
data_2007 <- filter(my_gap,
year == 2007) # get data for year 2007
data_2007 %>%
ggplot(aes(continent)) +
geom_bar()
geom_jitter()
In some situations, a scatter plot can suffer from what is called over plotting. This is where many of the data points are hidden, because they are buried under another data point. This commonly occurs when one of the variables plotted, only takes on a discrete set of values (e.g. year
or continent
in the gapminder data). Let’s illustrate this issue by creating a scatter plot of life expectancy vs years.
my_gap %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_point()
You can clearly see the issue with the above graphic. Namely, year
takes on discrete values, and as a result the points are clustered in a vertical line with many points over plotting. It is not possible to tell how many points there are at a given life expectancy for a given year (e.g. in 1952 is there 1 country or 20 countries with a life expectancy of 40 yrs?).
To solve this problem we can jitter
our points. Jittering the points adds a little random shift to the data so that the points do not all sit on top of one another. See the example below.
my_gap %>% ggplot(aes(x = year, y = lifeExp)) +
geom_jitter(height = 0)
By jittering the points, we can see all of the points much better and we don’t lose information as we did when the points were over plotting. Notice that I specified height = 0
in the geom_jitter()
function call. This tells it to just jitter in the horizontal direction but not in the vertical direction - this is good as it does not introduce noise into the life expectancy data, which we would like to avoid as it could affect our interepretation of the results.
Geometries can be layered. To layer geometries, you simply include additional geom
functions with your ggplot()
call. For instance you may want to add points to a line plot. Let’s re-visit our earlier line plot of GDP per capita vs. time for China
china_data %>%
ggplot(aes(x = year, y = gdpPercap)) +
geom_line()
Now let’s add points representing the same data on top of the line
china_data %>% ggplot(aes(x = year, y = gdpPercap)) +
geom_line() +
geom_point()
We simply added geom_point()
to our code and the points were added. As you can see, we did not need to specify the aes
thetic mappings for our geom_point()
function call. This is because we set our aes
in the ggplot()
function and thus it is a global aesthetic. Therefore any additional geom
s that we add to this ggplot()
function call will inherit the same aes
thetic.
If we needed to change the aesthetic for one of the geom
s then we could specify the desired aes
within that geom
s function and the global aesthetic would be overridden for just that geom
.
We’ve already seen how position (e.g. x
and y
location) and color
are used as aesthetics. Now we’ll introduce a few more.
size
controls the size of the geometryshape
controls the shape of the geometryfill
controls the fill color of a shape (only applies to some shape options).alpha
controls the transparency of the geometryWe can map a variable to these aesthetic properties, in which case we define it within the aes()
function. We can also set a single value for these aesthetics, in which case it is defined within the geom()
function but OUTSIDE of the aes()
.
Let’s see some examples.
size
Let’s create a scatter plot of life expectancy and GDP per capita (just for year 2007) and map the population to the size of the points.
data_2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, size = pop)) +
geom_point()
If you want to set a single uniform size you do this outside of the aes
function and within the geom()
function that you want the size to apply to.
data_2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(size = 2.5)
alpha
We can map a variable to the alpha
aesthetic. Let’s differentiate the data points by continent using the alpha aesthetic.
data_2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, alpha = continent)) +
geom_point()
Warning: Using alpha for a discrete variable is not advised.
You can see that each continent has a different level of transparency. I would probably use another aesthetic to represent the continents (e.g. color) but this nonetheless highlights how to use this feature.
A more common applicaton of aesthetic is to help with over plotting. In this case we will not map a variable to alpha
but instead set a single value outside of the aes()
function. Since in this case it is not a mapping we need to specify the alpha
within the geom
we want it to apply to.
my_gap %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.2)
By making the points semi-transparent we can see points that plot on top of one another. Thus, areas that appear darker have more points and lighter areas less point. This helps us to identify areas where the data is clustered.
shape
We can map a variable to the shape
aesthetic. Let’s map the continent variable to the shape aesthetic.
my_gap %>%
ggplot(aes(x = gdpPercap, y = lifeExp, shape = continent)) +
geom_point()
Or we can specify a single shape to apply across all points.
The available shapes are shown below. We can use the number corresponding to specify the point.
my_gap %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point(shape = 24)
fill
The fill
aesthetic specifies the fill color of a symbol. Fill only applies to symbols that allow you to specify the color
and fill
independently. For these symbols fill
determines the fill color and color
determines the color of the symbol border.
We can map a variable to the fill
aesthetic. However, only shapes 15-25 allow you to specify the fill. We’ll use shape 21.
Let’s map the continent to the fill
aesthetic.
my_gap %>%
ggplot(aes(x = gdpPercap, y = lifeExp, fill = continent)) +
geom_point(shape = 24)
We can also adjust the color
aesthetic together with the fill. Let’s recreate our figure above, but set the symbol borders to all be gray by using the color
aesthetic.
my_gap %>%
ggplot(aes(x = gdpPercap, y = lifeExp, fill = continent)) +
geom_point(shape = 24, color = "gray")
We could have also mapped population to the fill
aesthetic. In this case the fill is represented by a continuous color gradient, since unlike continent the population variable does not take discrete values
my_gap %>%
ggplot(aes(x = gdpPercap, y = lifeExp, fill = pop)) +
geom_point(shape = 24)
This works, but the example doesn’t look so great. The issue is that many of the countries are at the low end of the population scale, which is heavily skewed by a few very large countries (e.g. China, India)
We could assign the fill according to the population ranking (relative to the other countries) as this will be uniform (non-skewed) range.
my_gap %>%
ggplot(aes(x = gdpPercap, y = lifeExp, fill = percent_rank(pop))) +
geom_point(shape = 24)
Looks a bit better. We will learn how to improve this even more in upcoming lectures.
Note: You will need to use your dplyr
tools in addition to ggplot
Rely on dplyr
and try to do this in the most elegant way possible (Can you do it in just three lines of code??).
geom
s. You can also explore additional geom
s that you might want to try out by visiting this site