Last updated on 2024-05-12 | Edit this page
Estimated time: 80 minutes
Overview
Questions
- How can I create publication-quality graphics in R?
Objectives
- To be able to use ggplot2 to generate publication-qualitygraphics.
- To apply geometry, aesthetic, and statistics layers to a ggplotplot.
- To manipulate the aesthetics of a plot using different colors,shapes, and lines.
- To improve data visualization through transforming scales andpaneling by group.
- To save a plot created with ggplot to disk.
Plotting our data is one of the best ways to quickly explore it andthe various relationships between variables.
There are three main plotting systems in R, the base plottingsystem, the latticepackage, and the ggplot2package.
Today we’ll be learning about the ggplot2 package, because it is themost effective for creating publication-quality graphics.
ggplot2 is built on the grammar of graphics, the idea that any plotcan be built from the same set of components: a dataset, mapping aesthetics, and graphicallayers:
Data sets are the data that you, the user,provide.
Mapping aesthetics are what connect the data tothe graphics. They tell ggplot2 how to use your data to affect how thegraph looks, such as changing what is plotted on the X or Y axis, or thesize or color of different data points.
Layers are the actual graphical output fromggplot2. Layers determine what kinds of plot are shown (scatterplot,histogram, etc.), the coordinate system used (rectangular, polar,others), and other important aspects of the plot. The idea of layers ofgraphics may be familiar to you if you have used image editing programslike Photoshop, Illustrator, or Inkscape.
Let’s start off building an example using the gapminder data fromearlier. The most basic function is ggplot
, which lets Rknow that we’re creating a new plot. Any of the arguments we give theggplot
function are the global options for theplot: they apply to all layers on the plot.
R
library("ggplot2")ggplot(data = gapminder)
Here we called ggplot
and told it what data we want toshow on our figure. This is not enough information forggplot
to actually draw anything. It only creates a blankslate for other elements to be added to.
Now we’re going to add in the mapping aestheticsusing the aes
function. aes
tellsggplot
how variables in the data map toaesthetic properties of the figure, such as which columns ofthe data should be used for the x andy locations.
R
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
Here we told ggplot
we want to plot the “gdpPercap”column of the gapminder data frame on the x-axis, and the “lifeExp”column on the y-axis. Notice that we didn’t need to explicitly passaes
these columns(e.g.x = gapminder[, "gdpPercap"]
), this is becauseggplot
is smart enough to know to look in thedata for that column!
The final part of making our plot is to tell ggplot
howwe want to visually represent the data. We do this by adding a newlayer to the plot using one of thegeom functions.
R
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()
Here we used geom_point
, which tells ggplot
we want to visually represent the relationship betweenx and y as a scatterplot ofpoints.
Challenge 1
Modify the example so that the figure shows how life expectancy haschanged over time:
R
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()
Hint: the gapminder dataset has a column called “year”, which shouldappear on the x-axis.
Here is one possible solution:
R
ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) + geom_point()
Challenge 2
In the previous examples and challenge we’ve used theaes
function to tell the scatterplot geomabout the x and y locations of eachpoint. Another aesthetic property we can modify is the pointcolor. Modify the code from the previous challenge tocolor the points by the “continent” column. What trendsdo you see in the data? Are they what you expected?
The solution presented below adds color=continent
to thecall of the aes
function. The general trend seems toindicate an increased life expectancy over the years. On continents withstronger economies we find a longer life expectancy.
R
ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, color=continent)) + geom_point()
Layers
Using a scatterplot probably isn’t the best for visualizing changeover time. Instead, let’s tell ggplot
to visualize the dataas a line plot:
R
ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, color=continent)) + geom_line()
Instead of adding a geom_point
layer, we’ve added ageom_line
layer.
However, the result doesn’t look quite as we might have expected: itseems to be jumping around a lot in each continent. Let’s try toseparate the data by country, plotting one line for each country:
R
ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) + geom_line()
We’ve added the group aesthetic, whichtells ggplot
to draw a line for each country.
But what if we want to visualize both lines and points on the plot?We can add another layer to the plot:
R
ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country, color=continent)) + geom_line() + geom_point()
It’s important to note that each layer is drawn on top of theprevious layer. In this example, the points have been drawn on topof the lines. Here’s a demonstration:
R
ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) + geom_line(mapping = aes(color=continent)) + geom_point()
In this example, the aesthetic mapping ofcolor has been moved from the global plot options inggplot
to the geom_line
layer so it no longerapplies to the points. Now we can clearly see that the points are drawnon top of the lines.
Tip: Setting an aesthetic to a value insteadof a mapping
So far, we’ve seen how to use an aesthetic (such ascolor) as a mapping to a variable in the data.For example, when we usegeom_line(mapping = aes(color=continent))
, ggplot will givea different color to each continent. But what if we want to change thecolor of all lines to blue? You may think thatgeom_line(mapping = aes(color="blue"))
should work, but itdoesn’t. Since we don’t want to create a mapping to a specific variable,we can move the color specification outside of the aes()
function, like this: geom_line(color="blue")
.
Challenge 3
Switch the order of the point and line layers from the previousexample. What happened?
The lines now get drawn over the points!
R
ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, group=country)) + geom_point() + geom_line(mapping = aes(color=continent))
Transformations and statistics
ggplot2 also makes it easy to overlay statistical models over thedata. To demonstrate we’ll go back to our first example:
R
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()
Currently it’s hard to see the relationship between the points due tosome strong outliers in GDP per capita. We can change the scale of unitson the x axis using the scale functions. These control themapping between the data values and visual values of an aesthetic. Wecan also modify the transparency of the points, using the alphafunction, which is especially helpful when you have a large amount ofdata which is very clustered.
R
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point(alpha = 0.5) + scale_x_log10()
The scale_x_log10
function applied a transformation tothe coordinate system of the plot, so that each multiple of 10 is evenlyspaced from left to right. For example, a GDP per capita of 1,000 is thesame horizontal distance away from a value of 10,000 as the 10,000 valueis from 100,000. This helps to visualize the spread of the data alongthe x-axis.
Tip Reminder: Setting an aesthetic to a valueinstead of a mapping
Notice that we used geom_point(alpha = 0.5)
. As theprevious tip mentioned, using a setting outside of theaes()
function will cause this value to be used for allpoints, which is what we want in this case. But just like any otheraesthetic setting, alpha can also be mapped to a variable inthe data. For example, we can give a different transparency to eachcontinent withgeom_point(mapping = aes(alpha = continent))
.
We can fit a simple relationship to the data by adding another layer,geom_smooth
:
R
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm")
OUTPUT
`geom_smooth()` using formula = 'y ~ x'
We can make the line thicker by setting thesize aesthetic in the geom_smooth
layer:
R
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point(alpha = 0.5) + scale_x_log10() + geom_smooth(method="lm", size=1.5)
WARNING
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.ℹ Please use `linewidth` instead.This warning is displayed once every 8 hours.Call `lifecycle::last_lifecycle_warnings()` to see where this warning wasgenerated.
OUTPUT
`geom_smooth()` using formula = 'y ~ x'
There are two ways an aesthetic can be specified. Here weset the size aesthetic by passing it as anargument to geom_smooth
. Previously in the lesson we’veused the aes
function to define a mapping betweendata variables and their visual representation.
Challenge 4a
Modify the color and size of the points on the point layer in theprevious example.
Hint: do not use the aes
function.
Here a possible solution: Notice that the color
argumentis supplied outside of the aes()
function. This means thatit applies to all data points on the graph and is not related to aspecific variable.
R
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point(size=3, color="orange") + scale_x_log10() + geom_smooth(method="lm", size=1.5)
OUTPUT
`geom_smooth()` using formula = 'y ~ x'
Challenge 4b
Modify your solution to Challenge 4a so that the points are now adifferent shape and are colored by continent with new trendlines. Hint:The color argument can be used inside the aesthetic.
Here is a possible solution: Notice that supplying thecolor
argument inside the aes()
functionsenables you to connect it to a certain variable. The shape
argument, as you can see, modifies all data points the same way (it isoutside the aes()
call) while the color
argument which is placed inside the aes()
call modifies apoint’s color based on its continent value.
R
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, color = continent)) + geom_point(size=3, shape=17) + scale_x_log10() + geom_smooth(method="lm", size=1.5)
OUTPUT
`geom_smooth()` using formula = 'y ~ x'
Multi-panel figures
Earlier we visualized the change in life expectancy over time acrossall countries in one plot. Alternatively, we can split this out overmultiple panels by adding a layer of facet panels.
Tip
We start by making a subset of data including only countries locatedin the Americas. This includes 25 countries, which will begin to clutterthe figure. Note that we apply a “theme” definition to rotate the x-axislabels to maintain readability. Nearly everything in ggplot2 iscustomizable.
R
americas <- gapminder[gapminder$continent == "Americas",]ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) + geom_line() + facet_wrap( ~ country) + theme(axis.text.x = element_text(angle = 45))
The facet_wrap
layer took a “formula” as its argument,denoted by the tilde (~). This tells R to draw a panel for each uniquevalue in the country column of the gapminder dataset.
Modifying text
To clean this figure up for a publication we need to change some ofthe text elements. The x-axis is too cluttered, and the y axis shouldread “Life expectancy”, rather than the column name in the dataframe.
We can do this by adding a couple of different layers. Thetheme layer controls the axis text, and overall textsize. Labels for the axes, plot title and any legend can be set usingthe labs
function. Legend titles are set using the samenames we used in the aes
specification. Thus below thecolor legend title is set using color = "Continent"
, whilethe title of a fill legend would be set usingfill = "MyTitle"
.
R
ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) + geom_line() + facet_wrap( ~ country) + labs( x = "Year", # x axis title y = "Life expectancy", # y axis title title = "Figure 1", # main title of figure color = "Continent" # title of legend ) + theme(axis.text.x = element_text(angle = 90, hjust = 1))
Exporting the plot
The ggsave()
function allows you to export a plotcreated with ggplot. You can specify the dimension and resolution ofyour plot by adjusting the appropriate arguments (width
,height
and dpi
) to create high qualitygraphics for publication. In order to save the plot from above, we firstassign it to a variable lifeExp_plot
, then tellggsave
to save that plot in png
format to adirectory called results
. (Make sure you have aresults/
folder in your working directory.)
R
lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) + geom_line() + facet_wrap( ~ country) + labs( x = "Year", # x axis title y = "Life expectancy", # y axis title title = "Figure 1", # main title of figure color = "Continent" # title of legend ) + theme(axis.text.x = element_text(angle = 90, hjust = 1))ggsave(filename = "results/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm")
There are two nice things about ggsave
. First, itdefaults to the last plot, so if you omit the plot
argumentit will automatically save the last plot you created withggplot
. Secondly, it tries to determine the format you wantto save your plot in from the file extension you provide for thefilename (for example .png
or .pdf
). If youneed to, you can specify the format explicitly in thedevice
argument.
This is a taste of what you can do with ggplot2. RStudio provides areally useful cheatsheet of the different layers available, and more extensivedocumentation is available on the ggplot2 website. AllRStudio cheat sheets are available from the RStudiowebsite. Finally, if you have no idea how to change something, aquick Google search will usually send you to a relevant question andanswer on Stack Overflow with reusable code to modify!
Challenge 5
Generate boxplots to compare life expectancy between the differentcontinents during the available years.
Advanced:
- Rename y axis as Life Expectancy.
- Remove x axis labels.
Here a possible solution: xlab()
and ylab()
set labels for the x and y axes, respectively The axis title, text andticks are attributes of the theme and must be modified within atheme()
call.
R
ggplot(data = gapminder, mapping = aes(x = continent, y = lifeExp, fill = continent)) + geom_boxplot() + facet_wrap(~year) + ylab("Life Expectancy") + theme(axis.title.x=element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank())
Key Points
- Use
ggplot2
to create plots. - Think about graphics in layers: aesthetics, geometry, statistics,scale transformation, and grouping.