Data analysis using R: Creating Publication-Quality Graphics (2024)

Overview

Teaching: 60 min
Exercises: 20 min

Questions

  • How can I create and save publication-quality graphics in R?

Objectives

  • To be able to use ggplot2 to generate publication quality graphics.

  • To understand the basic grammar of graphics, including the aesthetics and geometry layers, adding statistics, transforming scales, and coloring or panelling by groups.

  • To understand how to save plots in a variety of formats

  • To be able to find extensions for ggplot2 to produce custom graphics

Plotting our data is one of the best ways toquickly explore it and the various relationshipsbetween variables.

There are three main plotting systems in R,the base plotting system, the latticepackage, and the ggplot2 package.

Today we’ll be learning about the ggplot2 package, which is part of the tidyverse. It is the most effective for creatingpublication quality graphics. There are many extension packages for ggplot2, which make it easy to produce specialised types ofgraph, such as survival plots, geographic maps and ROC curves.

ggplot2 is built on the grammar of graphics, the idea that any plot can beexpressed from the same set of components: a data set, acoordinate system, and a set of geoms–the visual representation of datapoints.

The key to understanding ggplot2 is thinking about a figure in layers.This idea may be familiar to you if you have used image editing programs like Photoshop, Illustrator, orInkscape.

Let’s start off with an example, using our gapminder data:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point()

Data analysis using R: Creating Publication-Quality Graphics (1)

So the first thing we do is call the ggplot function. This function lets Rknow that we’re creating a new plot, and any of the arguments we give theggplot function are the global options for the plot: they apply to alllayers on the plot.

We’ve passed in two arguments to ggplot. First, we tell ggplot what data wewant to show on our figure, in this example the gapminder data we read inearlier. For the second argument we passed in the aes function, whichtells ggplot how variables in the data map to aesthetic properties ofthe figure, in this case the x and y locations. Here we told ggplot wewant to plot the “gdpPercap” column of the gapminder data frame on the x-axis, andthe “lifeExp” column on the y-axis.

By itself, the call to ggplot isn’t enough to draw a figure:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp))

Data analysis using R: Creating Publication-Quality Graphics (2)

We need to tell ggplot how we want to visually represent the data, which wedo by adding a new geom layer. In our example, we used geom_point, whichtells ggplot we want to visually represent the relationship between x andy as a scatter-plot of points:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + geom_point()

Data analysis using R: Creating Publication-Quality Graphics (3)

Combining dplyr and ggplot2

As gplot2 is part of the tidyverse, we can use it with pipes. As we will see later in theepisode, this will be particularly useful if we need to modify the data before plotting it.

We can repeat the above plot, using a pipe, as follows:

gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point()

Data analysis using R: Creating Publication-Quality Graphics (4)Note that the ggplot2 commands are joined by the + symbol and not the %>% symbol. It may help to remember that we add layers to our plot.

Challenge 1

Modify the example so that the figure shows how life expectancy haschanged over time. Note that using points to show this data isn’t themost effective way of presenting it; we will look at other ways of showingthe data shortly.

Hint: the gapminder dataset has a column called “year”, which should appearon the x-axis.

Solution to challenge 1

Here is one possible solution:

gapminder %>% ggplot(aes(x = year, y = lifeExp)) + geom_point()

Data analysis using R: Creating Publication-Quality Graphics (5)

Challenge 2

In the previous examples and challenge we’ve used the aes function to tellthe scatterplot geom about the x and y locations of each point.Another aesthetic property we can modify is the point color. Modify thecode from the previous challenge to color the points by the “continent”column. What trends do you see in the data? Are they what you expected?

Solution to challenge 2

In the previous examples and challenge we’ve used the aes function to tellthe scatterplot geom about the x and y locations of each point.Another aesthetic property we can modify is the point color. Modify thecode from the previous challenge to color the points by the “continent”column. What trends do you see in the data? Are they what you expected?

gapminder %>% ggplot(aes(x = year, y = lifeExp, color = continent)) + geom_point()

Data analysis using R: Creating Publication-Quality Graphics (6)

Layers

Using a scatter-plot probably isn’t the best for visualizing change over time.Instead, let’s tell ggplot to visualize the data as a line plot. If we replace geom_point() withgeom_line(), we obtain:

Data analysis using R: Creating Publication-Quality Graphics (7)

This probably isn’t what you were expecting. We need to modify the aesthetic to tellggplot that each country’s data should be a separate line. By default, geom_point()joins all our observations together, sorting them in order of the variable we’re plottingon the x axis. To generate a separate line for each country, we use the group aesthetic:

gapminder %>% ggplot(aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line()

Data analysis using R: Creating Publication-Quality Graphics (8)

But what if we want to visualize both lines and points on the plot? We canadd another layer to the plot:

gapminder %>% ggplot(aes(x = year, y = lifeExp, group = country, color = continent)) + geom_line() + geom_point()

Data analysis using R: Creating Publication-Quality Graphics (9)

At the moment the aesthetic we defined applies to all of the plot layers; both the pointsand the lines are coloured according to their continent. We can apply an aesthetic to certain layersthe plot by supplying them with their own aesthetic. For example, if we remove the color option, we aren’tmapping any aspect of the data to the colour property of any part of the graph - all the points and lines have the samecolour:

gapminder %>% ggplot(aes(x = year, y = lifeExp, group = country)) + geom_line() + geom_point()

Data analysis using R: Creating Publication-Quality Graphics (10)

If we apply the aesthetic aes(colour=continent) to geom_line(), the (lack of) mapping of colouris overridden by the new aesthetic. The points’ colours are unchanged:

gapminder %>% ggplot(aes(x = year, y = lifeExp, group = country)) + geom_line(aes(colour=continent)) + geom_point()

Data analysis using R: Creating Publication-Quality Graphics (11)

What if we want to print our points in a colour other than the default black? Aesthetics mapdata to a property of the graph. If we want to change the colour of all our points, we are not usingthe data to specify the colour, so we specify the colour directly in the geom:

gapminder %>% ggplot(aes(x = year, y = lifeExp, group = country)) + geom_line(aes(colour = continent)) + geom_point(colour = "red")

Data analysis using R: Creating Publication-Quality Graphics (12)

It’s important to note that each layer is drawn on top of the previous layer. Inthis example, the points have been drawn on top of the lines. If we swap the orderof our geom_line() and geom_point(), the points appear behind the lines:

gapminder %>% ggplot(aes(x = year, y = lifeExp, group = country)) + geom_point(colour = "red") + geom_line(aes(colour = continent))

Data analysis using R: Creating Publication-Quality Graphics (13)

Tip: Transparency

If you have a lot of data or many layers, it can be useful to make some (semi)-transparent.You can do this by setting the alpha property to a value between 0 (fully transparent), and 1 (fully opaque).

Multi-panel figures

There’s still a lot going on in this graph. It may clearer if we plotted a separate graphfor each continent. We can split the plot into multiple panels by adding a layer of facet panels:

gapminder %>% ggplot(aes(x = year, y = lifeExp, group = country)) + geom_line() + facet_wrap("continent")

Data analysis using R: Creating Publication-Quality Graphics (14)

We have removedcolour=continent from the aesthetic since colouring each line by continent conveys no additionalinformation. Note that the variable we are faceting by needs to be placed in quotes.

More on faceting

It’s also possible to facet by one or two variables on a grid, using the facet_grid() function. For example, we could plot life GDP per capita’s relationship to life expectancy for each combination of continent and yearusing the following code:

gapminder %>% ggplot(aes(x=lifeExp, y=gdpPercap)) + geom_point(size=0.3) + facet_grid(continent ~ year)

Data analysis using R: Creating Publication-Quality Graphics (15)This uses R’s formula notation to specify how we want to arrange to grid; see ?facet_grid for more details.

Challenge 3

In this challenge you will explore how each country’s GDP per capita has changed with time.

Try two different approaches to visualising this data:

  • Plot all the data on a single graph, colouring each country’s data by continent
  • Facet the data by continent.

Solution to challenge 3

  • Plot all the data on a single graph, colouring each country’s data by continent
gapminder %>% ggplot(aes(x = year, y = gdpPercap, group = country, colour = continent)) + geom_line()

Data analysis using R: Creating Publication-Quality Graphics (16)

  • Facet the data by continent.
gapminder %>% ggplot(aes(x = year, y = gdpPercap, group = country)) + geom_line() + facet_wrap("continent")

Data analysis using R: Creating Publication-Quality Graphics (17)

This representation of the data is arguably clearer. Neither graph is ideal though; the huge range ofGDPs per capita makes it difficult to show the data on the same graph. We will look at transforming the scales of our axesshortly.

Another approach is to allow each facet to have its own scale on the y axis. This can be done bypassing the scales = "free_y" option to facet_wrap(). This can be useful in some circ*mstances. It does, however, makeit very difficult to compare data in different continents, and is arguably misleading.

Aside: Interactively exploring graphs

There are some outlying data points in the solution to challenge 3. You might be wondering which country these belong to. Unfortunately thereisn’t an easy way of doing this neatly in ggplot2.

It is possible to use an additional library called plotly to convert a ggplot graph to a plotly graph. This is interactive, and will let you zoom in on areas of interestand hover over points to see a tooltip containing their values.

To do this, make your graph, and then run library(plotly) to load the plotlylibrary and ggplotly() to convert the most recent ggplot graph to a plotly graph.

(If you don’t have the plotly library installed on your machine, you can installit with install.packages("plotly"))

A pure ggplot2 approach is to use geom_text() to label each data point with the country (this uses the label aesthetic to select which variable in the data to use as the label):

gapminder %>% ggplot(aes(x = year, y = gdpPercap, group = country, label = country)) + geom_line() + geom_text() + facet_wrap("continent")

Data analysis using R: Creating Publication-Quality Graphics (18)

The output from this clearly isn’t suitable for publication, but it may be sufficient if you just need to produce something for your own use.

The labels for each data point overlap each other, and are plotted for each data point. You can deal with the latter issue by creating a new variable which only contains the label for one point per group (i.e per country), and for the groups you wish to label. You can do this using mutate and ifelse as described at the end of the previous episode.

Pre processing data

When we want to start sub-setting and mutating the data before plotting, the usefulness of“piped” data-analysis becomes apparent; we can perform our data transformations and thensend the result to ggplot2 without making an intermediate data set.

For example, if we wanted to produce a version of the graph in challenge 3, but only for countries in the Americas, we could use:

gapminder %>% filter(continent == "Americas") %>% ggplot(aes(x = year, y = gdpPercap, group = country)) + geom_line() + facet_wrap("continent")

Data analysis using R: Creating Publication-Quality Graphics (19)

Challenge 4

Rather than plotting the life expectancy of each country over time, make a plot showing the average life expectancy in each continent over time.

Hint - Challenge 3 of the previous episode may be useful. This can then be piped into a ggplot command.

Solution to challenge 4

gapminder %>% group_by(continent, year) %>% summarise(mean_lifeExp = mean(lifeExp)) %>% ggplot(aes(x = year, y=mean_lifeExp, colour = continent)) + geom_line()
`summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.

Data analysis using R: Creating Publication-Quality Graphics (20)

Transformations

Ggplot also makes it easy to transform axes, to better show our data. Todemonstrate we’ll go back to our first example:

gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point()

Data analysis using R: Creating Publication-Quality Graphics (21)

Currently it’s hard to see the relationship between the points due to some strongoutliers in GDP per capita. We can change the scale of units on the x axis usingthe scale functions. These control the mapping between the data values andvisual values of an aesthetic. We also modify the transparency of thepoints, using the alpha function, which is especially helpful when you havea large amount of data which is very clustered.

gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point(alpha = 0.5) + scale_x_log10()

Data analysis using R: Creating Publication-Quality Graphics (22)

The scale_x_log10 function applied a transformation to the values of the gdpPercapcolumn before rendering them on the plot, so that each multiple of 10 now onlycorresponds to an increase in 1 on the transformed scale, e.g. a GDP per capitaof 1,000 is now 3 on the x axis, a value of 10,000 corresponds to 4 on the x-axis and so on. This makes it easier to visualize the spread of data on thex-axis. If we want plot the y-axis on a log scale we can use the scale_y_log10 function.

Challenge 5

Modify the faceted plot you produced in challenge 3 to show GDP per capita on a log scale.

Solution to challenge 5

We can add the scale_y_log10() to our plotting command:

gapminder %>% ggplot(aes(x = year, y = gdpPercap, group = country)) + geom_line() + facet_wrap("continent") + scale_y_log10()

Data analysis using R: Creating Publication-Quality Graphics (23)

Although this makes it easier to visualise all of the data on a single plot, it makes the inequality in GDP per capitabetween the difference continents much less obvious.

If we plot the data with a linear scale the inequality is more obvious, but this masks the individual trajectories ofmany countries’ GDPs. Decisions about how best to plot data are beyond the scope of this course. Research IT offers a course, Introduction to data visualisation and analysis, which covers this topic in much more detail.

Plotting 1D data

In the examples so far we’ve plotted one variable against another. Often we wish to plot single variable. We canplot counts using geom_bar(). For example, to plot the number of counties in the gapminder data that are in eachcontinent we can use:

gapminder %>% filter(year == 2007) %>% ggplot(aes(x=continent)) + geom_bar()

Data analysis using R: Creating Publication-Quality Graphics (24)

We filter to a single year of data to avoid multiple counting

We often wish to explore the distribution of a continuous variable. We can do this using a histogram (geom_histogram()),or a density plot (geom_density())

For example, to produce a histogram of GDPs per capita for countries in Europe in 2007:

gapminder %>% filter(year == 2007, continent == "Europe") %>% ggplot(aes(x=gdpPercap)) + geom_histogram(bins = 10)

Data analysis using R: Creating Publication-Quality Graphics (25)

We can specify the number of bins (bins = ), or the width of a bin (binwidth = ).

We can plot a density plot using geom_density(). This is a smoothed version of a histogram.

gapminder %>% filter(year == 2007, continent == "Europe") %>% ggplot(aes(x = gdpPercap)) + geom_density()

Data analysis using R: Creating Publication-Quality Graphics (26)

By default the density estimate is drawn in outline (i.e. it isn’t filled in). We can use the fill attribute to fill it in; this can bepassed in the aesthetic (e.g. aes(x = gdpPercap, fill = ...))) to fill according to the data, or directly to geom_density(). The colour attribute controls the outline of the shape. For example:

gapminder %>% filter(year == 2007, continent == "Europe") %>% ggplot(aes(x = gdpPercap)) + geom_density(fill = "red", colour="blue")

Data analysis using R: Creating Publication-Quality Graphics (27)

Challenge 6

In this challenge, we’ll extend the plot above to compare the distributions of GDP per capita in Europe and Africa over time.As the challenge is quite long, it’s broken down into sections. Please try each sectionbefore looking at the answer.

a. We’ll start off by plotting the data for a single year, before extending the plot for multiple years. Using the code above as a starting point, write some code to return a tibble containing the data for Europe and Africa in 2007. Hint: the %in% operator may be useful.

Solution a

gapminder %>% filter(year == 2007) %>% filter(continent %in% c("Europe", "Africa"))
# A tibble: 82 × 6 country year pop continent lifeExp gdpPercap <chr> <dbl> <dbl> <chr> <dbl> <dbl> 1 Albania 2007 3600523 Europe 76.4 5937. 2 Algeria 2007 33333216 Africa 72.3 6223. 3 Angola 2007 12420476 Africa 42.7 4797. 4 Austria 2007 8199783 Europe 79.8 36126. 5 Belgium 2007 10392226 Europe 79.4 33693. 6 Benin 2007 8078314 Africa 56.7 1441. 7 Bosnia and Herzegovina 2007 4552198 Europe 74.9 7446. 8 Botswana 2007 1639131 Africa 50.7 12570. 9 Bulgaria 2007 7322858 Europe 73.0 10681.10 Burkina Faso 2007 14326203 Africa 52.3 1217.# … with 72 more rows

This returns a tibble, which we can then pipe into ggplot.

b. Pipe the results of part a into ggplot, to make a density plot of GDP per capita, setting the fill colour by continent (e.g. each continent has its own density estimate).

Solution b

gapminder %>% filter(year == 2007) %>% filter(continent %in% c("Europe", "Africa")) %>% ggplot(aes(x = gdpPercap, fill = continent)) + geom_density()

Data analysis using R: Creating Publication-Quality Graphics (28)

c. This looks OK, but the continent’s density estimates overlay each other. Use the alpha = option to make each density estimatesemi transparent.

Solution c

gapminder %>% filter(year == 2007) %>% filter(continent %in% c("Europe", "Africa")) %>% ggplot(aes(x = gdpPercap, fill = continent)) + geom_density(alpha = 0.5)

Data analysis using R: Creating Publication-Quality Graphics (29)

d. Let’s take a look at how the relative GDPs per capita have changed over time. We can use facet_wrap() to do this.Modify your code to produce a separate graph for each year.

Solution d

gapminder %>% filter(continent %in% c("Europe", "Africa")) %>% ggplot(aes(x = gdpPercap, fill = continent)) + geom_density(alpha = 0.5) + facet_wrap("year")

Data analysis using R: Creating Publication-Quality Graphics (30)Note that you need to remove the filter(year == 2007) line from the code.

Modifying text

To clean this figure up for a publication we need to change some of the textelements. For example the axis labels should be “human readable” rather thanthe variable name from the data-set. We may also wish to change the text size, etc.

We can do this by adding a couple of different layers. The theme layercontrols the axis text, and overall text size. Labels for the axes, plottitle and any legend can be set using the labs function. Legend titlesare set using the same names we used in the aes specification; since we used the fill property tocolour by continent we use fill = "Continent in the labs() function.

gapminder %>% filter(continent %in% c("Europe", "Africa")) %>% ggplot(aes(x = gdpPercap, fill = continent)) + geom_density(alpha = 0.5) + facet_wrap("year") + labs( x = "GDP per capita", # x axis title y = "Density", # y axis title title = "Figure 1", # main title of figure fill = "Continent" # title of legend )

Data analysis using R: Creating Publication-Quality Graphics (31)

RStudio provides a really useful cheat sheet of the different layers available, and moreextensive documentation is available on the ggplot2 website.

Saving plots

Having produced a plot, we can save it, or copy it to the clipboard using the “Export” command at the top of RStudio’s plot window.

It’s a better idea to save your plots as part of your scripts; this way if you modify your analysis code, you know the plot will reflect the results of the code. If you manually save the plot, you have to remember to do this after changing the script.

We can save the most recently produced ggplot using the ggsave() function:

ggsave("plots/myplot.png")# Can also set the size of plotggsave("plots/myplot.pdf", width = 20, height = 20, units = "cm")

The help for the ggsave() function lists the image formats that are available, as well as the options for setting the resolution and size of the saved image.

ggplot themes and extensions

ggplot is very flexible, and its capabilities can be extended.

The theme of a plot affects the background, axes etc. The ggplot2 themes package contains many useful (and not so useful) themes we can apply to our data. The cowplot package makes it easy to plot sub-plots, and to overlay plots within plots.

The ggplot2 exensions pages lists R packages that can extend its capabilities. If you have a specialised plotting need (for example plotting ROC curves, survival data, or time series) there are packages that will allow you to make these plots with minimal effort. The top 50 ggplot2 visualisations page provides examples (with full code) of almost any type of graph you might want to make.

As an example of how easy it can be to extend ggplot, we will use the ggridges plot to produce a stacked density plot, to better visualise the previous figure:

library(ggridges)gapminder %>% filter(continent %in% c("Europe", "Africa")) %>% ggplot(aes(x = gdpPercap, y = factor(year), fill = continent)) + geom_density_ridges(alpha = 0.5)
Picking joint bandwidth of 1850

Data analysis using R: Creating Publication-Quality Graphics (32)

Data Visualization - A practical Introduction is an online book which covers good practice in data visualisation, using R and ggplot2 to illustrate this.

Key Points

  • Use ggplot2 to create plots.

  • We can feed the output of a dplyr pipe into ggplot2 to pre-process data

  • Plots are built up using layers: aesthetics, geometry, statistics, scale transformation, and grouping.

Data analysis using R: Creating Publication-Quality Graphics (2024)
Top Articles
Latest Posts
Article information

Author: Kerri Lueilwitz

Last Updated:

Views: 5916

Rating: 4.7 / 5 (47 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Kerri Lueilwitz

Birthday: 1992-10-31

Address: Suite 878 3699 Chantelle Roads, Colebury, NC 68599

Phone: +6111989609516

Job: Chief Farming Manager

Hobby: Mycology, Stone skipping, Dowsing, Whittling, Taxidermy, Sand art, Roller skating

Introduction: My name is Kerri Lueilwitz, I am a courageous, gentle, quaint, thankful, outstanding, brave, vast person who loves writing and wants to share my knowledge and understanding with you.