This Workshop covers basic plotting in R and essential visual communication principles.
We will talk about:
The workshop is part of a three day workshop in R including sessions on data handling and basic and advanced statistical modelling and assumes knowledge of these sections.
If you work through this on your own, you may want to start with this intro on the basics in R, and this [workshop on data handling(http://fdschneider.de/r_datahandling/).
The examples and code snippets require that you install the following packages:
install.packages("ggplot2", "readxl", "data.table", "RColorBrewer", "colorspace", "Shiny")
For the following examples, I will use the datasets from wednesday: LUI_factors
, voeller
and flowerdata
.
voeller <- read.csv("data/voeller.csv", encoding = "UTF-8")
flowerdata <- read.csv("data/flowerdata.csv", encoding = "UTF-8")
LUI_factors <- read.csv("data/LUI_factors.csv", encoding = "UTF-8")
We will refer to this dataset in the exercises:
Data from: Resource-mediated indirect effects of grassland management on arthropod diversity (2014) Simons NK, Gossner MM, Lewinsohn TM, Boch S, Lange M, Müller J, Pašalić E, Socher SA, Türke M, Fischer M, Weisser WW; DOI: https://doi.org/10.5061/dryad.f3b77
library(readxl)
arthropods08 <- read_xlsx("data/Plant_herb_carn/01_Arthropods_2008.xlsx")
arthropods09 <- read_xlsx("data/Plant_herb_carn/01_Arthropods_2009.xlsx")
plantcover08 <- read_xlsx("data/Plant_herb_carn/02_Plant_cover_2008.xlsx")
plantcover09 <- read_xlsx("data/Plant_herb_carn/02_Plant_cover_2009.xlsx")
landuse <- read_xlsx("data/Plant_herb_carn/03_Landuse_2007to2009.xlsx")
semdata08 <- read_xlsx("data/Plant_herb_carn/04_SEMData_2008.xlsx")
semdata09 <- read_xlsx("data/Plant_herb_carn/04_SEMData_2009.xlsx")
library(data.table)
dd_full <- rbindlist(list(semdata08, semdata09), fill = TRUE)
dd_full <- transform(dd_full,
PlotID = as.factor(PlotID),
Region = as.factor(Region),
Lifestock = as.factor(Lifestock)
)
The presentation slides can be found here!
R graphics by Florian Schneider is licensed under a Creative Commons Attribution 4.0 International License.
Visual:
Communication:
Graphics often not standalone, but accompanying text.
In the first place, we show data. Visuals are adding high-resolution and detail, were text can only transport summary and statistics.
Advantages of graphics:
Example: This two-dimensional mapping of multi-dimensional trait-space illustrates the global spectrum of plant form and function (Diaz et al. 2016). The lower panels illustrate subsets of the same data (b: woody plants, c: plant classes) .
Example: besides the spatial context, maps contain information about area, shape, and neighborhood of a data point. It is easy to read where values are high or low because of spatial context and where and at which level to look for explanatory variables.
(Source: E.R. Tufte “The Visual Display of Quantitative Information”)
Example: Anscombes famous quartet of x-y datasets. Each dataset has exactly the same stats regarding means, coefficient of correlation, model fit etc., but visually, they tell a totally different story.
Good data analyses are one thing. But the complex results in science can not be transported by statistics alone. In research papers, figures should do a good part of the job. But very often, they do a bad job.
As a fundamental critique to the practice of graphical communication Edward Tufte published this series of books, where he elaborated on why graphics fail at being informative and – more importantly – fail at being truthful about the underlying pattern in the data.
In this workshop, in addition to the technical aspects, I want to discuss three big challenges of visual communication in Data Science:
Note that we are not talking about plots for reviewing your data (like Q-plot or analysis of residuals)! those have been covered in the statistics lecture by Caterina.
Many available R packages facilitate plotting into publishable output. Particularly, ggplot2 is a great tool to design graphics since it is build around the logic of plot design, separating content (i.e. the data) from aesthetics (i.e. the design and choice of elements).
However here is a word of caution: Just like other software, like Excel or SASS, plotting packages come with the risk of ridding oneself of the responsability of thinking about plot elements. Indeed, one intention of those software packages is creating an own style that readers would recognise and make them use their software.
I am using base R for my examples because I believe, first comes understanding at the low level of plotting grammar, before we can start using powerful tools.
Learn design principles! Not usage of tools.
After learning about the basics, we’ll talk about ggplot2 as well.
In base R, plotting commands are either high-level, which means they create a new plot, or low-level, which means they add something to an existing plot. Additionally, on the highest level, there are the graphical parameters which define the plotting canvas, like margins and multi-panel arrangements. Finally, there are multiple graphics devices, virtual devices in which the plot is created. The default device is a window on your screen, but you also can open a file device (pdf or jpg) and put your plot there.
Examples for high level plotting functions are:
plot()
plot(Plant_SpeciesRichness ~ Herbivore_SpeciesRichness, data = dd_full)
Note the syntax here! the plot()
function either takes
plot(flowerdata$LUI, flowerdata$inflorescences_qcm, ...)
plot(inflorescences_qcm ~ LUI, data = flowerdata, ...)
plot(subset(flowerdata, select = c(LUI, inflorescences_qcm)), ...)
All calls produce exactly the same output!
plot(..., type = "l")
plot(co2)
In this case, the object provided to plot()
is a timeseries, which automatically is plotted as a line. Other data must be specified to plot as lines.
plot(x = 1:12, y = rnorm(12,5,1.2), type = "l", ylim = c(0,10) )
x <- z <- seq(0,1,length = 50)
f <- function(x) 1.2 + 0.4*x - 2.1*x^2 + 1.1*x^3
plot(f, ylim = c(0,2), xlim = c(0,2))
Note that R identifies the object class of the first argument and decides which kind of plot to apply. These switches for different cases are called ‘methods’. If providing a function, it calls the plot function curve()
.
boxplot()
plot(Plant_biomass ~ Lifestock, data = dd_full)
The data provided are a categorical explanatory and a continuous response variable, thus calling for the boxplot()
function.
barplot( tapply(dd_full$Plant_biomass, dd_full$Lifestock, mean),
xlab = "Lifestock", ylab = "Plant biomass")
hist()
hist(voeller$plant_height, breaks = 24)
persp()
x <- y <- seq(0,1,length = 25)
dd <- expand.grid(data.frame(x,y))
dd$z <- with(dd, 0.1 + 0.4*x - 1.5*x^2 + 1.15*x^3 + 0.2*y - 0.4*y^2 + 0.3*y^3 + 0.2*x*y - 0.23 *x^2*y^2 + 0.33 *x^3*y^3)
persp(x,y,z = matrix(dd$z, nrow = 25, byrow = FALSE), ticktype = "detailed", zlim = c(0,0.6), shade = 0.4)
The typical high-level plot arguments include
xlim
and ylim
, xaxs
and yaxs
), boxtype (bty
) and background (bg
), fix aspect ratio (asp
)log = "xy"
), axis label orientation (las
), font size (cex.axis
), specify tick mark labels (labels
), tick mark position (xaxp
and yaxp
) , tick length (tck
)ylab
and xlab
), plot titles (main
and sub
)cex
), point character (pch
), colors (col
and fg
)lty
) and line width (lwd
), colors (col
)look up ?par
and plot-specific help files (e.g. ?boxplot
).
In contrast, low level plotting commands are adding something to an existing plot
points()
or symbols()
)plot(inflorescences_qcm ~ LUI, data = flowerdata, log = "xy")
## Warning in xy.coords(x, y, xlabel, ylabel, log): 142 y values <= 0 omitted
## from logarithmic plot
points(inflorescences_qcm ~ LUI, data = subset(flowerdata, plant_genus == "Stellaria"),
pch = 20, col = "red", cex = 2
)
lines()
, abline()
)plot(n_inflorescences ~ plant_height, data = voeller)
model <- lm(n_inflorescences ~ plant_height, data = voeller )
abline(model) # adds a model line
lines(c(0,100), c(800,800), lwd = 2)
axis()
)plot(n_inflorescences ~ plant_height, data = voeller, xlim = c(0,100))
axis(4)
legend()
)plot(n_inflorescences ~ plant_height, data = voeller, xlim = c(0,100),
pch = 20, col =c("black", "red", "blue")[as.integer(voeller$eventID)] )
legend(3.5,800, c("HAI", "SCH", "ALB"), pch = c(20),
col = c("black", "red", "blue"))
text()
and mtext()
)plot(n_inflorescences ~ plant_height, data = voeller, xlim = c(0,100),
pch = 20, col =c("black", "red", "blue")[as.integer(voeller$eventID)] )
mtext("a)", adj = 0, line = 0.3)
Tipp: The parameter xpd = TRUE
allows to plot outside of plotting region.
plot(n_inflorescences ~ plant_height, data = voeller, xlim = c(0,100),
pch = 20, col =c("black", "red", "blue")[as.integer(voeller$eventID)] )
arrows(105,0, 105, 600, xpd = TRUE)
You can look up the options in ?par
and the specific low-level plotting functions!
The graphical parameters are ususally defined before a high-level plot is called using the function par()
. They set the scene for the plot(s) that follow and are valid until the graphics device is terminated using dev.off()
.
Important graphical parameters are:
mar
and outer oma
, specified in bottom, left, top, right order)family
,font
or font.lab
specify font, e.g. for axis labels)cex
, cex.lab
etc.)new = TRUE
the existing frame is not cleaned)Note that many parameters specific to the plot’s high- and low-level elements can be specfied here generally, to apply to all subsequent plots, e.g. axis type or point character. Calling ?par
gives a full list.
The most common settings here are the outer and inner margins.
For reducing space around single plot panels, setting mar
with a vector in order bottom, left, top, right will do.
par(mar = c(4,4,4,6))
plot(n_inflorescences ~ plant_height, data = voeller, xlim = c(0,100),
pch = 20, main = "Inflorescences" )
The ggplot2 package is based on a different logic of graphic composition and therefore many base functions are completely replaced by own ggplot-specific calls.
Its advantages:
Its disadvantages:
The basic principle is that you define a plot object that contains the specifications for which data to plot how and using which scheme.
The central function is ggplot()
which specifies how the data map to aestethics, that is for instance x and y coordinates. Additional calls are appended using +
to specify which elements to show eventually. Those calles can be
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.3
dat <- data.frame(xval = 1:4, yval = c(2,5,6,9), group = c("A","B","A","B"))
ggplot(dat, aes(x = xval, y = yval)) +
geom_point()
Linking the calls with +
also allows you to build your plot successively, e.g. when plotting for different purposes.
p <- ggplot(dat, aes(x = xval, y = yval)) +
geom_point()
p + theme_classic()
I will for this seminar focus on basic R, but occasionally will present a simple ggplot2 alternative. The scatterplot example from above could be produced in ggplot2 using
ggplot(voeller, aes(x = plant_height, y = n_inflorescences, colour = eventID)) +
geom_point() + xlim(0,100) + theme_classic()
## Warning: Removed 3461 rows containing missing values (geom_point).
It is slightly shorter in code. Especially, mapping a third dimension of information to a color code, like in this case the region, is more intuitive and includes the plotting of a legend. The axis is altered by calling the function xlim()
or ylim()
. The theme can be changed to a lighter classic look.
That said, this seminar cannot cover all features of ggplot2 but I will try to encourage its use. Please check out the book and package documentation by Hadley Wickham, which is quite extensive: http://ggplot2.tidyverse.org/index.html
Also, the book R Graphics Cookbook by Winston Chang has lots of recipes for plotting in ggplot.
For this exercise, we’ll use the grassland flower trait data of Voeller et al. and plot reproductive allocation ratio in response to plant height:
voeller <- read.csv("data/voeller.csv")
Try to reproduce as many features of this graph as possible (advanced users may use ggplot2).
Hints:
par(las = 1)
par(bty = "l")
par(mar = c(4,4,1,1))
xlim()
and xaxp()
ylab =
and xlab =
lm()
and abline()
Solution:
We’ll discuss this in the subsequent section step-by-step.
par(mar = c(4,4,1,1), las = 1, bty = "l")
plot(n_inflorescences ~ plant_height, data = voeller,
xaxp = c(0,100,4), xlim = c(0,100),
ylab = "number of inflorescences [m²]",
xlab = "plant height [cm]",
pch = 20)
abline(lm(n_inflorescences ~ plant_height, data = voeller))
Noise in graphics are distortions, distractions, anything that puts obstacles in the way of the observer.
In descriptive statistics and modeling, researchers know the principle of Occam’s razor: To describe a thing, use as few elements as possible and as many as necessary.
It is rarely the case that a graph shows too many data. If a graph appears cluttered, this is mostly because it contains to many elements of data or uses a noisy design. The message delivered from a blurry, dense cloud of data points is also that a large amount of data was available to draw this conclussion. Another message might be the huge complexity of the analysed system (e.g. networks). Thus, reducing data density is not a good choice.
A bad Example:
With the choice of color and plot style, resulting in heavy overplotting it is difficult to extract any information about the size and price of diamonds, depending on their quality.
A good Example: combine summary and raw-data when showing richess in data (panel E) or illustrate network complexity by density of lines.
(Source: Brose et al. 2016 Biological Reviews)
Reduce elements of graphs that do not contribute information.
In this graph, the entire legend and filling is redundant with the x-axis labels. The number of observations could be placed on top of the bars.
(Source: E.R. Tufte “The Visual Display of Quantitative Information”)
Improved version:
Edward Tufte suggests that we should maximize the ‘data-ink ratio’:
\[ \textrm{data-ink ratio} = \frac{\textrm{ink that transports information}}{\textrm{total ink of the graph}} \]
Reduction of noise and chartjunk is key for producing statistical graphics for publication. The following elements are to consider:
Particularly important for scatter plots are the point formatting arguments pch
and col
which can take a single value or a vector of values, which allows you to specify the shape and color of each single data point.
plot(Herbivore_SpeciesRichness ~ Predator_SpeciesRichness, data = dd_full,
pch = c(18,20,17)[Region],
col = c( "black", "red","blue")[Region]
)
Note that the subsetting of the shape or color vector with a factorial variable, e.g. c(18,20,17)[Region]
, will yield a vector that addresses each individual data point.
table of point characters:
Tips:
pch
can be characters: pch = c("A","B")
18
or points 20
) rather than open ones (crosses 3
or 4
or circles 1
)In most cases, the best choice is a filled point mark (number 20) which can easily be colour coded.
In plots with high data density, the message you want to deliver is not the precise position of individual points, but the distribution and density of points, i.e. the pattern. Solid data points are not well suited to deliver this information.
A simple solution to this could be to use opaque data points by specifying color with function rgb()
including a low alpha value.
par(las = 1, bty = "l")
plot(reproductive_allocation_ratio ~ plant_height, data = voeller,
pch = 19, col = rgb(0,0,0, alpha = 0.1)
)
Of course, if it is mostly about density, you may plot a density heatmap, which is easy in ggplot:
ggplot(voeller, aes(x = plant_height, y = reproductive_allocation_ratio)) +
stat_bin2d(bins = 50) +
scale_fill_gradient(low = "white", high = "black", limits = c(0,25)) +
geom_density_2d(colour = "black", n = 200) +
xlab("") +
theme_classic()
## Warning: Removed 622 rows containing non-finite values (stat_bin2d).
## Warning: Removed 622 rows containing non-finite values (stat_density2d).
Basically, every plot element can be colored. For scientific graphs, color should only be used when it encodes meaningful information! A rule of thum would be: don’t use color, if the difference can be made by shape or by clear labelling.
R’s default color palette is somewhat cruel.
In principle, all plot commands have an argument for color for points, lines or areas. You can specify a color palette by providing a vector with the color names.
As shown above, we apply the squared brackets for assigning each factor level in column eventID an own color value.
For continuous variables this requires binning of the vector into discrete variables, before assigning colors. However, adding a legend for this is not trivial.
plot(Herbivore_SpeciesRichness ~ Predator_SpeciesRichness, data = dd_full,
pch = 19,
col = gray.colors(25)[cut(Mowing, 25)]
)
The simplest form of color difference is black and white, or halftone (i.e. grayscale), which is always safe for print and on screen.
There are four basic ways of defining colors.
rainbow()
, heat.colors()
, terrain.colors()
, topo.colors()
However the first three options arguably leave you with an overwhelming choice. The latter is designed to produce contingent color tables, rather than divergent palettes that are strong in contrast.
You can create your own color palettes using the function colorRampPalette()
pal <- colorRampPalette(c("springgreen", "darkgoldenrod1", "lightsalmon"))
pal(8)
## [1] "#00FF7F" "#48EB5F" "#91D73F" "#DAC31F" "#FFB51E" "#FFAE3C" "#FFA75B"
## [8] "#FFA07A"
It is not recommended to pick arbitrary or subjectively aesthetic colors for a scientific plot. Rather you should rely on professionally designed color palettes for this purpose.
One of the best collection of colors is provided in the RColorBrewer package.
library(RColorBrewer)
plot_palette(brewer.pal(8, "Dark2"))
You should constrain your use to the many colorblind friendly options offered by this package:
display.brewer.all(colorblindFriendly = TRUE)
There is also the very convenient package ‘colorspace’ which lets you chose your color palette in a graphical interface:
library(colorspace)
pal <- choose_palette()
will return you a function that can be called to produce n
steps of the choosen color palette:
The palettes produced by this tool maximize contrast even for colorblind viewers.
par(las = 1, bty = "l")
plot(n_inflorescences ~ plant_height, data = voeller,
pch = 19, xlim = c(0,100),
col = c( "#E88797", "#BA92E6", "#00B8C6")[eventID]
)
Finally, ggplot2 specifies color within the aes() argument of the ggplot() call, i.e. a third level of information mapped to color. It comes with tools for color ramps defined as scale functions.
ggplot(voeller, aes(x = plant_height, y = n_inflorescences, colour = eventID)) +
geom_point() +
scale_color_brewer(palette = "Dark2") +
xlim(0,100) + theme_classic()
## Warning: Removed 3461 rows containing missing values (geom_point).
scale_color_brewer()
applies the palettes of the RColorBrewer packagescale_color_manual()
allows you to specify own colorsscale_color_hue()
alters the shade/hue of your selected color paletteFor continuous variables, we use scale_color_gradient()
:
ggplot(voeller, aes(x = plant_height, y = n_inflorescences, colour = latitude)) +
geom_point() +
scale_colour_gradient(low = "salmon", high = "red3") +
xlim(0,100) + theme_classic()
## Warning: Removed 3461 rows containing missing values (geom_point).
Note that ggplot gracefully produces an exact legend!
By default, R applies regular intervals and aligns the tick labels with the axis. It also extends the axes into a box frame.
Set y-axis labels horizontal to be less irritating to read. Use box frame with reason.
Chose axis dimensions to produce a balanced data range. Reduce steps of axis with reason.
Don’t stress the levels of tick marks too much. Use as little steps as necessary to allow the reader to still locate the data points. Especially, if data are normalized to abstract axes, estimating the exact quantity will be impossible anyway.
Add meaningful axis labels.
In R, the plot area is usually just a blank rectangle in R, while ggplot by default plots a grey background area.
# offset grid on grey (ggplot base style)
ggplot(d1, aes(x = x,y = y) ) + geom_point()
Apart from the fact that a grey area is not justified really (it adds too much ink and no information), for a plot that is about the distribution and pattern of your data, this grid adds little information and should be omitted.
The grid lines might be a useful feature wherever the datapoints are identified individually.
Box plots are noisy by default. The dashed lines of the whiskers, the horizontal end of whiskers, are unnecessary specifications.
boxplot(Plant_SpeciesRichness ~ Region, data = dd_full)
They can be simplified by using their internal pars
argument. See ?boxplot
and ?bxp
for further information.
boxplot(Plant_SpeciesRichness ~ Region, data = dd_full,
ylim = c(0,75), las = 1,
pars = list(boxwex = 0.3, staplewex = 0,
whiskwex = 0, whisklty = 1, outpch = 20, outcex = 0.8)
)
The ggplot boxplot applies a similar style.
ggplot(dd_full, aes(x = Region, y = Plant_SpeciesRichness)) +
geom_boxplot() + ylim(0,75)
Data-related science in biology, meteorology, geography, social sciences strive for objectivity and use data to build theories and widen our understanding of the world. The data are measured using standardised methods, processed analysed in transparent statistical applications (code) and published as peer-reviewed papers and reports.
Graphics are a central element of the publication, key for delivering additional information and convincing the reader of the findings of the study. The truthfulness of the graphic is as important as maintaining objectivity in writing and statistics.
To be truthful, the relative sizes and lengths of data shown in your figure must precisely correspond to the quantities in the underlying data.
An intended exaggeration and over-stating facts is fraud!
Example: In this example, the visual impression of the value arises from the volume of the barrel, but the values define the height. Perspective distortion adds to the impression that the price per barrel crude oil must have exploded by multiple orders of magnitude, not just factor 6.
(Source: E.R. Tufte “The Visual Display of Quantitative Information”)
But sometimes, distortions and optical illusions are created because of a lack of awareness: Particularly risky are 3D plots to visualise 1D data.
Example from MS Excel:
But unintentional exaggeration can happen in R, too. For instance, if plotting values as radii of symbols. The area of each data point is supposed to scale with the value, but here it is the radius that scales with the z-Dimension ‘population size’.
## Warning: package 'gapminder' was built under R version 3.4.3
Also, when values scale exponentially, like growth rates, body size effects etc., the plot must reflect this clearly. The lack of logartithmic scaling of axes purports a risk of biasing the visual impression.
This is of particular importance when plotting data that are best described using a generalized model with a non-standard error structure.
Edward Tufte calls this ‘Graphical Integrity’ and he defines the lie-factor of a graph as the ratio
\[ \textrm{lie factor} = \frac{\textrm{size of graphical effect}}{\textrm{size of effect in data}} \]
In the following sections, we will discuss properties in R plotting that are relevant for matching your graphical effects with the effect size in the data.
For a simple addition of linear models to a plain scatterplot, you can use the abline()
function.
model <- lm(Plant_SpeciesRichness ~ Herbivore_SpeciesRichness, data = dd_full)
S_new <- seq(0, 60, length = 51) # generates new x-values
plot(Plant_SpeciesRichness ~ Herbivore_SpeciesRichness, data = dd_full, pch = 20)
abline(model)
To add lines of a non-linear model or to trace the linear model in log-log space, we can apply the function predict()
to simulate a new vector of predicted values for a equidistant gradient along our x-axis.
model <- lm(Plant_SpeciesRichness ~ Herbivore_SpeciesRichness, data = dd_full)
S_new <- seq(0, 60, length = 51) # generates new x-values
plant_pred <- predict(model, # simulate y-values
newdata = list(
Herbivore_SpeciesRichness = S_new
), interval = "confidence"
)
The predict function takes the model object and then applies the model equation to a set of new explanatory values provided as a list in newdata
. Take care that there is one vector for each explanatory variable of the model (here: grazing
and fenced
) and that they are of equal length. See ?predict.lm
for further information on the function method.
Then, you can add the model line to your plot including lower and upper confidence bands.
plot(Plant_SpeciesRichness ~ Herbivore_SpeciesRichness, data = dd_full, pch = 20)
lines(S_new, plant_pred[,"fit"])
lines(S_new, plant_pred[,"lwr"], lty = 3)
lines(S_new, plant_pred[,"upr"], lty = 3)
Predict also has methods for other model functions, like nonparametric polynomials or glms and works just as well. Not that this even works if you change your plotting plane into log-log space.
model <- glm(total_biomass ~ plant_height, data = voeller, family = quasipoisson)
summary(model)
##
## Call:
## glm(formula = total_biomass ~ plant_height, family = quasipoisson,
## data = voeller)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7.2572 -2.5005 -0.8708 1.3318 11.0920
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.476239 0.303382 -1.57 0.117
## plant_height 0.024266 0.001957 12.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 10.75206)
##
## Null deviance: 6222.0 on 439 degrees of freedom
## Residual deviance: 4197.5 on 438 degrees of freedom
## (3306 observations deleted due to missingness)
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
x_new <- seq(0, 200, length = 51) # generates new x-values
plant_pred <- predict(model, newdata = list(plant_height = x_new), type="response", se.fit = TRUE)
plot( total_biomass ~ plant_height, data = voeller, pch = 20)
lines(x_new, plant_pred$fit)
lines(x_new, plant_pred$fit + 1.96*plant_pred$se.fit, lty = 3)
lines(x_new, plant_pred$fit - 1.96*plant_pred$se.fit, lty = 3)
This works, regardless of the mapping of the plot in logarithmic scales:
plot( total_biomass ~ plant_height, data = voeller, pch = 20, log = "y")
lines(x_new, plant_pred$fit)
lines(x_new, plant_pred$fit + 1.96*plant_pred$se.fit, lty = 3)
lines(x_new, plant_pred$fit - 1.96*plant_pred$se.fit, lty = 3)
Note that the package ggplot2
provides some tools to generate confidence intervals with ease.
library(ggplot2)
ggplot(dd_full, aes(x = Herbivore_SpeciesRichness, y = Plant_SpeciesRichness)) +
geom_point() +
geom_smooth(method='lm',formula=y~x) +
theme_classic()
The base barplot does not have a native option to add error bars. The arrows()
function can be used for that:
Design an own graph, using your own data or data from the course, with at least one of the following features to add a layer of information:
Above all, show the data! Maximize data-ink ratio and minimize the lie-factor!
Figures are rarely simple and composed from only one panel. In most cases, panels are ordered and supposed to show different layers of the same phenomenon, like correlations between multiple factors.
Co-variation of multivariate data can be shown in multi-panel plots. More useful for data exploration.
Positive or negative correlation between response values along the same predictor value can be shown in stacks:
(Source: Schneider et al. 2012 J.Anim.Ecol.)
(Source: Jochum et al. 2012 Phil. Trans. R.Soc. B)
Ideal for showing correlative effects in parameter space are multiples. Once the x-y-coordinate system is clearly laid out, the other data can be shown in smaller layout or without the noise of labels and legends.
(Source: E.R. Tufte “The Visual Display of Quantitative Information”)
The transfer achieved when reading multiples is extraordinary, as the example of sea ice extend in the arctic shows. It shows seasonal variation along y and yearly decline on x.
(Source: NASA)
This combination of Micro pattern and Macro pattern allows for the reader to zoom in and out and acheive a contextual view of things, a understanding of events across time.
Other arrangements are possible to illustrate relation between different plots. E.g. a central schematic figure which is explained by statistical plots.
(Source: Schneider et al. 2012 Ecol Lett)
By default, a single plot is plotted to the device, and it is overwritten, or a second page is prompted if another high-level plotting function is called.
The option par(mfrow = c(2,1))
allows with little effort to create rectangular multi-panel plots. But these are all of equal height and width.
par(mfrow = c(2,3))
layout.show(n = 6)
This allows a simple composition of multiples using a for loop.
par(mfrow = c(3,4), las = 1)
for(i in levels(flowerdata$plant_genus)[c(2,4,7,10,11,12,14,17:20,22)]) {
plot(inflorescences_qcm ~ LUI,
data = subset(flowerdata, plant_genus == i),
pch = 19,
xlim = c(0.001,1), ylim = c(1,10000),
main = i,
log = "xy"
)
}
The ggplot2 package knows a concept of ‘facets’ which split a dataset by a factor to plot multiples of its subsets.
ggplot(subset(flowerdata, as.integer(plant_genus) %in% c(2,4,7,10,11,12,14,17:20,22)), aes(y = inflorescences_qcm, x = LUI)) +
geom_point() +
scale_x_log10() + scale_y_log10() +
facet_wrap( ~ plant_genus, nrow = 3)
More advanced is the layout()
function, which uses a matrix grid to define the space occupied by the different plots.
(m <- matrix(c(1,1,1,2,3,6,4,5,6), byrow = TRUE, ncol = 3))
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 2 3 6
## [3,] 4 5 6
layout(m)
layout.show(n = 6)
This is very flexible, since further parameters can be used to define width and height of the columns.
layout(m , height = c(1,2,2), width = c(2,2,1))
layout.show(n = 6)
In combination this allows for highly complex plotting layouts.
m <- matrix(rep(0, 36), byrow = TRUE, ncol = 6)
m[1:5,1:5] <- 1
m[6,1:2] <- 2
m[6,3:4] <- 3
m[6,5:6] <- 4
m[1:2,4:5] <- 5
m[1:3,6] <- 6
m
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 1 1 5 5 6
## [2,] 1 1 1 5 5 6
## [3,] 1 1 1 1 1 6
## [4,] 1 1 1 1 1 0
## [5,] 1 1 1 1 1 0
## [6,] 2 2 3 3 4 4
layout(m , height = c(1,1,1,1,1,3), width = c(1,2,1,2,1,2))
layout.show(n = 6)
par(mar = c(4,4,1,3))
plot(Plant_SpeciesRichness ~ Herbivore_SpeciesRichness, data = dd_full, col = pal(3)[Region], pch = 20, cex = 2)
mtext("a)", adj = 0, line = -1.5)
par(mar = c(5,4,1,1))
boxplot(Plant_SpeciesRichness ~ I(Grazing > 0), data = dd_full)
mtext("b)", adj = 0, line = -1.5)
boxplot(Plant_SpeciesRichness ~ I(Fertilization > 30), data = dd_full)
mtext("c)", adj = 0, line = -1.5)
boxplot(Plant_SpeciesRichness ~ I(Mowing > 0), data = dd_full)
mtext("d)", adj = 0, line = -1.5)
par(mar = c(4,4,2,4))
hist(dd_full$Plant_SpeciesRichness, main = NA, col = "black", breaks = 24, xlab = NA)
par(mar = c(1,1,1,1))
contour(unique(dd$x),unique(dd$y), matrix(dd$z, nrow = 25, byrow = FALSE))
mtext("e)", adj = 0, line = -1.5)
Note that the first plot spans over matrix cells that are later hidden by another figure.
By default, R plots figures into the screen graphics device (windows in Windows, X11 in Ubuntu and Mac), which pops up in a new window. Via buttons, this picture can be exported to files. Alternatively, you can print plots directly into a file, or a virtual ‘device’. When producing figures for production, i.e. files for submission or to embed in presentations, it is important to chose the right file format.
The journal requirements often ask for these things
there are different graphics devices and they can be sorted into raster formats or vector formats:
Raster formats:
bmp()
for bitmap files (no compression)png()
for gif and png files (selective color)jpeg()
for jpg (pixel compression)tiff()
for tiff files (best quality/compression trade-off)Vector formats:
pdf()
for pdf plottingpostscript()
for .eps filesMostly, vector format is what journals want, because its scalable without loss and the file size is small, unless your data points are several thousands. Use raster formats for plots with many (>1000) datapoints or for presentations on screen, because they are more reliable. Use png()
for black-and-white plots in high-resolution!
Raster plots scale with the resolution that you specify. You can either give a precise pixel dimensions by providing width
and height
, or define a target print dimension in centimeters, by setting units = "cm"
and define width
, height
as well as resolutinon res
. A minimal resolution for printing is 300 dots per inch. For screen output, e.g. on websites, 72 dpi is sufficient.
jpeg("filename.jpg",
width = 460, height = 720, #in pixels
antialias = "none" # no smoothing of pixelation
family = "Times" # for journals that want serif
)
plot(...)
dev.off()
pdf is the standard format for paper submission because it is lossless. The file will grow very big if your plot contains many data points. You can chose whether you print your plot on a regular A4 paper format, or cropped to the given plot dimensions. The useDingbats argument is important for making your plot independent of Windows-specific fonts.
pdf("filename.pdf",
width = 7, height = 6, #in inches: 1in = 2.54cm
paper = "special", # or "a4"
useDingbats = FALSE, # avoids weird symbols
family = "Times" # for journals that want serif
)
plot(...)
dev.off()
Note that multiple high-level plot calls in a pdf device create multi-page pdfs!
Saving ggplot plots is slightly different. A simple way is to wrap your plot in the print()
function for writing it to a pdf or jpeg device.
pdf("filename.pdf",
width = 7, height = 6, #in inches: 1in = 2.54cm
paper = "special", # or "a4"
useDingbats = FALSE, # avoids weird symbols
family = "Times" # for journals that want serif
)
print(ggplot(...) + geom_point() + theme_classic())
dev.off()
The other option is the function ggsave()
which is called just following a plot call.
ggplot(...) + geom_point() + theme_classic()
ggsave("filename.pdf", width = 20, height = 20, units = "cm")
RStudio offers a feature for interactive plotting which is a great way to explore complex datasets or to visualize model behaviour depending on parameters.
#
# This is a Shiny web application. You can run the application by clicking
# the 'Run App' button above.
#
# Find out more about building applications with Shiny here:
#
# http://shiny.rstudio.com/
#
library(shiny)
# Define UI for application that draws a histogram
ui <- fluidPage(
# Application title
titlePanel("Biodiversity Exploratories : Land Use Intensity on Plots"),
# Sidebar with a slider input for number of bins
sidebarLayout(
sidebarPanel(
selectInput(inputId = "exploratory",
label = "Exploratory region:",
choices = c("ALB", "HAI", "SCH")
),
sliderInput("year",
"Year of observation:",
min = 2006,
max = 2016,
value = 2014)
),
# Show a plot of the generated distribution
mainPanel(
plotOutput("distPlot")
)
)
)
# Define server logic required to draw a histogram
server <- function(input, output) {
output$distPlot <- renderPlot({
dd <- subset(LUI_factors, Year == input$year & Exploratory == input$exploratory )
plot(LUI ~ EP_PlotID,
data = droplevels(dd),
ylim = c(0,1)
)
})
}
# Run the application
shinyApp(ui = ui, server = server)
The package animation offers the possibility to create animated gifs or videos from your plot, e.g. to create a visual of changin parameters on a model.
Finish your own graph and produce a print output in pdf and jpg.