This workshop covers basic and advanced data handling in R.
We will talk about how to
and more advanced:
The commands and examples are tested for R > v3.0! Please run the follwing command to ensure that you have all required packages installed:
install.packages(c("getPass", "RCurl", "XML", "dplyr", "tidyr", "ggplot2", "reshape2"))
Also, a suite of online scripts and data will be required. Please run this block to make them available:
source("https://gist.githubusercontent.com/fdschneider/9f5d044d5091976741ad07dab5439a77/raw/faed4eca5ed142797e3c9431029d02c8d46d39dc/accessBexis.R")
This lesson assumes basic knowledge in R, like
?
and manpages)If you don’t feel comfortable with these topics, please walk through this basic script before the workshop: http://fdschneider.de/r_basics/
The presentation script and exercise can be found at http://fdschneider.de/r_datahandling/!
Please report difficulties or errors on my Github page: https://github.com/fdschneider/r_datahandling/issues.
R data handling script by Florian D. Schneider is licensed under a Creative Commons Attribution 4.0 International License.
Usually, the dataset you want to use was assessed in the experimental environment in the field, or in a laboratory. Either via direct input or via a paper notebook, the data ended up in some spreadsheet software, mostly MS Excel or Libre Office Calc.
Of course, R will be able to read in data tables of any structure. Any data structure can be transferred with a couple of lines of code into any form you require. But you can simplify your life if you decide on a raw data structure that is easy to handle in R. To achieve this, there are a couple of rules of thumb:
;
separators between columns and decimal commas if your system language is German!Never
Ideally, your raw-data file will have precisely one header row (not two rows, no empty rows!), that contains column names.
The column names should
_
.I recommend that you use english column names from the start. That makes it easier to publish or share your data.
Keep in mind that there are different types of values which come with different implications for computation (and file size):
Some columns may contain comments or warnings, specific explanations etc. In R these are considered as character strings (e.g. “Data missing because of bad weather conditions.”) and are not exactly used for data analysis.
R can read data tables of many file formats but the most basic file format are pure text files (.csv, .txt), where columns are separated by line breaks and rows are separated by a ‘Separator’, which is usually a comma (hence .csv = comma separated value).
We cover two-dimensional spradsheet files here, which in R are handled in ‘data.frame’ objects. More complex file formats like relational databases are beyond this tutorial.
To open a file in R you will need one of the following functions.
rawdata <- read.table("rawdata.txt")
rawdata <- read.csv2("rawdata.csv")
rawdata <- read.csv("rawdata.csv")
rawdata <- read.delim("rawdata.tab")
These are alternative functions that do mostly the same, but have different default assumptions of the data structure of your text-file “table.csv”. For instance read.csv2()
assumes ;
as field separator and comma ,
as decimal separator, as for files generated in Office with German language settings. The function parameters allow you to set the separator (sep
) and decimal delimiter (del
), to skip a leading line (skip
) or to expect a header (header
). The argument na.strings
takes a vector of character strings that are to be taken as missing values. So, if your file does not read well into R, it’s likely you have to meddle with these parameters (see ?read.table
for further information).
For other file formats you might require additional packages. Most notable are:
readxl
package for files created with MS Excel: provides the functions read_xls()
and read_xlsx()
.foreign
package allows you to read data tables exported from other statistical software like SPSS, SAS, Octave (see this comment).Importing data probably does not work right away. The R environment typically assumes your home directory as the working environment. Since you refer to a file outside of R, you need to tell R which place you are referring to by specifying the path to that file. There are two ways of doing this.
d1 <- read.table("C:/Windows/Users/florian.schneider/Documents/projects/stats/r_datsahandling/data/12807/12807.txt")
This can be very long, and will not work, if you switch to a different computer or share the file with colleagues.
setwd("G:/uni/projects/stats/r_datahandling/")
d1 <- read.table("data/12807/12807.txt", header = TRUE, sep = "\t")
Advanced Note: The simplest way to manage working directories and relative paths is to use RStudio projects. For each project, RStudio will assume the working directory in the project root directory. If you open the project on a different computer or move it around on your computer, R will still find the relative paths.
Instead of reading files from your local computer, you may want to refer to files published online, e.g. on Github or figshare. In principle, you can open any text-based file (csv or txt) that is accessible online as a download. Just enter the download link (usually what you get when right-clicking on the link and select “copy link”) as source path.
traits <- read.table("https://datadryad.org/bitstream/handle/10255/dryad.76638/ArthropodSpeciesTraits.txt?sequence=1",
sep = "\t", header = TRUE) # original publication DOI: 10.1038/sdata.2015.13 metadata: https://www.nature.com/articles/sdata201513/tables/2
Advanced Tip: Very often, data packages are compiled into compressed zip-archives. In this case, R can download and extract files by name. This little script allows you to do this. The example accesses data from this publication from the Biodiversity Exploratories:
Völler, Eva; Bossdorf, Oliver; Prati, Daniel; Auge, Harald (2017): Quantitative genetic traits of eight grassland species from three regions of Germany (Schwäbische Alb, Hainich-Dün, Schorfheide Chorin). PANGAEA, https://doi.org/10.1594/PANGAEA.871061
temp <- tempfile(fileext = ".zip")
download.file("https://doi.pangaea.de/10.1594/PANGAEA.871061?format=zip", temp, method = "auto", quiet = TRUE, mode="wb")
unzip(temp, files = "datasets/Voeller_F-rubra.tab", exdir = "data", junkpaths = TRUE ) # parameters give relative path of file within the archive and where to extract to
unlink(temp)
rm(temp)
F_rubra <- read.delim("data/Voeller_F-rubra.tab", skip = 19, encoding = "UTF-8")
BExIS (https://www.bexis.uni-jena.de/) is the central data management platform for the Biodiversity Exploratories. You are supposed to upload your own project data here and you have access to descriptive parameters of the plots and many other project datasets.
An R script by Dennis Heimann and Andreas Ostrowski provides the function read_service()
(original version as BExIS ID 6080, modified by myself with interactive password prompt from this link).
Note that you need BExIS access credentials to call this function as well as permission to download the file from the data owner (request management is done via BExIS). The function takes a BExIS ID, your personal credentials (user
and pswd
) as well as the typical parameters required for read.table()
(i.e. sep
, dec
).
LUI_factors <- read.service(19266, dec = ",", user = "yourname", pswd = "yourpassword")
Please enter password in TK window (Alt+Tab)
A couple of shared core datasets that might be useful for your analysis:
For some of these, you may need to ask for permission by the data owner. By downloading files from BExIS you declare consent with the data sharing agreement of the Exploratories!
If you work with large bits of data, your raw data might be split across separate files or spreadsheet pages. To load them in a single call you may use a loop. In the following example, we call all files from a specific directory (called “data”):
files <- dir("data")
filenames <- sub( ".csv", "", files)
for (i in 1:length(files)) assign(filenames[i], read.csv(files[i]))
This requires however, that the data are of the same structure, i.e. they use same delimiters and separators etc!
In the example above, using an extended code will read all files that are included in the zip archive:
temp <- tempfile(fileext = ".zip")
download.file("https://doi.pangaea.de/10.1594/PANGAEA.871061?format=zip", temp, method = "auto", quiet = TRUE, mode="wb")
unzip(temp, exdir = "data") # parameters give relative path of file within the archive and where to extract to
unlink(temp)
rm(temp)
library(data.table)
Attaching package: 'data.table'
The following objects are masked from 'package:dplyr':
between, first, last
files <- dir("data/datasets")
filenames <- sub( ".tab", "", files)
voeller <- list() # creating a list object that will take all our individual data tables
for (i in 1:length(files)) {
file <- paste0("data/datasets/",files[i])
temp <- readLines(file) # temporary content of file
skipLine <- which(temp == "*/") #this is a dirty solution to finding the end of the comment section in each file
voeller[[i]] <- read.delim(file, skip = skipLine, encoding = "UTF-8") # read and store each object in the list, while omitting any non-data
}
voeller <- rbindlist(voeller, fill = TRUE )
voeller
Just to complete this section, we cover writing data files that have been created in R to your disk. Any data.frame object can be written into a text file using write.csv()
and write.table()
(just analog to read.csv()
and read.table()
).
write.csv(LUI_factors, file = "data/LUI_factors.csv", fileEncoding = "UTF-8", row.names = FALSE )
write.csv(voeller, file = "data/voeller.csv", fileEncoding = "UTF-8", row.names = FALSE )
By default the function writes row names as well. In our case they are only numeric and don’t contain information, so we skip them.
Saving output after certain steps of computation can be useful to avoid repeating computation-heavy steps during your data processing. Just save the compiled data, means, or intermediate statistics in script 1 and build on top of this in script 2.
A note on file encoding: There are an infinite number of standards for text file encodings, i.e. the mappings of bits and bytes to actual letters. The UTF-8 standard is the most universal and accepted encoding. If you want to publish your data on a repository, it is recommended to choose this mode. If you want to learn more about encodings, start with this Wikipedia article and check out ?file
.
Now you have the content of the file available in R as an object, more specifically as a ‘data.frame’.
Dataframes are two-dimensional tabular structures. They can easily be accessed by column and row. Usually the columns are named by a column header, and the rows are numbered.
In most cases, tables can be interpreted as observations (the rows) on a fixed number of variables (the columns). Each observation contains a set of defined metrics, one for each variable of interest. In ecological data, variables can usually be further distinguished into response variables and descriptive explanatory variables or cofactors. Explanatory variables also comprise information about the spatio-temporal context of the measurement (e.g. the plot or year).
Technically, each column is a vector of the same length (= number of rows), while each vector can have different content type, i.e. numbers, factorials, characters. This is the main difference with matrices, which are homogeneous in content. In contrast, list objects may have different vector lengths for each entry.
The most direct way to access a column in a data frame is the $
sign:
hist(LUI_factors$TotalGrazing)
We cover alternative ways of accessing data further below.
You can have a look at how the data are structured using the functions str()
or head()
str(LUI_factors) # returns the type of data for each column of the table
'data.frame': 1650 obs. of 8 variables:
$ Year : int 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
$ Exploratory : Factor w/ 3 levels "ALB","HAI","SCH": 1 1 1 1 1 1 1 1 1 1 ...
$ EP_PlotID : Factor w/ 150 levels "AEG1","AEG10",..: 1 12 23 34 45 47 21 22 24 25 ...
$ isVIP : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 1 1 1 1 ...
$ isMIP : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 1 1 2 2 ...
$ TotalGrazing : num 0 0 0 111 114 ...
$ TotalMowing : int 2 3 2 1 1 1 0 1 0 0 ...
$ TotalFertilization: num 50 196.9 64.6 50 50 ...
This shows you the overall dimensions of the table, as well as the column name, type of values and number of factor levels for each colum. You see that R recognised numerical, integer and factorial columns. We’ll discuss how and if this is a correct interpretation below.
head(LUI_factors)
This returns the first 6 rows of a dataframe by default, and is ideal for scanning the structure of the data if the tables are not too wide.
For a more detailled summary, you may call
summary(LUI_factors)
Year Exploratory EP_PlotID isVIP isMIP
Min. :2006 ALB:550 AEG1 : 11 no :1353 no :825
1st Qu.:2008 HAI:550 AEG10 : 11 yes: 297 yes:825
Median :2011 SCH:550 AEG11 : 11
Mean :2011 AEG12 : 11
3rd Qu.:2014 AEG13 : 11
Max. :2016 AEG14 : 11
(Other):1584
TotalGrazing TotalMowing TotalFertilization
Min. : 0.0 Min. :0.000 Min. : 0.00
1st Qu.: 0.0 1st Qu.:0.000 1st Qu.: 0.00
Median : 60.0 Median :1.000 Median : 0.00
Mean : 133.8 Mean :1.023 Mean : 29.87
3rd Qu.: 172.1 3rd Qu.:2.000 3rd Qu.: 47.23
Max. :1644.2 Max. :4.000 Max. :433.00
If applied to a data.frame, this function produces basic stats for each column, like means and quantiles of numerical values, and frequencies of factor levels. Obviously, these statistics are not always meaningful, e.g. in case of Year.
Further convenient functions for data.frames are
names()
: which gives you a vector of the column namesdim()
: which gives you the number of rows and columns of your data.frameWhen reading data from a file, R ‘guesses’ the data type from the values it finds within a column. Often, this is not precisely what you want and you need to correct data types. Also, sometimes R needs further input to interpret variable values correctly.
A quick recap of data types in R: There are different types of data that R can handle. The following objects do have different specifications.
x <- c(3.2323, 4.5723, 3.8356, 8.2301, 4.4285)
str(x)
num [1:5] 3.23 4.57 3.84 8.23 4.43
v <- 1:6
str(v)
int [1:6] 1 2 3 4 5 6
b <- c(FALSE, TRUE, FALSE, FALSE)
str(b)
logi [1:4] FALSE TRUE FALSE FALSE
n <- c("Agriphila_tristella", "Agriphila_tristella", "Idaea_aversata", "Agriphila_straminella", "Chiasmia_clathrata")
str(n)
chr [1:5] "Agriphila_tristella" "Agriphila_tristella" ...
f <- as.factor(c("A", "A", "B", "C", "A", "B", "B", "C", "C"))
str(f)
Factor w/ 3 levels "A","B","C": 1 1 2 3 1 2 2 3 3
This becomes very important e.g. for statistical modelling, since the types prompt different kinds of analysis or plotting. The types also require different amount of space in the computer memory.1
It is useful to understand that for modelling purposes and computationally intensive tasks.
data type | usage | bits per value |
---|---|---|
“double” | for non-integer, float numbers, percentages etc. | 32 |
“integer” | for count data, for factorial data | 3 - 32 |
“logical” | for boolean TRUE/FALSE values, for binary data (e.g. survival) | 1 |
“character” | for strings of characters, written information, non-repeating words | many thousand, 8 per character |
“factor” | categorical values, like experimental blocks, treatments, etc. ; technically not a type, but a class, but never mind (stored as integer numbers, encoded as factor levels) | 3 - 32 |
Advanced R: The package readr
provides an improved function for guessing the types of in heterogeneous data. It does a much better job interpreting logical and numerical values, or date and currency vectors. The output
It is quite common that integer numbers are interpreted as integer, but it would be more useful to consider them as categorical, e.g. in the case of years. The order of the years is not important in the first place (seasons are highly variable) and if the timeseries misses a year, that doesn’t mean that gap is interpreted as a double distance of data points.
str(LUI_factors$Year)
int [1:1650] 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
LUI_factors$Year <- as.factor(LUI_factors$Year)
str(LUI_factors$Year)
Factor w/ 11 levels "2006","2007",..: 1 1 1 1 1 1 1 1 1 1 ...
Note that this is indeed a re-assignment (using the assignment arrow <-
), that overwrites the old variable ‘Year’ with the new, corrected vector! We will use this throughout the next couple of examples.
Alternatively, you might want years be encoded as date. Date and time handling in R is a tricky task and not covered in depth here. But for a start, check out the manual for the as.Date()
function, as well as the format coding described under ?strptime
.
str(d1$date)
Factor w/ 24 levels "7/22/2009 12:00:00 AM",..: 6 9 20 23 6 9 20 23 9 20 ...
d1$date <- as.Date(d1$date, format = "%m/%d/%Y %I:%M:%S %p")
str(d1$date)
Date[1:1647], format: "2009-07-27" "2009-07-30" "2009-08-05" "2009-08-08" "2009-07-27" ...
In factorial/categorical variables, the factor levels must be coherent, otherwise R will interprete them as two or more independent levels. This is a dangerous error that often goes unnoticed in big datasets. Start with checking out the levels of your factorial variables.
levels(d1$plant_genus) # shows the vector of levels found in your data
[1] "Achemilla" "Achillea" "Agrimonia" "Alchemilla"
[5] "Anthriscus" "Anthyllis" "Arctium" "Asperula"
[9] "Bellis" "Betonica" "Campanula" "Carduus"
[13] "Carlina" "Carum" "Centaurea" "Centaurium"
[17] "Cerastium" "Cichorium" "Cirsium" "Colchicum"
[21] "Convolvulus" "Crepis" "Daucus" "Dipsacus"
[25] "Euphorbia" "Euphrasia" "Galium" "Genista"
[29] "Geranium" "Helianthemum" "Heracleum" "Hieracium"
[33] "Hypericum" "Knautia" "Lamium" "Lathyrus"
[37] "Leontodon" "Leucanthemum" "Linum" "Lotus"
[41] "Lychnis" "Matricaria" "Medicago" "Melilotus"
[45] "Minuartia" "Myosotis" "Odontites" "Ononis"
[49] "Origanum" "Pastinaca" "Pimpinella" "Plantago"
[53] "Polygonum" "Potentillla" "Prunella" "Ranunculus"
[57] "Rhinanthus" "Rumex" "Scabiosa" "Senecio"
[61] "Silene" "Stachys" "Tanacetum" "Taraxacum"
[65] "Teucrium" "Thymus" "Trapogon" "Trifolium"
[69] "Vicia"
subset(d1, plant_genus %in% c("Alchemilla", "Achemilla"))
Obviously, the genus “Achemilla” does not exist. Its a typo of ‘Alchemilla’. What we need is to reunite the factor level ‘Achemilla’ with ‘Alchemilla’. The graceful solution requires revalue
from package plyr
:
require(plyr)
d1$plant_genus <- revalue(d1$plant_genus, replace = c(Achemilla = "Alchemilla"))
Advanced tip: such errors sometimes are not just typos but result of lacking standardization: Especially with taxonomic names, you will need to figure out synonyms. The package taxize
helps to look up taxon names on online webservices and map them to their accepted taxon name.
By default, the factor levels are ordered alphabetically. E.g.
head(traits$Stratum_use_short)
[1] t h h u u u
Levels: g h s t u w
Here, g
stands for ground-dweller, h
for herb layer, s
for soil-dweller, t
for shrub & tree layer, u
for unspecific and w
for water-bound. For a meaningful interpretation of the vertical stratum, a ecologically informed re-assignment of the order might be useful.
traits$Stratum_use_short <- factor(traits$Stratum_use_short, levels = c("s", "g", "h", "t", "w", "u"))
head(traits$Stratum_use_short)
[1] t h h u u u
Levels: s g h t w u
Any factor levels of the original vector or dataframe will be preserved if you use []
or subset()
. If you want to eliminate unused factor levels you should apply the function droplevels()
, e.g. if you want to plot figures from the data that are based on the factor levels:
f
[1] A A B C A B B C C
Levels: A B C
f[f != "A"]
[1] B C B B C C
Levels: A B C
droplevels(f[f != "A"])
[1] B C B B C C
Levels: B C
Most common is the case where logical values are encoded in user-defined terms, like ‘yes’/‘no’ or ‘1’/‘0’. Only if levels are encoded in the file as one out of c(“T”, “TRUE”, “True”, “true”) or c(“F”, “FALSE”, “False”, “false”) then R assumes they are logical. Pure integer vectors of 1 and 0 are also interpreted as TRUE and FALSE in a logical vector. In any other case, you need to translate your own encoding. To fix this, we first need to replace the labels, and then convert it to a logical vector.
str(LUI_factors$isVIP)
Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 1 1 1 1 ...
LUI_factors$isVIP <- factor(LUI_factors$isVIP, levels = c("yes", "no"), labels = c(TRUE, FALSE)) # here, levels lists the order of the old labels, and labels the names of the new labels.
LUI_factors$isVIP <- as.logical(LUI_factors$isVIP) # as.logical is able to interpret
str(LUI_factors$isVIP)
logi [1:1650] TRUE TRUE TRUE TRUE TRUE TRUE ...
Sometimes a downstream code or function expects a column to have a certain name, or you want to fix a typo in the original column header, or the data you loaded have cryptic column names that you want to replace with something meaningful.
The function names()
allows you to access and edit the column names of a data frame.
names(voeller)
[1] "Event" "Latitude" "Longitude"
[4] "Plot" "DOY..day." "Plant.h..cm."
[7] "Veg.biom.tot..g." "Panicle...." "Biom.dm..g."
[10] "Repro.allocation.r" "Caps...." "Umbel...."
[13] "Inflores...."
names(voeller) <- c("eventID", "latitude", "longitude", "EP_PlotID", "day_of_year", "plant_height", "total_biomass", "panicle", "biomass_dry", "reproductive_allocation_ratio", "n_capsules", "n_umbel", "n_inflorescences")
voeller
You may want to keep your data table tidy and strip all columns that are not needed for your analysis. There are a couple of ways of deleting unwanted columns (negative subsetting).
# replacing the original table with one that is reduced by the unwanted column (negative subsetting)
traits <- subset(traits, select = c(-Endophagous_lifestyle))
# eliminating the column by setting it to value NULL
traits$Stratum_use <- NULL
alternatively, select the columns you want to keep (positive subsetting):
traits_tidy <- subset(traits, select = c(SpeciesID, Body_Size, Feeding_guild_short, Stratum_use_short))
traits_tidy
This obviously can also be used to re-order columns in your data frame.
Sometimes, some entries in your dataset are flagged as invalid or flawed, or just are empty rows. In those cases, delete them before you start doing any analysis.
d1 <- subset(d1, plant_species != "") # deletes any rows with empty entries in column plant_species
In the traits data, a *
in column Remark
flags species that do neither obligatory nor facultative occur in grasslands. Let’s say, we want to skip those.
traits <- subset(traits, Remark != "*")
We go more into depth for selecting rows and columns in the next section of the workshop.
After fixing your data, you should not forget to save your edits and make a note in the README file. You should also keep the entire script of this editing process for reasons of reproducibility.
write.csv(LUI_factors, file = "data/LUI_factors.csv", fileEncoding = "UTF-8", row.names = FALSE )
write.csv(voeller, file = "data/voeller.csv", fileEncoding = "UTF-8", row.names = FALSE )
These are the data you will from now on rely on.
Before you do any calculations with your raw-data, you should make sure that the R-object correctly reflects the structure of your data. It is recommended to keep a script of these steps that allows you to re-run the corrections any time you change your data.
We are using data from the open data section of the Biodiversity Exploratories (https://www.bexis.uni-jena.de/PublicData/BrowsePublicData.aspx).
Advanced:
Solution:
d1 <- read.table("data/12807/12807.txt", sep = "\t", header = TRUE )
d2 <- read.table("data/13426/13426.txt", sep = "\t", header = TRUE )
d3 <- read.table("data/12967/12967.txt", sep = "\t", header = TRUE )
#2.
d1$date <- as.Date(d1$date, format = "%m/%d/%Y %I:%M:%S %p")
d2$date <- as.Date(d2$date, format = "%m/%d/%Y %I:%M:%S %p")
#3.
d1$exploratory <- factor(d1$exploratory, levels = c("Schorfheide", "Hainich", "Alb"), labels = c("SCH","HAI", "ALB"))
# Note that in the 2009 dataset, the Schorfheide is missing originally. This adds a third level that wasn't there before!
d2$exploratory <- factor(d2$exploratory, levels = c("Schorfheide", "Hainich", "SchwaebischeAlb"), labels = c("SCH","HAI", "ALB"))
# note the difference in nomenclature of Schwaebische Alb.
#4.
d1$plant_genus <- revalue(d1$plant_genus, replace = c(Achemilla = "Alchemilla"))
# any other synonyms found?
#5.
if(!dir.exists("data/processed/")) dir.create("data/processed/")
# note that any sub-folder you are writing to must exist! This line checks if it does and otherwise creates it.
write.csv(d1, file = "data/processed/flowers2009.csv", row.names = FALSE)
write.csv(d2, file = "data/processed/flowers2008.csv", row.names = FALSE)
write.csv(d1, file = "data/processed/diversity2008.csv", row.names = FALSE)
# you might want to add a README file in "data/processed" that explains the origin and processing for these files.
Now, before you start analysing your data, you may have to modify some of the original raw data into derived statistics. This section covers operations that affect single columns in your data.frame.
levels(traits$Stratum_use_short)
[1] "s" "g" "h" "t" "w" "u"
For many data analysis, you want to classify continuous variables, e.g. into high or low categories. You can do that by using fixed thresholds and the cut()
function.
hist(LUI_factors$TotalGrazing)
LUI_factors$Grazing_categorical <- cut(LUI_factors$TotalGrazing, breaks = c(0,100,1000,10000), labels = c("low", "normal", "high"))
plot(LUI_factors$Grazing_categorical, main = "number of plots with grazing pressure ...")
Very often in ecology, you will have to normalize measured values to make different variables comparable. If you need to apply a normalization (i.e. all variation falls within 0 and 1) to an entire column, this is as simple as dividing the vector by its maximal value.
LUI_factors$Grazing_standardized <- LUI_factors$TotalGrazing / max(LUI_factors$TotalGrazing)
hist(LUI_factors$Grazing_standardized)
An alternative option is to standardize values around 0 while scaling all variation to get a standard deviation of 1 (that is often done before multivariate analysis). For this task we have function scale()
LUI_factors$Grazing_standardized <- (LUI_factors$TotalGrazing - mean(LUI_factors$TotalGrazing)) / sd(LUI_factors$TotalGrazing)
hist(LUI_factors$Grazing_standardized)
However, often you want a standardization applied within a group of some other variable. For instance if you need measured values comparable between regions. For this, it is necessary to create a normalization vector that depends on the grouping variable. In basic R, this is rather tedious. The key function here is tapply()
which applies a function to the data grouped by an index.
tapply(LUI_factors$TotalGrazing, LUI_factors$Exploratory, max)
ALB HAI SCH
1225.597 1644.167 1428.614
max_vals <- tapply(LUI_factors$TotalGrazing, LUI_factors$Exploratory, max)[LUI_factors$Exploratory]
LUI_factors$Grazing_standardized <- LUI_factors$TotalGrazing / max_vals
A function that helps in adding new columns to your data.frame is transform()
and its advanced version mutate()
of the package plyr. They allow you to specify multiple new columns or alter existing columns in one call.
LUI_factors <- transform(LUI_factors,
Grazing_standardized = LUI_factors$TotalGrazing / max(LUI_factors$TotalGrazing),
Fertilization_standardized = LUI_factors$TotalFertilization / max(LUI_factors$TotalFertilization),
Mowing_standardized = LUI_factors$TotalMowing / max(LUI_factors$TotalMowing)
)
Note that you can’t use the new columns for computations of other new columns. You need to run a second call of transform. The mutate()
function adds this capacity while also being much faster (which may become important in big datasets):
library(plyr)
LUI_factors <- mutate(LUI_factors,
Grazing_standardized = LUI_factors$TotalGrazing / max(LUI_factors$TotalGrazing),
Fertilization_standardized = LUI_factors$TotalFertilization / max(LUI_factors$TotalFertilization),
Mowing_standardized = LUI_factors$TotalMowing / max(LUI_factors$TotalMowing),
LUI = (Grazing_standardized + Fertilization_standardized + Mowing_standardized) / 3
)
Again, it is important to save these steps of preparation for your work. But don’t touch your raw data!!! Instead you are now working with processed data that you want to store separately. You could change their filename or put them in a seperate directory.
write.csv(LUI_factors, file = "data/processed/LUI_factors.csv", fileEncoding = "UTF-8", row.names = FALSE )
write.csv(voeller, file = "data/processed/voeller.csv", fileEncoding = "UTF-8", row.names = FALSE )
When it comes to analysis, you need to use the content of your data.frame to do your further calculations. Most beginners guides describe attach()
and detach()
to access data within a table.
attach(traits)
plot(Dispersal_ability ~ Body_Size)
detach(traits)
This clutters the R environment with a lot of invisible objects. People do often forget to detach databases and then objects are difficult to handle and to remove. Also, if you modify those objects, they will not affect the master table, which remains in the state it was before it was attached. Using attach()
is considered bad practice!
Rather you should access content directly. The easiest way would be to use the $
selector.
lm(d3$heteroptera_simpson ~ d3$coleoptera_simpson)
or, if you will need to access data many times, it is more convenient to extract what you need into a shorthand object.
y <- LUI_factors$Year
grazing <- LUI_factors$TotalGrazing
lm(grazing ~ y)
Best practice!: Keep the table structure of your data object intact and tell the functions were to find the columns using the data
parameter, which is available in many statistical functions and plotting functions:
lm(TotalMowing ~ TotalFertilization + Exploratory, data = LUI_factors)
Advanced: You could also use the function with()
which creates a temporary environment where the content of your data table is available:
with(d3, lm(heteroptera_richness ~ coleoptera_richness))
The with()
function is very powerful since it allows you to execute any other function, even very complex calls, on a data.frame.
A very powerful function are the squared brackets []
which allow you to select columns by name or by position and let you create a subset of columns.
head(LUI_factors[,c("Year","EP_PlotID", "TotalGrazing")])
In complex data sets, you often want to exclude part of it before you go on with your analysis or plot only a subset based on a condition, e.g. to show the data of only one treatment and not the other.
The squared brackets []
can also be used to select rows of a vector or a data frame.
To exclude one row of data, e.g. because you know that the replicate was methodologically flawed or an extreme outlier, the simplest way is to do
LUI_factors[23:25,]
Or you might want to exclude a sequence of data (negative subsetting).
data1 <- rawdata[-c(42:64),]
Thus, the first selector within []
defines the rows, the second defines the columns! This allows us to create very compact data extracts.
s1 <- d1[d1$plant_species %in% c("Plantago lanceolata") , c("inflorescences_qcm","date")]
head(s1)
s2 <- d1[d1$EP_PlotID %in% c("HEG24"), c("plant_species","inflorescences_qcm", "date")]
head(s2)
In most cases, the rows you want to exclude are loosely mixed into your dataset and you want to select rows by a condition. Conditionals are based on ‘logical expressions’, a concept that is common to all programming languages:
A == B
: A equals BA != B
: A is unequal to BA < B
, A > B
: A is less than B, A is greater than BA <= B
, A >= B
: A is less than or equal to B, A is greater than or equal to BA %in% B
: The value in A is contained in vector BThus, if A and B are numerical values, the expression returns logical values: TRUE
or FALSE
. If the logical operator is comparing a value and a vector, it returns a logical vector!
1:12 > 6
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[12] TRUE
f <- rep(c("A","B", "C"), each = 3); f
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"
f == "A"
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
6 %in% 1:5
[1] FALSE
c("SCH", "HAI", "ALB") %in% c("HAI")
[1] FALSE TRUE FALSE
These logical vectors can be used to select rows from a data table.
# create an example data.frame
d <- data.frame(plot = f, y = c(1,1,1.4,1.2,1.3,1.2,1.4,1.2,1.1))
d
d[d$x == "C",]
d[d$y >= 1.3,]
An alternative method is to use the subset()
function.
subset(d, plot == "C")
subset(d, y >= 1.3)
We continue to work with the data of the Land Use Intensity Index.
Read data from BExIS (or from the data package) and compute LUI.
summarize/plot amount of fertilization in each of the three regions across all years
#1. Read data from BExIS (or from the data package) and compute LUI.
LUI_factors <- read.service(19266, dec = ",", user = "yourname", pswd = "yourpassword")
LUI_factors <- mutate(LUI_factors,
Grazing_standardized = LUI_factors$TotalGrazing / max(LUI_factors$TotalGrazing),
Fertilization_standardized = LUI_factors$TotalFertilization / max(LUI_factors$TotalFertilization),
Mowing_standardized = LUI_factors$TotalMowing / max(LUI_factors$TotalMowing),
LUI = (Grazing_standardized + Fertilization_standardized + Mowing_standardized) / 3
)
#2. Create a subset of all VIP plots and
vip <- subset(LUI_factors, isVIP)
#3. extract a list of EP_PlotIDs (i.e. a vector of all IDs)
vip$EP_PlotID <- droplevels(vip$EP_PlotID)
levels(vip$EP_PlotID)
# or
levels(LUI_factors$EP_PlotID[LUI_factors$isVIP, drop = TRUE])
#4. create a subset of all plots with high grazing (i.e. more than median value)
highgrazing <- subset(LUI_factors, TotalGrazing >= median(LUI_factors$TotalGrazing))
#5. eliminate all data from before 2008.
LUI_after2007 <- subset(LUI_factors, Year %in% as.character(2008:2016) )
#6. summarize/plot amount of fertilization in each of the three regions across all years
par(mfrow = c(1,3))
plot(TotalFertilization ~ Year, ylim = c(0,420), data = subset(LUI_factors, Exploratory == "HAI") )
plot(TotalFertilization ~ Year, ylim = c(0,420), data = subset(LUI_factors, Exploratory == "ALB") )
plot(TotalFertilization ~ Year, ylim = c(0,420), data = subset(LUI_factors, Exploratory == "SCH") )
A noteworthy package for this task is the tidyr package by Hadley Wickham provided a powerful toolset to handle data more easily. It inherits functionality from the earlier reshape2 and the dplyr package and offers tools to reformat data frames in complex ways.
To order the rows of your data.frame by a certain colum, or hierarchy of columns, in base R, you have to use the square brackets with a new sorting condition (note that the old rownumbers are preserved in row names).
head(LUI_factors[order(LUI_factors$EP_PlotID, LUI_factors$Year),], 10)
The dplyr package provides the arrange()
function, which is slightly more intuitive.
head(arrange(LUI_factors, EP_PlotID, desc(Year)), 10)
Note that you can apply descending ordering by wrapping the variable name in desc()
.
If two data.frames have exactly the same order of variabes, it is easy to combine them using rbind()
. Similarly, cbind()
allows you to combine data with the same number of rows.
d
e <- data.frame(ID = 1:9, year = rep(2008:2010, times = 3)); e
ed <- cbind(e, d)
rbind(d, c("C", 1.4))
Note that you need to make sure by yourself that the rows are aligned, i.e. they have the same length and meaning per column or row!
If you want to bind data.frames with different column content, you may want to use the rbindlist()
function from the package data.table. It completes columns that are not present in all data.frames with NAs.
library(data.table)
A_elatius <- read.delim("data/datasets/Voeller_A-elatius.tab", skip = 19, encoding = "UTF-8")
B_hordeaceaus <- read.delim("data/datasets/Voeller_B-hordeaceus.tab", skip = 20, encoding = "UTF-8")
rbindlist(list(A_elatius, B_hordeaceaus), fill = TRUE )
An important function when working with multiple data.frames is the function merge()
. It allows to combine two tables of different columns by matching the rows by a common variable, i.e. one that occurs in table x
and in table y
. The rows in the shorter table will be repeated to match the rows in the longer one.
ed
p <- data.frame(id = c("A", "B", "C"), value = c(425,754,542)); p
edp <- merge(ed, p, by.x = "plot", by.y = "id"); edp
You might want to reduce a long table dataset to a summarized version that pools multiple observations into aggregated statistics. An example could be observations on multiple sampling events that you want to sum up into a single value.
edp
by(edp[,c("y", "value")], edp$plot, colMeans)
edp$plot: A
y value
1.133333 425.000000
--------------------------------------------------------
edp$plot: B
y value
1.233333 754.000000
--------------------------------------------------------
edp$plot: C
y value
1.233333 542.000000
Some trick is necessary to compile this into a nice summary table.
do.call( rbind ,by(edp[,c("y", "value")], edp$plot, colMeans))
y value
A 1.133333 425
B 1.233333 754
C 1.233333 542
Advanced note: The function do.call()
is very handy to forward a list of parameters to a function. In this case, the function rbind() takes a homogeneous list of data.frames and concatenates it. You still would have to add any co-variates that you eliminated for calling by()
.
This task can be very tedious in base R with more complex and non-numeric data. The dplyr package is designed for this purpose.
library(dplyr)
ddply(edp, .(plot), summarize, y = mean(y), value = mean(value))
Here, the logic is
fun
with the subsequent arguments.In this case (and most cases you will encounter) is the very powerful function summarize()
. It applies any summary statistics or factor operations to the dataframe and returns them into a table. It works similar to the function mutate()
that we discussed above.
It is very common that you need to transfer data from a long-table format, or an observation table, to a matrix or wide-table format. For example, if you want to show all plot replicates grouped by year.
edp
reshape(edp,
timevar = c("plot"), #the variable to group for (i.e. the new columns)
v.names = c("value"), # the variable to report, could be multiple columns
idvar = c("year"), # the variable to group by (i.e. the new rows)
direction = "wide",
drop = c("y","ID") ) # this must be dropped from the output, because it doesn't match the new table
A better function to perform this is the dcast()
function of the reshape2 package, which uses a formula syntax (y ~ x means: “show y depending on x”).
library(reshape2)
Attaching package: 'reshape2'
The following objects are masked from 'package:data.table':
dcast, melt
The following object is masked from 'package:tidyr':
smiths
dcast(edp, plot ~ year, value.var= "value")
The most advanced package for this is tidyr, which gives you the function spread()
library(tidyr)
spread(subset(edp, select = c(-y, -ID)), # the input here has to be a clean table with only key, value and grouping variable
key = year,
value = value ) # note that the tidyr functions know how to handle variables without putting them in quotes!
The opposite case may be where you find multiple measurements grouped for one taxon or individual, as often the case for trait data. In this case you want to convert wide table formats into a long table. The base function stack()
produces a simple two-column output.
stack(traits, select = c("Body_Size", "Dispersal_ability", "Feeding_guild_short", "Feeding_mode", "Feeding_specialization", "Feeding_plant_part", "Stratum_use_short"))
Warning in stack.data.frame(traits, select = c("Body_Size",
"Dispersal_ability", : non-vector columns will be ignored
This drops all non-target variables (like taxon). A better choice is the reshape() function, but it needs quite a lot of arguments to get it right:
reshape(
subset(traits, select = c(-Suborder, -Order, -Author, -Feeding_guild, -Feeding_tissue)), #eliminating the unwanted columns from the output
idvar = "SpeciesID", # name of the column to iterate
ids = traits$species, # vector of the sorting variable
varying = list(value = c("Body_Size", "Dispersal_ability","Feeding_guild_short",
"Feeding_mode", "Feeding_specialization", "Feeding_plant_part", "Stratum_use_short")), # list of variable to stack
times = c("Body_Size", "Dispersal_ability","Feeding_guild_short", "Feeding_mode",
"Feeding_specialization", "Feeding_plant_part", "Stratum_use_short"), # label of the stacked variable in output
timevar = "trait", # column name of the stacked variable label
direction = "long") # parameter for conversion from wide to long
The melt()
function from the reshape2 package is the more powerful and efficient option:
library(reshape2)
melt(traits, id.vars = c("Order", "SpeciesID"),
measure.vars = c("Body_Size", "Dispersal_ability", "Feeding_guild_short", "Feeding_mode",
"Feeding_specialization", "Feeding_plant_part", "Stratum_use_short"))
Warning: attributes are not identical across measure variables; they will
be dropped
rbind()
) and add a column for ‘Year’Advanced:
#1. merge the land-use levels of dataset LUI_factors (LUI or total factor levels) to the data table.
flowerdata2008 <- merge(d2, subset(LUI_factors, Year == "2009", select = c("EP_PlotID", "LUI")), by = "EP_PlotID")
flowerdata2009 <- merge(d1, subset(LUI_factors, Year == "2008", select = c("EP_PlotID", "LUI")), by = "EP_PlotID")
#2. combine the flower datasets of 2008 and 2009
flowerdata <- rbind(flowerdata2009,
subset(flowerdata2008, select = names(flowerdata2009))
)
# to add a column for year, we reformat the date column
flowerdata <- mutate(flowerdata, Year = as.factor(format(date, "%Y")))
#alternatively, you can specify a vector manually.
# 3. Write to a csv file
write.csv(flowerdata, file = "data/processed/flowerdata.csv", row.names = FALSE)
# 4. summarize the inflorescence counts of the repeated survey events to an average (per species, plot and year)
d1_sums <- ddply(d1, # the original data set
.(EP_PlotID, plant_species), # the columns to use as index
summarize, # the function to apply, see ?summarize
inflorescences_qcm = sum(inflorescences_qcm), # the statistic to apply
exploratory = first(exploratory) # the columns to maintain (since this maps 1:1 on plant_species, we only need to keep the first value)
)
d1_sums
# 5. reshape the flower dataset to a wide-format table (containing the species as variables and the plots as observations)
# using base function reshape()
d1_wide <- reshape(d1_sums, timevar = "plant_species", v.names = c("inflorescences_qcm"), idvar = c("EP_PlotID", "exploratory"), direction = "wide")
d1_wide
# using spread() of package tidyr
d1_wide <- spread(d1_sums, plant_species, inflorescences_qcm)
d1_wide
I already said that documentation of your data is important. That should include how the data were produced methodically (e.g. your field set-up and method of observation), and also how they are stored in a data format (e.g. column names, units, encoding definitions and nomenclature standards).
Being part of the Biodiversity Exploratories project, you are supposed to publish your data on BExIS. When uploading, you are asked for all these information.
This is very time consuming and add extra stress to your schedule if you start to assemble all these information at the end of your project/PhD.
Also, BExIS is not fulfilling the criteria of a gold Open Access standard, yet. It is more and more required by journals and funding agencies, that the data must be published and be accessible to the readers of your paper. Most journals today point to public repositories like data dryad or Figshare, which provide your data with a permanent address on the internet, a Document Object Identifier, or DOI. Those repositories also allow you to chose from a set of open access licenses that protect your authorship rights of the data (Creative Commons International 4.0).
These repositories have a more general structure for data documentation/metadata that might not ask for all the ecological detail that is necessary to interpret it. It might be necessary to add extra Metadata files.
Therefore, you should think in terms of publication and archiving from the beginning!
That means:
As soon as you’re ready to share your data (the sooner the better!)
We learned today:
and more advanced:
A good advice: It is important to keep a documentation of your process of data-analysis by using R-scripts with inline comments and README files! This makes your data fit for archiving and for publication from the start.
The central take home message should be that you use Excel and other spreadsheet software to enter your raw data in a computer, but R is the tool for anything else: arranging data, cleaning data, filtering data, producing aggregate statistics, etc.
The following parts of this workshop will continue with data analysis, statistics and plotting in R.
Objects that contain data of mixed types will be coerced into the more flexible type. e.g. a vector c("a", 1)
will be of type "character"
.↩