ampvis2: The bread and butter of our amplicon analyses: amp_heatmap!

(This post is part of a series of ongoing posts about selected ampvis2 features)

Now that the ampvis2 R package has got its own paper on bioRxiv it is a good time to also write a small blog post about our beloved heatmap, which Albertsen lab and the Center for Microbial Communities use in practically every amplicon analysis.

Why not stacked bar charts?

Getting a simple overview of the microbial community composition of environmental samples can be done in different ways. However, when visualisation of both the composition as well as abundance is desired, very few options exist. For visualisation at higher taxonomic levels (mostly phylum level) a common way of achieving this is to use a stacked bar chart, which can show the relative abundances of all taxa in each sample (example) for comparison. Sadly, stacked bar charts have somehow found a way into microbial ecology even though they are best used for much simpler data and to illustrate gradual changes between values, not large compositional differences as with microbial community data. One of the reasons is that there is only about a dozen or so of different colors that are easily distinguishable by the human eye. There are just not enough colors to cover just a fraction of the diversity in microbial community data. Another reason is that it can be very difficult, if not impossible, to easily read the exact values of each constituent bar (the abundance of a taxa in a sample) as they are shown as an interval. You would first have to judge what the lower and upper limits are, calculate the difference, and only then get the value, and then it would only be approximate. Have a look at this example plot showing the species level composition of a range of samples obtained by 16S amplicon sequencing:


This just makes you reconsider why you would ever visualise anything. If the purpose of graphical visualisation is not to simplify and communicate an important message that would otherwise require pages of words to convey, then there are just no reasons to do it. For example, there are probably 7-8 different shades of green. Which one is E. coli? I can’t tell, but maybe you can. In any case, there should not be any doubt. When it comes to reading the relative abundances of the taxa in the samples, in most cases “around” X % abundance is fine to answer most research questions (there are biases anyways), but when you can’t tell the difference between taxa, the plot almost becomes useless.

Rethinking the heatmap

Enough hate on stacked bar charts, but these are some of the main reasons why we had to come up with something different without compromising both precision and simplicity, as well as being pleasant for the eye. This resulted in the amp_heatmap which has several additional features compared to other software tools that are also able to generate heatmaps, these will be demonstrated in the following. The example data used in the following is from the activated sludge of two wastewater treatment plants (WWTP’s) in Aalborg and is included in the ampvis2 R-package.

A heatmap is basically a grid of colored tiles with samples on the x-axis, taxa on the y-axis, and then the abundance (relative or absolute) of each taxon in each sample is indicated by coloring the tiles by a gradient of usually 2 or 3 colors. The colors are then used to identify patterns between the samples and the exact abundance values can be written on top of each tile. Here is an overview of the 10 most abundant Genera in all 16 samples in the dataset and the corresponding R code to generate the plot, by default the abundances are shown relative to the total in each sample:

  data = data,
  tax_aggregate = "Genus"


Simple, right? We have all information we need in terms of composition and abundance in one plot. All OTU’s are aggregated to a specific taxonomic level of choice and the order of the taxa on the y-axis is automatically sorted by the cumulative abundance of each taxa across all samples. amp_heatmap particularly stands out in the way that it can incorporate sample metadata to dynamically group samples based on one or more metadata variable(s). These variables can contain any relevant information about the samples (for example when and where the samples were taken, physical parameters like pH or temperature, etc…) and amp_heatmap then calculates the mean (or even median, minimum, or maximum) abundance of each taxon in each group of samples as defined by the variable. For example, we can show the yearly average for each WWTP by setting the group_by argument (figure A below). A different way to show the same information is by using faceting on one of the variables instead by setting the facet_by argument (figure B below).


  data = data,
  tax_aggregate = "Genus",
  group_by = c("Plant", "Year")

  data = data,
  tax_aggregate = "Genus",
  group_by = "Plant",
  facet_by = "Year")


We can come a long way with just the group_by and facet_by features alone, but we may also be interested in looking at a different taxonomic level (by setting tax_aggregate), for example family level, add more taxonomic information (tax_add), or show more of the most abundant taxa (tax_show). Furthermore, we can hide the values (plot_values = FALSE) if we are more interested in patterns between the groups and not the exact values, or even adjust the color scale (plot_colorscale) and the colors themselves (color_vector):

  data = data,
  group_by = "Plant",
  facet_by = "Period",
  tax_show = 20,
  tax_aggregate = "Family",
  tax_add = "Phylum",
  plot_values = FALSE,
  plot_colorscale = "sqrt",
  color_vector = c("royalblue4",


To compare the top most abundant taxa in a single sample to all other samples it can be done by the sort_by argument (in this case the sample “16SAMP-3941“). This is also useful to check control samples. As mentioned, by default the read counts are normalised, but this can of course also be disabled by setting normalise = FALSE:

  data = data,
  normalise = FALSE,
  tax_aggregate = "Genus",
  sort_by = "16SAMP-3941"


Last but not least, we primarily analyse samples from wastewater treatment systems, so we have made a nice way to show known biological functions of Genus level taxa next to the heatmap according to the MiDAS field guide database. Furthermore, when using group_by to group samples, the x-axis will be alphabetically ordered, so in the case of seasonal period, it makes more sense to order it chronologically (by setting order_x_by):

  data = data,
  group_by = "Period",
  tax_aggregate = "Genus",
  tax_show = 20,
  plot_functions = TRUE,
  functions = c("FIL", 
  order_x_by = c("Spring",
  rel_widths = c(0.6, 0.4)


I hope this will convince you to try out amp_heatmap from the ampvis2 R package. Only a selection of features have been demonstrated, there are even more, have a look in the amp_heatmap online documentation. We have been using it for quite some years and it has stood the test in many of our analyses, papers and posters. It gives a nice, informative overview of any microbial community data obtained by amplicon sequencing, while still being eye-candy. In fact, it can even be used for transcriptomics as well.


What about pie-charts? I love pie-charts!

We certainly don’t, and definitely not for microbial community data. Let Max Roser explain why:

ampvis2: A guide to ordination and how to use amp_ordinate in R

(This post is part of a series of ongoing posts about selected ampvis2 features)

A common way for microbial ecologists to compare the microbial communities of different samples is by using ordination methods. These methods seek to highlight differences between samples based on their microbial community composition. By plotting samples as points in an x/y-plane, the ecological differences between samples can then be interpreted simply by the distance between the points (see figure 1). The term “ordination” in fact stems from the German word “ordnung”, meaning “order”, so ordination is essentially a way of ordering samples based on similarity.

Figure 1: Principal Components Analysis (Hellinger transformation applied) of 8 samples from the activated sludge of two Danish wastewater treatment plants Aalborg East and Aalborg West. Obtained by 16S rRNA amplicon sequencing on the Illumina platform. The sample points are shaped by which seasonal period the samples were taken in 2012.

However, there are numerous types of ordination methods and it is not clear to many researchers which method to use when, and why. This is not surprising, as some rely on highly complex mathematics, but there are a few key differences that are important to know when comparing the microbial communities using ordination.

Assumptions and terminology

An underlying assumption in ordination analysis is that the microbial community composition at a given site (a.k.a. alpha-diversity) is highly dependent on the surrounding environmental and ecological conditions. The microbes present at a given site are therefore termed the response variables, as they are dependent on other factors (the surrounding environment). It follows that the microorganisms present at a particular site (which for a researcher is: in a single sample) are thus also likely to be present at different sites, as long as the environmental and ecological conditions are highly similar. When the environmental conditions, including physical and chemical conditions, as well as the presence of other species, differ, so does the overall microbial community composition. These differences in the environmental conditions over time or space are termed environmental gradients and the species are expected to be gradually less abundant the more the conditions are different from an optimal set of conditions for the individual species to thrive (niche theory). Ordination methods mainly differ in how they hypothesise the abundance distribution of the response variables (aka. microbial communities) along the environmental gradient to be, commonly either showing a unimodal or a linear distribution.

Figure 2: The main structure of two-dimensional data can often be described in one dimension by relative positions on a straight line. Adopted from (Pearson, 1901)

Distances, differences, and dimensionality reduction

When using ordination methods to visualise the differences between samples with respect to their microbial community composition, differences and distances are one and the same thing. The distances are essentially defined by a mathematical formula with which to represent the differences between the samples as a distance- or (dis)similarity coefficient, which is used to calculate a symmetrical distance matrix containing a coefficient for each pair of samples. This multidimensional matrix is all but user-friendly to look at and difficult to visualise, and this is where ordination really gets to shine. The matrix is then subjected to, let’s keep it simple, dimensionality reduction, where the matrix is sort of “compressed” and only the most significant differences are extracted and expressed by new, synthetic axes, and usually plotted in a normal x/y scatter plot. The principle is similar to that of linear regression, where drawing a straight line through points in two dimensions can represent the overall structure of the data in only one dimension (see figure 2). The only difference is that there are just as many dimensions as there are samples. Yes, hard to imagine.

Six common ordination methods used within microbial ecology

In general, there are six methods in particular that are relevant for microbial ecology, see the table below. There are several ways to categorise them, below they are categorised by whether the distance matrix is calculated explicitly or implicitly. That means, whether the user has to manually calculate a distance matrix based on a suited distance measure before performing the ordination or not, respectively.

Unconstrained (exploratory)Constrained (explanatory)
Implicit distance measure
Principal Components Analysis (PCA)Redundancy Analysis (RDA)
Correspondence Analysis (CA)
Canonical Correspondance Analysis (CCA)
Explicit distance measure
Principal Coordinates Analysis (PcoA)
non-metric Multidimensional Scaling (nMDS)

Principal Components Analysis (PCA) vs. Correspondence Analysis (CA)

The distance measure incorporated in Principal Components Analysis (PCA) and Redundancy Analysis (RDA) is the Euclidean distance, which is calculated by the Pythagorean theorem. Correspondence Analysis (CA) (a.k.a. weighted- or reciprocal averaging) and Canonical Correspondance Analysis (CCA) are instead based on the Pearson chi-squared measure.

A common way of distinguishing PCA+RDA from CA+CCA is that the former two are best applied to data showing a linear distribution of the species abundances along the environmental gradient, while the latter two are best applied to a unimodal distribution.

This means that in theory, CA and CCA should be the most ideal for ecological data, because of the chi-squared measure better deals with the fact that some species can be present in some samples while absent in others (double-zeros), which is common in ecological data and fits niche theory well. PCA (and its constrained version RDA) is widely used, perhaps due to its simplicity (it just highlights numerical variance), but within ecology, it underestimates ecological differences due to these double-zeros, and only the most abundant species will influence the distances predominantly. This is by some considered a flaw of PCA when used within ecology, but with appropriate data transformation, for example the Hellinger transformation (see Legendre 2001), PCA can prove quite useful when answering questions involving only the most abundant species and their numerical differences between samples. CA (and its constrained version CCA) on the other hand is very sensitive to the less abundant and often unique species in the samples, which are often numerous in microbial communities. Again, data transformation like the Hellinger transformation, can cope with this problem. Then, CA becomes a powerful tool to tell which species are characteristic of, or correspond to, each sample or group of samples, as we will see in the examples below.

One can think of it like CA (and CCA) is a more qualitative method, while PCA (and RDA) is a more quantitative method, and thus they can supplement each other nicely and reveal different aspects of the data.

Principal Coordinates Analysis (PCoA) and non-metric Multidimensional Scaling (nMDS)

The advantage of Principal Coordinates Analysis (PCoA, a.k.a. metric Multidimensional Scaling or mMDS) and non-metric Multidimensional Scaling (nMDS) is that the user has the freedom to choose the distance measure explicitly. There are well over 50 different measures to choose from, but luckily only a handful are relevant for most analyses (see GUSTA ME). Common measures include the Bray-Curtis dissimilarity index, which has been widely used for decades, Pearson chi-squares, Jaccard, Chord, and lately UniFrac, which incorporates phylogeny. PCoA is very similar to PCA, RDA, CA, and CCA in that they are all based on eigenanalysis: each of the resulting axes is an eigenvector associated with an eigenvalue, and all axes are orthogonal to each other. This means that all axes reveal unique information about the inertia in the data, and exactly how much inertia is indicated by the eigenvalue. When plotting the ordination result in an x/y scatterplot, the axis with the largest eigenvalue is plotted on the first axis, and the one with the second largest on the second axis. How much each axis then contributes to the total inertia in the data is often indicated on the axis labels as a fraction of the eigenvalue of the particular axis and the sum of all eigenvalues. Often the remaining axes other than the two plotted can highlight a large amount information as well, but the best case scenario is that the most inertia is expressed by the first two. A simple bar plot of all the axes and their eigenvalues, called a scree plot, can be used to get an overview of each axis’ contribution to the total inertia and confirm that those plotted are the most significant ones.

In contrast to the other five methods, non-metric Multidimensional Scaling (nMDS) is quite different. It is not an eigenanalysis based method, it is instead a rank-based approach, which means that the magnitudes of the differences are lost (hence the term non-Metric), and only the ranked relationships between the samples are preserved.

This makes nMDS particularly robust and independent of any underlying environmental gradient.

It can even deal with missing values and different types of data at the same time. As opposed to the eigenanalysis-based analyses, nMDS is an iterative procedure, where the resulting ordination does not have a single solution. Instead of eigenvalues, nMDS utilise a stress value to indicate the “success” of the ordination. The number of axes obtained by nMDS is chosen by the user a-priori, and a number of iterations (20 is usually enough) are run to best fit the data to the chosen number of axes. Each solution is used as a starting point for the next iteration and a stress value is calculated after each iteration and will after a number of iterations reach a stable value. A stress value of 0.20 or more is considered unacceptable, 0.1 is acceptable, and 0.05 or below is great. A disadvantage of PCoA and nMDS is that it is only possible to show differences between samples/sites. Plotting both species scores together with site scores in a biplot is not possible.

How to perform ordination with ampvis2 in R

Now that you know the basics of the different ordination methods relevant for microbial ecology, let’s demonstrate how to create publication-ready ordination plots with the ampvis2 package in R and discuss the results of different methods on the same data. The amp_ordinate function aims to simplify the process of performing ordination by basically wrapping numerous functions from other R packages (many of which are from the vegan, ggplot2, and APE packages) into one large function. Performing ordination is often a multi-step workflow and amp_ordinate performs all the required steps with just one function call. Along the way there are also several error checks to help guide the user in the process. All steps are fully customisable by the user, by default amp_ordinate does the following in order:

  1. Filters low abundant OTUs.
  2. Performs data transformation.
  3. Calculates a distance matrix if nMDS/PCoA.
  4. Calculates both site- and species scores by the chosen ordination method.
  5. Generates the ordination plot with numerous visual layers defined by the user.
  6. Fits the correlation of environmental variables to the ordination plot.
  7. Returns a nice looking plot or a interactive plotly plot.

The ordination is performed on the read counts contained in the OTU table (which simply contains the read counts of each OTU in each sample), and the corresponding metadata of those samples are used for all aesthetic options, like the coloring or shape of sample points based on a group variable, as well as for environmental interpretation by constrained ordination methods or the fitting of the correlation of environmental variables onto the ordination plot.


Some of our 7th semester student groups have really done a great job collecting samples from a plethora of interesting environments in nature and from animal faeces to hunt for novel bacteria. In the following examples, data obtained by 16S rRNA amplicon sequencing (on the Illumina MiSeq platform) of a handful of their samples is used. The samples are from quite different environments, including the faeces of a Bearded dragon and a Vikunja from Aalborg Zoo, sediment from the Limfjord, a wooden boardwalk from Lille Vildmose, and a rusting pipe from somewhere.

To avoid that this blog post gets longer than it already is, only PCA and CA will be demonstrated in the following.

Principal Components Analysis

Principal Components Analysis can be performed by simply setting type = “pca” as shown below. This would produce a plot with just black dots, which is not very useful, so setting sample_color_by and sample_colorframe can color the sample points based on the metadata variable “SampleContent”, which simply contains information about where the samples were taken.

  data = data,
  type = "pca",
  transform = "hellinger",
  sample_color_by = "SampleContent",
  sample_colorframe = TRUE,
  sample_colorframe_label = "SampleContent")

This only highlights the differences between the different samples/sites. To show even more information, making a biplot with also the species scores (with this data species are defined as OTU’s clustered at 97% similarity) plotted can reveal information about which species are abundant in samples or groups of samples, as seen below. The plot is an interactive plotly plot, so go ahead and zoom, hover, export etc 🙂

    data = data,
    type = "pca",
    transform = "hellinger",
    sample_color_by = "SampleContent",
    sample_colorframe = TRUE,
    sample_colorframe_label = "SampleContent",
    species_plotly = TRUE)

Notice that most of the species points are positioned on a line outwards from (0,0) towards the samples. This is very characteristic of PCA, where the relative position of the sample points along these “species lines” or “species gradients” is an expression of how abundant the species are in the samples.

Principal Components Analysis is therefore favored when answering questions involving only the most abundant species in the samples

For example, the 3 samples from Lille Vildmose boardwalk must have a high abundance of OTU 1775 and the Limfjorden samples must have a high abundance of OTU 610. Low abundant species, however, have no impact at all on the observed differences when using PCA, they will all be positioned near (0,0). Often the species scores are plotted as arrows starting from (0,0) when performing PCA, but since microbial community data in most cases contain hundreds or even thousands of different species when sequenced with modern DNA sequencers, they are plotted as points when using amp_ordinate.

Correspondence Analysis

The differences between the samples as observed with Correspondence Analysis are in this case somewhat similar to those obtained with Principal Components Analysis, however there is a big difference when it comes to the species scores, as seen in the biplot below.

  data = data,
  type = "ca",
  transform = "hellinger",
  sample_color_by = "SampleContent",
  sample_colorframe = TRUE,
  sample_colorframe_label = "SampleContent",
  species_plot = TRUE)

As the name suggests, CA tries to find how species “correspond” to samples. Therefore, any species positioned near a sample are then characteristic of that sample and is most likely only present in that sample. If a species point is positioned at the exact same position as a sample, then it is only found in that sample. On the contrary, any species that are present in all samples will be positioned near (0,0) as they are not directly representative of any sample or group of samples. The species abundances also play a role in the observed differences with CA, but not at all as much as with PCA.

This makes Correspondence Analysis a more qualitative method and is therefore favored when answering questions involving community composition and ‘what’ is present ‘where’.

To illustrate this, try zooming in closely at some of the groups of samples in the plot above. For example, the two faeces samples from the Bearded dragon seem to be the only samples containing about 50 species, since they are positioned on a straight line between the two samples. This is also the case for the rusting pipe or the sediment samples from Limfjorden:

Remember to check the scree plot!

Judging from the Correspondence Analysis above it may be tempting to conclude that the samples from the rusting pipe are more similar to the sediment samples from Limfjorden than they are to the faeces samples. Well, the scree plot below tells us that we may be wrong!

result <- amp_ordinate(data, 
  type = "ca",
  transform = "hellinger",
  detailed_output = TRUE)

The first two axes show about 43% of the total inertia in the data, but the 3rd and 4th axes can actually reveal just as much information as the first two, about 39% combined. By default amp_ordinate plots the first two axes, but by using the x_axis and y_axis arguments it is possible to show any of the remaining axes. Usually, keeping the first axis and only adjusting the y_axis is fine, but any combination of axes can be shown. By looking at a fancy 3D plot with the first 3 axes plotted it is clear that the rusting pipe samples are actually quite different from all the other samples:

It is therefore important to always validate the resulting ordination plot by looking at the scree plot, and maybe even plot other axes than just the first two, before drawing any conclusions!

Read more



Starting from scratch – building a package in R

For the first time, I am going to share something more related to my master thesis. When I started this thesis, I did not know how to use R. In order to learn R, I started using DataCamp, which is a series of interactive courses. You can start from scratch and build your skills step by step. My favorite course so far is called “Writing Functions in R”. During the course, you are told:

If you need to copy something three times or more – build a function.

As a rookie, it sounds complicated, but in fact it is really simple AND it will save you so much trouble later. Defining a function will allow you to produce code that is easy to read, easy to reuse and easy to share with colleagues.  A function can be defined by the following syntax:

my_function <- function(arguments){
  Function Body

my_function  can be any valid variable name, however, it is a good idea to avoid names used elsewhere in R. Arguments (also called formals) can be any R object that is needed for my_function to run,  for example numbers or data frames. Arguments can have a default value or not. If not, a value must be provided for the given function to run. Function body is the code between the { brackets }  and this is run every time the function is applied. Preferably, the function body should be short and a function should do just one thing. If a large function cannot be avoided, often they can be constructed using a series of small functions. An example of a simple function could be:

cookies <- function(who, number=10){
  print(paste(who, "ate", number, "cookies", sep = " "))

The cookie function has two arguments, the number argument defaults to 10 and the user does not necessarily need to provide a value. The who argument on the other hand has no default and a name must be provided. I had some cookies BUT I only had nine cookies so I better change the number argument:

cookies(who="Julie", number=9)
[1] "Julie ate 9 cookies"

So, now I have defined a function to keep track of my cookie consumption. What if I want to share this with the rest of Albertsen Lab? I could forward a script for them to save locally.  No no, I will build a personal R package. This might seem like overkill for the cookie function, but imagine a more complex function.  In my search for helpful tools for calculating correlations, I have come by several functions/sets of functions with no documentation. It is nearly impossible to piece together how, what and when to use arguments with no provided help.  So, now I will build a bare minimum package to help me share my function with the group, strongly inspired by Not So Standard Deviations. For more information check out the excellent book  “R-packages” by Hadley Wickham.

First, you will need the following packages:


After this we need to create a package directory:

create("cookies") #create package

 So now,  a package called cookies has been created (you can change the folder with: setwd("my_directory")).

It is a good idea to update the DESCRIPTION file, so that it contains the relevant information about the package (cookies) and the author (me). Next step is to add the cookie function to the package. For this I save a script containing the function in the R folder. If you want to add more functions to your package, you can either create a new file for each function (recommended) or define the functions sequentially in one file.

Now comes the important part – documentation. Good documentation is key if you want other people to benefit from your work. This can be done easily using the roxygen2 package also by Hadley Wickham. roxygen2 uses a custom syntax so that the text starting with #' will be compiled into the correct format for R documentation when processed. Make a new R script with the following code and save it as cookies.R in the folder cookies/R:

#' Cookies
#' This function will allow you to keep track of who your cookies.
#' @param who Who ate the cookies? (Requires input)
#' @param number How many cookies has been eaten? (Default: 10)
#' @keywords cookies
#' @export

cookies <- function(who, number=10){
  print(paste(who, "ate", number, "cookies", sep = " "))

After creating the script then roxygen2 can be used to create all the needed files and documentation:


Lastly the package needs to be installed:


You can now access your awesome functions by loading your brand new package:


 Now you have built a real R package! If you type ?cookies in the console a help page will actually pop up.

Finally, you can upload you package to (Guide). This will allow your others to try out your package, point out issues and suggest changes. Download and install directly from github is easy using install_github() from the devtools package. Try it out by typing this:


It really can be this easy! So next time you copy something more than three times or really want to share your work, consider defining a function and building your own personal package with associated documentation.