ampvis2: A guide to ordination and how to use amp_ordinate in R

(This post is part of a series of ongoing posts about selected ampvis2 features)

A common way for microbial ecologists to compare the microbial communities of different samples is by using ordination methods. These methods seek to highlight differences between samples based on their microbial community composition. By plotting samples as points in an x/y-plane, the ecological differences between samples can then be interpreted simply by the distance between the points (see figure 1). The term “ordination” in fact stems from the German word “ordnung”, meaning “order”, so ordination is essentially a way of ordering samples based on similarity.

Figure 1: Principal Components Analysis (Hellinger transformation applied) of 8 samples from the activated sludge of two Danish wastewater treatment plants Aalborg East and Aalborg West. Obtained by 16S rRNA amplicon sequencing on the Illumina platform. The sample points are shaped by which seasonal period the samples were taken in 2012.

However, there are numerous types of ordination methods and it is not clear to many researchers which method to use when, and why. This is not surprising, as some rely on highly complex mathematics, but there are a few key differences that are important to know when comparing the microbial communities using ordination.

Assumptions and terminology

An underlying assumption in ordination analysis is that the microbial community composition at a given site (a.k.a. alpha-diversity) is highly dependent on the surrounding environmental and ecological conditions. The microbes present at a given site are therefore termed the response variables, as they are dependent on other factors (the surrounding environment). It follows that the microorganisms present at a particular site (which for a researcher is: in a single sample) are thus also likely to be present at different sites, as long as the environmental and ecological conditions are highly similar. When the environmental conditions, including physical and chemical conditions, as well as the presence of other species, differ, so does the overall microbial community composition. These differences in the environmental conditions over time or space are termed environmental gradients and the species are expected to be gradually less abundant the more the conditions are different from an optimal set of conditions for the individual species to thrive (niche theory). Ordination methods mainly differ in how they hypothesise the abundance distribution of the response variables (aka. microbial communities) along the environmental gradient to be, commonly either showing a unimodal or a linear distribution.

Figure 2: The main structure of two-dimensional data can often be described in one dimension by relative positions on a straight line. Adopted from (Pearson, 1901)

Distances, differences, and dimensionality reduction

When using ordination methods to visualise the differences between samples with respect to their microbial community composition, differences and distances are one and the same thing. The distances are essentially defined by a mathematical formula with which to represent the differences between the samples as a distance- or (dis)similarity coefficient, which is used to calculate a symmetrical distance matrix containing a coefficient for each pair of samples. This multidimensional matrix is all but user-friendly to look at and difficult to visualise, and this is where ordination really gets to shine. The matrix is then subjected to, let’s keep it simple, dimensionality reduction, where the matrix is sort of “compressed” and only the most significant differences are extracted and expressed by new, synthetic axes, and usually plotted in a normal x/y scatter plot. The principle is similar to that of linear regression, where drawing a straight line through points in two dimensions can represent the overall structure of the data in only one dimension (see figure 2). The only difference is that there are just as many dimensions as there are samples. Yes, hard to imagine.

Six common ordination methods used within microbial ecology

In general, there are six methods in particular that are relevant for microbial ecology, see the table below. There are several ways to categorise them, below they are categorised by whether the distance matrix is calculated explicitly or implicitly. That means, whether the user has to manually calculate a distance matrix based on a suited distance measure before performing the ordination or not, respectively.

Unconstrained (exploratory)Constrained (explanatory)
Implicit distance measure
Principal Components Analysis (PCA)Redundancy Analysis (RDA)
Correspondence Analysis (CA)
Canonical Correspondance Analysis (CCA)
Explicit distance measure
Principal Coordinates Analysis (PcoA)
non-metric Multidimensional Scaling (nMDS)

Principal Components Analysis (PCA) vs. Correspondence Analysis (CA)

The distance measure incorporated in Principal Components Analysis (PCA) and Redundancy Analysis (RDA) is the Euclidean distance, which is calculated by the Pythagorean theorem. Correspondence Analysis (CA) (a.k.a. weighted- or reciprocal averaging) and Canonical Correspondance Analysis (CCA) are instead based on the Pearson chi-squared measure.

A common way of distinguishing PCA+RDA from CA+CCA is that the former two are best applied to data showing a linear distribution of the species abundances along the environmental gradient, while the latter two are best applied to a unimodal distribution.

This means that in theory, CA and CCA should be the most ideal for ecological data, because of the chi-squared measure better deals with the fact that some species can be present in some samples while absent in others (double-zeros), which is common in ecological data and fits niche theory well. PCA (and its constrained version RDA) is widely used, perhaps due to its simplicity (it just highlights numerical variance), but within ecology, it underestimates ecological differences due to these double-zeros, and only the most abundant species will influence the distances predominantly. This is by some considered a flaw of PCA when used within ecology, but with appropriate data transformation, for example the Hellinger transformation (see Legendre 2001), PCA can prove quite useful when answering questions involving only the most abundant species and their numerical differences between samples. CA (and its constrained version CCA) on the other hand is very sensitive to the less abundant and often unique species in the samples, which are often numerous in microbial communities. Again, data transformation like the Hellinger transformation, can cope with this problem. Then, CA becomes a powerful tool to tell which species are characteristic of, or correspond to, each sample or group of samples, as we will see in the examples below.

One can think of it like CA (and CCA) is a more qualitative method, while PCA (and RDA) is a more quantitative method, and thus they can supplement each other nicely and reveal different aspects of the data.

Principal Coordinates Analysis (PCoA) and non-metric Multidimensional Scaling (nMDS)

The advantage of Principal Coordinates Analysis (PCoA, a.k.a. metric Multidimensional Scaling or mMDS) and non-metric Multidimensional Scaling (nMDS) is that the user has the freedom to choose the distance measure explicitly. There are well over 50 different measures to choose from, but luckily only a handful are relevant for most analyses (see GUSTA ME). Common measures include the Bray-Curtis dissimilarity index, which has been widely used for decades, Pearson chi-squares, Jaccard, Chord, and lately UniFrac, which incorporates phylogeny. PCoA is very similar to PCA, RDA, CA, and CCA in that they are all based on eigenanalysis: each of the resulting axes is an eigenvector associated with an eigenvalue, and all axes are orthogonal to each other. This means that all axes reveal unique information about the inertia in the data, and exactly how much inertia is indicated by the eigenvalue. When plotting the ordination result in an x/y scatterplot, the axis with the largest eigenvalue is plotted on the first axis, and the one with the second largest on the second axis. How much each axis then contributes to the total inertia in the data is often indicated on the axis labels as a fraction of the eigenvalue of the particular axis and the sum of all eigenvalues. Often the remaining axes other than the two plotted can highlight a large amount information as well, but the best case scenario is that the most inertia is expressed by the first two. A simple bar plot of all the axes and their eigenvalues, called a scree plot, can be used to get an overview of each axis’ contribution to the total inertia and confirm that those plotted are the most significant ones.

In contrast to the other five methods, non-metric Multidimensional Scaling (nMDS) is quite different. It is not an eigenanalysis based method, it is instead a rank-based approach, which means that the magnitudes of the differences are lost (hence the term non-Metric), and only the ranked relationships between the samples are preserved.

This makes nMDS particularly robust and independent of any underlying environmental gradient.

It can even deal with missing values and different types of data at the same time. As opposed to the eigenanalysis-based analyses, nMDS is an iterative procedure, where the resulting ordination does not have a single solution. Instead of eigenvalues, nMDS utilise a stress value to indicate the “success” of the ordination. The number of axes obtained by nMDS is chosen by the user a-priori, and a number of iterations (20 is usually enough) are run to best fit the data to the chosen number of axes. Each solution is used as a starting point for the next iteration and a stress value is calculated after each iteration and will after a number of iterations reach a stable value. A stress value of 0.20 or more is considered unacceptable, 0.1 is acceptable, and 0.05 or below is great. A disadvantage of PCoA and nMDS is that it is only possible to show differences between samples/sites. Plotting both species scores together with site scores in a biplot is not possible.

How to perform ordination with ampvis2 in R

Now that you know the basics of the different ordination methods relevant for microbial ecology, let’s demonstrate how to create publication-ready ordination plots with the ampvis2 package in R and discuss the results of different methods on the same data. The amp_ordinate function aims to simplify the process of performing ordination by basically wrapping numerous functions from other R packages (many of which are from the vegan, ggplot2, and APE packages) into one large function. Performing ordination is often a multi-step workflow and amp_ordinate performs all the required steps with just one function call. Along the way there are also several error checks to help guide the user in the process. All steps are fully customisable by the user, by default amp_ordinate does the following in order:

  1. Filters low abundant OTUs.
  2. Performs data transformation.
  3. Calculates a distance matrix if nMDS/PCoA.
  4. Calculates both site- and species scores by the chosen ordination method.
  5. Generates the ordination plot with numerous visual layers defined by the user.
  6. Fits the correlation of environmental variables to the ordination plot.
  7. Returns a nice looking plot or a interactive plotly plot.

The ordination is performed on the read counts contained in the OTU table (which simply contains the read counts of each OTU in each sample), and the corresponding metadata of those samples are used for all aesthetic options, like the coloring or shape of sample points based on a group variable, as well as for environmental interpretation by constrained ordination methods or the fitting of the correlation of environmental variables onto the ordination plot.

Examples

Some of our 7th semester student groups have really done a great job collecting samples from a plethora of interesting environments in nature and from animal faeces to hunt for novel bacteria. In the following examples, data obtained by 16S rRNA amplicon sequencing (on the Illumina MiSeq platform) of a handful of their samples is used. The samples are from quite different environments, including the faeces of a Bearded dragon and a Vikunja from Aalborg Zoo, sediment from the Limfjord, a wooden boardwalk from Lille Vildmose, and a rusting pipe from somewhere.

To avoid that this blog post gets longer than it already is, only PCA and CA will be demonstrated in the following.

Principal Components Analysis

Principal Components Analysis can be performed by simply setting type = “pca” as shown below. This would produce a plot with just black dots, which is not very useful, so setting sample_color_by and sample_colorframe can color the sample points based on the metadata variable “SampleContent”, which simply contains information about where the samples were taken.

amp_ordinate(
  data = data,
  type = "pca",
  transform = "hellinger",
  sample_color_by = "SampleContent",
  sample_colorframe = TRUE,
  sample_colorframe_label = "SampleContent")

This only highlights the differences between the different samples/sites. To show even more information, making a biplot with also the species scores (with this data species are defined as OTU’s clustered at 97% similarity) plotted can reveal information about which species are abundant in samples or groups of samples, as seen below. The plot is an interactive plotly plot, so go ahead and zoom, hover, export etc 🙂

amp_ordinate(
    data = data,
    type = "pca",
    transform = "hellinger",
    sample_color_by = "SampleContent",
    sample_colorframe = TRUE,
    sample_colorframe_label = "SampleContent",
    species_plotly = TRUE)

Notice that most of the species points are positioned on a line outwards from (0,0) towards the samples. This is very characteristic of PCA, where the relative position of the sample points along these “species lines” or “species gradients” is an expression of how abundant the species are in the samples.

Principal Components Analysis is therefore favored when answering questions involving only the most abundant species in the samples

For example, the 3 samples from Lille Vildmose boardwalk must have a high abundance of OTU 1775 and the Limfjorden samples must have a high abundance of OTU 610. Low abundant species, however, have no impact at all on the observed differences when using PCA, they will all be positioned near (0,0). Often the species scores are plotted as arrows starting from (0,0) when performing PCA, but since microbial community data in most cases contain hundreds or even thousands of different species when sequenced with modern DNA sequencers, they are plotted as points when using amp_ordinate.

Correspondence Analysis

The differences between the samples as observed with Correspondence Analysis are in this case somewhat similar to those obtained with Principal Components Analysis, however there is a big difference when it comes to the species scores, as seen in the biplot below.

amp_ordinate(
  data = data,
  type = "ca",
  transform = "hellinger",
  sample_color_by = "SampleContent",
  sample_colorframe = TRUE,
  sample_colorframe_label = "SampleContent",
  species_plot = TRUE)

As the name suggests, CA tries to find how species “correspond” to samples. Therefore, any species positioned near a sample are then characteristic of that sample and is most likely only present in that sample. If a species point is positioned at the exact same position as a sample, then it is only found in that sample. On the contrary, any species that are present in all samples will be positioned near (0,0) as they are not directly representative of any sample or group of samples. The species abundances also play a role in the observed differences with CA, but not at all as much as with PCA.

This makes Correspondence Analysis a more qualitative method and is therefore favored when answering questions involving community composition and ‘what’ is present ‘where’.

To illustrate this, try zooming in closely at some of the groups of samples in the plot above. For example, the two faeces samples from the Bearded dragon seem to be the only samples containing about 50 species, since they are positioned on a straight line between the two samples. This is also the case for the rusting pipe or the sediment samples from Limfjorden:

Remember to check the scree plot!

Judging from the Correspondence Analysis above it may be tempting to conclude that the samples from the rusting pipe are more similar to the sediment samples from Limfjorden than they are to the faeces samples. Well, the scree plot below tells us that we may be wrong!

result <- amp_ordinate(data, 
  type = "ca",
  transform = "hellinger",
  detailed_output = TRUE)
result$screeplot

The first two axes show about 43% of the total inertia in the data, but the 3rd and 4th axes can actually reveal just as much information as the first two, about 39% combined. By default amp_ordinate plots the first two axes, but by using the x_axis and y_axis arguments it is possible to show any of the remaining axes. Usually, keeping the first axis and only adjusting the y_axis is fine, but any combination of axes can be shown. By looking at a fancy 3D plot with the first 3 axes plotted it is clear that the rusting pipe samples are actually quite different from all the other samples:

It is therefore important to always validate the resulting ordination plot by looking at the scree plot, and maybe even plot other axes than just the first two, before drawing any conclusions!

Read more

Websites

Literature

The following two tabs change content below.
Kasper S. Andersen

Kasper S. Andersen

My name is Kasper and I am a research assistant at Mads Albertsen's lab. My main focus area is data analysis and bioinformatics mainly using R and Shiny. I currently develop and maintain the ampvis2 and mmgenome2 R-packages.
Posted in ampvis2, R.

3 Comments

  1. Hello, I would like to ask how to use ampvis2 ordination function for Pcoa analysis. When I choose distmeasure with unifrac, the r replied ” the tree? needed. Then I tried to find the example or tutorial for the pcoa analysis with r. It appeared to be a little bit complicated. How to build the tree from otu_table.biom? Thank you for your time the kindness. Best, lin

  2. Hi. UniFrac is based on branch lengths in a phylogenetic tree. You have to generate a tree outside of R, (with for example clustalo for alignment, then input to fasttree). Then you have to root it, and load it with amp_load(…, tree = “path/to/tree.file”).

  3. Hello,
    I am trying to use ampvis2 to generate CCAs. I am having trouble figuring out whether my environmental data is not producing arrows when I fit it to the species plot with my sample IDs because it is not significant or whether this is a coding bug on my end. I’ve tried adjusting the significance level of the envfit argument to 0.9 and even 1 and it still won’t work. Could you provide an example of code that generates these visualizations?

    Thank you very much!
    Dan

Leave a Reply

Your email address will not be published. Required fields are marked *