ampvis2: The bread and butter of our amplicon analyses: amp_heatmap!

(This post is part of a series of ongoing posts about selected ampvis2 features)

Now that the ampvis2 R package has got its own paper on bioRxiv it is a good time to also write a small blog post about our beloved heatmap, which Albertsen lab and the Center for Microbial Communities use in practically every amplicon analysis.

Why not stacked bar charts?

Getting a simple overview of the microbial community composition of environmental samples can be done in different ways. However, when visualisation of both the composition as well as abundance is desired, very few options exist. For visualisation at higher taxonomic levels (mostly phylum level) a common way of achieving this is to use a stacked bar chart, which can show the relative abundances of all taxa in each sample (example) for comparison. Sadly, stacked bar charts have somehow found a way into microbial ecology even though they are best used for much simpler data and to illustrate gradual changes between values, not large compositional differences as with microbial community data. One of the reasons is that there is only about a dozen or so of different colors that are easily distinguishable by the human eye. There are just not enough colors to cover just a fraction of the diversity in microbial community data. Another reason is that it can be very difficult, if not impossible, to easily read the exact values of each constituent bar (the abundance of a taxa in a sample) as they are shown as an interval. You would first have to judge what the lower and upper limits are, calculate the difference, and only then get the value, and then it would only be approximate. Have a look at this example plot showing the species level composition of a range of samples obtained by 16S amplicon sequencing:

 

This just makes you reconsider why you would ever visualise anything. If the purpose of graphical visualisation is not to simplify and communicate an important message that would otherwise require pages of words to convey, then there are just no reasons to do it. For example, there are probably 7-8 different shades of green. Which one is E. coli? I can’t tell, but maybe you can. In any case, there should not be any doubt. When it comes to reading the relative abundances of the taxa in the samples, in most cases “around” X % abundance is fine to answer most research questions (there are biases anyways), but when you can’t tell the difference between taxa, the plot almost becomes useless.

Rethinking the heatmap

Enough hate on stacked bar charts, but these are some of the main reasons why we had to come up with something different without compromising both precision and simplicity, as well as being pleasant for the eye. This resulted in the amp_heatmap which has several additional features compared to other software tools that are also able to generate heatmaps, these will be demonstrated in the following. The example data used in the following is from the activated sludge of two wastewater treatment plants (WWTP’s) in Aalborg and is included in the ampvis2 R-package.

A heatmap is basically a grid of colored tiles with samples on the x-axis, taxa on the y-axis, and then the abundance (relative or absolute) of each taxon in each sample is indicated by coloring the tiles by a gradient of usually 2 or 3 colors. The colors are then used to identify patterns between the samples and the exact abundance values can be written on top of each tile. Here is an overview of the 10 most abundant Genera in all 16 samples in the dataset and the corresponding R code to generate the plot, by default the abundances are shown relative to the total in each sample:

amp_heatmap(
  data = data,
  tax_aggregate = "Genus"
)

 

Simple, right? We have all information we need in terms of composition and abundance in one plot. All OTU’s are aggregated to a specific taxonomic level of choice and the order of the taxa on the y-axis is automatically sorted by the cumulative abundance of each taxa across all samples. amp_heatmap particularly stands out in the way that it can incorporate sample metadata to dynamically group samples based on one or more metadata variable(s). These variables can contain any relevant information about the samples (for example when and where the samples were taken, physical parameters like pH or temperature, etc…) and amp_heatmap then calculates the mean (or even median, minimum, or maximum) abundance of each taxon in each group of samples as defined by the variable. For example, we can show the yearly average for each WWTP by setting the group_by argument (figure A below). A different way to show the same information is by using faceting on one of the variables instead by setting the facet_by argument (figure B below).

#plot_A:

amp_heatmap(
  data = data,
  tax_aggregate = "Genus",
  group_by = c("Plant", "Year")

#plot_B:
 
amp_heatmap(
  data = data,
  tax_aggregate = "Genus",
  group_by = "Plant",
  facet_by = "Year")

 

We can come a long way with just the group_by and facet_by features alone, but we may also be interested in looking at a different taxonomic level (by setting tax_aggregate), for example family level, add more taxonomic information (tax_add), or show more of the most abundant taxa (tax_show). Furthermore, we can hide the values (plot_values = FALSE) if we are more interested in patterns between the groups and not the exact values, or even adjust the color scale (plot_colorscale) and the colors themselves (color_vector):

amp_heatmap(
  data = data,
  group_by = "Plant",
  facet_by = "Period",
  tax_show = 20,
  tax_aggregate = "Family",
  tax_add = "Phylum",
  plot_values = FALSE,
  plot_colorscale = "sqrt",
  color_vector = c("royalblue4",
                   "whitesmoke",
                   "darkred")
)

 

To compare the top most abundant taxa in a single sample to all other samples it can be done by the sort_by argument (in this case the sample “16SAMP-3941“). This is also useful to check control samples. As mentioned, by default the read counts are normalised, but this can of course also be disabled by setting normalise = FALSE:

amp_heatmap(
  data = data,
  normalise = FALSE,
  tax_aggregate = "Genus",
  sort_by = "16SAMP-3941"
)

 

Last but not least, we primarily analyse samples from wastewater treatment systems, so we have made a nice way to show known biological functions of Genus level taxa next to the heatmap according to the MiDAS field guide database. Furthermore, when using group_by to group samples, the x-axis will be alphabetically ordered, so in the case of seasonal period, it makes more sense to order it chronologically (by setting order_x_by):

amp_heatmap(
  data = data,
  group_by = "Period",
  tax_aggregate = "Genus",
  tax_show = 20,
  plot_functions = TRUE,
  functions = c("FIL", 
                "AOB",
                "NOB",
                "PAO",
                "GAO"),
  order_x_by = c("Spring",
                 "Summer",
                 "Fall",
                 "Winter"),
  rel_widths = c(0.6, 0.4)
)

Conclusion

I hope this will convince you to try out amp_heatmap from the ampvis2 R package. Only a selection of features have been demonstrated, there are even more, have a look in the amp_heatmap online documentation. We have been using it for quite some years and it has stood the test in many of our analyses, papers and posters. It gives a nice, informative overview of any microbial community data obtained by amplicon sequencing, while still being eye-candy. In fact, it can even be used for transcriptomics as well.

 

What about pie-charts? I love pie-charts!

We certainly don’t, and definitely not for microbial community data. Let Max Roser explain why:

Analysing amplicon data, how easy does it get?

Ever done amplicon DNA sequencing of the 16S rRNA gene to identify microbes? If so, then you must know about the challenge of analysing such complex data easily.

My name is Kasper, and I am currently a master student here at the Albertsens lab. When I first learned how DNA is sequenced today, I was astonished by the rapid development DNA sequencing technologies have been experiencing during the last decade. A whole human genome can now be sequenced within a day for less than $1000! The applications of DNA sequencing are countless and there are countless questions yet to be answered. But the first and most important question is perhaps… how? What to do with millions of sequences of just A, C, T and G’s? Well, that question is the foundation of a huge field within biology: Bioinformatics! Which is something we try to expand here at Albertsen Lab.

During my master thesis I have been working with ampvis to take it a step or two further. Using R for bioinformatics takes a certain skill level and I’ve spent weeks of my project at learning R in depth to be able to write R functions for my thesis. During my project, I have specifically been applying various ordination methods (such as Principal Components Analysis, Correspondence Analysis, Redundancy Analysis and more) and other multivariate statistics to activated sludge samples to identify patterns between and within danish wastewater treatment plants – more on that will follow in subsequent blog posts. However, after I spent all that time learning R I thought:

Does everyone need to spend so much time at learning how to do complex bioinformatics, at least, just to do simple data analysis?

Interactive data analysis through Shiny

If you want to do reproducible research, then yes. There is no other way. But if you just want to do brief analysis of amplicon data using basic functions of ampvis, I have done all the work for you by making an interactive Shiny app of ampvis. A shiny app is basically designed from a collection of HTML widgets custom made to communicate with an active R session, so anything that can be done in R, can be done with a shiny app. This means that you can now use ampvis using only mouse clicks! No R experience required.

Amplicon Visualiser in action

All you have to do is upload your data and start clicking around! If the data is suited for ampvis it is also suited for the app (an OTU table + metadata if any, minimal examples can be downloaded from within the app). We recommend using UPARSE for OTU clustering into an otutable. You can then do basic filtering based on the metadata columns and also subset taxa if your are looking for specific microbes. With the app is some example data from the MiDAS database with samples taken from Danish wastewater treatment plants (WWTP’s) from 2014 and 2015. As a start you can simply click analysis, then render plot to get a simple overview of all the samples taken grouped by WWTP. No harm can be done, feel free to try out every clickable element!

The app is available here: https://kasperskytte.shinyapps.io/shinyampvis/.

Of course, it is open source, so the sourcecode is available on my Github. If you encounter any issues please report them here or in the comments below. Any suggestions for improvements are of course welcome.
(spoiler: Among future plans of the app is more extensive ordination with various metrics, data transformation and ordination types)