(This post is part of a series of ongoing posts about selected ampvis2 features)
Now that the ampvis2 R package has got its own paper on bioRxiv it is a good time to also write a small blog post about our beloved heatmap, which Albertsen lab and the Center for Microbial Communities use in practically every amplicon analysis.
Why not stacked bar charts?
Getting a simple overview of the microbial community composition of environmental samples can be done in different ways. However, when visualisation of both the composition as well as abundance is desired, very few options exist. For visualisation at higher taxonomic levels (mostly phylum level) a common way of achieving this is to use a stacked bar chart, which can show the relative abundances of all taxa in each sample (example) for comparison. Sadly, stacked bar charts have somehow found a way into microbial ecology even though they are best used for much simpler data and to illustrate gradual changes between values, not large compositional differences as with microbial community data. One of the reasons is that there is only about a dozen or so of different colors that are easily distinguishable by the human eye. There are just not enough colors to cover just a fraction of the diversity in microbial community data. Another reason is that it can be very difficult, if not impossible, to easily read the exact values of each constituent bar (the abundance of a taxa in a sample) as they are shown as an interval. You would first have to judge what the lower and upper limits are, calculate the difference, and only then get the value, and then it would only be approximate. Have a look at this example plot showing the species level composition of a range of samples obtained by 16S amplicon sequencing:
This just makes you reconsider why you would ever visualise anything. If the purpose of graphical visualisation is not to simplify and communicate an important message that would otherwise require pages of words to convey, then there are just no reasons to do it. For example, there are probably 7-8 different shades of green. Which one is E. coli? I can’t tell, but maybe you can. In any case, there should not be any doubt. When it comes to reading the relative abundances of the taxa in the samples, in most cases “around” X % abundance is fine to answer most research questions (there are biases anyways), but when you can’t tell the difference between taxa, the plot almost becomes useless.
Rethinking the heatmap
Enough hate on stacked bar charts, but these are some of the main reasons why we had to come up with something different without compromising both precision and simplicity, as well as being pleasant for the eye. This resulted in the amp_heatmap which has several additional features compared to other software tools that are also able to generate heatmaps, these will be demonstrated in the following. The example data used in the following is from the activated sludge of two wastewater treatment plants (WWTP’s) in Aalborg and is included in the ampvis2 R-package.
A heatmap is basically a grid of colored tiles with samples on the x-axis, taxa on the y-axis, and then the abundance (relative or absolute) of each taxon in each sample is indicated by coloring the tiles by a gradient of usually 2 or 3 colors. The colors are then used to identify patterns between the samples and the exact abundance values can be written on top of each tile. Here is an overview of the 10 most abundant Genera in all 16 samples in the dataset and the corresponding R code to generate the plot, by default the abundances are shown relative to the total in each sample:
amp_heatmap( data = data, tax_aggregate = "Genus" )
Simple, right? We have all information we need in terms of composition and abundance in one plot. All OTU’s are aggregated to a specific taxonomic level of choice and the order of the taxa on the y-axis is automatically sorted by the cumulative abundance of each taxa across all samples. amp_heatmap particularly stands out in the way that it can incorporate sample metadata to dynamically group samples based on one or more metadata variable(s). These variables can contain any relevant information about the samples (for example when and where the samples were taken, physical parameters like pH or temperature, etc…) and amp_heatmap then calculates the mean (or even median, minimum, or maximum) abundance of each taxon in each group of samples as defined by the variable. For example, we can show the yearly average for each WWTP by setting the group_by argument (figure A below). A different way to show the same information is by using faceting on one of the variables instead by setting the facet_by argument (figure B below).
#plot_A: amp_heatmap( data = data, tax_aggregate = "Genus", group_by = c("Plant", "Year") #plot_B: amp_heatmap( data = data, tax_aggregate = "Genus", group_by = "Plant", facet_by = "Year")
We can come a long way with just the group_by and facet_by features alone, but we may also be interested in looking at a different taxonomic level (by setting tax_aggregate), for example family level, add more taxonomic information (tax_add), or show more of the most abundant taxa (tax_show). Furthermore, we can hide the values (plot_values = FALSE) if we are more interested in patterns between the groups and not the exact values, or even adjust the color scale (plot_colorscale) and the colors themselves (color_vector):
amp_heatmap( data = data, group_by = "Plant", facet_by = "Period", tax_show = 20, tax_aggregate = "Family", tax_add = "Phylum", plot_values = FALSE, plot_colorscale = "sqrt", color_vector = c("royalblue4", "whitesmoke", "darkred") )
To compare the top most abundant taxa in a single sample to all other samples it can be done by the sort_by argument (in this case the sample “16SAMP-3941“). This is also useful to check control samples. As mentioned, by default the read counts are normalised, but this can of course also be disabled by setting normalise = FALSE:
amp_heatmap( data = data, normalise = FALSE, tax_aggregate = "Genus", sort_by = "16SAMP-3941" )
Last but not least, we primarily analyse samples from wastewater treatment systems, so we have made a nice way to show known biological functions of Genus level taxa next to the heatmap according to the MiDAS field guide database. Furthermore, when using group_by to group samples, the x-axis will be alphabetically ordered, so in the case of seasonal period, it makes more sense to order it chronologically (by setting order_x_by):
amp_heatmap( data = data, group_by = "Period", tax_aggregate = "Genus", tax_show = 20, plot_functions = TRUE, functions = c("FIL", "AOB", "NOB", "PAO", "GAO"), order_x_by = c("Spring", "Summer", "Fall", "Winter"), rel_widths = c(0.6, 0.4) )
I hope this will convince you to try out amp_heatmap from the ampvis2 R package. Only a selection of features have been demonstrated, there are even more, have a look in the amp_heatmap online documentation. We have been using it for quite some years and it has stood the test in many of our analyses, papers and posters. It gives a nice, informative overview of any microbial community data obtained by amplicon sequencing, while still being eye-candy. In fact, it can even be used for transcriptomics as well.
What about pie-charts? I love pie-charts!
We certainly don’t, and definitely not for microbial community data. Let Max Roser explain why:
Why you should not use pie charts. pic.twitter.com/AahWfGYU6N
— Max Roser (@MaxCRoser) April 27, 2017
Kasper S. Andersen
Latest posts by Kasper S. Andersen (see all)
- ampvis2: The bread and butter of our amplicon analyses: amp_heatmap! - April 13, 2018
- ampvis2: A guide to ordination and how to use amp_ordinate in R - February 22, 2018
- Analysing amplicon data, how easy does it get? - March 27, 2017