(This post is the first in a small series compiled during my visit to Segata Lab in Trento, Italy)
A provocative title I know, but as a young scientist venturing into the field of microbiology my concern about this is increasing to the point of now writing this blogpost. Sequencing data are inherently relative [Lovén et al., 2012; Gloor et al., 2017], because there is a maximum read capacity of every sequencing run. However, correlating relative data is like comparing bananas and apples – and although everyone does it – it makes little sense (In the vast majority of cases). Consider the plot from Vandeputte et al. [2017] shown below:

Figure 1: Figure 2 redrawn from Vandeputte et al. [2017]. The composition of 40 microbiomes was analysed using 16S amplicon sequencing, with colours representing different genera. Shown in (a) are the relative abundances which is quantitative output after sequencing. The authors simultanous quantified the cell numbers for each sample, which enabled them to rescale each sample according to the cell count and perform quantitative profiling as shown in (b).
I will give an intuitive argument about why correlation of relative data is problematic, very much inspired by this post. Let’s start with the relative abundance of two bacteria measured by e.g 16S rRNA gene amplicon sequencing; and
. Since they are relative they have to sum to a constant:
It’s easy to see here that knowing the relative abundance of one immediately tells you the relative abundance of the other. Whatever does,
must go in the opposite direction with no room to vary: their correlation is a perfect -1. If we introduce a third bacteria
it is possible to pick any two bacteria before we are forced to make a decision on the last one. The freedom to vary is shared between two bacteria, which means the expected correlation of any two bacteria will be -0.5. We can repeat this with any number of bacteria and find that as the number of bacteria goes up the effect of negative correlation is mitigated. Phew, we dodged the bullet!
But wait a minute…
This is not the end of it, by far. The above example is highly idealized and assumes all bacteria to be independent and identically distributed, which is not true in real data as figure 1
exemplifies. In reality, bacterial abundance exists on multiple levels, often with few dominating bacteria and a long tail of low-abundant ones. Additionally, intricate correlations between different bacteria are commonplace for biological reasons. Under these circumstances our idealized view simply does not cut it. Contrary, there seems to be a overrepresentation of negative correlations in real data [Lovell et al., 2015]. Perhaps more disturbing, the correlation of relative data tells nothing about the correlation of the underlying absolute values. This is perfectly illustrated in figure 2 from a very nice publication by Lovell et al. [2015]:

Figure 2: Figure 1A redrawn from Lovell et al. [2015]. The relative abundances for two hypothetical mRNAs across seven samples are shown in red in the lower left corner. The three sets of blue, green and purple points are absolute abundances that corresponds to the observed relative abundances with correlation coefficients of -1,1, and 0, respectively.
But we are not all lost
I would argue that most scientist working with sequencing data know about this problem. But it is often circumvented by the ubiquitous assumption that the amount of input material is the same across all samples [Lovén et al., 2012; Lovell et al., 2015]. If that is true, the relative and absolute quantification are proportional across samples and we will not find discrepancies between the relative and absolute quantification. This is a very, very strong assumption! I will not go into details how this assumption can be violated, but many systems under study are constantly changing and heterogeneous. Given the complexity of microbial communities and their interactions this at least requires verification. And it is rarely done or required by the scientific community in general.
We can do other things with the data besides correlation. Compositional data, which is the statistical term for relative data, can be analysed by instead looking at ratios [Aitchison, 1982]. This is because ratios are invariant (i.e. unaffected) to scaling:
This is exactly what happens with compositional data, in the case of sequencing data is the total abundance for the sample. A common starting point for analysis is to quantify how much this ratio varies across samples based on the logratio variance [Aitchison, 1982]:
The size of for
and
can be interpreted as the proportionality of
and
. A
close to zero means that
and
are proportional, i.e. “synchronized” with a constant multiplication factor across samples. Contrary, large
indicate that
and
have different multiplication factors across samples. This
gives a great intuitive understanding of the concept, but is not directly used for calculating proportionality [Lovell et al., 2015; Quiin et al., 2017]. Instead, the
-statistic can be used, which conceptualize the logratio variance in a statistical framework. The exact formulation is too technical for the purpose of this blogpost – I encurage you to the read the papers!
Implementation
The propr package [Quiin et al., 2017] implements the calculations of the -statistic. I’ve wrapped some of the functionality in in my own function
mt_phi
for streamlined processing, which is included in my mmtravis package for transcriptomics data analysis. This function allows you to input a matrix of relative abundances and output a
matrix of the calculated
-statistic for each gene pair. This matrix can then be used for downstream analysis, such as clustering analysis, heatmap visualization etc. Let’s compare the proportionality and correlation matrices using some of my own transcriptomics data. You can find the code and data here. First we load the data:
library(mmtravis) library(tidyverse) library(magrittr) mmt <- readRDS("mt_example.rds")
The data consists of some 2224 genes measured across 10 timepoints. Before we can analyse the data we need to transform it from raw counts into relative abundances. For this purpose I use transcripts per million (TPM) [Wagner et al., 2012]. Prior to this I also subset to genes with > 50 reads across all samples, to remove noisy genes.
mmt_sub <- mt_subset(mmt,minreads = 50,normalise = "TPM")
Now we can compute the -statistic and correlation matrices.
mmt_phi <- mt_phi(mmt_sub) mmt_cor <- mmt_sub$mtdata %>% column_to_rownames("GeneID") %>% t() %>% cor()
Let’s convert the – and correlation-matrices into distance objects and do hierarchical clustering.
phi_dist <- as.dist(mmt_phi) cor_dist <- as.dist(1 - abs(mmt_cor)) phi_clst <- hclust(phi_dist) cor_clst <- hclust(cor_dist)
Finally we can visualize the clusterings by drawing heatmaps shown in figure 3.
par(mfrow = c(1,2),mar = c(1,1,1,1)) image( mmt_phi[phi_clst$order,phi_clst$order], breaks = quantile(mmt_phi,seq(0,1,0.1)), col = (heat.colors(10)), useRaster = TRUE, yaxt = "n", xaxt = "n", asp = 1, bty = "n",main = expression(phi~statistic)) image( mmt_cor[cor_clst$order,cor_clst$order], breaks = quantile(mmt_cor,seq(0,1,0.1)), col = (heat.colors(10)), useRaster = TRUE, yaxt = "n", xaxt = "n", asp = 1, bty = "n",main = expression(Correlation~rho))

Figure 3: Clustered heatmap of all genes based on the proportionality calculated using the -statistic (left) and using simple correlation (right). Red indicates high proportionality or correlation.
Note how different the two matrices in figure 3 are! The -statistic seems to yield tight clusteres of various sizes, while the correlation-based matrix is highly segmented into small clusters, with sets of genes that seem to be strongly correlated with almost all other genes, evident by all the vertical and horizontal red lines. This simple visualization is not enough for real conclusions, but the
-matrix seems to have a more “biological” topology in concordance with what one may expect – cluster of genes of various sizes that are up- or downregulated together. In contrary, the clustering based on correlation seems more homogenous suggesting that noise from spurious correlations are dominating. From here the
statistic could be used for further analysis, such as pathway enrichment, network analysis etc.
Conclusions
I have only scratched the surface on this topic and other approaches to handle issues with relative data exists [Pawlowsky-Glahn and Buccianti, 2011; Faust et al., 2012; Friedman and Alm, 2012; Knight et al., 2018]. The concepts of proportionality and comparing ratios are relatively easy to understand but might at first be hard to interpret. The concept of correlation is hardwired into any scientist and facing new concepts will require rewiring, additional theoretical studies, and use-case examples to establish guidelines for interpretation. But that being said we should not rely on measures that in many cases are plain wrong, or just blindly trust our assumptions to hold true. Either you validate your assumptions to be true or you adapt your analysis to the constraints of the data. The former will always be problematic, therefore I believe proportionality provides a meaningful path forward analyzing sequencing data and compositional data in general.
References
Aitchison J (1982) The statistical analysis of compositional data. Chapman & Hall, Ltd.
Faust K, Sathirapongsasuti JF, Izard J, Segata N, Gevers D, Raes J, Huttenhower C. (2012). Microbial Co-occurrence Relationships in the Human Microbiome. PLoS Computational Biology 8.
Friedman J, Alm EJ (2012) Inferring Correlation Networks from Genomic Survey Data. PLoS Computational Biology 8:1–11.
Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ (2017) Microbiome Datasets Are Compositional: And This Is Not Optional. Front. Microbiol. 8:2224.
Knight R, Vrbanac A, Taylor BC, Aksenov A, Callewaert C, Debelius J, Gonzalez A, et al. (2018) Best practices for analysing microbiomes. Nature Reviews Microbiology.
Lovell D, Pawlowsky-Glahn V, Egozcue JJ, Marguerat S, Bähler J (2015) Proportionality: A Valid Alternative to Correlation for Relative Data. PLoS Computational Biology 11:1–12.
Lovén J, Orlando DA, Sigova AA, Lin CY, Rahl PB, Burge CB, Levens DL, Lee TI, Young RA (2012) Revisiting global gene expression analysis. Cell 151:476–482.
Pawlowsky-Glahn V, Buccianti A (2011) Compositional Data Analysis: Theory and Applications. John Wiley & Sons.
Quinn TP, Richardson MF, Lovell D, Crowley TM (2017) propr: An R-package for identifying proportionally abundant features using compositional data analysis. Scientific reports 7(1)
Vandeputte D, Kathagen G, D’hoe K, Vieira-Silva S, Valles-Colomer M, Sabino J, Wang J, et al. (2017) Quantitative microbiome profiling links gut community variation to microbial load. Nature 551(7681):507-511.
Wagner GP, Kin K, Lynch VJ (2012) Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory in Biosciences 131:281–285.
Supplementary
You can download the R script and data used to generate figure 3 here.

Thomas Yssing Michaelsen

Latest posts by Thomas Yssing Michaelsen (see all)
- EliteForsk scholarship – failure is a virtue - March 4, 2019
- What is wrong with correlating relative abundance? Everything! - November 13, 2018
- The Roblon Prize: my first prize - April 16, 2018
I wanted to point out that LR_{var} is equivalent to the pairwise variation V to find stable reference genes in the geNorm concept (Vandesompele et al., Genome Biology, 2002; equation 3). Indeed, stable reference genes are proportionally expressed.
https://genomebiology.biomedcentral.com/articles/10.1186/gb-2002-3-7-research0034
Hej Thomas,
thanks for sharing your experience and creating this example. Just a quick note, that the mmtravis package requires the ampvis2 package.
Those who have not already obtained this package should add:
remotes::install_github(“MadsAlbertsen/ampvis2”)
to the next line after the installing command for the the remotes package.