What is a good genome assembly?

With short reads you will often get fragmented but high single base accuracy assemblies  and with long error prone reads you can get a single contig assembly with just a few ultra long reads but with a lot of errors due to the lower read quality. Some of these errors can be fixed by adding more coverage and polishing the assembly.  Another approach can be to use both technologies in combination and one can fairly easy produce a single contig assembly. A good assembly should be in as many pieces as the original genetic elements they represent (one contig – one chromosome) but to allow gene calling, genome alignments single base accuracy is also essential. There are many genome assemblers, polishing tools etc. that will help you make a genome assembly but how do you know if you have a good assembly? To test this people have developed tools to calculate the AverageNucleotideIdentity (ANI), estimate genome completeness, assess the quality of genome assemblies compared to a reference. However, it may also be useful to use annotation tools to assess whether genes can be called correctly. The genome assembly, polishing, and annotation strategy is an ongoing discussion in the scientific community on twitter.

To decide which strategy should be our “preferred” genome assembly approach based on data rather than my gut-feeling about the “best assembly” I decided to do some testing with a known “true” reference E Coli K12 MG1655 (U00096.2).

Objective: develop a strategy to compare genome assemblies

Strategy: Use E. Coli K12 MG1655 reference strain data to produce a set assemblies and evaluate these with different assembly metrics in comparison to the “true” reference

Data:

 MbpX coverageRessource 
Nanopore data18340in house E. Coli K12 MG1655 grown and prepped for 1D R9.4 with RAD002(a subset of our total run to get a more realistic view than with 7+Gbp data for a bacterial genome)
Illumina data27058ENA: SRR2627175(I used a subset of 1095594 reads)

Assemblies:

 Read typeContigsSize (bp)Relative sizeANI (%)CheckM completenessProkka # CDSProkka # rRNAProkka # tRNAMedian CDS sizeQUAST NGA50QUAST mismatches per 100kbQUAST indels per 100kb
SpadesShort8445462200.9810099.91423511878101327000.420.09
Spades-hybridHybrid146203770.99610099.914282228881528288795.910.43
MiniasmLong144101010.95183.950.0011241615206NANANA
Miniasm +1xRaconLong146198180.99699.0270.871019222802874618042239.94541.42
Miniasm +2xRaconLong146220210.99699.2374.97962622823084620219222.51449.98
Miniasm +2xRacon +PilonHybrid146220210.99699.9698.5145092288761359021210.3519.19
CANULong145335740.97798.6950.001047821732634531613113.53662.93
CANU +NanopolishLong145631520.98499.2771.79966422842964561292146.74468.02
CANU +Nanopolish +PilonHybrid145674110.98499.9798.414415228777013295246.1715.79
UnicyclerHybrid146339760.99999.9999.914305228881524718074.282.25
Reference14639675110099.9743002288818463967500

Conclusion:

The long read only assemblies are nice contiguous assemblies with approximately the same length as the reference. However, the genes are broken due to indel errors present in the reads as is seen in the high number of CDS detected and the low median CDS size. Polishing the assemblies with long reads improves the ANI value a lot but the genes are still fragmented. Polishing with short reads is needed to get the number of CDS and median CDS size in the range of that of the reference. The short read only assembly has a high sequence identity with the reference but is fragmented and cannot recreate the repeat structure of the genome. The number of CDS is lower than that of the reference and the rRNA genes, which are known to be very similar if not identical, are messed up and co-assembled.  The unicycler assembly seems to deliver the best metrics across the board until we come to the NGA50 value reported by QUAST which was surprisingly low. I used mummer to visualise the alignment of the assembly against the reference genome to investigate if there was a mis-assembly causing this. However, the figure below shows no big inversions or anything, so I guess that for now we will stick to UNIcycler assemblies that combine the benefit of the short read accuracy with the power of long reads.

Dotplot of unicycler assembly aligned to the reference genome

Methods:

I have used the following assemblers

I have used the following mappers

I have used the following polishing tools

I have used the following tools to assess genome assembly characteristics

If you have any ideas or superior tools we have missed please let us know in the comments.

The following two tabs change content below.
Rasmus H. Kirkegaard

Rasmus H. Kirkegaard

Post Doc
Playing with microbes and bioinformatics on the path towards "finished" genomes for "everyone" of them and rapid detection. My bet is currently on nanopore sequencing and I am fortunate to be involved in MinION and PromethION research in our lab.
Posted in Genomics.

One Comment

  1. Pingback: Nanopore Community Meeting 2017: Prologue – Albertsen Lab

Leave a Reply

Your email address will not be published. Required fields are marked *