With short reads you will often get fragmented but high single base accuracy assemblies and with long error prone reads you can get a single contig assembly with just a few ultra long reads but with a lot of errors due to the lower read quality. Some of these errors can be fixed by adding more coverage and polishing the assembly. Another approach can be to use both technologies in combination and one can fairly easy produce a single contig assembly. A good assembly should be in as many pieces as the original genetic elements they represent (one contig – one chromosome) but to allow gene calling, genome alignments single base accuracy is also essential. There are many genome assemblers, polishing tools etc. that will help you make a genome assembly but how do you know if you have a good assembly? To test this people have developed tools to calculate the AverageNucleotideIdentity (ANI), estimate genome completeness, assess the quality of genome assemblies compared to a reference. However, it may also be useful to use annotation tools to assess whether genes can be called correctly. The genome assembly, polishing, and annotation strategy is an ongoing discussion in the scientific community on twitter.
To decide which strategy should be our “preferred” genome assembly approach based on data rather than my gut-feeling about the “best assembly” I decided to do some testing with a known “true” reference E Coli K12 MG1655 (U00096.2).
Objective: develop a strategy to compare genome assemblies
Strategy: Use E. Coli K12 MG1655 reference strain data to produce a set assemblies and evaluate these with different assembly metrics in comparison to the “true” reference
|Nanopore data||183||40||in house E. Coli K12 MG1655 grown and prepped for 1D R9.4 with RAD002||(a subset of our total run to get a more realistic view than with 7+Gbp data for a bacterial genome)|
|Illumina data||270||58||ENA: SRR2627175||(I used a subset of 1095594 reads)|
|Read type||Contigs||Size (bp)||Relative size||ANI (%)||CheckM completeness||Prokka # CDS||Prokka # rRNA||Prokka # tRNA||Median CDS size||QUAST NGA50||QUAST mismatches per 100kb||QUAST indels per 100kb|
|Miniasm +2xRacon +Pilon||Hybrid||1||4622021||0.996||99.96||98.51||4509||22||88||761||3590212||10.35||19.19|
|CANU +Nanopolish +Pilon||Hybrid||1||4567411||0.984||99.97||98.41||4415||22||87||770||1329524||6.17||15.79|
The long read only assemblies are nice contiguous assemblies with approximately the same length as the reference. However, the genes are broken due to indel errors present in the reads as is seen in the high number of CDS detected and the low median CDS size. Polishing the assemblies with long reads improves the ANI value a lot but the genes are still fragmented. Polishing with short reads is needed to get the number of CDS and median CDS size in the range of that of the reference. The short read only assembly has a high sequence identity with the reference but is fragmented and cannot recreate the repeat structure of the genome. The number of CDS is lower than that of the reference and the rRNA genes, which are known to be very similar if not identical, are messed up and co-assembled. The unicycler assembly seems to deliver the best metrics across the board until we come to the NGA50 value reported by QUAST which was surprisingly low. I used mummer to visualise the alignment of the assembly against the reference genome to investigate if there was a mis-assembly causing this. However, the figure below shows no big inversions or anything, so I guess that for now we will stick to UNIcycler assemblies that combine the benefit of the short read accuracy with the power of long reads.
I have used the following assemblers
I have used the following mappers
I have used the following polishing tools
I have used the following tools to assess genome assembly characteristics
- ANI.pl (https://github.com/chjp/ANI)
- CheckM (v. 1.0.7)
- Prokka (v. 1.12)
- QUAST (v. 2.3)
- mummer (v. not available)
If you have any ideas or superior tools we have missed please let us know in the comments.