What is a good genome assembly?

With short reads you will often get fragmented but high single base accuracy assemblies and with long error prone reads you can get a single contig assembly with just a few ultra long reads but with a lot of errors due to the lower read quality. Some of these errors can be fixed by adding more coverage and polishing the assembly. Another approach can be to use both technologies in combination and one can fairly easy produce a single contig assembly. A good assembly should be in as many pieces as the original genetic elements they represent (one contig – one chromosome) but to allow gene calling, genome alignments single base accuracy is also essential. There are many genome assemblers, polishing tools etc. that will help you make a genome assembly but how do you know if you have a good assembly? To test this people have developed tools to calculate the AverageNucleotideIdentity (ANI), estimate genome completeness, assess the quality of genome assemblies compared to a reference. However, it may also be useful to use annotation tools to assess whether genes can be called correctly. The genome assembly, polishing, and annotation strategy is an ongoing discussion in the scientific community on twitter.

To decide which strategy should be our “preferred” genome assembly approach based on data rather than my gut-feeling about the “best assembly” I decided to do some testing with a known “true” reference E Coli K12 MG1655 (U00096.2).

Objective: develop a strategy to compare genome assemblies

Strategy: Use E. Coli K12 MG1655 reference strain data to produce a set assemblies and evaluate these with different assembly metrics in comparison to the “true” reference

Data:

	Mbp	X coverage	Ressource
Nanopore data	183	40	in house E. Coli K12 MG1655 grown and prepped for 1D R9.4 with RAD002	(a subset of our total run to get a more realistic view than with 7+Gbp data for a bacterial genome)
Illumina data	270	58	ENA: SRR2627175	(I used a subset of 1095594 reads)

Assemblies:

	Read type	Contigs	Size (bp)	Relative size	ANI (%)	CheckM completeness	Prokka # CDS	Prokka # rRNA	Prokka # tRNA	Median CDS size	QUAST NGA50	QUAST mismatches per 100kb	QUAST indels per 100kb
Spades	Short	84	4546220	0.98	100	99.91	4235	11	87	810	132700	0.42	0.09
Spades-hybrid	Hybrid	1	4620377	0.996	100	99.91	4282	22	88	815	2828879	5.91	0.43
Miniasm	Long	1	4410101	0.951	83.95	0.00	1124	16	15	206	NA	NA	NA
Miniasm +1xRacon	Long	1	4619818	0.996	99.02	70.87	10192	22	80	287	4618042	239.94	541.42
Miniasm +2xRacon	Long	1	4622021	0.996	99.23	74.97	9626	22	82	308	4620219	222.51	449.98
Miniasm +2xRacon +Pilon	Hybrid	1	4622021	0.996	99.96	98.51	4509	22	88	761	3590212	10.35	19.19
CANU	Long	1	4533574	0.977	98.69	50.00	10478	21	73	263	4531613	113.53	662.93
CANU +Nanopolish	Long	1	4563152	0.984	99.27	71.79	9664	22	84	296	4561292	146.74	468.02
CANU +Nanopolish +Pilon	Hybrid	1	4567411	0.984	99.97	98.41	4415	22	87	770	1329524	6.17	15.79
Unicycler	Hybrid	1	4633976	0.999	99.99	99.91	4305	22	88	815	2471807	4.28	2.25
Reference		1	4639675	1	100	99.97	4300	22	88	818	4639675	0	0

Conclusion:

The long read only assemblies are nice contiguous assemblies with approximately the same length as the reference. However, the genes are broken due to indel errors present in the reads as is seen in the high number of CDS detected and the low median CDS size. Polishing the assemblies with long reads improves the ANI value a lot but the genes are still fragmented. Polishing with short reads is needed to get the number of CDS and median CDS size in the range of that of the reference. The short read only assembly has a high sequence identity with the reference but is fragmented and cannot recreate the repeat structure of the genome. The number of CDS is lower than that of the reference and the rRNA genes, which are known to be very similar if not identical, are messed up and co-assembled. The unicycler assembly seems to deliver the best metrics across the board until we come to the NGA50 value reported by QUAST which was surprisingly low. I used mummer to visualise the alignment of the assembly against the reference genome to investigate if there was a mis-assembly causing this. However, the figure below shows no big inversions or anything, so I guess that for now we will stick to UNIcycler assemblies that combine the benefit of the short read accuracy with the power of long reads.

Dotplot of unicycler assembly aligned to the reference genome

Methods:

I have used the following assemblers

Spades (v. 3.10.1)
CANU (v. 1.6)
Unicycler (v. v0.4.1)
Miniasm (v. 0.2-r137-dirty)

I have used the following mappers

minimap2 (v. 2.0rc1-r232)
minimap (v. 0.2-r124-dirty)
bwa (v. 0.7.12-r1039)

I have used the following polishing tools

Racon (v. not available)
Pilon (v. 1.18)
Nanopolish (v. 0.8.3)

I have used the following tools to assess genome assembly characteristics

ANI.pl (https://github.com/chjp/ANI)
CheckM (v. 1.0.7)
Prokka (v. 1.12)
QUAST (v. 2.3)
mummer (v. not available)

If you have any ideas or superior tools we have missed please let us know in the comments.

Bio
Latest Posts

Rasmus H. Kirkegaard

Staff scientist

Playing with microbes and bioinformatics on the path towards "finished" genomes for "everyone" of them and rapid detection. My bet is currently on nanopore sequencing and I am fortunate to be involved in MinION and PromethION research in our lab.

Latest posts by Rasmus H. Kirkegaard (see all)

We aR(10.)3 pretty close now!!! - February 10, 2020
AR(10)E we there yet? - September 2, 2019
Why is it important to remove short molecules? - January 15, 2019

3 Comments

Pingback: Nanopore Community Meeting 2017: Prologue – Albertsen Lab
Rasmus H. Kirkegaard

July 3, 2018 at 8:36 am Reply

Inspired by a blogpost by Mick Watson (http://www.opiniomics.org/on-stuck-records-and-indel-errors-or-stop-publishing-bad-genomes/) I tested some of these assemblies using his strategy for determining whether the genes are fragmented in the different assembly strategies: https://twitter.com/kirk3gaard/status/1014057842700152843
Pingback: AR(10)E we there yet? – Albertsen Lab