There is no doubt that long read DNA sequencing is transforming genome assembly for genomes at all scales due to the ability to span repetitive elements such as the rRNA genes. For years DNA sequencing has been the choice of either low accuracy (85-95%) long reads with systematic errors (pacbio, nanopore) or high accuracy (>99.9%) short reads (illumina). Or a difficult game trying to combine the two data types in a hybrid approach to boost accuracy of the contiguous assembly that can be achieved with long reads. With recent updates pacbio is finally able to produce HIgh FIdelity reads (>99.9%) reads at much higher scale (15-30Gbp/run). This is a game changer as we can suddenly have the benefit of long reads without sacrificing accuracy or throughput. This makes the assembly problem a lot easier as the criteria for overlapping reads can be much stricter than what is possible if your long reads have 5-20 % errors. HIFI long reads potentially also enable the separate assembly of closely related genomes in meta-genomes which would be extremely difficult/impossible to do with error prone reads. Pacbio Hifi reads currently seem to be on top of the throne if you need reference quality genomes using a single type of sequencing technology. However, besides the risk of going out of business, pacbio suffers from being a platform with a huge capital investment and lab footprint so owning one is not for #everyone (we ship DNA to a service provider).
When it comes to affordability and mobility there is no real competition to the MinION DNA sequencer. For a long period nanopore has been the leader of long read sequencing when it comes to read length and yield. However, the technology has suffered from a high error rate and even worse systematic erorrs in homo polymers that made perfect sequences through consensus polishing impossible. To get rid of the systematic errors in the nanopore platform the basecalling algorithms, chemistry and even the pores themselves are continuously being optimised. The recent release of a pore called “R10” was promised to help with the issue but still left us with a need for hybrid polishing. Just half a year later they are back with “R10.3” which hopefully gets us closer to bacterial genomes for #anyone.
Objective: Test if R10.3 data can produce nanopore only assemblies with an indel error rate on par with that of the hybrid approach or pacbio CCS (I finally got my hands on some data)
Strategy: Use E. Coli K12 MG1655 reference strain data to produce a set assemblies and evaluate these with different assembly metrics in comparison to the “true” reference
The best indel rate for a nanopore assembly has dropped from 34/100kb to ~3/100kb with the release of R10.3!!! This an amazing jump in accuracy and means that most genes will now be perfect. There is still a little room for improvement before catching up with the pacbio CCS assemblies. However, if we consider each break in a genome assembly a severe deletion, the nanopore only assembly has already surpassed the illumina only assembly in many cases! The seesaw is finally tipping in favour of long read only assemblies and no-one is going back once they see Q50+ assemblies. Longer homo-polymers might still be a challenge (for now) but there is no doubt that these technologies are going to work on nailing those as well.
|Contigs||Size (bp)||ANI (%)||ANI (Q-score)||QUAST|
pr 100 kbp
- Some Nanopore R10.3 guppy 3.4.5 w. high accuracy mode (250002227 bp~50x coverage) – Raw fast5s are available at ENA PRJEB36648
- Pacbio CCS from [SRR10971019]
- Illumina from: [SRR2627175] (227773707 bp~49x coverage)
- Reference assembly: [U00096.2]
- [Medaka (v. 0.11.5)]
- [racon (v. 1.3.3)]
- I intentionally left out nanopolish see blogpost by Ryan Wick
**BONUS** How well does R10.3+guppy HAC currently capture homopolymers?
To check this I used counterr and the results are present here:
R10.3 dist len hp (A & Ts are called pretty well up to 9 (and ecoli does not seem to have longer stretches than this), C & Gs are somewhat okay up to 8)
If you have any ideas or superior tools we have missed please let us know in the comments.
Rasmus H. Kirkegaard
Latest posts by Rasmus H. Kirkegaard (see all)
- We aR(10.)3 pretty close now!!! - February 10, 2020
- AR(10)E we there yet? - September 2, 2019
- Why is it important to remove short molecules? - January 15, 2019
Out of curiosity I ran tip Canu since 1.9 had only preliminary HiFi support, using only the first 15,000 reads in the input, defaults with -pacbio-hifi. FastANI reports the ANI to the reference as 99.9995 (Q53).
I also mapped the HiFi reads and the assembly to the reference using MUMmer 4. There are very few differences (0 SNPs, 6 indels, 2 structural differences) and most are supported by the read mappings (that is the sequenced sample doesn’t match the reference). You can see one of the structural differences as an IGV screenshot here: https://obj.umiacs.umd.edu/sergek/seqdata/ecoli_HiFi/igv_snapshot.png. All the reads are calling an insertion like the assembly. At least two indels look like strain differences too: https://obj.umiacs.umd.edu/sergek/seqdata/ecoli_HiFi/igv_snapshot2.png while some others look like minor frequency variants. So I’d estimate maybe 4 true differences which gives a QV over 60. The asm is available here: https://obj.umiacs.umd.edu/sergek/seqdata/ecoli_HiFi/ecoli_k12_HiFi.fasta.gz if you want to run more validation on it.
That’s not to say HiFi is without its own issues, there are some coverage biases now and I’m sure R10.3 is much improved over R9.4 but I think the HiFi canu assembly is close to perfect and I expect more accurate than the R10.3 or the R10.3 + illumina polished result.
Awesome news! I did run the PB HiFi assembly with the HICANU branch that I downloaded some time ago. It is really nice and a lot faster than when running assemblies with error prone reads 😀
I completely agree that PB HiFi data is the leader at the moment when it comes to assembly quality and was happy to finally get my hands on some public HiFi data from this strain (thanks to the PacBio team for uploading). I will try to rerun my assembly with the newest version of CANU on github (looking forward to CANU 2.0 :D). I think this field will be extremely interesting to follow in the coming years.
You are right that the reference might not be representative anymore for bench-marking as we approach a limit where we see true differences for “E Coli K12MG1655” rather than assembly errors. To take this further I would probably have to sequence the exact same DNA rather than rely on public data + a “reference” genome.
Agree that short read polishing is far from ideal. That is the main reason why I keep looking at this for every release of ONT data. To keep an eye out for the point where it is perfect or good enough that we will no longer need short reads. I find it likely that polishing assemblies with short reads could introduce new errors in regions of repeats (or miss polishing some copies completely) but have not looked much at it yet.