There is no doubt that long read DNA sequencing is transforming genome assembly for genomes at all scales due to the ability to span repetitive elements such as the rRNA genes. For years DNA sequencing has been the choice of either low accuracy (85-95%) long reads with systematic errors (pacbio, nanopore) or high accuracy (>99.9%) short reads (illumina). Or a difficult game trying to combine the two data types in a hybrid approach to boost accuracy of the contiguous assembly that can be achieved with long reads. With recent updates pacbio is finally able to produce HIgh FIdelity reads (>99.9%) reads at much higher scale (15-30Gbp/run). This is a game changer as we can suddenly have the benefit of long reads without sacrificing accuracy or throughput. This makes the assembly problem a lot easier as the criteria for overlapping reads can be much stricter than what is possible if your long reads have 5-20 % errors. HIFI long reads potentially also enable the separate assembly of closely related genomes in meta-genomes which would be extremely difficult/impossible to do with error prone reads. Pacbio Hifi reads currently seem to be on top of the throne if you need reference quality genomes using a single type of sequencing technology. However, besides the risk of going out of business, pacbio suffers from being a platform with a huge capital investment and lab footprint so owning one is not for #everyone (we ship DNA to a service provider).
When it comes to affordability and mobility there is no real competition to the MinION DNA sequencer. For a long period nanopore has been the leader of long read sequencing when it comes to read length and yield. However, the technology has suffered from a high error rate and even worse systematic erorrs in homo polymers that made perfect sequences through consensus polishing impossible. To get rid of the systematic errors in the nanopore platform the basecalling algorithms, chemistry and even the pores themselves are continuously being optimised. The recent release of a pore called “R10” was promised to help with the issue but still left us with a need for hybrid polishing. Just half a year later they are back with “R10.3” which hopefully gets us closer to bacterial genomes for #anyone.
Objective: Test if R10.3 data can produce nanopore only assemblies with an indel error rate on par with that of the hybrid approach or pacbio CCS (I finally got my hands on some data)
Strategy: Use E. Coli K12 MG1655 reference strain data to produce a set assemblies and evaluate these with different assembly metrics in comparison to the “true” reference
The best indel rate for a nanopore assembly has dropped from 34/100kb to ~3/100kb with the release of R10.3!!! This an amazing jump in accuracy and means that most genes will now be perfect. There is still a little room for improvement before catching up with the pacbio CCS assemblies. However, if we consider each break in a genome assembly a severe deletion, the nanopore only assembly has already surpassed the illumina only assembly in many cases! The seesaw is finally tipping in favour of long read only assemblies and no-one is going back once they see Q50+ assemblies. Longer homo-polymers might still be a challenge (for now) but there is no doubt that these technologies are going to work on nailing those as well.
|Contigs||Size (bp)||ANI (%)||ANI (Q-score)||QUAST|
pr 100 kbp
- Some Nanopore R10.3 guppy 3.4.5 w. high accuracy mode (250002227 bp~50x coverage) – Raw fast5s are available at ENA PRJEB36648
- Pacbio CCS from [SRR10971019]
- Illumina from: [SRR2627175] (227773707 bp~49x coverage)
- Reference assembly: [U00096.2]
- [Medaka (v. 0.11.5)]
- [racon (v. 1.3.3)]
- I intentionally left out nanopolish see blogpost by Ryan Wick
**BONUS** How well does R10.3+guppy HAC currently capture homopolymers?
To check this I used counterr and the results are present here:
R10.3 dist len hp (A & Ts are called pretty well up to 9 (and ecoli does not seem to have longer stretches than this), C & Gs are somewhat okay up to 8)
If you have any ideas or superior tools we have missed please let us know in the comments.