We aR(10.)3 pretty close now!!!

There is no doubt that long read DNA sequencing is transforming genome assembly for genomes at all scales due to the ability to span repetitive elements such as the rRNA genes. For years DNA sequencing has been the choice of either low accuracy (85-95%) long reads with systematic errors (pacbio, nanopore) or high accuracy (>99.9%) short reads (illumina). Or a difficult game trying to combine the two data types in a hybrid approach to boost accuracy of the contiguous assembly that can be achieved with long reads. With recent updates pacbio is finally able to produce HIgh FIdelity reads (>99.9%) reads at much higher scale (15-30Gbp/run). This is a game changer as we can suddenly have the benefit of long reads without sacrificing accuracy or throughput. This makes the assembly problem a lot easier as the criteria for overlapping reads can be much stricter than what is possible if your long reads have 5-20 % errors. HIFI long reads potentially also enable the separate assembly of closely related genomes in meta-genomes which would be extremely difficult/impossible to do with error prone reads. Pacbio Hifi reads currently seem to be on top of the throne if you need reference quality genomes using a single type of sequencing technology. However, besides the risk of going out of business, pacbio suffers from being a platform with a huge capital investment and lab footprint so owning one is not for #everyone (we ship DNA to a service provider).

When it comes to affordability and mobility there is no real competition to the MinION DNA sequencer. For a long period nanopore has been the leader of long read sequencing when it comes to read length and yield. However, the technology has suffered from a high error rate and even worse systematic erorrs in homo polymers that made perfect sequences through consensus polishing impossible. To get rid of the systematic errors in the nanopore platform the basecalling algorithms, chemistry and even the pores themselves are continuously being optimised. The recent release of a pore called “R10” was promised to help with the issue but still left us with a need for hybrid polishing. Just half a year later they are back with “R10.3” which hopefully gets us closer to bacterial genomes for #anyone.

Objective: Test if R10.3 data can produce nanopore only assemblies with an indel error rate on par with that of the hybrid approach or pacbio CCS (I finally got my hands on some data)

Strategy: Use E. Coli K12 MG1655 reference strain data to produce a set assemblies and evaluate these with different assembly metrics in comparison to the “true” reference

Conclusion:

The best indel rate for a nanopore assembly has dropped from 34/100kb to ~3/100kb with the release of R10.3!!! This an amazing jump in accuracy and means that most genes will now be perfect. There is still a little room for improvement before catching up with the pacbio CCS assemblies. However, if we consider each break in a genome assembly a severe deletion, the nanopore only assembly has already surpassed the illumina only assembly in many cases! The seesaw is finally tipping in favour of long read only assemblies and no-one is going back once they see Q50+ assemblies. Longer homo-polymers might still be a challenge (for now) but there is no doubt that these technologies are going to work on nailing those as well.

Assemblies:

AssemblyRead
type
ContigsSize (bp)ANI (%)ANI (Q-score)QUAST
mismatches
per 100kbp
# MM
Errors
QUAST
indels
pr 100 kbp
# Indel
Errors
Prokka
# CDS
EcoliK12MG1655_reference14639675100Inf00004300
R10.3_miniasm+raconNP+medaka+raconILMhybrid1463865499.9978470.28130.47224311
R10.3_WTDBG2+raconNP+medaka+raconILMhybrid6463284499.9969450.22100.69324302
PB_HICANUPB Hifi1465772699.9955430.0630.1154321
PB_flye+racon1xPB Hifi4464552399.9955430.1990.1364309
PB_flye+racon2xPB Hifi4464552399.9955430.1990.1364309
PB_flyePB Hifi4464571099.9954430.1990.22104308
R10.3_CANU_trimmed+raconNP+medakananopore1463899299.9952430.32153.281524360
R10.3_FLYE+raconNP+medaka+raconILMhybrid1463857899.9952430.1360.63294310
R10.3_miniasm+raconNP+medakananopore1463898999.9942420.88414.552114366
R10.3_FLYE+raconNP2x+medaka2xnanopore1463895199.9938420.32153.211494357
R10.3_WTDBG2+raconNP+medakananopore6463324199.9934420.78363.811774359
R10.3_FLYE+raconNP+medakananopore1463902099.9932420.39183.881804369
R10.3_FLYE+raconNP2x+medakananopore1463898399.993420.45213.691714364
spadesillumina86456057799.9923412.241040.1364240
R10.3_Unicyclerhybrid1462063399.9908401.47680.1784283
R10.3_FLYE+raconNP2xnanopore1463858399.96483511.1751816.177504632
R10.3_CANU_trimmed+raconNPnanopore1463830199.96333411.4753220.319424719
R10.3_FLYE+raconNPnanopore1463867599.96323410.9750917.618174648
R10.3_WTDBG2+raconNPnanopore5463341399.95833411.0651322.910624769
R10.3_CANUnanopore1464780499.9446338.4339142.3819665153
R10.3_CANU_trimmednanopore1463687999.9444338.4739342.8319875141
R10.3_miniasm+raconNPnanopore1463811499.94353214.7968637.717495030
R10.3_WTDBG2nanopore6464036599.80012739.021810160.5474496533
R10.3_FLYEnanopore1465283499.7478266.51302243.66113057658
R10.3_miniasmnanopore1460912595.806142051.04951621735.49805219556

Methods:

Data:

Assemblers:

Polishing tools:

QC

Misc.

**BONUS** How well does R10.3+guppy HAC currently capture homopolymers?

To check this I used counterr and the results are present here:
R10.3 dist len hp (A & Ts are called pretty well up to 9 (and ecoli does not seem to have longer stretches than this), C & Gs are somewhat okay up to 8)

If you have any ideas or superior tools we have missed please let us know in the comments.

The following two tabs change content below.
Rasmus H. Kirkegaard

Rasmus H. Kirkegaard

Staff scientist
Playing with microbes and bioinformatics on the path towards "finished" genomes for "everyone" of them and rapid detection. My bet is currently on nanopore sequencing and I am fortunate to be involved in MinION and PromethION research in our lab.
Rasmus H. Kirkegaard

Latest posts by Rasmus H. Kirkegaard (see all)

Posted in Uncategorized.

2 Comments

  1. Out of curiosity I ran tip Canu since 1.9 had only preliminary HiFi support, using only the first 15,000 reads in the input, defaults with -pacbio-hifi. FastANI reports the ANI to the reference as 99.9995 (Q53).

    I also mapped the HiFi reads and the assembly to the reference using MUMmer 4. There are very few differences (0 SNPs, 6 indels, 2 structural differences) and most are supported by the read mappings (that is the sequenced sample doesn’t match the reference). You can see one of the structural differences as an IGV screenshot here: https://obj.umiacs.umd.edu/sergek/seqdata/ecoli_HiFi/igv_snapshot.png. All the reads are calling an insertion like the assembly. At least two indels look like strain differences too: https://obj.umiacs.umd.edu/sergek/seqdata/ecoli_HiFi/igv_snapshot2.png while some others look like minor frequency variants. So I’d estimate maybe 4 true differences which gives a QV over 60. The asm is available here: https://obj.umiacs.umd.edu/sergek/seqdata/ecoli_HiFi/ecoli_k12_HiFi.fasta.gz if you want to run more validation on it.

    That’s not to say HiFi is without its own issues, there are some coverage biases now and I’m sure R10.3 is much improved over R9.4 but I think the HiFi canu assembly is close to perfect and I expect more accurate than the R10.3 or the R10.3 + illumina polished result.

    • Awesome news! I did run the PB HiFi assembly with the HICANU branch that I downloaded some time ago. It is really nice and a lot faster than when running assemblies with error prone reads 😀

      I completely agree that PB HiFi data is the leader at the moment when it comes to assembly quality and was happy to finally get my hands on some public HiFi data from this strain (thanks to the PacBio team for uploading). I will try to rerun my assembly with the newest version of CANU on github (looking forward to CANU 2.0 :D). I think this field will be extremely interesting to follow in the coming years.

      You are right that the reference might not be representative anymore for bench-marking as we approach a limit where we see true differences for “E Coli K12MG1655” rather than assembly errors. To take this further I would probably have to sequence the exact same DNA rather than rely on public data + a “reference” genome.

      Agree that short read polishing is far from ideal. That is the main reason why I keep looking at this for every release of ONT data. To keep an eye out for the point where it is perfect or good enough that we will no longer need short reads. I find it likely that polishing assemblies with short reads could introduce new errors in regions of repeats (or miss polishing some copies completely) but have not looked much at it yet.

Leave a Reply

Your email address will not be published. Required fields are marked *