AR(10)E we there yet?

Genome assembly has been a big challenge since the first methods for analysing DNA sequences saw the light of day. The challenge is primarily a result of our inability to read long fragments of DNA. With an explosion in the market for long read sequencing technologies (pacbio, nanopore, 10X, longas, etc) there is now hope that assembly will soon be a solved problem at least for simple genomes. However, some of these long read sequencing platforms struggle with a high error rate and even worse systematic errors related to homopolymers causing insertion and deletion errors that result in incomplete genes. To mitigate this problem a hybrid approach using short high accuracy reads were needed but this increases price, complexity and is generally just a bit annoying.

To get rid of the systematic errors in the nanopore platform (the platform with the longest reads reported so far) the company are continually modifying the basecalling algorithms but also the chemistry and the pores themselves. The recent release of a pore called “R10” was promised to help with the issue and hopefully allow us to get rid of the short read polishing.

Objective: Test if R10 data can produce nanopore only assemblies with an indel error rate on par with that of the hybrid approach

Strategy: Use E. Coli K12 MG1655 reference strain data to produce a set assemblies and evaluate these with different assembly metrics in comparison to the “true” reference

Conclusion:

The best indel rate for a nanopore assembly has dropped from 51/100kb to 34/100kb with the release of R10. However, that is still more than 34 times higher than the best hybrid approaches. This causes the number of genes to be 816 higher than it should be. Which indicates that the current version of R10 data is insufficient to produce nanopore only assemblies of sufficient quality to allow annotation.

Assemblies:

AssemblyRead
type
ContigsSize
(bp)
ANI
(%)
QUAST
mismatches
per 100kb
Equivalent to
~mismatch
errors
QUAST
indels
per 100kb
Equivalent to
~indel
errors
Prokka # CDS
Reference14639675100.0000.000.004300
Spadesshort77455875999.9982.31080.284237
NPR10_Unicyclerhybrid1462063199.9911.5680.2104283
NPR94_Unicyclerhybrid1462062999.9911.5680.2114284
NPR94_Hybrid_spadeshybrid6462466999.9836.83150.5234283
NPR10_Hybrid_spadeshybrid5462427499.9867.03220.8354283
NPR10_CANUtrimmed+raconNP+medakax2+raconILMhybrid1463851299.9970.6270.9424315
NPR10_CANUtrimmed+raconNP+medaka+raconILMhybrid1463859999.9960.7301.0444317
NPR10_miniasm+raconNP+medaka+raconILMhybrid1463859599.9961.1531.2544318
NPR10_FLYE+raconNP+medaka+raconILMhybrid1463855699.9950.8391.2564322
NPR10_WTDBG2+raconNP+medaka+raconILMhybrid1463866499.9960.7341.3584317
NPR94_CANUtrimmed+raconNP+medaka+raconILMhybrid1463840399.9951.1521.7804318
NPR94_CANUtrimmed+raconNP+medakax2+raconILMhybrid1463846299.9950.6281.9864331
NPR94_WTDBG2+raconNP+medaka+raconILMhybrid1463859899.9951.1511.9864327
NPR94_FLYE+raconNP+medaka+raconILMhybrid1463804799.9932.71232.41104328
NPR94_miniasm+raconNP+medaka+raconILMhybrid1463738999.9893.21493.11424340
NPR10_CANUtrimmed+raconNP+medakax2long1463897999.92726.5122834.115815116
NPR10_CANUtrimmed+raconNP+medakalong1463897699.92228.0129738.517875184
NPR10_WTDBG2+raconNP+medakalong1463920999.92128.8133740.418745251
NPR10_FLYE+raconNP+medakalong1463951599.92029.9138744.220515259
NPR10_miniasm+raconNP+medakalong1463928099.91135.6165149.422915384
NPR94_CANUtrimmed+raconNP+medakax2long1463732399.9423.717351.023685272
NPR10_CANUtrimmed+raconNP+medaka+medakaR94long1463784999.83882.0380552.924555328
NPR94_WTDBG2+raconNP+medakalong1463736999.9287.735858.827305428
NPR94_CANUtrimmed+raconNP+medakalong1463730499.9278.539560.127885451
NPR10_FLYE+raconNPlong1463858599.84763.4294284.539225957
NPR10_CANUtrimmed+raconNPlong1463682599.83961.2284188.040846089
NPR10_WTDBG2+raconNPlong1463666899.82762.1288397.945416244
NPR94_FLYE+raconNP+medakalong2466977999.670181.58419127.158986671
NPR10_miniasm+raconNPlong1463674699.78177.93613133.561926748
NPR94_miniasm+raconNP+medakalong1463438599.667167.37762143.466526867
NPR10_CANUtrimmedlong1462933799.76438.01764184.885747448
NPR94_FLYElong2469582099.553238.511067235.3109168022
NPR94_FLYE+raconNPlong2468116999.573167.07747243.8113118137
NPR10_FLYElong1465203599.69035.61649282.3130968111
NPR94_CANUtrimmed+raconNPlong1462614399.548133.96210294.1136448720
NPR94_WTDBG2+raconNPlong1462610899.550128.85976296.8137718682
NPR94_miniasm+raconNPlong1462553999.446105.84909320.4148669309
NPR10_WTDBG2long1461822399.45373.83424448.3208029505
NPR94_CANUtrimmedlong1460889299.38186.74023506.2234859951
NPR94_WTDBG2long2466257399.21182.63832675.03131610632
NPR10_miniasmlong1453470593.4942698.71252091499.3695605262
NPR94_miniasmlong1448457791.458100000001000000010000000100000005053

Methods:

Data:

Assemblers:

Polishing tools:

QC

Misc.

**BONUS** How well does not R10+guppy HAC currently capture homopolymers?

To check this I used counterr and the results are present here: R10_dist_len_hp (pretty good for hp of length 6) and for comparison R94_dist_len_hp (pretty good for hp of length 4)

If you have any ideas or superior tools we have missed please let us know in the comments.

The following two tabs change content below.
Rasmus H. Kirkegaard

Rasmus H. Kirkegaard

Post Doc
Playing with microbes and bioinformatics on the path towards "finished" genomes for "everyone" of them and rapid detection. My bet is currently on nanopore sequencing and I am fortunate to be involved in MinION and PromethION research in our lab.
Rasmus H. Kirkegaard

Latest posts by Rasmus H. Kirkegaard (see all)

Posted in Data analysis, Genomics.

4 Comments

  1. Nice post! Would be cool to see the marginPolish + HELEN polishing pipeline next to it. Although i think they are still working on a model for R10.

  2. Great work!
    What do you think if the long reads coverage was 100×? Would you guess much more coverage would be still insufficient?

Leave a Reply

Your email address will not be published. Required fields are marked *