Genome assembly has been a big challenge since the first methods for analysing DNA sequences saw the light of day. The challenge is primarily a result of our inability to read long fragments of DNA. With an explosion in the market for long read sequencing technologies (pacbio, nanopore, 10X, longas, etc) there is now hope that assembly will soon be a solved problem at least for simple genomes. However, some of these long read sequencing platforms struggle with a high error rate and even worse systematic errors related to homopolymers causing insertion and deletion errors that result in incomplete genes. To mitigate this problem a hybrid approach using short high accuracy reads were needed but this increases price, complexity and is generally just a bit annoying.
To get rid of the systematic errors in the nanopore platform (the platform with the longest reads reported so far) the company are continually modifying the basecalling algorithms but also the chemistry and the pores themselves. The recent release of a pore called “R10” was promised to help with the issue and hopefully allow us to get rid of the short read polishing.
Objective: Test if R10 data can produce nanopore only assemblies with an indel error rate on par with that of the hybrid approach
Strategy: Use E. Coli K12 MG1655 reference strain data to produce a set assemblies and evaluate these with different assembly metrics in comparison to the “true” reference
Conclusion:
The best indel rate for a nanopore assembly has dropped from 51/100kb to 34/100kb with the release of R10. However, that is still more than 34 times higher than the best hybrid approaches. This causes the number of genes to be 816 higher than it should be. Which indicates that the current version of R10 data is insufficient to produce nanopore only assemblies of sufficient quality to allow annotation.
Assemblies:
Assembly | Read type | Contigs | Size (bp) | ANI (%) | QUAST mismatches per 100kb | Equivalent to ~mismatch errors | QUAST indels per 100kb | Equivalent to ~indel errors | Prokka # CDS |
---|---|---|---|---|---|---|---|---|---|
Reference | 1 | 4639675 | 100.000 | 0.0 | 0 | 0.0 | 0 | 4300 | |
Spades | short | 77 | 4558759 | 99.998 | 2.3 | 108 | 0.2 | 8 | 4237 |
NPR10_Unicycler | hybrid | 1 | 4620631 | 99.991 | 1.5 | 68 | 0.2 | 10 | 4283 |
NPR94_Unicycler | hybrid | 1 | 4620629 | 99.991 | 1.5 | 68 | 0.2 | 11 | 4284 |
NPR94_Hybrid_spades | hybrid | 6 | 4624669 | 99.983 | 6.8 | 315 | 0.5 | 23 | 4283 |
NPR10_Hybrid_spades | hybrid | 5 | 4624274 | 99.986 | 7.0 | 322 | 0.8 | 35 | 4283 |
NPR10_CANUtrimmed+raconNP+medakax2+raconILM | hybrid | 1 | 4638512 | 99.997 | 0.6 | 27 | 0.9 | 42 | 4315 |
NPR10_CANUtrimmed+raconNP+medaka+raconILM | hybrid | 1 | 4638599 | 99.996 | 0.7 | 30 | 1.0 | 44 | 4317 |
NPR10_miniasm+raconNP+medaka+raconILM | hybrid | 1 | 4638595 | 99.996 | 1.1 | 53 | 1.2 | 54 | 4318 |
NPR10_FLYE+raconNP+medaka+raconILM | hybrid | 1 | 4638556 | 99.995 | 0.8 | 39 | 1.2 | 56 | 4322 |
NPR10_WTDBG2+raconNP+medaka+raconILM | hybrid | 1 | 4638664 | 99.996 | 0.7 | 34 | 1.3 | 58 | 4317 |
NPR94_CANUtrimmed+raconNP+medaka+raconILM | hybrid | 1 | 4638403 | 99.995 | 1.1 | 52 | 1.7 | 80 | 4318 |
NPR94_CANUtrimmed+raconNP+medakax2+raconILM | hybrid | 1 | 4638462 | 99.995 | 0.6 | 28 | 1.9 | 86 | 4331 |
NPR94_WTDBG2+raconNP+medaka+raconILM | hybrid | 1 | 4638598 | 99.995 | 1.1 | 51 | 1.9 | 86 | 4327 |
NPR94_FLYE+raconNP+medaka+raconILM | hybrid | 1 | 4638047 | 99.993 | 2.7 | 123 | 2.4 | 110 | 4328 |
NPR94_miniasm+raconNP+medaka+raconILM | hybrid | 1 | 4637389 | 99.989 | 3.2 | 149 | 3.1 | 142 | 4340 |
NPR10_CANUtrimmed+raconNP+medakax2 | long | 1 | 4638979 | 99.927 | 26.5 | 1228 | 34.1 | 1581 | 5116 |
NPR10_CANUtrimmed+raconNP+medaka | long | 1 | 4638976 | 99.922 | 28.0 | 1297 | 38.5 | 1787 | 5184 |
NPR10_WTDBG2+raconNP+medaka | long | 1 | 4639209 | 99.921 | 28.8 | 1337 | 40.4 | 1874 | 5251 |
NPR10_FLYE+raconNP+medaka | long | 1 | 4639515 | 99.920 | 29.9 | 1387 | 44.2 | 2051 | 5259 |
NPR10_miniasm+raconNP+medaka | long | 1 | 4639280 | 99.911 | 35.6 | 1651 | 49.4 | 2291 | 5384 |
NPR94_CANUtrimmed+raconNP+medakax2 | long | 1 | 4637323 | 99.942 | 3.7 | 173 | 51.0 | 2368 | 5272 |
NPR10_CANUtrimmed+raconNP+medaka+medakaR94 | long | 1 | 4637849 | 99.838 | 82.0 | 3805 | 52.9 | 2455 | 5328 |
NPR94_WTDBG2+raconNP+medaka | long | 1 | 4637369 | 99.928 | 7.7 | 358 | 58.8 | 2730 | 5428 |
NPR94_CANUtrimmed+raconNP+medaka | long | 1 | 4637304 | 99.927 | 8.5 | 395 | 60.1 | 2788 | 5451 |
NPR10_FLYE+raconNP | long | 1 | 4638585 | 99.847 | 63.4 | 2942 | 84.5 | 3922 | 5957 |
NPR10_CANUtrimmed+raconNP | long | 1 | 4636825 | 99.839 | 61.2 | 2841 | 88.0 | 4084 | 6089 |
NPR10_WTDBG2+raconNP | long | 1 | 4636668 | 99.827 | 62.1 | 2883 | 97.9 | 4541 | 6244 |
NPR94_FLYE+raconNP+medaka | long | 2 | 4669779 | 99.670 | 181.5 | 8419 | 127.1 | 5898 | 6671 |
NPR10_miniasm+raconNP | long | 1 | 4636746 | 99.781 | 77.9 | 3613 | 133.5 | 6192 | 6748 |
NPR94_miniasm+raconNP+medaka | long | 1 | 4634385 | 99.667 | 167.3 | 7762 | 143.4 | 6652 | 6867 |
NPR10_CANUtrimmed | long | 1 | 4629337 | 99.764 | 38.0 | 1764 | 184.8 | 8574 | 7448 |
NPR94_FLYE | long | 2 | 4695820 | 99.553 | 238.5 | 11067 | 235.3 | 10916 | 8022 |
NPR94_FLYE+raconNP | long | 2 | 4681169 | 99.573 | 167.0 | 7747 | 243.8 | 11311 | 8137 |
NPR10_FLYE | long | 1 | 4652035 | 99.690 | 35.6 | 1649 | 282.3 | 13096 | 8111 |
NPR94_CANUtrimmed+raconNP | long | 1 | 4626143 | 99.548 | 133.9 | 6210 | 294.1 | 13644 | 8720 |
NPR94_WTDBG2+raconNP | long | 1 | 4626108 | 99.550 | 128.8 | 5976 | 296.8 | 13771 | 8682 |
NPR94_miniasm+raconNP | long | 1 | 4625539 | 99.446 | 105.8 | 4909 | 320.4 | 14866 | 9309 |
NPR10_WTDBG2 | long | 1 | 4618223 | 99.453 | 73.8 | 3424 | 448.3 | 20802 | 9505 |
NPR94_CANUtrimmed | long | 1 | 4608892 | 99.381 | 86.7 | 4023 | 506.2 | 23485 | 9951 |
NPR94_WTDBG2 | long | 2 | 4662573 | 99.211 | 82.6 | 3832 | 675.0 | 31316 | 10632 |
NPR10_miniasm | long | 1 | 4534705 | 93.494 | 2698.7 | 125209 | 1499.3 | 69560 | 5262 |
NPR94_miniasm | long | 1 | 4484577 | 91.458 | 10000000 | 10000000 | 10000000 | 10000000 | 5053 |
Methods:
Data:
- Some Nanopore R9 guppy w. flipflop (232018357 bp~50x coverage)
- Some Nanopore R10 guppy w. high accuracy mode (232012362 bp~50x coverage) – Raw fast5s are available at Figshare (look for barcode 13 after basecalling and demultiplexing)
- Illumina from: [SRR2627175] (227773707 bp~49x coverage)
- Reference assembly: [U00096.2]
Assemblers:
Polishing tools:
- [Medaka (v. 0.8.1)]
- [racon (v. 1.3.3)]
- I intentionally left out nanopolish see blogpost by Ryan Wick
QC
Misc.
**BONUS** How well does not R10+guppy HAC currently capture homopolymers?
To check this I used counterr and the results are present here: R10_dist_len_hp (pretty good for hp of length 6) and for comparison R94_dist_len_hp (pretty good for hp of length 4)
If you have any ideas or superior tools we have missed please let us know in the comments.
Rasmus H. Kirkegaard
Latest posts by Rasmus H. Kirkegaard (see all)
- We aR(10.)3 pretty close now!!! - February 10, 2020
- AR(10)E we there yet? - September 2, 2019
- Why is it important to remove short molecules? - January 15, 2019
Nice post! Would be cool to see the marginPolish + HELEN polishing pipeline next to it. Although i think they are still working on a model for R10.
Cool. I will check it out.
Great work!
What do you think if the long reads coverage was 100×? Would you guess much more coverage would be still insufficient?
My first test with higher coverage indicates that the systematic errors cannot be solved by higher coverage.
Pingback: We aR(10.)3 pretty close now!!! – Albertsen Lab