Tagging of individual molecules has been used as an effective consensus error-correction strategy for Illumina data (Kivioja et al 2011, Burke et al 2016, Zhang et al 2016) and the principle is similar to the circular consensus sequencing strategy used to generate consensus reads with error rate of < 1 % on the PacBio (Travers et al 2010, Schloss et al 2016, Singer et al 2016) and the Oxford Nanopore platforms (Li et al 2016). You simply sequence the same molecule several times and compare the reads to generate a consensus with a better accuracy than the individual reads. As far as we know a tag-based consensus error correction strategy has not been attempted for long reads before, probably because the raw error rate complicates identification of the tags. However, we see several benefits of the tag-based strategy in the long run, which is why we decided to pursue it.
My colleague @SorenKarst tested tag-based error correction on the Nanopore MinION in connection with our work on generating primer free, full length 16S/18S sequences from environmental rRNA (see our bioRxiv paper: Karst et al 2016). The main approach used in the paper is based on Illumina sequencing inspired by Burke et al 2016, but moving to nanopore sequencing in the future would make the approach considerably easier. His approach was relatively “simple”; individual cDNA molecules were uniquely tagged at both ends with a 10 bp random sequence, then diluted to a few thousand molecules, amplified by PCR to generate 1000’s of copies of each molecule, which were prepared for 2D sequencing on the Nanopore MinION. The resulting sequence reads were binned based on the unique tags, which indicated they originated from the same parent molecule, and a consensus was generated from each read bin. The approach was tested on a simple mock community with three reference organisms (E. Coli MG 1655, B. Subtilis str. 168, and P. aeruginosa PAO1), which allowed us to calculate error rates.
For locating the unique tags we used cutadapt with loose settings to locate flanking adaptor sequences and extract the tag sequences. The tags were clustered and filtered based on abundance to remove false tags. As tags and adaptors contain errors, it can be a challenge to cluster the tags correctly without merging groups that do not belong together. Afterwards the filtered tags were used to extract and bin sequence reads using a custom perl script ;). For each bin we used the CANU correction tool followed by USEARCH consensus calling. By this very naive approach we were able to improve the median sequence similarity from 90% to 99%.
We think this is a good start, but we are sure that someone in the nanopore community will be able to come up with a better solution to improve the error rate even further. The data is freely available and a short description of the sequence read composition is provided below. We are looking forward to hear your inputs!
Ps. If you come up with a solution that beats our “quick and dirty” one and post it here or on twitter, I will make sure to mention you in my slides at ASM ;).
Data and script availability:
The nanopore reads are available as fastq at: 2D.fq or fasta: 2Dr.fa and fast5: PRJEB20906
The 16S rRNA gene reference sequences: mockrRNAall.fasta
Scripts:
Our approach in a shell script############################################################################# # # # Shell script for generating error corrected FL16S and getting error rates # # # # Use at your own RISK! # ############################################################################# #################### # Variables # #################### ID_adapt=0.1; ID_cluster=0.8; LINKtoCANU=/space/users/rkirke08/Desktop/canu/canu-1.3/Linux-amd64/bin; # Update path to include poretools installation # export PATH=$PATH:/space/users/rkirke08/.local/bin #################### # End of variables # #################### ############################################### # Depends on the following files and software # ############################################### # folder with fast5 files "data/pass/" # file with reference 16S sequences "mockrRNAall.fasta" # perl script "F16S.cluster.split.pl" # poretools # cutadapt # usearch8.1 # CANU ############################################### # End of dependencies # ############################################### # Extract fastq files # poretools fastq --type 2D data/pass/ > data/2D.fq # Rename headers (Some tools do not accept the long poretools headers) #awk '{print (NR%4 == 1) ? "@" ++i : $0}' data/2D.fq | sed -n '1~4s/^@/>/p;2~4p' > 2Dr.fa # Find adapters cutadapt -g AAAGATGAAGAT -e $ID_adapt -O 12 -m 1300 --untrimmed-output un1.fa -o a1.fa 2Dr.fa cutadapt -a ATGGATGAGTCT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o a1_a2.fa a1.fa usearch8.1 -fastx_revcomp un1.fa -label_suffix _RC -fastaout un1_rc.fa cutadapt -g AAAGATGAAGAT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o ua1.fa un1_rc.fa cutadapt -a ATGGATGAGTCT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o ua1_a2.fa ua1.fa cat a1_a2.fa ua1_a2.fa > c.fa # Extract barcodes cut -c1-12 c.fa > i1.fa rev c.fa | cut -c1-12 | rev > i2.fa paste i1.fa i2.fa -d "" | cut -f1-2 -d ">" > i1i2.fa # Cluster barcodes usearch8.1 -cluster_fast i1i2.fa -id $ID_cluster -centroids nr.fa -uc res.uc -sizeout # Extract raw sequences perl F16S.cluster.split.pl -c res.uc -i c.fa -m 3 -f 50 -r 40 # Count number of files in directory find clusters -type f | wc -l FILES=clusters/*.fa for OTU in $FILES do wc -l $OTU >> lines.txt done FILES=clusters/*.fa for OTU in $FILES do OTUNO=$(echo $OTU | cut -f2 -d\/); # Rename header sed "s/>/>$OTUNO/" clusters/$OTUNO > clusters/newHeaders_$OTUNO # Correct reads using CANU $LINKtoCANU/canu -correct -p OTU_$OTUNO -d clusters/OTU_$OTUNO genomeSize=1.5k -nanopore-raw $OTU # Unsip corrected reads gunzip clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta.gz sed -i "s/>/>$OTUNO/" clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta # Call consensus using Usearch usearch8.1 -cluster_fast clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta -id 0.9 -centroids clusters/OTU_$OTUNO/nr_cor_$OTUNO.fa -uc clusters/OTU_$OTUNO/res_$OTUNO.uc -sizeout -consout clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa sed -i "s/>/>$OTUNO/" clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa # Map reads to references to estimate error rate # Raw reads # Map FL16S back to references usearch8.1 -usearch_global clusters/newHeaders_$OTUNO -db mockrRNAall.fasta -strand both -id 0.60 -top_hit_only -maxaccepts 10 -query_cov 0.5 -userout clusters/map_raw_$OTUNO.txt -userfields query+target+id+ql+tl+alnlen # Usearch consensus corrected sequence # Map FL16S back to references usearch8.1 -usearch_global clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa -db mockrRNAall.fasta -strand both -id 0.60 -top_hit_only -maxaccepts 10 -query_cov 0.5 -userout clusters/map_cor_Ucons_$OTUNO.txt -userfields query+target+id+ql+tl+alnlen cat clusters/map_raw_$OTUNO.txt >> myfile.txt cat clusters/map_cor_Ucons_$OTUNO.txt >> myfile.txt # Collect corrected sequences cat clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa >> final_corrected.fa done
Requirements:
a perl script F16S-cluster.split.pl
cDNA molecule composition
The cDNA molecules are tagged by attaching adaptors to each end of the molecule. The adaptor contains a priming site red, the unique 10 bp tag sequence (blue) and a flanking sequence (black). Note that the design is complicated as we simply modified it from our approach to get Thousands of primer-free, high-quality, full-length SSU rRNA sequences from all domains of life.
AAAGATGAAGAT–NNNNNNNNNNCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTTTTTTTTTTTTTTTT<—- fragment of SSU cDNA molecule—->GGGCAATATCAGCACCAACAGAAATAGATCGCNNNNNNNNNN–ATGGATGAGTCT
The number of T’s before the fragment will vary between molecules because it is a result of the polyA tailing described in the paper. The black parts of the sequence are generally not needed for the purpose of Nanopore sequencing but are present in the molecule because they were needed for the illumina sequencing.
Example Nanopore sequence read:
>18125Example cluster with same random sequence:
>18125
Bonus data
We also have an even bigger set of 1D data with more reads per cluster:
Example read:
>42Rasmus H. Kirkegaard
Latest posts by Rasmus H. Kirkegaard (see all)
- We aR(10.)3 pretty close now!!! - February 10, 2020
- AR(10)E we there yet? - September 2, 2019
- Why is it important to remove short molecules? - January 15, 2019
Dear Rasmus, I just listened to you in ASM. How do I buy my own nanopore machine
Dear Bukola.
You can buy it at https://store.nanoporetech.com/devices.html and Oxford Nanopore also has a spot at ASM (Booth 1242) where you can have a chance to win one!
https://twitter.com/nanopore/status/870324096126652417
Hey Rasmus,
bit late but I just posted something in the Nanopore community that goes into this direction. Would be curious to hear what you think about it:
https://community.nanoporetech.com/posts/consensus-reads-accuracy
Best wishes,
Philipp
Hi Philipp
Great to hear that you are pursuing the idea. I have provided some comments on your post in the community forum.
Best regards
Rasmus
I have now added a dropbox link to the raw fast5 files for the 1D data (https://www.dropbox.com/s/lt472r4en625a3i/20170104_MockFL16S.tar.gz?dl=0). There has been a lot of basecalling updates since this blogpost so the basecalled reads could probably be improved quite a bit by re-basecalling the data. I hope to see someone pick this up and provide a method for consensus calling individual molecules in complex systems from error prone sequencing data. We published this demonstration as a part of our paper generating millions of 16S rRNA reference sequences (https://www.nature.com/articles/nbt.4045). A method to generate high quality sequences on the MinION could have several advantages as compared to hacking “long reads” on the short read sequencing platforms.