Can you beat our Nanopore read error correction? We hope so!

Tagging of individual molecules has been used as an effective consensus error-correction strategy for Illumina data (Kivioja et al 2011, Burke et al 2016Zhang et al 2016) and the principle is similar to the circular consensus sequencing strategy used to generate consensus reads with error rate of < 1 % on the PacBio (Travers et al 2010, Schloss et al 2016, Singer et al 2016) and the Oxford Nanopore platforms (Li et al 2016). You simply sequence the same molecule several times and compare the reads to generate a consensus with a better accuracy than the individual reads. As far as we know a tag-based consensus error correction strategy has not been attempted for long reads before, probably because the raw error rate complicates identification of the tags. However, we see several benefits of the tag-based strategy in the long run, which is why we decided to pursue it.

My colleague @SorenKarst tested tag-based error correction on the Nanopore MinION in connection with our work on generating primer free, full length 16S/18S sequences from  environmental rRNA (see our bioRxiv paper: Karst et al 2016). The main approach used in the paper is based on Illumina sequencing inspired by Burke et al 2016, but moving to nanopore sequencing in the future would make the approach considerably easier.  His approach was relatively “simple”; individual cDNA molecules were uniquely tagged at both ends with a 10 bp random sequence, then diluted to a few thousand molecules, amplified by PCR to generate 1000’s of copies of each molecule, which were prepared for 2D sequencing on the Nanopore MinION. The resulting sequence reads were binned based on the unique tags, which  indicated they originated from the same parent molecule, and a consensus was generated from each read bin. The approach was tested on a simple mock community with three reference organisms (E. Coli MG 1655, B. Subtilis str. 168, and P. aeruginosa PAO1), which allowed us to calculate error rates.

For locating the unique tags we used cutadapt with loose settings to locate flanking adaptor sequences and extract the tag sequences. The tags were clustered and filtered based on abundance to remove false tags. As tags and adaptors contain errors, it can be a challenge to cluster the tags correctly without merging groups that do not belong together. Afterwards the filtered tags were used to extract and bin sequence reads using a custom perl script ;). For each bin we used the CANU correction tool followed by USEARCH consensus calling. By this very naive approach we were able to improve the median sequence similarity from 90% to 99%.

We think this is a good start, but we are sure that someone in the nanopore community will be able to come up with a better solution to improve the error rate even further. The data is freely available and a short description of the sequence read composition is provided below. We are looking forward to hear your inputs!

Ps. If you come up with a solution that beats our “quick and dirty” one and post it here or on twitter, I will make sure to mention you in my slides at ASM ;).

 

Data and script availability:

The nanopore reads are available as fastq at: 2D.fq or fasta: 2Dr.fa and fast5: PRJEB20906

The 16S rRNA gene reference sequences: mockrRNAall.fasta

Scripts:

Our approach in a shell script
#############################################################################
# 									    #
# Shell script for generating error corrected FL16S and getting error rates #
#									    #
# Use at your own RISK!							    #
#############################################################################

####################
#     Variables    #
####################
ID_adapt=0.1;
ID_cluster=0.8;
LINKtoCANU=/space/users/rkirke08/Desktop/canu/canu-1.3/Linux-amd64/bin;
# Update path to include poretools installation
# export PATH=$PATH:/space/users/rkirke08/.local/bin
####################
# End of variables #
####################
###############################################
# Depends on the following files and software #
###############################################
# folder with fast5 files "data/pass/"
# file with reference 16S sequences "mockrRNAall.fasta"
# perl script "F16S.cluster.split.pl"
# poretools
# cutadapt
# usearch8.1
# CANU
###############################################
# End of dependencies			      #
###############################################

# Extract fastq files
# poretools fastq --type 2D data/pass/ > data/2D.fq


# Rename headers (Some tools do not accept the long poretools headers)
#awk '{print (NR%4 == 1) ? "@" ++i : $0}' data/2D.fq | sed -n '1~4s/^@/>/p;2~4p' > 2Dr.fa

# Find adapters
cutadapt -g AAAGATGAAGAT -e $ID_adapt -O 12 -m 1300 --untrimmed-output un1.fa -o a1.fa 2Dr.fa
cutadapt -a ATGGATGAGTCT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o a1_a2.fa a1.fa

usearch8.1 -fastx_revcomp un1.fa -label_suffix _RC -fastaout un1_rc.fa
cutadapt -g AAAGATGAAGAT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o ua1.fa un1_rc.fa
cutadapt -a ATGGATGAGTCT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o ua1_a2.fa ua1.fa

cat a1_a2.fa ua1_a2.fa > c.fa

# Extract barcodes
cut -c1-12 c.fa > i1.fa
rev c.fa | cut -c1-12 | rev > i2.fa

paste i1.fa i2.fa -d "" | cut -f1-2 -d ">" > i1i2.fa


# Cluster barcodes
usearch8.1 -cluster_fast i1i2.fa -id $ID_cluster -centroids nr.fa -uc res.uc -sizeout

# Extract raw sequences
perl F16S.cluster.split.pl -c res.uc -i c.fa -m 3 -f 50 -r 40
# Count number of files in directory
find clusters -type f | wc -l

FILES=clusters/*.fa
for OTU in $FILES

do
  wc -l $OTU >> lines.txt	
done

FILES=clusters/*.fa
for OTU in $FILES

do
  OTUNO=$(echo $OTU | cut -f2 -d\/);
  # Rename header
  sed "s/>/>$OTUNO/" clusters/$OTUNO > clusters/newHeaders_$OTUNO

  # Correct reads using CANU
  $LINKtoCANU/canu -correct -p OTU_$OTUNO -d clusters/OTU_$OTUNO genomeSize=1.5k -nanopore-raw  $OTU

  # Unsip corrected reads
  gunzip clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta.gz

  sed -i "s/>/>$OTUNO/" clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta

  # Call consensus using Usearch
  usearch8.1 -cluster_fast clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta -id 0.9 -centroids clusters/OTU_$OTUNO/nr_cor_$OTUNO.fa -uc clusters/OTU_$OTUNO/res_$OTUNO.uc -sizeout -consout clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa

  sed -i "s/>/>$OTUNO/" clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa

  # Map reads to references to estimate error rate
  # Raw reads
  # Map FL16S back to references
  usearch8.1 -usearch_global clusters/newHeaders_$OTUNO -db mockrRNAall.fasta -strand both -id 0.60 -top_hit_only -maxaccepts 10 -query_cov 0.5 -userout clusters/map_raw_$OTUNO.txt -userfields query+target+id+ql+tl+alnlen

  # Usearch consensus corrected sequence
  # Map FL16S back to references
  usearch8.1 -usearch_global clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa -db mockrRNAall.fasta -strand both -id 0.60 -top_hit_only -maxaccepts 10 -query_cov 0.5 -userout clusters/map_cor_Ucons_$OTUNO.txt -userfields query+target+id+ql+tl+alnlen

  cat clusters/map_raw_$OTUNO.txt >> myfile.txt
  cat clusters/map_cor_Ucons_$OTUNO.txt >> myfile.txt

  # Collect corrected sequences
  cat clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa >> final_corrected.fa
done

Requirements:

cutadapt

USEARCH

a perl script F16S-cluster.split.pl

cDNA molecule composition

The cDNA molecules are tagged by attaching adaptors to each end of the molecule. The adaptor contains a priming site red, the unique 10 bp tag sequence (blue) and a flanking sequence (black). Note that the design is complicated as we simply modified it from our approach to get Thousands of primer-free, high-quality, full-length SSU rRNA sequences from all domains of life.

AAAGATGAAGATNNNNNNNNNNCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTTTTTTTTTTTTTTTT<—- fragment of SSU cDNA molecule—->GGGCAATATCAGCACCAACAGAAATAGATCGCNNNNNNNNNNATGGATGAGTCT

The number of T’s before the fragment will vary between molecules because it is a result of the polyA tailing described in the paper. The black parts of the sequence are generally not needed for the purpose of Nanopore sequencing but are present in the molecule because they were needed for the illumina sequencing.

Example Nanopore sequence read:

>18125
GATCTGGCTTCGTTCGGTTACGTATTGCTGGGGGCAAAGATGAAGATGTTCGTTATTCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTGGTCAAGCCTCACGAGCAATTAGTACTGGTTAACTCAACGCCTCACAACGCTTACACACCCAGCCTATCAACGTCGTAGTCTCCGACGGCCCTTCAGGGGAATCAAGTTCCAGTGAGATCTCATCTTGAGGCAAGTTTCCGCTTAGATGCTTTCAGCGGTTATCTTTTCCGAACATGGCTACCCGGCAATGCCACTGGCGTGACAACCGGAACACCAGAGTTCGTCCACCCGGTCCTTCCGTACTAGGAGCAGCCCCTCTCAAATTCAAACGTCCACGGCCAGATGGGGACCGAACTGTCTCACGACGTTCTAAGCCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTAGACCGGCTTCAGCCCCAGGATGTGATGAGCCGACATCGAGGTGCCAAACACCGCCGTCGATAAACTCTTGGGCGGTATCAGCCTGTTATCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCTTCCATACAGAACCACCGGATCTTCAAGACCTACTTTCGTACCTGCTCGACGTGTCTGCTCTGATCAAGCGCTTTTGCCTTTATATTCTCTGCGACCGATTTCCGACCGGTCTGAGCGCACCTTCGTGGTACTCCTCCGTTACTCTTTTAGGAGGAGACCGCCCCAGTCAAACTGCCCACCATACACTGTCCTCGATCCGGATTACGGACCAGAGTTAGAACCTCAAGCATGCCAGGATGGTGATTTCAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCCACCTAATCCTACACAGCAGGCTCAGTCCAGTGCCGCTACAGTAAAGGTTCACGGGGTCTTTCCGTCGCCGCGGATACACTGCATCTTCACAGCGATTTCAATTTCACTGAGTCTCGGGTGGAGACAGCGCCGCCATCGTTACGCCACTCGTGCAGGTCGGAACTTACCCGACAAGGAATTTCGCTACCTTGGACCGTTATCGTTACGGCCGCCGTTTACCGGGGCTAGATCAGGCTTCGCGCCCCATCAATACTTCCGGCACCGGGAGGCGTCACACTTATACGCCGTCCACTTTCGTGTTTTGCAGAGTGCTGTGTTTTTAATAAACAGTCGCAGCGGCCTGGTATCTTCGACCAGCCAGAGCTTACGGAGTAAATCCTTCACCCTAGCCGGCGCACCTTCTCCCGAAGTTACGGTGCCATTTGCCTAGTTCCTTCACCCGAGTTCTCAAGCGCCTTGGTATTCTCTACCCGACCACCTGTGTCGGTTTGGGTGCAGTTCCTGGTGCCTGAAGCTTAGAAGCTTTTGGAAGCATGGCATCAACCACTTCGTCGTCTAAAAGACGACTCGTCATCAACTCTCGGCCTTGAAACCCCGGATTTACCTAAGATTTCAGCCTACCACCTTAAACTTGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGGTGGAACGTTCTGTTTATGTTTCTGAAA

Example cluster with same random sequence:

>18125
GATCTGGCTTCGTTCGGTTACGTATTGCTGGGGGCAAAGATGAAGATGTTCGTTATTCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTGGTCAAGCCTCACGAGCAATTAGTACTGGTTAACTCAACGCCTCACAACGCTTACACACCCAGCCTATCAACGTCGTAGTCTCCGACGGCCCTTCAGGGGAATCAAGTTCCAGTGAGATCTCATCTTGAGGCAAGTTTCCGCTTAGATGCTTTCAGCGGTTATCTTTTCCGAACATGGCTACCCGGCAATGCCACTGGCGTGACAACCGGAACACCAGAGTTCGTCCACCCGGTCCTTCCGTACTAGGAGCAGCCCCTCTCAAATTCAAACGTCCACGGCCAGATGGGGACCGAACTGTCTCACGACGTTCTAAGCCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTAGACCGGCTTCAGCCCCAGGATGTGATGAGCCGACATCGAGGTGCCAAACACCGCCGTCGATAAACTCTTGGGCGGTATCAGCCTGTTATCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCTTCCATACAGAACCACCGGATCTTCAAGACCTACTTTCGTACCTGCTCGACGTGTCTGCTCTGATCAAGCGCTTTTGCCTTTATATTCTCTGCGACCGATTTCCGACCGGTCTGAGCGCACCTTCGTGGTACTCCTCCGTTACTCTTTTAGGAGGAGACCGCCCCAGTCAAACTGCCCACCATACACTGTCCTCGATCCGGATTACGGACCAGAGTTAGAACCTCAAGCATGCCAGGATGGTGATTTCAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCCACCTAATCCTACACAGCAGGCTCAGTCCAGTGCCGCTACAGTAAAGGTTCACGGGGTCTTTCCGTCGCCGCGGATACACTGCATCTTCACAGCGATTTCAATTTCACTGAGTCTCGGGTGGAGACAGCGCCGCCATCGTTACGCCACTCGTGCAGGTCGGAACTTACCCGACAAGGAATTTCGCTACCTTGGACCGTTATCGTTACGGCCGCCGTTTACCGGGGCTAGATCAGGCTTCGCGCCCCATCAATACTTCCGGCACCGGGAGGCGTCACACTTATACGCCGTCCACTTTCGTGTTTTGCAGAGTGCTGTGTTTTTAATAAACAGTCGCAGCGGCCTGGTATCTTCGACCAGCCAGAGCTTACGGAGTAAATCCTTCACCCTAGCCGGCGCACCTTCTCCCGAAGTTACGGTGCCATTTGCCTAGTTCCTTCACCCGAGTTCTCAAGCGCCTTGGTATTCTCTACCCGACCACCTGTGTCGGTTTGGGTGCAGTTCCTGGTGCCTGAAGCTTAGAAGCTTTTGGAAGCATGGCATCAACCACTTCGTCGTCTAAAAGACGACTCGTCATCAACTCTCGGCCTTGAAACCCCGGATTTACCTAAGATTTCAGCCTACCACCTTAAACTTGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGGTGGAACGTTCTGTTTATGTTTCTGAAA
>42262
AGCGTTCAGATTACGTATTGCTAGGGGGCAAAGATGAAGATGTTCGTTATTCTTTAGACTTGCCTGTCGCTCTATCTTCTCTTTTTGGTCAAGCCTCACGGGCAATTAGTACTGGTTAGCTCAACGCCTCACAACGCTTACAGCCTATCAACGTCATAATTCTTCTGACGGCCCTTCAGAATCAAGTTCCCAGTGAGATCTCATCTTGAGCAAGTTTCCCACCGTCTTTCAGCGGTTATCTTTTCGAACCTGCTTCCAGCAATACCACTGGCGTGACAACCGGAACACCAGAGGTTCGTCCACTCCGGTCCTCTCCGTACTAGGGCAGCCCTCTCAAATCTCAGAACGTCCACGGCAGATAGGACCGAACTGTCTCACGACGTTCTAAGCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATGCAGGACCGGCTTCGGCCCCAGGATGTGATGAGCCGGCATCGGGTGCCAAACACCGCCGTCGATATAAACTCGGGCATTGACCTGTTATCCCCGGGTACCTTTTTATCGTTGAGCGATGGCCCTTCCATACCAGAACCACCGGATCACTACAGACCTACTTTCGTACCTGCTCGCTGTCTGTCGCGGCCAAGCGCTTTTGCTATGCTCTGCGACCGATTTCCGACCGGTCTGGGCGCACCTTCGTACTCCGTTGCCTCTTTTGGAGACCGCTGATCAAACTGCCCACCATACACTGTCCTCGATCCGGATTACCAGAGTTTAGAACTCAATGCCAGGGTGGTATTTCAAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCACCTATCCTACAAGCAGGCTCAAAGTCCAGTACAACTACAGTAGGTTCACGGGGTCTTTCCGTCTAGCCGCGGATACCTGCATCTTCAGCGTTTCAATTTCACTGAGTCTCAGGTGGAGACAGCGCCGCCATCGTTACGCCATTCGTGCAGGTCGGAACTTGCCGACAAGGAATTTTGCACCTTGGGACCATTCGTTACGCCGTTTACCGGGGCTGATCAAGAGCTTGCTTGCGCTAACCCCATCAATTAATTTTTCCGGCACCGGGGAGGCGTCACACCTACGTCCCACTGCGTGTTTGCAGAGTGCTGTGTTTAATAAGTCGCAGCAGCTCAGTATCTTCGACCAGCCAGAGCTTACGGAGTAAATCTTCACCTAGCCGGCGACCTTCTCCCGGAAGTTACGGTGCCATTTGCCTAGTTCCTCCGCACCCGAAAGCCCTTCGCGCCTTGGTATTCTCTACCCGACCTGTGTCGGTTTGGGGCACGGTTCCTGGCCTGAAGCAGAAGCTTTTCTTGGAAGCCTGGCATCAACCACTTCGTCATCTAAAAGACGACTCGTCATCAGCTCTCGGCCTTGAAACCGGATTTACCTAAGATTTCAGCCTACCACCTTAAACTTGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGTGGAGACGTTCTGTTTATGTTTCTATC
>50101
CCCGGTTACGTATTGCTAGGGGCAAAGATGAAGATGTTCGTTATTCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTGGTCAAGCCTGCGGGCAATTATACTGGATAGCTCAACGCCTCACAACGCATACACCCAGCTTCTATCAACGTCGTAGTCTTCGACGGCCCTTCAGGAATCAAGTTCCCAGTGAGATCTCATCTTGAGGCAAGTTTCCCGCTTAGATGCTTTCAGCGGTTATCTTTTCCGAACATAGCTACCCGGCAATGCCACTGGCGTGACAACCGGAACACCAGAGGTTCGTCCACTCCGGTCCTCTCGTACTAGGAGCAGCCTCTCAAATCAAACGTCCACGGCAGATATAGGGACCGAACTGTCTCACGACGTTTCTAAACCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCACCCTTGGGACCGGCTTCAGCCCCAGGATGTGATGAGCCGACATCGGGAACAAACACCGCCGTCGATATAAACTCTTGGGCGGTATCAGCCTGTTATCCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCCTTCCATACAGAACCACCGGATCACTAAGACCTACTTTCGTACCTGCTCGACGTGTCTGTCTCGCAGTCAAGCGCGCTTTTGCTTTATACTCTGCGACCGATTTCCGACCGGTCTGAGCGCACCTTCGTACTCTCCGTTACTCTTTAGGAGACCGCCCCAGTCAAACTGCCCACCATACACTGTCCCTATCGATCCGGATTACGGACAGAGTTAGAACCTCAAGCATGCCAGGGTGGTATTTCAAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCACCTATCCTACACAAGCAGGCTCAAAGTCCAGTGCAAAGCTACAGTAAGGTTCACGGGTCTTTCCGTCTAGCCGCGGATACACTGCATCTCCACAGCGATTTCACCTCACTGAGTCTCTCGGGTGGAGACAGCGCCGCCATCGTTACGCCATTCGTGCAGGTCGGAACTTACCGACAAGGAATTTCGCTACCTTAGACCGTTATCGTTACGGCCGCCGTTTACCGGGGCTTCGATCAAGAGCTTCGCTTGCGCTAACCCCATCAATTAACCTTCGGCACCGGGGAGGCGTCACACCCTATACGTCCACTTTCGTGTTTGCAGAGTGCTGTGGCTTTTAATAAACAGTCGCAGCGGCCTGGTATCTTTTCGACCAGCCAGAGCTTACGGAGTAAATCCTTCACCTTAGCCGGCGCACCTTCTCCCGAAGTTACGGTGCCATTTGCTAGTTCCTTCACCCGAGTTCTCTCAAGCGCCTTGGTATTCTCTACCCGACCACCTGTGTCGGTTTGGGGTACGGTTCTGGTTACCTGAAGCTTAGAAGCTTTTCTTGGAAGCATGGCATCAACCACTTCGTCGTCTAAAGACGACTCGTCATCAGCTCTCGGCCTTGAAACCCCGGATTTACCTAAAGATTTCAGCCTACCACCTTAAACTTAGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGGTGGAAGTTCTGTTTATGTTTCTTGAGC

Continue reading

Fast, easy and robust DNA extraction for on-site bacterial identification using MinION

My name is Peter Rendbæk, and I’m currently a master student in the Albertsen lab. The overarching aim of my master project, is as a pre-test for several of the new big projects in the group, which focus on applying the on-line bacterial identification for process control at wastewater treatment plants. Hence, last couple of months I have been working on the project “Developing methods for on-site DNA sequencing using the Oxford Nanopore MinION platform”. The MinION has improved a lot since its release three years ago, and it can now be used to make rapid determination of bacterial compositions.

The potential for this fast and mobile DNA-sequencing is mind-blowing. However, given that the technology is here now (!), there has been relatively little focus on portable, fast, easy and robust DNA extraction. Hence, I’ve spent the last months on trying to develop a fast, cheap, mobile, robust and easy to use DNA extraction method.

There is a significant amount of bias connected with DNA extraction, but the bias associated with wastewater treatment samples has been investigated in depth. However, the “optimized method” is not suited for on-site DNA-extraction. There are 3 principle steps in DNA extraction, cell lysis, debris removal and DNA isolation, which I will cover below and discuss how I simplified each step.

In general, complex samples require bead beating for cell lysis and homogenization. The problem is that our in-house bead-beating is done by a big table top tool weighing 17 kg, which makes it hard to transport. However, I came across a blog post from loman labs about sample preparation and DNA extraction in the field for Nanopore sequencing. In the blog post, the possibilities of a portable bead beater outlined, by the use of a remodeled power-tool. I thought this was interesting, so I went out and bought an Oscillating Multi-Tool cutter and tried this with lots of duct tape…

The amazing part was that it worked! But the problem was that the samples would get “beaten” differently depend on how you taped the sample to the power-tool, which could give rise to variation large variations in the observed microbial community.

I solved this by 3D printing an adapter to the power-tool that fits the bead-beater tube (Finally, a good excuse to use a 3D printer!). I used Solidworks to design the adapter and collaborated with our local department of mechanical and manufacturing engineering (m-tech) in 3D printing it. You can make your own by simply downloading my design from Thingiverse (It did take a few iterations to make it durable enough, and I still use a little duct tape..).

 

After the bead beating, the cell debris removal is done by centrifugation. Our “standard” protocol recommends centrifugation at 14000 x G for 10 minutes at 4 C. However, in our minds that seemed a little extensive and requires a huge non-transportable centrifuge… Alternatively, there are a lot of possibilities to use small, easy to transport and easy to use centrifuges if we do not have to centrifuge at 14.000 xG at 4 C. There is even the possibility to 3D print a hand-powered centrifuge. However, I did not follow this path, as it seems a bit dangerous… After several tests, we discovered that a simple table top centrifuge could do the job perfectly well, using 2000 xG for 1 min at room temperature if we combined it with the DNA isolation described below.

The last step is DNA isolation, I tried several different methods, but we got the idea to simply use Agencourt AMPure XP that is routinely used in e.g. PCR purification (we 10 diluted the AMPure XP beads 1:10 to save some money and it seems to work just as good). And… It works..

So, now you have an overview of the method I developed. The most amazing part is that it works! It takes 10-15 minutes from the sample is taken until you’ve got ready DNA for use, compared to 60+ minutes for our “standard” protocol. Furthermore, it requires inexpensive equipment that can be carried in a small suitcase. So, just to prove that this approach is fast, I filmed myself doing the DNA extraction with a GO-PRO camera, as you can see below.

The next part is to test the MinION in the lab. How, fast can we identify bacteria and is the extracted DNA compatible with the downstream library preparation, which we hope to do on the our new and shiny Voltrax (which is now moving liquids!).

How to spend 10.000.000 DKK (1.3 mill. EUR)

Recently, I was one of 16 recipients of a 10 mill. DKK grant (1.3 mill. EUR) from the VILLUM foundation under their Young Investigator Program (YIP). The program is unique in Denmark and offers young scientists an opportunity to build a research group on their own terms. The foundation is working on the premise of the founder of who famously said:

“One experiment is better than a thousand expert opinions”

Villum Kann Rasmussen

Hence, they simply support good experiments and trust that the researchers will come up with great solutions, if the foundation interfere as little as possible. This means as little as possible administration and flexible funding if new opportunities arise during the project. While this sounds almost too good to be true, previous grantees have all said that it actually works this way!

So, how do I plan to spend 10 mill DKK (1.5 mill. EUR)?

Microbial communities underpin all processes in the environment and have direct impact on human health. Despite their importance, only a tiny fraction of the millions of different microbes is known. This is mainly due to the immense difficulties of cultivating microbes from natural systems in the laboratory. This discrepancy is also known as the “microbial dark matter”.

For any microbe, the genome is the blueprint of its physiological properties. Having this in hand, it is possible to reconstruct its potential metabolism and establish hypotheses for evolution, function and ecology. Furthermore, it provides a foundation for further validating its function through a variety of in situ methods. However, genomes are extremely difficult to obtain from the microbial dark matter.

Currently, multiple metagenomes combined with bioinformatic approaches, is used to retrieve individual genomes from complex samples (see e.g. our paper from 2013). This has let to numerous fundamental discoveries, including the discovery of bacteria cable of complete ammonia oxidation (Comammox, see here and here), which radically change our view of the global nitrogen cycle and granted us the “Danish research result of the year, 2015”.

However, we are still far from realizing the full potential of metagenomics to retrieve genomes. Mainly due to the complexity of nature, where multiple closely related strains co-exists, which renders the current approaches useless.

Using the VILLUM YIP grant we want to use cutting-edge DNA sequencing related techniques to enable access to all genomes despite strain-complexity, link genomes, plasmids and phages, and enable direct measurements of in situ bacterial activity. The ability to readily obtain activity measurements of any bacteria, in any microbial ecosystem, will radically change microbial ecology and environmental biotechnology.

Obtaining complete bacterial genomes

Retrieving individual bacterial genomes from complex microbial communities can be compared to mixing hundreds of puzzles with millions of pieces, all containing different shades of blue sky. However, one way to circumvent the problem with closely related strains, is to use bigger pieces of DNA to assemble the genomes. The current standard approach is to use short read sequencing (Illumina; approx. 2 x 250 bp). However, the rapid development within long-read DNA sequencing, means that it is possible to start to experiment and envision how this is going to be solved.

The newest technology to the long read market is the Oxford Nanopore. It has successfully been used to generate complete genomes from pure cultures and we have used it for metagenomics of simple enrichment reactors to obtain the first complete Comammox genome. We have been early access testers of the MinION and are currently involved in the developer program. The improvement of the technology that has happened in the first half of 2016, means that the quality and throughput of the technology are now sufficient to attempt medium complexity metagenomes. Furthermore, we are one of the early customers to the high-throughput version of the MinION, the PromethION, which, in theory, would allow us to tackle even complex metagenomes.

Furthermore, while long-read DNA sequencing might enable closed bacterial chromosomes, they are still not associated directly with e.g. plasmids and phages. However, the last couple of years the several new methods have appeared, e.g. Hi-C and 3C, that utilize physical cross-linking of the DNA inside cells to generate sequencing libraries where proximity information can be retrieved. This information can then be used to infer which genetic elements were in close proximity, and thereby originated from the same bacterial cell. However, until now, the methods have only been used in microbial communities of limited complexity, but there is does not seem to be theoretical limits that would hinder the use of the methods, if complete genomes are available.

 

Measuring in situ replication rates using DNA sequencing

An exciting new possibility is that complete genomes enable measurements of bacterial replication rates directly from metagenomic data (see here and here). The method is very simple and based on the fact that the majority of bacteria starts replication at a single origin and then proceeds bi-directional. Hence, in an active dividing bacterial population there will be more DNA at the origin of replication than at the termini. This can be directly measured using DNA sequencing, as the number of reads is proportional to the amount of DNA in the original sample. Hence, by comparing the number of reads (coverage) at the origin to the termini, a measure of bacterial replication rate is obtained. This allows direct observations of individual bacterial response to stimuli in the environment, even in very complex environments as e.g. the human gut and with sub-hourly resolution. This type of information has been the dream of microbial ecologists since the field emerged over 100 years ago and will allow for countless new experiments within microbial ecology. Recently, the method has even been demonstrated to work with high-quality metagenome bins (see here). It is going to be interesting to further explore the potentials and limitations of the method using complete genomes at an unprecedented scale.

A few closing remarks

I am thrilled to have the next five years to explore how we can apply new DNA sequencing methods to understand the bacterial world and have the chance to build up a group of young scientists that share my excitement! If you think the project sounds great and either want to collaborate or work with us – then drop me an email!

Finally, I have to thank the people and mentors that made this possible. First of all my long-term mentor Per H. Nielsen; 6 years ago he introduced me to the world of microbial communities and throughput the years he have given me the freedom to pursue my own ideas – “freedom with responsibility” as we say in Danish. A leadership style that I very much try to adapt in my own newly found role as group leader. Secondly, my colleagues and friends Søren M. Karst and Rasmus H. Kirkegaard, whom I have persuaded to join me on further adventures down the rabbit hole! Furthermore, the long list of collaborators over the past years, where I have been fortunate to learn from some best scientists in the world (if you ask me). There are too many to mention, but a special thank goes out to Holger Daims, Michael Wagner, Gene Tyson and Phil Hugenholtz.

Promethion unboxing

Following our experiences with DNA sequencing using the MinION since 2014 as a part of the Minion (early) Access Programme (MAP), and their developers programme we applied for a spot on the PromethION Early Access Programme (PEAP) back in May 2015. The MinION was the mindblowing DNA sequencer that allows you to do long read (no fixed limit) DNA sequencing by plugging it into a laptop!!! It was an absolutely amazing piece of tech, but the initial throughput was not enough for our aim of retrieving the complete (and closed) genomes from all the abundant organisms in complex samples such as wastewater treatment systems. The PromethION promised a solution to this lack of throughput by having 48 times more flow-cells with 6 times more pores in each cell.

As we were waiting for the Promethion, we used the MinION frequently and our first try at a metagenome sample was a simple two species culture where we used the long reads to scaffold the Nitrospira genome and thus helped show that all the genes neeeded for complete nitrification were present in a single organism (Comammox). At the time, we could scaffold the illumina based assembly with some nanopore reads, but since then ONT has improved their technology tremendously and people have started to get data in the ~5 Gbp range from a single flowcell.

Hence, back-of-the-envelope calculations says that without any further improvements the PromethION would now be able to generate:

[5Gbp pr. flowcell]  * [6 x number of pores] * [48 flowcells] = 1440 Gbp (in just ~48hrs)

In other words equivalent to 288.000X coverage of a microbial genome of 5 Mbp (1440000 Mbp/5 Mbp). If we want to retrive genomes of organisms at 0.1% abundance that would still amount to 288 X coverage! While we expected improvements in throughput, we never foresaw that it would come this quick and then suddenly the day came where our Promethion configuration unit arrived. The unit was delivered by ONT in a small van and we had a nice little unboxing experience. The Nanopore hype have finally reached the entire department that have started dreaming about applications for long read sequencing.

As the PromethION is expected to produce massive amounts of data in very little time the need for fast data transport and storage is another challenge. Even storing data for a single MinION is causing trouble for people.

ONT therefore ships a PromethION configuration unit to test whether the local infrastructure is ready before shipping the actual PromethION. The accompanying manual states that the maximum expected signal data output would be 80GB/hr per flowcell. The spec sheet for a NAS server suggested by ONT to move the data away from the PromethION itself, while running the sequencing, includes 2 fibre connections and 12*6 TB SSDs to support the internal buffer of 24 TB SDD storage on the PromethION. This amount of SSD storage at enterprise quality does not come cheap and only covers a machine for temporary storage, not the following bioinformatic computations. Compute costs should does not be neglected in the considerations regarding  buying a PromethION. As prices tend to drop fast for computer equipment, postponing any unnecessary upgrades could save you a lot of money or give you much more compute power for the same amount. We therefore planned to buy a “cheap” storage server (for now) with the specs below to hopefully meet the needs for the configuration unit and pass the test.

  • 768 GB ram
  • 2 x Intel Xeon 2650v4 (12 cores each)
  • 768gb DDR4 ram 2400MHZ
  • 2 x 400gb SSD (for the OS)
  • 16 x 8TB NLSAS (12gbps)
  • 2 x 10gbit sfp+ fibre ports

We plan to upgrade our entire compute facility when we get a better overview of the true needs for running the sequencing and bioinformatics. With PromethION level output of signal data we do not expect that we will be able to store or upload the raw data files to the read archives in long term, but would hopefully obtain fastq or fasta files as early as possible and discard the raw signals. Re-sequencing samples can probably end up being a lot cheaper than storing raw signal data.

Currently, we are working with our IT support department to get everything connected and hope to be able to share a “hello world” from the PromethION soon!