# Why is it important to remove short molecules?

In the age of short read sequencing you could treat the cells with very harsh physical disruption as there was little chance that your DNA extraction could shear your DNA to shorter fragments than the machines could sequence. However, with the emerging long read sequencers short fragments are suddenly no fun and people have started opening long forgotten books studying ancient molecular methods. Preparing high molecular weight DNA can be a pain (tricky+toxic chemicals) and if your DNA libraries do not produce much data you do not always know why. The presence of even a small mass of short molecules can be one of the reasons in this blog-post I try to explore why.

### Ligation depends on the number of molecules

Often when running reactions in the lab we do not think too much about reagents running out unless we run way too many PCR cycles. However when it comes to nanopore sequencing they recommend loading only 100-200 fmol into the reaction. This indicates that some of the reagents in the ligation will be depleted when adding more DNA.

Calculating the number of moles of dsDNA with a given length and mass is given by

number of moles = mass (g)/((# nucleotides x 607.4) + 157.9 (g/mol))

Converting this to “femto” moles is just a matter of multiplying it with (1015) e.g. if we have 1 ug of 8 kb fragments the calculation is

1*10-6g/(8000*607.4g/mol+157.9h/mol)=2.058*10-13 mol=205.8 fmol

Multiplying the number of moles with Avogadros constant (6.022*1023mol-1) gives you the number of molecules e.g.

2.058*10-13mol*6.022*1023mol-1= 123,932,760,000 molecules

### How is the number of molecules related to length?

The number of molecules in a given mass of DNA is inversely proportional to the length. E.g. if we have equal mass of 1000 bp molecules and 10000 bp molecules we have (1000/10000)−1=10 times more short molecules than we have long molecules. If we take this concept and look at some distributions we can explore what that means for your input to a ligation reaction (Figure 1, 3 & 4).

#### Figure 1 – Your gel vs the molecule distribution at equal mass

In the simple case where we we have equal mass of short (1,000 bp), and somewhat long (10,000 bp) fragments the distribution of molecules show that we have 10 times more short molecules than long ones. If we perform an adapter ligation reaction with adapters enough for 50 % of our molecules and any molecule has the same chance of entering in a reaction we will have 10 times more short molecules with adapters than long molecules. If we intend to perform reactions such as PCR, clean ups, diffusion limited sequencing that can favour short molecules we are going to skew this even more…

mtot=msinglemolecule∗Nmolecules

Where mtot is the total mass, msinglemolecule is the mass of a single molecule, and Nmolecules
is the number of molecules.

msinglemolecule=Lmolecule∗constant

Where L is the length of the molecule, and the constant is the X g/bp

mtot=Lmolecule∗constantt∗Nmolecules

Nmolecules=mtot∗L−1molecule∗constant−1

Where the constant−1 is just another constant. The number of molecules is proportional to the mass∗length−1. Thus the distribution of the molecules is equal to the distribution weighted by the length−1.

### How does molecule size impact data generation?

If we sequenced all of the DNA in 1 ug it would be equivalent to roughly 1000*1.98*10^12 bp≈2000 terabases (https://www.thermofisher.com/us/en/home/references/ambion-tech-support/rna-tools-and-calculators/dna-and-rna-molecular-weights-and-conversions.html). Unfortunately we can only expect to sequence a tiny fraction of these molecules. The unofficial record at the moment is 34 Gbp on a MinION flowcell (https://twitter.com/DrT1973/status/1067915825783353344). As the number of molecules sequenced is a finite subset (we do not sequence all the DNA) the molar distribution becomes important as short molecules means we produce less data. If we imagine sequencing a DNA pool with equal mass of 1 kb and 10 kb fragments, this would give a ~80 % lower sequencing yield than with only the long molecules.

#### Figure 2 – Impact of short molecules on sequencing yield

Using the very simplistic assumption that we sequence a fixed amount of molecules regardless of length, it is clear that even a little mass of short molecules has huge impact on our chances of getting record breaking yields. At equal mass of short and long molecules with a size ratio of 10 we get only ~18 % of the yield that we could have had with only the long molecules and even at only 10 % mass we get 50% of the yield.

We can calculate how the short molecules affect our sequencing yield.

The amount of data we generate from sequencing is:

Where N is the number of reads, and L is the length of the reads (bp)

The amount of short read data is given by:

Where F is the fraction of reads of a given length.

The amount of long read data is given by:

The maximum amount of data will be where all reads are long and is given by:

The presence of short reads means that the total amount of data is given by

The impact on our sequencing is a loss in efficiency from the short reads:

Dtot/Dmax=(Fshort∗Lshort+Flong∗Llong)/Llong

With the molecules in the example from before (equal mass of 1 kbp, and 10 kbp)

Dtot/Dmax=((10/11)∗1000+(1/11)∗10000)/10000=0.18

This gives us an efficiency of 18% compared to having only the long fragments. If you expected 10 Gbp you will only be getting 1.8 Gbp!!!

#### Figure 3 – Your gel vs molecule distribution at only 1 % mass of short molecules and 99% very long molecules

What does the molecule distribution look like if we have succeeded to make a pool of DNA with 99% mass of long (100,000 bp) fragments but still have 1% mass of short (1,000 bp)? The gel would look like there is hardly anything but the molecule size distribution tells a different story. This is pretty bad, half of our molecules will be short ones! Which makes it clear that we should try to get rid of short fragments. This can be done using gel based methods or the modified ampure clean up protocol in the community pages.

#### Figure 4 – Your gel vs molecule distribution at only 1 % mass of short molecules and 99% fairly long molecules

Imagine that we shear the 100,000 bp fragments to 10,000 bp fragments which leaves us with 1% mass of short (1,000 bp), and 99% mass of fairly long (10,000 bp) fragments). The numbers game looks much better than before (Figure 3). We will not get super long reads but at least most of our data will be from ~10kbp fragments.

If you happen to know cool tricks to boost sequencing yields, remove short molecules or generate long ones without expensive equipment or toxic chemicals I would love to hear your input.

The following two tabs change content below.

#### Rasmus H. Kirkegaard

Post Doc
Playing with microbes and bioinformatics on the path towards "finished" genomes for "everyone" of them and rapid detection. My bet is currently on nanopore sequencing and I am fortunate to be involved in MinION and PromethION research in our lab.

#### Latest posts by Rasmus H. Kirkegaard (see all)

Posted in DNA extraction, Genomics.

1. Gary Barker

Nice post! One point though where the assumptions may not hold in practice, according to speakers at the Plant and Animal Genome conference today:
“Using the very simplistic assumption that we sequence a fixed amount of molecules regardless of length, it is clear that even a little mass of short molecules has huge impact on our chances of getting record breaking yields.”

Evidence seem to suggest that using higher molecular weight DNA actually lowers sequence output in practice – although for many folks this is an acceptable trade-off.

2. Rasmus H. Kirkegaard

Hi Gary.
Thanks for your comments. I agree that my assumptions may not hold in practice and may be too simplistic. However, we have observed that in our lab running ligation based library preps without removing short fragments (smear of DNA from 100bp-2000bp not just a nice single peak). we produce less data than when incorporating the modified ampure protocol. I have to admit that we have done a huge effort getting really long molecules but are happy with anything above 7 kbp (https://www.sciencedirect.com/science/article/pii/S1369527414001817)., as we work mostly with bacteria and archaea.

Other reasons why HMW DNA (>100kb) even without any short molecules may not produce good yields could be because of incomplete reactions (diffusion limitations or too few DNA molecules) and poor delivery to the pores.

I completely agree that for people working with eukaryotic genomes it must be very powerful to get a few reads in the Mbp range (https://www.biorxiv.org/content/early/2018/05/03/312256) even if it comes at the cost of a lower total yield. Hopefully someone will figure out how to produce Mbp reads more consistently and without losing throughput.