NCM17 – Day 2: Into the unknown

Day 2 at the NCM17 started, again, with Oxford Nanopore CEO Gordon Sanghera taking the stage, once again stating his dream that sequencing should be available to anybody, anytime, anywhere. There are currently around 12,500 mainframe DNA sequencers around the globe, a number he believes will be passed by ONT in a not too distant future. The availability of portable sequencers to the masses will change the way we sequence DNA (the sequencing singularity?), not to mention all the ethics involved in sequencing the genomes of everyone – a topic which was also discussed in a panel discussion. It will undoubtedly improve our understanding of diversity on our planet.

Arwyn Edwards employs Gordon’s motto in his work as he, in his own words, sequences in strange places. One example of such is a field trip to an old coal mine in Wales, where risks were high due to methane gasses. Although plugging in all equipment over ground to avoid sparks in the explosive environment, he still had a battery powered centrifuge blow up in a blue flame. Nevertheless, the team managed to get some reads out of the run, even though the input material was a mere 3×6 ng DNA! Other field trips have included Greenland to investigate microbial darkening and to Svalbard to monitor microbial populations in the melting Arctic ice. He has also been testing the new lyophilised reagents and found results coherent with other kits and Illumina sequencing. It is important to note however, that all methods have biases, all databases are incomplete and all extraction methods have problems. We need to address these issues before we can go out and sequence metagenomes in the field.

Continuing this subject was Mads Albertsen, my PI, who had a clear message of what we need – more genomes! There are some problems in recovering genomes from metagenomes though; Microdiversity in samples and separation of the genomes (binning). His approach is to use differential coverage, so binning can be performed by abundance. This is something that can, sometimes, be done with Illumina reads only, but hybrid assemblies with long Nanopore reads is simplifying this. Mads believes it won’t be long before we can beat evolution in speed and sequence everything! If you want to do metagenomes with differential coverage binning, check out our mmgenome R package.

It has been a great community meeting (my first!) with plenty of activity in the demonstration zone and flow cell loading workshop.

NCM17 – Day 1: New tech and exciting applications

All the gear

We start out with blazing punk rock (things are never too boring at Oxford Nanopore meetings) and plenty of promises from Oxford Nanopore CEO Gordon Sanghera. He gave us updates on various products, such as PromethION flow cells, which have been harder to produce than expected. However, we and others have now sequenced data on the PromethION! Yes, not the 50 gbp as Oxford Nanopore claim to be able to achive in-house. But we start to be much more optimistic than we have ever been! The VolTRAX should be working now and a second version was announced! The second version will have a fluorescent detector built in for quantification and it will be possible to load the prepped sample directly onto a flow cell. The second version will be released in the first half of 2018.

VolTRAX v. 2

The much-anticipated 128-channel Flongle will be available in the second half of 2018. We are looking forward to fast and cheap sequencing the field and laboratory – although a MinION flowcell is relatively cheap (approx 500 USD) it is still way too expensive for regular use in many applications where we do not need Gbps of data. Gordon Sanghera highlighted that they are currently also moving the flongle into the diagnostic field. The sky seems to be the limits for small, cheap and generic diagnostic kits.

Oxford Nanopore has realised that data yield on flow cells depends largely on extraction and library preparation techniques and have started focusing more on this. Automation of everything from extraction to library prep would be ideal, but no updates were given on Zumbador other than the fact that a lot of work remains to be done. Along these lines, Lyophilised reagents have been sent to a selected few for initial trials, and removing the cold chain will be a large step forward to enable sequencing anywhere. Furthermore the new mobile FPGA base-caller was shown and it will be up to the community to name the unit in a competition – use the #NanoporeAnyone tag! The SmidgION and the mobile basecaller was displayed in the demonstration area and look very fancy:

The talks

Nick Loman was up first and gave us an outlook on what we will be able to achieve with sequencing in the future. Based on the sequencing of Ebola in West Africa he showed a video of how the disease had spread and infected different regions. By using real-time genome sequencing we will be able to track diseases as they develop and although outbreaks will happen – we should be able to avoid epidemics.

This sequencing singularity, as he termed it, is however dependent on sequencing genomes instead of genes, as is standard today. Long reads will make this possible as they can often cover repeat locations in the genomes. The long reads will also mean that much less compute power is needed. An example was given with an E. coli sample, where an assembly was produced in miniasm in 1.5 seconds, producing a single contig with 8 reads (Now that Nick and his team lost the world record of the longest sequenced read, the still claims the world record of assembling E.coli with the fewest number of reads).

Sergey Koren continued in a related subject, talking about how we may be able to finally close the human genome. He showed very convincing results on integrating Hi-C with long-read Nanopore data, see their new tool SALSA and their related paper here. Combining Nanopore and Illumina reads in hybrid assemblies is the preferred way to get the desired accuracy and read length, which was also the theme of Ryan Wick’s talk. He brought up some shortcomings of assemblers and stated that all genomes are somewhat metagenomes because of variation. What should assemblers do when encountering these differences? There is a need for more variation-aware assemblers and more graph-friendly tools.

The diversity of life on Earth has always been tricky to study. Steven Salzberg addressed this in his talk about closing large genomes. Even with long reads you need bioinformagic to assemble the wheat genome, which is 16 Gbp long and full of repetitive transposon regions. The fact that it has taken 8-9 months of sequencing, 1 trillion bases sequenced on Illumina and 545 billion bases sequenced on Pacbio is a good indicator of how difficult it, still, is to close these genomes. And it is still not closed, but a lot closer to than before!

The evening speaker, Aaron Pomerantz, gave us a great presentation about his trip to the Ecuadorian rainforest with a mobile lab. The development of portable sequencing has the potential to give us bigger insights into the diversity in remote areas. Often samples have to be shipped halfway around the globe to be sequenced, adding further expenses and time to the workflow.

Nanopore Community Meeting 2017: Prologue

Oxford Nanopore is hosting their annual community meeting in New York City this week. I’ll be there, thanks to being in top 5 of their one more flowcell competition – thanks to all that voted!

Although not as “big” as the London Calling conference they host, the line-up of speakers is impressive and promises talks on everything from megabase reads and gigabase yields to the newest assembly algorithms, field applications and maybe even some product updates? I’m excited! And in this blog post, I will give an overview of what I expect to hear at the meeting.

The first plenary speaker is Nick Loman, who, until recently, had the boasting rights of the longest read on the MinION. The record is now with Kinghorn Genomics, but still sits below the megabase read.

Although megabase reads may not be practical or strictly necessary for most applications, the importance of read length is bound to be brought up by Loman, who will also speak about the future of sequencing – sequencing singularity, as he terms it. I am also looking forward to hearing Ryan Wick speak as he will undoubtedly tell about his work with Unicycler, which we recently described as our preferred hybrid genome assembler. Wick will give us an overview of what to improve in order to obtain perfect assemblies (In general you should check his awesome analysis and tools hosted on github, see e.g. his comparison on current basecalling algorithms for Oxford Nanopore data).

In general, the algorithm developers are very well representated with both Sergey Koren, developer of Canu, who will give an update on assembling the human genome with ultra-long reads and Steven Salzberg that will present an update on their hybrid assembly method for large genomes (MaSuRCA), which is bound to be interesting (you should also check out his blog).

The current record holder in data yield, Baptiste Mayjonade, has a talk on day 1. It will be interesting to hear updates on yields from both researchers and Oxford Nanopore, as a lot of us are still struggling to reach the 10 Gbp. Perhaps with the right methods we could all compete with Baptiste’s 15.7 Gbp on a MinION flowcell? We could also hope for some updates on PromethION flowcell yields, although these are still under heavy development (we have been running our own PromethION, and are now waiting for next flowcell shipment – a blog post should be in the works..).

The last speaker of day 1 is Aaron Pomerantz, who has been sequencing barcoded DNA in a rainforest in Ecuador. There is also Arwyn Edwards on day 2, who has been sequencing in extreme environments. As I am working with on-site DNA sequencing myself, I am very interested in hearing more about these projects. The general trend lately seems to be increased portability, which will undoubtedly spark a wide array of new exciting projects and give insights into the ecology of the Earth.

Another session in my interest is the “metagenomics, microbiomes and microbiology” session on day 2, where my PI, Mads Albertsen, is giving a talk on metagenomics in the long read era. I am currently moving into the metagenomics field and hope to gain some insight in this session.

With some Oxford Nanopore talks throughout the two days, I hope to get some updates on the new tech released. I’m particularly interested in the “flongle” (the small and dirt-cheap flowcell), which I have heard might be ready soon. Recently, we have seen tweets about the first version of the all-in-one Zumbador, new lyophilised reagents being tested and the first version of chemistry for the VolTRAX sample prep device.

I am sure I will be wiser by the end of next week. I will be giving some daily digests here, so stay tuned. If you are going to NCM17 yourself, please come talk to me by poster 18 on Thursday 10.50-11.15 AM and 3.25-3.50 PM, and I can give you an overview of how we recently sequenced activated sludge samples on-site at a wastewater treatment plant (and on the way home).

What is a good genome assembly?

With short reads you will often get fragmented but high single base accuracy assemblies  and with long error prone reads you can get a single contig assembly with just a few ultra long reads but with a lot of errors due to the lower read quality. Some of these errors can be fixed by adding more coverage and polishing the assembly.  Another approach can be to use both technologies in combination and one can fairly easy produce a single contig assembly. A good assembly should be in as many pieces as the original genetic elements they represent (one contig – one chromosome) but to allow gene calling, genome alignments single base accuracy is also essential. There are many genome assemblers, polishing tools etc. that will help you make a genome assembly but how do you know if you have a good assembly? To test this people have developed tools to calculate the AverageNucleotideIdentity (ANI), estimate genome completeness, assess the quality of genome assemblies compared to a reference. However, it may also be useful to use annotation tools to assess whether genes can be called correctly. The genome assembly, polishing, and annotation strategy is an ongoing discussion in the scientific community on twitter.

To decide which strategy should be our “preferred” genome assembly approach based on data rather than my gut-feeling about the “best assembly” I decided to do some testing with a known “true” reference E Coli K12 MG1655 (U00096.2).

Objective: develop a strategy to compare genome assemblies

Strategy: Use E. Coli K12 MG1655 reference strain data to produce a set assemblies and evaluate these with different assembly metrics in comparison to the “true” reference

Data:

 MbpX coverageRessource 
Nanopore data18340in house E. Coli K12 MG1655 grown and prepped for 1D R9.4 with RAD002(a subset of our total run to get a more realistic view than with 7+Gbp data for a bacterial genome)
Illumina data27058ENA: SRR2627175(I used a subset of 1095594 reads)

Assemblies:

 Read typeContigsSize (bp)Relative sizeANI (%)CheckM completenessProkka # CDSProkka # rRNAProkka # tRNAMedian CDS sizeQUAST NGA50QUAST mismatches per 100kbQUAST indels per 100kb
SpadesShort8445462200.9810099.91423511878101327000.420.09
Spades-hybridHybrid146203770.99610099.914282228881528288795.910.43
MiniasmLong144101010.95183.950.0011241615206NANANA
Miniasm +1xRaconLong146198180.99699.0270.871019222802874618042239.94541.42
Miniasm +2xRaconLong146220210.99699.2374.97962622823084620219222.51449.98
Miniasm +2xRacon +PilonHybrid146220210.99699.9698.5145092288761359021210.3519.19
CANULong145335740.97798.6950.001047821732634531613113.53662.93
CANU +NanopolishLong145631520.98499.2771.79966422842964561292146.74468.02
CANU +Nanopolish +PilonHybrid145674110.98499.9798.414415228777013295246.1715.79
UnicyclerHybrid146339760.99999.9999.914305228881524718074.282.25
Reference14639675110099.9743002288818463967500

Conclusion:

The long read only assemblies are nice contiguous assemblies with approximately the same length as the reference. However, the genes are broken due to indel errors present in the reads as is seen in the high number of CDS detected and the low median CDS size. Polishing the assemblies with long reads improves the ANI value a lot but the genes are still fragmented. Polishing with short reads is needed to get the number of CDS and median CDS size in the range of that of the reference. The short read only assembly has a high sequence identity with the reference but is fragmented and cannot recreate the repeat structure of the genome. The number of CDS is lower than that of the reference and the rRNA genes, which are known to be very similar if not identical, are messed up and co-assembled.  The unicycler assembly seems to deliver the best metrics across the board until we come to the NGA50 value reported by QUAST which was surprisingly low. I used mummer to visualise the alignment of the assembly against the reference genome to investigate if there was a mis-assembly causing this. However, the figure below shows no big inversions or anything, so I guess that for now we will stick to UNIcycler assemblies that combine the benefit of the short read accuracy with the power of long reads.

Dotplot of unicycler assembly aligned to the reference genome

Methods:

I have used the following assemblers

I have used the following mappers

I have used the following polishing tools

I have used the following tools to assess genome assembly characteristics

If you have any ideas or superior tools we have missed please let us know in the comments.

ampvis2: A guide to ordination and how to use amp_ordinate in R

(This post is part of a series of ongoing posts about selected ampvis2 features)

ampvis2: A guide to ordination and how to use amp_ordinate in R

A common way for microbial ecologists to compare the microbial communities of different samples is by using ordination methods. These methods seek to highlight significant differences between samples based on their microbial community composition, which is useful when analysing complex data as obtained with next-generation sequencers. By plotting samples as points in a x/y-plane, the ecological differences between samples can then be interpreted simply by the distance between the points (see figure 1).

Figure 1: Principal Components Analysis (Hellinger transformation applied) of 8 samples from the activated sludge of two Danish wastewater treatment plants Aalborg East and Aalborg West. Obtained by 16S rRNA amplicon sequencing on the Illumina platform. The sample points are shaped by which seasonal period the samples were taken in 2012.

There are numerous types of different ordination methods, however, and it is not clear to many researchers which method to use when, and why. This is not surprising, as they are very complex to understand under the hood, it’s all math wizardry, but there are a few key differences that are important to know about when comparing the microbial communities in different samples using ordination.

At first, it is important to understand that the microbial community composition at a given site (a.k.a. alpha-diversity) is nothing more than an indication of the surrounding environment and ecological conditions. They are therefore termed response variables, as they are dependent on other factors (the surrounding environment), not the other way around. The microorganisms present at a particular site (which for a researcher is: in a single sample) are thus also likely to be present at a different site, as long as the ecological conditions are identical. When the environmental conditions, including physical and chemical conditions as well as the presence of other species, change over time or as a result of physical distance between sites, so does the overall microbial community composition. These changes in the environmental conditions are termed environmental gradients and the species are expected to be gradually less abundant the more the conditions are different from an optimal set of conditions for the individual species to thrive (niche theory), ultimately absent. Ordination methods mainly differ in how they hypothesise the abundance distribution of the response variables along the environmental gradient to be, commonly either a unimodal or linear distribution.

When using ordination methods to visualise the differences between samples with respect to their microbial community composition, differences and distances are one and the same thing. The distances are essentially defined by a mathematical formula with which to represent the differences between the samples as a distance- or (dis)similarity coefficient, which is used to calculate a symmetrical distance matrix containing a coefficient for each pair of samples. This multidimensional matrix is all but user-friendly to look at and difficult to visualise, and this is where ordination really gets to shine. The matrix is then subjected to, let’s keep it simple, dimensionality reduction, where the matrix is sort of “compressed” and only the most significant differences are extracted and expressed by new, synthetic axes, and usually plotted in a normal x/y scatter plot. The principle is similar to that of linear regression, where drawing a straight line through points in two dimensions can represent the overall structure of the data in only one dimension (see figure 2). The only difference is that there are just as many dimensions as there are samples. Yes, hard to imagine.

Figure 2: The main structure of two-dimensional data can often be described in one dimension by relative positions on a straight line. Adopted from (Pearson, 1901)

In general there are six methods in particular that are relevant for microbial ecology, see the table below. There are several ways to categorise them, below they are categorised by whether the distance matrix is calculated explicitly or implicitly. That means, whether the user has to manually calculate a distance matrix based on a suited distance measure before performing the ordination or not, respectively.

Unconstrained (exploratory)Constrained (explanatory)
Implicit distance measure
Principal Components Analysis (PCA)Redundancy Analysis (RDA)
Correspondence Analysis (CA)
Canonical Correspondance Analysis (CCA)
Explicit distance measure
Principal Coordinates Analysis (PcoA)
non-metric Multidimensional Scaling (nMDS)

The distance measure incorporated in Principal Components Analysis (PCA) and Redundancy Analysis (RDA) is the Euclidean distance, which is calculated by the Pythagorean theorem. Correspondence Analysis (CA) (a.k.a. weighted- or reciprocal averaging) and Canonical Correspondance Analysis (CCA) are instead based on the Pearson chi-squared measure. A common way of distinguishing PCA+RDA from CA+CCA is that the former two are best applied to data showing a linear distribution of the species abundances along the environmental gradient, while the latter two are best applied to a unimodal distribution. This means that, in theory, CA and CCA should be the most ideal for ecological data, because the chi-squared measure better deals with the fact that some species can be present in some samples while absent in others (double-zeros), which is common in ecological data and fits niche theory well. PCA (and its constrained version RDA) is widely used, perhaps due to its simplicity (it just highlights numerical variance), but within ecology it underestimates ecological differences due to these double-zeros, and only the most abundant species will influence the distances predominantly. This is by some considered a flaw of PCA when used within ecology, but with appropriate data transformation, for example the Hellinger transformation (see Legendre 2001), PCA can prove quite useful when answering questions involving only the most abundant species and their numerical differences between samples. CA (and its constrained version CCA) on the other hand is very sensitive to the less abundant and often unique species in the samples, which are often numerous in microbial communities. Again, data transformation like the Hellinger transformation, can cope with this problem. Then, CA becomes a powerful tool to tell which species are characteristic of, or correspond to, each sample or group of samples, as we will see in the examples below. One can perhaps think of it like CA (and CCA) is a more qualitative method, while PCA (and RDA) is a more quantitative method, and thus they can supplement each other nicely and reveal different aspects of the data.

The advantage of Principal Coordinates Analysis (PCoA) and non-metric Multidimensional Scaling (nMDS) is that the user has the freedom to choose the distance measure explicitly. There are well over 50 different measures to choose from, but luckily only a handful are relevant for most analyses (see fx GUSTA ME https://sites.google.com/sidistanceste/mb3gustame/reference/dissimilarity). Common measures include the Bray-Curtis dissimilarity index, which has been widely used for decades, Pearson chi-squares, Jaccard, Chord, and lately UniFrac, which incorporates phylogeny. PCoA is very similar to PCA, RDA, CA, and CCA in that they are all based on eigenanalysis: each of the resulting axes is an eigenvector associated with an eigenvalue, and all axes are orthogonal to eachother. This means that all axes reveal unique information about the inertia in the data, and exactly how much inertia is indicated by the eigenvalue. When plotting the ordination result in a x/y scatterplot, the axis with the largest eigenvalue is plotted on the first axis, and the one with the second largest on the second axis. How much each axis then contributes to the total inertia in the data is then often indicated on the axis labels as a fraction of the eigenvalue of the particular axis and the sum of all eigenvalues. Often the remaining axes other than the two plotted can highlight a large amount information as well, but the best case scenario is that the most inertia is expressed by the first two. A simple bar plot of all the axes and their eigenvalues, called a scree plot, can be used to get an overview of each axis’ contribution and confirm that those plotted are the most significant ones.

In contrast to the other five methods, non-metric Multidimensional Scaling is quite different. It is not an eigenanalysis based method, it is instead a rank-based approach, which means that the magnitudes of the differences are lost (hence the term non-Metric), and only the ranked relationships between the samples are preserved. This makes nMDS particularly robust and independent of any underlying environmental gradient. It can even deal with missing values and different types of data at the same time. As opposed to the eigenanalysis-based analyses, nMDS is an iterative procedure, where the resulting ordination is not a single solution. Instead of eigenvalues, nMDS utilise a stress value to indicate the “success” of the ordination. The number of axes obtained by nMDS is chosen by the user a-priori, and a number of iterations (20 is usually enough) are run to best fit the data on the chosen number of axes. Each solution is used as a starting point for the next iteration and a stress value is calculated after each iteration and will after a number of iterations reach a stable value. A stress value of 0.20 or more is considered unacceptable, 0.1 is acceptable, and 0.05 or below is great.

################# needs work:

The constrained analyses (RDA and CCA) have the ability to highlight how measured environmental variables, e.g. pH, temperature or a nutrient concentration, can explain the observed differences in the data. This is done by introducing an additional step in the analysis, where a linear combination of the variable(s) is used to obtain the axes, hence RDA is considered an extension of PCA, and CCA an extension of CA.

##################

How to perform ordination with ampvis2 in R

Now that you know the basics of the different ordination methods relevant for microbial ecology, let’s demonstrate how to create publication ready ordination plots with the ampvis2 package in R and compare the results of different methods on the same data. The amp_ordinate function aims to simplify the process of performing ordination by basically wrapping numerous functions from other R packages (many of which are from the vegan and ggplot2 packages) into one large function. Performing ordination is often a multi-step workflow and amp_ordinate performs all the required steps with just one function call. Along the way there are also several error checks to help guide the user in the process. All steps are fully customisable by the user, by default amp_ordinate does the following in order:

  1. filters low abundant OTUs
  2. performs data transformation
  3. calculates a distance matrix if nMDS/PCoA
  4. calculates both site- and species scores by the chosen ordination method
  5. generates the ordination plot with numerous visual layers defined by the user
  6. fits the correlation of environmental variables to the ordination plot
  7. returns a nice looking plot or a interactive plotly plot

The ordination is performed on the read counts contained in the OTU table (which simply contains the read counts of each OTU in each sample), and the corresponding metadata of those samples are used for all aesthetic options, like the coloring or shape of sample points based on a group variable, as well as for environmental interpretation by constrained ordination methods or the fitting of the correlation of environmental variables onto the ordination plot.

Examples

Some of our 7th semester student groups have really done a great job collecting samples from a plethora of interesting environments in nature and from animal faeces to hunt for novel bacteria. In the following examples, data obtained by 16S rRNA amplicon sequencing (on the Illumina MiSeq platform) of a handful of their samples is used. The samples are from quite different environments, including the faeces of a Bearded dragon and a Vikunja from Aalborg Zoo, sediment from the Limfjord, a wooden boardwalk from Lille Vildmose, and a rusting pipe from somewhere.

A Principal Components Analysis can be performed by simply setting type = “pca” as shown below. This would produce a plot with just black dots, which is not very useful, so setting sample_color_by and sample_colorframe can color the sample points based on the metadata variable “SampleContent”, which simply contains where the samples were taken.

 

library(ampvis2)
amp_ordinate(
  data = data,
  type = "pca",
  transform = "hellinger",
  sample_color_by = "SampleContent",
  sample_colorframe = TRUE,
  sample_colorframe_label = "SampleContent"
)

 

 

 

 

 

safasdfc

cafsdfasdsadfc

 

amp_ordinate(
  data = data,
  type = "pca",
  transform = "hellinger",
  sample_color_by = "SampleContent",
  sample_colorframe = TRUE,
  sample_colorframe_label = "SampleContent",
  species_plotly = TRUE
)

 

 

 

 

 

fghfghfgh

kkkuyh

 

amp_ordinate(
  data = data,
  type = "ca",
  transform = "hellinger",
  sample_color_by = "SampleContent",
  sample_colorframe = TRUE,
  sample_colorframe_label = "SampleContent",
  species_plot = TRUE
)

 

 

 

 

 

 

jhiklohjkl

gh

 

 

 

 

 

 

 

 

 

 

 

 

 

ghjghj

gfhj

 

 

 

 

 

 

 

 

 

 

 

 

 

jhkhgjk

ghjk

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The term “ordination” in fact stems from the German word “ordnung”, meaning “order”.

Read more

Websites

Literature

Populating the tree-of-life

 

Hi everybody and welcome to my first blog post at Albertsen Lab. As a newly started PhD student, I have engaged myself with the simple, yet Herculean task of populating the tree-of-life. As most people are aware of, microorganisms are more or less inescapably present in all places of the world — no matter how hostile environments you will encounter, they will probably accommodate some sort of small living organisms. As it once elegantly was put “If you don’t like bacteria, you’re on the wrong planet”. Or in this case, if you don’t like bacteria, you’re in the wrong blog post. Recently, a study was published by Kenneth J. Locey & Jay T. Lennon (2016) trying to illuminate just how omnipresent microorganisms are. This was carried out by applying scaling laws to predict the global microbial diversity. Main conclusion: our planet hosts up to 1012 microbial species. That is 1 trillion species! Of course, one trillion microbial species is still a number that is dwarfed by the approximately 1030 bacterial and archaeal cells living on Earth. Since incredibly large numbers tend to be hard to grasp, the article kindly supplied some illustrative examples. For example, the ~1030 bacterial and archaeal cells exceed the 22 orders of magnitude that separate the mass of a Prochlorococcus cell from a blue whale. It also exceeds the 26 orders of magnitude that result from measuring Earth’s surface area at a spatial grain equivalent to bacteria.

From a perspective concerning my research, it is naturally only the so-called microbial dark matter (microbes not yet discovered) that is of interest. Fortunately, it is only an infinitesimal number of microorganisms that have been discovered to date. For some reason, it is currently a prerequisite to have a bacterium growing in a pure culture before naming it. Alas, if you take a quick look at DSMZ’s homepage (one of the largest bioresource centers worldwide located in Germany), you will find that they boast a collection containing around 31,000 cultures representing some 10,000 species and 2,000 genera in total. Only a tiny bit short of one trillion. On a side note, I guess DSMZ eventually will face some kind of capacity-related problems if we insist on requiring each new species as a pure culture before an ‘official’ name can be granted. Luckily, nowadays bacterial species can also be cataloged by its DNA sequence. Currently, one of the most wide-spread methods for identifying bacteria is the 16S rRNA amplicon sequencing. Large databases such as SILVA and the Ribosomal Database Project (RDP) use huge amounts of digital ink to keep the ever-increasing influx of new sequences up to date. The current version 128 of SILVA (released in September 2016) includes over 6,300,000 SSU/LSU sequences, whereas RDP Release 11 contains 3,356,809 16S rRNA sequences (also released in September 2016). Although this is definitely a lot more than 10,000 pure cultures, it is still ridiculously less than the total estimated microbial diversity of the Earth. Hence, as I begin my PhD study, estimations from Kenneth J. Locey & Jay T. Lennon point to the fact that potentially 99.999% of all microbial taxa remain undiscovered!

This may sound like my research is going to be a cakewalk. After all, I should be able to go to the lab and find novel microorganisms literally by accident. However, I have decided to drastically confine my explorations of microbial dark matter to a specific environment. Although traveling around the world chasing novel bacteria would be pretty cool. So far, my primary focus has been sampling of microbial biomass from drinking water. You may think looking for novelty in a resource where one of the main goals is to restrict living things in is a bit weird. However, there is more than one good reason for this choice. 1) I worked with drinking water during my master’s thesis. Thus I already have experience with sampling and extraction procedures as well as a few connections with people working in this field. 2) Recently, articles such as Antonia Bruno et al. 2017, Karthik Anantharaman et al. 2016 and Birgit Luef et al. 2015 have illustrated that drinking water may potentially contain a large portion of microbial novelty. 3) Unless you are living in a third world country, gaining access to water is not very complicated. But just to clarify, do not mistake “easy access” with “easy sampling”, as the sampling of microorganisms from drinking water can be a king-sized pain in the ass.

A drastically under-represented visualization of the microbial diversity in drinking water.

As you might have guessed, I am not going to dedicate much time to trying to make bacteria living in large water reservoirs deep below the surface grow on plates in the lab. For identification, I am instead going to one-up the conventional 16S rRNA amplicon sequencing method. A new technique developed by some of the people in the Albertsen Lab have shown very promising results, generating full-length 16S rRNA gene sequences. Hopefully, I will have the opportunity to address it a bit more in detail in a later blog post. However, the first hurdle is simply collecting sufficient amounts of biomass for subsequent extraction. As it states in the protocol, the input material is ~ 800 ng RNA. If you — like many of my colleges at Albertsen Lab — have been working with wastewater, 800 ng RNA is no big deal. Getting the same amount from drinking water is a big deal. Drinking water typically has bacterial concentrations in the range of 103 – 105 cells/ml, which is why collecting adequate amounts of biomass is difficult. I naively started out using the same sampling setup that I used in my master’s thesis (where 1 ng/µl DNA would be plenty for further analysis). It basically consisted of a vacuum pump and a disposable funnel, and after spending too many hours pouring several liters of water into a puny 250 ml funnel, I ended up with negligible amounts of RNA. This is the point where you break every single pipette in the lab in despair tell your supervisor that the current sampling method is not feasible. Instead, the sampling task has been partly outsourced to people with more specialized equipment. So I have worked with samples based on 100+ liters of water yielding RNA amounts of more than 800 ng, but the workflow still needs some optimization.

Another aspect complicating the sampling step is the really obvious fact that bacteria are really small (I am perfectly aware of the fact that all bacteria with very good reason can be categorized as “small”), therefore my statement refers to the aforementioned article from Birgit Luef et al. 2015Diverse uncultivated ultra-small bacterial cells in groundwater”. The article highlights findings that demonstrate how a wide range of bacteria can pass through a 0.2 µm-pore-sized membrane filter.  FYI, filtration with a 0.2 µm filter is also commonly referred to as “sterile filtration”. One could argue it is a poorly chosen term for a filtration type that apparently allows numerous types of bacteria to pass the membrane unhindered. Also, sterile filtration is by-far the most used sampling method in papers concerning 16S rRNA amplicon sequencing of drinking water.  During my master’s thesis, I also utilized 0.2 µm filters for water samples, however, in the pursuit of novelty, the filter-size has been reduced to 0.1 µm. This should, as far as I am concerned, capture all microbes inhabiting drinking water (although it is hard not to imagine at least one out of a trillion species slipping through).

Hopefully, I will start to generate full-length 16S rRNA sequences from novel bacteria in the near future and maybe share some interesting findings here on albertsenlab.org.

Can you beat our Nanopore read error correction? We hope so!

Tagging of individual molecules has been used as an effective consensus error-correction strategy for Illumina data (Kivioja et al 2011, Burke et al 2016Zhang et al 2016) and the principle is similar to the circular consensus sequencing strategy used to generate consensus reads with error rate of < 1 % on the PacBio (Travers et al 2010, Schloss et al 2016, Singer et al 2016) and the Oxford Nanopore platforms (Li et al 2016). You simply sequence the same molecule several times and compare the reads to generate a consensus with a better accuracy than the individual reads. As far as we know a tag-based consensus error correction strategy has not been attempted for long reads before, probably because the raw error rate complicates identification of the tags. However, we see several benefits of the tag-based strategy in the long run, which is why we decided to pursue it.

My colleague @SorenKarst tested tag-based error correction on the Nanopore MinION in connection with our work on generating primer free, full length 16S/18S sequences from  environmental rRNA (see our bioRxiv paper: Karst et al 2016). The main approach used in the paper is based on Illumina sequencing inspired by Burke et al 2016, but moving to nanopore sequencing in the future would make the approach considerably easier.  His approach was relatively “simple”; individual cDNA molecules were uniquely tagged at both ends with a 10 bp random sequence, then diluted to a few thousand molecules, amplified by PCR to generate 1000’s of copies of each molecule, which were prepared for 2D sequencing on the Nanopore MinION. The resulting sequence reads were binned based on the unique tags, which  indicated they originated from the same parent molecule, and a consensus was generated from each read bin. The approach was tested on a simple mock community with three reference organisms (E. Coli MG 1655, B. Subtilis str. 168, and P. aeruginosa PAO1), which allowed us to calculate error rates.

For locating the unique tags we used cutadapt with loose settings to locate flanking adaptor sequences and extract the tag sequences. The tags were clustered and filtered based on abundance to remove false tags. As tags and adaptors contain errors, it can be a challenge to cluster the tags correctly without merging groups that do not belong together. Afterwards the filtered tags were used to extract and bin sequence reads using a custom perl script ;). For each bin we used the CANU correction tool followed by USEARCH consensus calling. By this very naive approach we were able to improve the median sequence similarity from 90% to 99%.

We think this is a good start, but we are sure that someone in the nanopore community will be able to come up with a better solution to improve the error rate even further. The data is freely available and a short description of the sequence read composition is provided below. We are looking forward to hear your inputs!

Ps. If you come up with a solution that beats our “quick and dirty” one and post it here or on twitter, I will make sure to mention you in my slides at ASM ;).

 

Data and script availability:

The nanopore reads are available as fastq at: 2D.fq or fasta: 2Dr.fa and fast5: PRJEB20906

The 16S rRNA gene reference sequences: mockrRNAall.fasta

Scripts:

Our approach in a shell script
#############################################################################
# 									    #
# Shell script for generating error corrected FL16S and getting error rates #
#									    #
# Use at your own RISK!							    #
#############################################################################

####################
#     Variables    #
####################
ID_adapt=0.1;
ID_cluster=0.8;
LINKtoCANU=/space/users/rkirke08/Desktop/canu/canu-1.3/Linux-amd64/bin;
# Update path to include poretools installation
# export PATH=$PATH:/space/users/rkirke08/.local/bin
####################
# End of variables #
####################
###############################################
# Depends on the following files and software #
###############################################
# folder with fast5 files "data/pass/"
# file with reference 16S sequences "mockrRNAall.fasta"
# perl script "F16S.cluster.split.pl"
# poretools
# cutadapt
# usearch8.1
# CANU
###############################################
# End of dependencies			      #
###############################################

# Extract fastq files
# poretools fastq --type 2D data/pass/ > data/2D.fq


# Rename headers (Some tools do not accept the long poretools headers)
#awk '{print (NR%4 == 1) ? "@" ++i : $0}' data/2D.fq | sed -n '1~4s/^@/>/p;2~4p' > 2Dr.fa

# Find adapters
cutadapt -g AAAGATGAAGAT -e $ID_adapt -O 12 -m 1300 --untrimmed-output un1.fa -o a1.fa 2Dr.fa
cutadapt -a ATGGATGAGTCT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o a1_a2.fa a1.fa

usearch8.1 -fastx_revcomp un1.fa -label_suffix _RC -fastaout un1_rc.fa
cutadapt -g AAAGATGAAGAT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o ua1.fa un1_rc.fa
cutadapt -a ATGGATGAGTCT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o ua1_a2.fa ua1.fa

cat a1_a2.fa ua1_a2.fa > c.fa

# Extract barcodes
cut -c1-12 c.fa > i1.fa
rev c.fa | cut -c1-12 | rev > i2.fa

paste i1.fa i2.fa -d "" | cut -f1-2 -d ">" > i1i2.fa


# Cluster barcodes
usearch8.1 -cluster_fast i1i2.fa -id $ID_cluster -centroids nr.fa -uc res.uc -sizeout

# Extract raw sequences
perl F16S.cluster.split.pl -c res.uc -i c.fa -m 3 -f 50 -r 40
# Count number of files in directory
find clusters -type f | wc -l

FILES=clusters/*.fa
for OTU in $FILES

do
  wc -l $OTU >> lines.txt	
done

FILES=clusters/*.fa
for OTU in $FILES

do
  OTUNO=$(echo $OTU | cut -f2 -d\/);
  # Rename header
  sed "s/>/>$OTUNO/" clusters/$OTUNO > clusters/newHeaders_$OTUNO

  # Correct reads using CANU
  $LINKtoCANU/canu -correct -p OTU_$OTUNO -d clusters/OTU_$OTUNO genomeSize=1.5k -nanopore-raw  $OTU

  # Unsip corrected reads
  gunzip clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta.gz

  sed -i "s/>/>$OTUNO/" clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta

  # Call consensus using Usearch
  usearch8.1 -cluster_fast clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta -id 0.9 -centroids clusters/OTU_$OTUNO/nr_cor_$OTUNO.fa -uc clusters/OTU_$OTUNO/res_$OTUNO.uc -sizeout -consout clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa

  sed -i "s/>/>$OTUNO/" clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa

  # Map reads to references to estimate error rate
  # Raw reads
  # Map FL16S back to references
  usearch8.1 -usearch_global clusters/newHeaders_$OTUNO -db mockrRNAall.fasta -strand both -id 0.60 -top_hit_only -maxaccepts 10 -query_cov 0.5 -userout clusters/map_raw_$OTUNO.txt -userfields query+target+id+ql+tl+alnlen

  # Usearch consensus corrected sequence
  # Map FL16S back to references
  usearch8.1 -usearch_global clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa -db mockrRNAall.fasta -strand both -id 0.60 -top_hit_only -maxaccepts 10 -query_cov 0.5 -userout clusters/map_cor_Ucons_$OTUNO.txt -userfields query+target+id+ql+tl+alnlen

  cat clusters/map_raw_$OTUNO.txt >> myfile.txt
  cat clusters/map_cor_Ucons_$OTUNO.txt >> myfile.txt

  # Collect corrected sequences
  cat clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa >> final_corrected.fa
done

Requirements:

cutadapt

USEARCH

a perl script F16S-cluster.split.pl

cDNA molecule composition

The cDNA molecules are tagged by attaching adaptors to each end of the molecule. The adaptor contains a priming site red, the unique 10 bp tag sequence (blue) and a flanking sequence (black). Note that the design is complicated as we simply modified it from our approach to get Thousands of primer-free, high-quality, full-length SSU rRNA sequences from all domains of life.

AAAGATGAAGATNNNNNNNNNNCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTTTTTTTTTTTTTTTT<—- fragment of SSU cDNA molecule—->GGGCAATATCAGCACCAACAGAAATAGATCGCNNNNNNNNNNATGGATGAGTCT

The number of T’s before the fragment will vary between molecules because it is a result of the polyA tailing described in the paper. The black parts of the sequence are generally not needed for the purpose of Nanopore sequencing but are present in the molecule because they were needed for the illumina sequencing.

Example Nanopore sequence read:

>18125
GATCTGGCTTCGTTCGGTTACGTATTGCTGGGGGCAAAGATGAAGATGTTCGTTATTCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTGGTCAAGCCTCACGAGCAATTAGTACTGGTTAACTCAACGCCTCACAACGCTTACACACCCAGCCTATCAACGTCGTAGTCTCCGACGGCCCTTCAGGGGAATCAAGTTCCAGTGAGATCTCATCTTGAGGCAAGTTTCCGCTTAGATGCTTTCAGCGGTTATCTTTTCCGAACATGGCTACCCGGCAATGCCACTGGCGTGACAACCGGAACACCAGAGTTCGTCCACCCGGTCCTTCCGTACTAGGAGCAGCCCCTCTCAAATTCAAACGTCCACGGCCAGATGGGGACCGAACTGTCTCACGACGTTCTAAGCCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTAGACCGGCTTCAGCCCCAGGATGTGATGAGCCGACATCGAGGTGCCAAACACCGCCGTCGATAAACTCTTGGGCGGTATCAGCCTGTTATCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCTTCCATACAGAACCACCGGATCTTCAAGACCTACTTTCGTACCTGCTCGACGTGTCTGCTCTGATCAAGCGCTTTTGCCTTTATATTCTCTGCGACCGATTTCCGACCGGTCTGAGCGCACCTTCGTGGTACTCCTCCGTTACTCTTTTAGGAGGAGACCGCCCCAGTCAAACTGCCCACCATACACTGTCCTCGATCCGGATTACGGACCAGAGTTAGAACCTCAAGCATGCCAGGATGGTGATTTCAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCCACCTAATCCTACACAGCAGGCTCAGTCCAGTGCCGCTACAGTAAAGGTTCACGGGGTCTTTCCGTCGCCGCGGATACACTGCATCTTCACAGCGATTTCAATTTCACTGAGTCTCGGGTGGAGACAGCGCCGCCATCGTTACGCCACTCGTGCAGGTCGGAACTTACCCGACAAGGAATTTCGCTACCTTGGACCGTTATCGTTACGGCCGCCGTTTACCGGGGCTAGATCAGGCTTCGCGCCCCATCAATACTTCCGGCACCGGGAGGCGTCACACTTATACGCCGTCCACTTTCGTGTTTTGCAGAGTGCTGTGTTTTTAATAAACAGTCGCAGCGGCCTGGTATCTTCGACCAGCCAGAGCTTACGGAGTAAATCCTTCACCCTAGCCGGCGCACCTTCTCCCGAAGTTACGGTGCCATTTGCCTAGTTCCTTCACCCGAGTTCTCAAGCGCCTTGGTATTCTCTACCCGACCACCTGTGTCGGTTTGGGTGCAGTTCCTGGTGCCTGAAGCTTAGAAGCTTTTGGAAGCATGGCATCAACCACTTCGTCGTCTAAAAGACGACTCGTCATCAACTCTCGGCCTTGAAACCCCGGATTTACCTAAGATTTCAGCCTACCACCTTAAACTTGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGGTGGAACGTTCTGTTTATGTTTCTGAAA

Example cluster with same random sequence:

>18125
GATCTGGCTTCGTTCGGTTACGTATTGCTGGGGGCAAAGATGAAGATGTTCGTTATTCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTGGTCAAGCCTCACGAGCAATTAGTACTGGTTAACTCAACGCCTCACAACGCTTACACACCCAGCCTATCAACGTCGTAGTCTCCGACGGCCCTTCAGGGGAATCAAGTTCCAGTGAGATCTCATCTTGAGGCAAGTTTCCGCTTAGATGCTTTCAGCGGTTATCTTTTCCGAACATGGCTACCCGGCAATGCCACTGGCGTGACAACCGGAACACCAGAGTTCGTCCACCCGGTCCTTCCGTACTAGGAGCAGCCCCTCTCAAATTCAAACGTCCACGGCCAGATGGGGACCGAACTGTCTCACGACGTTCTAAGCCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTAGACCGGCTTCAGCCCCAGGATGTGATGAGCCGACATCGAGGTGCCAAACACCGCCGTCGATAAACTCTTGGGCGGTATCAGCCTGTTATCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCTTCCATACAGAACCACCGGATCTTCAAGACCTACTTTCGTACCTGCTCGACGTGTCTGCTCTGATCAAGCGCTTTTGCCTTTATATTCTCTGCGACCGATTTCCGACCGGTCTGAGCGCACCTTCGTGGTACTCCTCCGTTACTCTTTTAGGAGGAGACCGCCCCAGTCAAACTGCCCACCATACACTGTCCTCGATCCGGATTACGGACCAGAGTTAGAACCTCAAGCATGCCAGGATGGTGATTTCAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCCACCTAATCCTACACAGCAGGCTCAGTCCAGTGCCGCTACAGTAAAGGTTCACGGGGTCTTTCCGTCGCCGCGGATACACTGCATCTTCACAGCGATTTCAATTTCACTGAGTCTCGGGTGGAGACAGCGCCGCCATCGTTACGCCACTCGTGCAGGTCGGAACTTACCCGACAAGGAATTTCGCTACCTTGGACCGTTATCGTTACGGCCGCCGTTTACCGGGGCTAGATCAGGCTTCGCGCCCCATCAATACTTCCGGCACCGGGAGGCGTCACACTTATACGCCGTCCACTTTCGTGTTTTGCAGAGTGCTGTGTTTTTAATAAACAGTCGCAGCGGCCTGGTATCTTCGACCAGCCAGAGCTTACGGAGTAAATCCTTCACCCTAGCCGGCGCACCTTCTCCCGAAGTTACGGTGCCATTTGCCTAGTTCCTTCACCCGAGTTCTCAAGCGCCTTGGTATTCTCTACCCGACCACCTGTGTCGGTTTGGGTGCAGTTCCTGGTGCCTGAAGCTTAGAAGCTTTTGGAAGCATGGCATCAACCACTTCGTCGTCTAAAAGACGACTCGTCATCAACTCTCGGCCTTGAAACCCCGGATTTACCTAAGATTTCAGCCTACCACCTTAAACTTGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGGTGGAACGTTCTGTTTATGTTTCTGAAA
>42262
AGCGTTCAGATTACGTATTGCTAGGGGGCAAAGATGAAGATGTTCGTTATTCTTTAGACTTGCCTGTCGCTCTATCTTCTCTTTTTGGTCAAGCCTCACGGGCAATTAGTACTGGTTAGCTCAACGCCTCACAACGCTTACAGCCTATCAACGTCATAATTCTTCTGACGGCCCTTCAGAATCAAGTTCCCAGTGAGATCTCATCTTGAGCAAGTTTCCCACCGTCTTTCAGCGGTTATCTTTTCGAACCTGCTTCCAGCAATACCACTGGCGTGACAACCGGAACACCAGAGGTTCGTCCACTCCGGTCCTCTCCGTACTAGGGCAGCCCTCTCAAATCTCAGAACGTCCACGGCAGATAGGACCGAACTGTCTCACGACGTTCTAAGCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATGCAGGACCGGCTTCGGCCCCAGGATGTGATGAGCCGGCATCGGGTGCCAAACACCGCCGTCGATATAAACTCGGGCATTGACCTGTTATCCCCGGGTACCTTTTTATCGTTGAGCGATGGCCCTTCCATACCAGAACCACCGGATCACTACAGACCTACTTTCGTACCTGCTCGCTGTCTGTCGCGGCCAAGCGCTTTTGCTATGCTCTGCGACCGATTTCCGACCGGTCTGGGCGCACCTTCGTACTCCGTTGCCTCTTTTGGAGACCGCTGATCAAACTGCCCACCATACACTGTCCTCGATCCGGATTACCAGAGTTTAGAACTCAATGCCAGGGTGGTATTTCAAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCACCTATCCTACAAGCAGGCTCAAAGTCCAGTACAACTACAGTAGGTTCACGGGGTCTTTCCGTCTAGCCGCGGATACCTGCATCTTCAGCGTTTCAATTTCACTGAGTCTCAGGTGGAGACAGCGCCGCCATCGTTACGCCATTCGTGCAGGTCGGAACTTGCCGACAAGGAATTTTGCACCTTGGGACCATTCGTTACGCCGTTTACCGGGGCTGATCAAGAGCTTGCTTGCGCTAACCCCATCAATTAATTTTTCCGGCACCGGGGAGGCGTCACACCTACGTCCCACTGCGTGTTTGCAGAGTGCTGTGTTTAATAAGTCGCAGCAGCTCAGTATCTTCGACCAGCCAGAGCTTACGGAGTAAATCTTCACCTAGCCGGCGACCTTCTCCCGGAAGTTACGGTGCCATTTGCCTAGTTCCTCCGCACCCGAAAGCCCTTCGCGCCTTGGTATTCTCTACCCGACCTGTGTCGGTTTGGGGCACGGTTCCTGGCCTGAAGCAGAAGCTTTTCTTGGAAGCCTGGCATCAACCACTTCGTCATCTAAAAGACGACTCGTCATCAGCTCTCGGCCTTGAAACCGGATTTACCTAAGATTTCAGCCTACCACCTTAAACTTGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGTGGAGACGTTCTGTTTATGTTTCTATC
>50101
CCCGGTTACGTATTGCTAGGGGCAAAGATGAAGATGTTCGTTATTCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTGGTCAAGCCTGCGGGCAATTATACTGGATAGCTCAACGCCTCACAACGCATACACCCAGCTTCTATCAACGTCGTAGTCTTCGACGGCCCTTCAGGAATCAAGTTCCCAGTGAGATCTCATCTTGAGGCAAGTTTCCCGCTTAGATGCTTTCAGCGGTTATCTTTTCCGAACATAGCTACCCGGCAATGCCACTGGCGTGACAACCGGAACACCAGAGGTTCGTCCACTCCGGTCCTCTCGTACTAGGAGCAGCCTCTCAAATCAAACGTCCACGGCAGATATAGGGACCGAACTGTCTCACGACGTTTCTAAACCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCACCCTTGGGACCGGCTTCAGCCCCAGGATGTGATGAGCCGACATCGGGAACAAACACCGCCGTCGATATAAACTCTTGGGCGGTATCAGCCTGTTATCCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCCTTCCATACAGAACCACCGGATCACTAAGACCTACTTTCGTACCTGCTCGACGTGTCTGTCTCGCAGTCAAGCGCGCTTTTGCTTTATACTCTGCGACCGATTTCCGACCGGTCTGAGCGCACCTTCGTACTCTCCGTTACTCTTTAGGAGACCGCCCCAGTCAAACTGCCCACCATACACTGTCCCTATCGATCCGGATTACGGACAGAGTTAGAACCTCAAGCATGCCAGGGTGGTATTTCAAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCACCTATCCTACACAAGCAGGCTCAAAGTCCAGTGCAAAGCTACAGTAAGGTTCACGGGTCTTTCCGTCTAGCCGCGGATACACTGCATCTCCACAGCGATTTCACCTCACTGAGTCTCTCGGGTGGAGACAGCGCCGCCATCGTTACGCCATTCGTGCAGGTCGGAACTTACCGACAAGGAATTTCGCTACCTTAGACCGTTATCGTTACGGCCGCCGTTTACCGGGGCTTCGATCAAGAGCTTCGCTTGCGCTAACCCCATCAATTAACCTTCGGCACCGGGGAGGCGTCACACCCTATACGTCCACTTTCGTGTTTGCAGAGTGCTGTGGCTTTTAATAAACAGTCGCAGCGGCCTGGTATCTTTTCGACCAGCCAGAGCTTACGGAGTAAATCCTTCACCTTAGCCGGCGCACCTTCTCCCGAAGTTACGGTGCCATTTGCTAGTTCCTTCACCCGAGTTCTCTCAAGCGCCTTGGTATTCTCTACCCGACCACCTGTGTCGGTTTGGGGTACGGTTCTGGTTACCTGAAGCTTAGAAGCTTTTCTTGGAAGCATGGCATCAACCACTTCGTCGTCTAAAGACGACTCGTCATCAGCTCTCGGCCTTGAAACCCCGGATTTACCTAAAGATTTCAGCCTACCACCTTAAACTTAGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGGTGGAAGTTCTGTTTATGTTTCTTGAGC

Continue reading

Starting from scratch – building a package in R

For the first time, I am going to share something more related to my master thesis. When I started this thesis, I did not know how to use R. In order to learn R, I started using DataCamp, which is a series of interactive courses. You can start from scratch and build your skills step by step. My favorite course so far is called “Writing Functions in R”. During the course, you are told:

If you need to copy something three times or more – build a function.

As a rookie, it sounds complicated, but in fact it is really simple AND it will save you so much trouble later. Defining a function will allow you to produce code that is easy to read, easy to reuse and easy to share with colleagues.  A function can be defined by the following syntax:

my_function <- function(arguments){
  Function Body
}

my_function  can be any valid variable name, however, it is a good idea to avoid names used elsewhere in R. Arguments (also called formals) can be any R object that is needed for my_function to run,  for example numbers or data frames. Arguments can have a default value or not. If not, a value must be provided for the given function to run. Function body is the code between the { brackets }  and this is run every time the function is applied. Preferably, the function body should be short and a function should do just one thing. If a large function cannot be avoided, often they can be constructed using a series of small functions. An example of a simple function could be:

cookies <- function(who, number=10){
  print(paste(who, "ate", number, "cookies", sep = " "))
}

The cookie function has two arguments, the number argument defaults to 10 and the user does not necessarily need to provide a value. The who argument on the other hand has no default and a name must be provided. I had some cookies BUT I only had nine cookies so I better change the number argument:

cookies(who="Julie", number=9)
[1] "Julie ate 9 cookies"

So, now I have defined a function to keep track of my cookie consumption. What if I want to share this with the rest of Albertsen Lab? I could forward a script for them to save locally.  No no, I will build a personal R package. This might seem like overkill for the cookie function, but imagine a more complex function.  In my search for helpful tools for calculating correlations, I have come by several functions/sets of functions with no documentation. It is nearly impossible to piece together how, what and when to use arguments with no provided help.  So, now I will build a bare minimum package to help me share my function with the group, strongly inspired by Not So Standard Deviations. For more information check out the excellent book  “R-packages” by Hadley Wickham.

First, you will need the following packages:

install.packages("devtools")
library("devtools")
install.packages("roxygen2")
library("roxygen2")

After this we need to create a package directory:

create("cookies") #create package

 So now,  a package called cookies has been created (you can change the folder with: setwd("my_directory")).

It is a good idea to update the DESCRIPTION file, so that it contains the relevant information about the package (cookies) and the author (me). Next step is to add the cookie function to the package. For this I save a script containing the function in the R folder. If you want to add more functions to your package, you can either create a new file for each function (recommended) or define the functions sequentially in one file.

Now comes the important part – documentation. Good documentation is key if you want other people to benefit from your work. This can be done easily using the roxygen2 package also by Hadley Wickham. roxygen2 uses a custom syntax so that the text starting with #' will be compiled into the correct format for R documentation when processed. Make a new R script with the following code and save it as cookies.R in the folder cookies/R:

#' Cookies
#'
#' This function will allow you to keep track of who your cookies.
#'
#' @param who Who ate the cookies? (Requires input)
#' @param number How many cookies has been eaten? (Default: 10)
#' @keywords cookies
#' @export


cookies <- function(who, number=10){
  print(paste(who, "ate", number, "cookies", sep = " "))
}

After creating the script then roxygen2 can be used to create all the needed files and documentation:

roxygenise("cookies/")

Lastly the package needs to be installed:

install_local("cookies")

You can now access your awesome functions by loading your brand new package:

library("cookies")

 Now you have built a real R package! If you type ?cookies in the console a help page will actually pop up.

Finally, you can upload you package to github.com (Guide). This will allow your others to try out your package, point out issues and suggest changes. Download and install directly from github is easy using install_github() from the devtools package. Try it out by typing this:

devtools::install_github("julieklessner/cookies")

It really can be this easy! So next time you copy something more than three times or really want to share your work, consider defining a function and building your own personal package with associated documentation.

Analysing amplicon data, how easy does it get?

Ever done amplicon DNA sequencing of the 16S rRNA gene to identify microbes? If so, then you must know about the challenge of analysing such complex data easily.

My name is Kasper, and I am currently a master student here at the Albertsens lab. When I first learned how DNA is sequenced today, I was astonished by the rapid development DNA sequencing technologies have been experiencing during the last decade. A whole human genome can now be sequenced within a day for less than $1000! The applications of DNA sequencing are countless and there are countless questions yet to be answered. But the first and most important question is perhaps… how? What to do with millions of sequences of just A, C, T and G’s? Well, that question is the foundation of a huge field within biology: Bioinformatics! Which is something we try to expand here at Albertsen Lab.

During my master thesis I have been working with ampvis to take it a step or two further. Using R for bioinformatics takes a certain skill level and I’ve spent weeks of my project at learning R in depth to be able to write R functions for my thesis. During my project, I have specifically been applying various ordination methods (such as Principal Components Analysis, Correspondence Analysis, Redundancy Analysis and more) and other multivariate statistics to activated sludge samples to identify patterns between and within danish wastewater treatment plants – more on that will follow in subsequent blog posts. However, after I spent all that time learning R I thought:

Does everyone need to spend so much time at learning how to do complex bioinformatics, at least, just to do simple data analysis?

Interactive data analysis through Shiny

If you want to do reproducible research, then yes. There is no other way. But if you just want to do brief analysis of amplicon data using basic functions of ampvis, I have done all the work for you by making an interactive Shiny app of ampvis. A shiny app is basically designed from a collection of HTML widgets custom made to communicate with an active R session, so anything that can be done in R, can be done with a shiny app. This means that you can now use ampvis using only mouse clicks! No R experience required.

Amplicon Visualiser in action

All you have to do is upload your data and start clicking around! If the data is suited for ampvis it is also suited for the app (an OTU table + metadata if any, minimal examples can be downloaded from within the app). We recommend using UPARSE for OTU clustering into an otutable. You can then do basic filtering based on the metadata columns and also subset taxa if your are looking for specific microbes. With the app is some example data from the MiDAS database with samples taken from Danish wastewater treatment plants (WWTP’s) from 2014 and 2015. As a start you can simply click analysis, then render plot to get a simple overview of all the samples taken grouped by WWTP. No harm can be done, feel free to try out every clickable element!

The app is available here: https://kasperskytte.shinyapps.io/shinyampvis/.

Of course, it is open source, so the sourcecode is available on my Github. If you encounter any issues please report them here or in the comments below. Any suggestions for improvements are of course welcome.
(spoiler: Among future plans of the app is more extensive ordination with various metrics, data transformation and ordination types)

Traversing the fog… Soon graduating – then what??

The other day I watched my boyfriend playing the game Dark Souls III. At some point he had to step through this fog, entering a new area or a boss area or something else. You never know.

It was a boss, but I don’t remember how it went from that point on. Because it got me thinking, this summer I will graduate and, for the first time since I was a kid, I will not be a student. Wow, it is scary! My boyfriend is also graduating this summer. So, everyday life for our family will change. We live in a dormitory with our 1.5 year old son and we will have to move. But where? Closer to family? Closer to job opportunities? Somewhere in the middle? And where will that be?

The above are a small selection of the thoughts that crossed my mind. All of them circled one thing – can I find a job? I believe that the answer is yes. But can I do something before I graduate to help the process? Again, I believe that the answer is yes. So, as the to-do-list kinda person I am – I made a plan! Firstly, I must figure out what direction I want to go. The projects I did earlier are very different from my thesis. If I want to continue this path, the words I usually use to describe my professional profile, must change. This is a work in progress.

Second – LinkedIn. My profile is a mess at the moment and not accurately describing my capabilities. The next project is therefore to tailor my profile to the path I want to take. Together with my fellow Albertens Lab students, Kasper and Peter, I attended an afternoon course in LinkedIn, which was very for useful. For example, changing a simple thing as such as your headline can make a big difference.

 

 

 

 

 

 

It still needs some work I think and there a several great guides out there. I will find some inspiration in one available at the university called Pejling (Also notice the great interview about being parent and student at page 20).  As before, we are giving each other feedback. This way I hope to get a clear-cut profile before I start the job hunt.

Next step will be updating my basic résumé. This brings up the eternal question: How to write a good CV? There is a different answer depending on who you ask. Again, I plan to find some inspiration in Pejling. Afterwards I will kindly ask for feedback from my supervisors to help me improve it even further. After building my basic résumé, I plan to tailor my CV for each application, including only what is relevant for the particular job.

Lastly, I plan to apply for jobs before I graduate. It would be awesome to have the security of job before I graduate and the only way to do this is to start before I’m done. Furthermore, I think it demonstrates a capability to juggle several tasks at once. Finally, I plan to get feedback if/when some of my applications are turned down.

In the end, this turned out to be very personal. When this is published, people will know my intentions, also if I don’t succeed in finding a job right away. This makes me feel vulnerable. However, for me, being close to the graduation and the huge decisions that follow, has become a big part of being a master student.