The Roblon Prize: my first prize

Photo: Lars Horn / Baghuset

Each year, the Roblon foundation awards the Roblon Prize of 100,000 DKK (~16,500 USD), to a master thesis from Aalborg University acclaimed to fulfill the following:

“The thesis should be of highest quality and innovative. It should have a positive effect on the research field as well as the future research and career of the recipient, preferably in the North Jutland region.”

This year the recipient was me (!) for my thesis entitled “A differentiation dependent classification of diffuse large B-cell lymphomas by the NanoString technology” which I submitted this summer. In short, my thesis concerned the development of a gene expression assay on the NanoString technology [1], that can separate diffuse large B-cell lymphoma (DLBCL) into the so-called B-cell-associated gene signature (BAGS) subtypes which differs in survival and resistance to chemotherapy [2]. The NanoString technology is highly accurate and cheap compared to previous methods, and needed in order to facilitate the use of BAGS classification for clinical applications, as a novel routine tool for diagnostic and prognostic assessment as well as insight into intrinsic cancer biology.

Building the assay was a three-step process involving a training and validation dataset. First, I searched the training data bioinformatically for a set of genes with good discriminative properties between BAGS subtypes. Second, I trained a model using these genes to accurately predict BAGS subtypes in the training data. Thirdly, the set of genes and associated model (i.e. the assay) was applied on an independent validation cohort to test its predictive properties. Our preliminary results looks convincing and we are currently testing the clinical and biological differences between BAGS subtypes in an ongoing study of German DLBCL patients [3].

My most sincere and great thanks to the Roblon foundation for taking their time to evaluate my work and finding it good enough for the Roblon Prize. Great thanks also to my thesis supervisors Martin Bøgsted and Mads Albertsen for nominating me in the first place. As a young aspiring scientist, such acknowledgement and support really boosts one’s confidence! I feel more inspired than ever to dive heads-first into the unforgiving manic-depressive wonderful world of science.

Read more about the prize and the university annual celebration here.

References

  1. G. K. Geiss et al., “Direct multiplexed measurement of gene expression with color-coded probe pairs” Nat. Biotechnol., vol. 26, no. 3, pp. 317–325, 2008.
  2. Dybkær et al., “Diffuse large B-cell lymphoma classification system that associates normal B-cell subset phenotypes with prognosis” J. Clin. Oncol., vol. 33, no. 12, pp. 1379–1388, 2015.
  3. T. Y. Michaelsen et al., “B-Cell Differentiation Dependent Classification of Diffuse Large B-Cell Lymphoma By the Nanostring Technology” Blood, vol. 130, no. Suppl 1, p. 3995, 2017.

ampvis2: The bread and butter of our amplicon analyses: amp_heatmap!

(This post is part of a series of ongoing posts about selected ampvis2 features)

Now that the ampvis2 R package has got its own paper on bioRxiv it is a good time to also write a small blog post about our beloved heatmap, which Albertsen lab and the Center for Microbial Communities use in practically every amplicon analysis.

Why not stacked bar charts?

Getting a simple overview of the microbial community composition of environmental samples can be done in different ways. However, when visualisation of both the composition as well as abundance is desired, very few options exist. For visualisation at higher taxonomic levels (mostly phylum level) a common way of achieving this is to use a stacked bar chart, which can show the relative abundances of all taxa in each sample (example) for comparison. Sadly, stacked bar charts have somehow found a way into microbial ecology even though they are best used for much simpler data and to illustrate gradual changes between values, not large compositional differences as with microbial community data. One of the reasons is that there is only about a dozen or so of different colors that are easily distinguishable by the human eye. There are just not enough colors to cover just a fraction of the diversity in microbial community data. Another reason is that it can be very difficult, if not impossible, to easily read the exact values of each constituent bar (the abundance of a taxa in a sample) as they are shown as an interval. You would first have to judge what the lower and upper limits are, calculate the difference, and only then get the value, and then it would only be approximate. Have a look at this example plot showing the species level composition of a range of samples obtained by 16S amplicon sequencing:

 

This just makes you reconsider why you would ever visualise anything. If the purpose of graphical visualisation is not to simplify and communicate an important message that would otherwise require pages of words to convey, then there are just no reasons to do it. For example, there are probably 7-8 different shades of green. Which one is E. coli? I can’t tell, but maybe you can. In any case, there should not be any doubt. When it comes to reading the relative abundances of the taxa in the samples, in most cases “around” X % abundance is fine to answer most research questions (there are biases anyways), but when you can’t tell the difference between taxa, the plot almost becomes useless.

Rethinking the heatmap

Enough hate on stacked bar charts, but these are some of the main reasons why we had to come up with something different without compromising both precision and simplicity, as well as being pleasant for the eye. This resulted in the amp_heatmap which has several additional features compared to other software tools that are also able to generate heatmaps, these will be demonstrated in the following. The example data used in the following is from the activated sludge of two wastewater treatment plants (WWTP’s) in Aalborg and is included in the ampvis2 R-package.

A heatmap is basically a grid of colored tiles with samples on the x-axis, taxa on the y-axis, and then the abundance (relative or absolute) of each taxon in each sample is indicated by coloring the tiles by a gradient of usually 2 or 3 colors. The colors are then used to identify patterns between the samples and the exact abundance values can be written on top of each tile. Here is an overview of the 10 most abundant Genera in all 16 samples in the dataset and the corresponding R code to generate the plot, by default the abundances are shown relative to the total in each sample:

amp_heatmap(
  data = data,
  tax_aggregate = "Genus"
)

 

Simple, right? We have all information we need in terms of composition and abundance in one plot. All OTU’s are aggregated to a specific taxonomic level of choice and the order of the taxa on the y-axis is automatically sorted by the cumulative abundance of each taxa across all samples. amp_heatmap particularly stands out in the way that it can incorporate sample metadata to dynamically group samples based on one or more metadata variable(s). These variables can contain any relevant information about the samples (for example when and where the samples were taken, physical parameters like pH or temperature, etc…) and amp_heatmap then calculates the mean (or even median, minimum, or maximum) abundance of each taxon in each group of samples as defined by the variable. For example, we can show the yearly average for each WWTP by setting the group_by argument (figure A below). A different way to show the same information is by using faceting on one of the variables instead by setting the facet_by argument (figure B below).

#plot_A:

amp_heatmap(
  data = data,
  tax_aggregate = "Genus",
  group_by = c("Plant", "Year")

#plot_B:
 
amp_heatmap(
  data = data,
  tax_aggregate = "Genus",
  group_by = "Plant",
  facet_by = "Year")

 

We can come a long way with just the group_by and facet_by features alone, but we may also be interested in looking at a different taxonomic level (by setting tax_aggregate), for example family level, add more taxonomic information (tax_add), or show more of the most abundant taxa (tax_show). Furthermore, we can hide the values (plot_values = FALSE) if we are more interested in patterns between the groups and not the exact values, or even adjust the color scale (plot_colorscale) and the colors themselves (color_vector):

amp_heatmap(
  data = data,
  group_by = "Plant",
  facet_by = "Period",
  tax_show = 20,
  tax_aggregate = "Family",
  tax_add = "Phylum",
  plot_values = FALSE,
  plot_colorscale = "sqrt",
  color_vector = c("royalblue4",
                   "whitesmoke",
                   "darkred")
)

 

To compare the top most abundant taxa in a single sample to all other samples it can be done by the sort_by argument (in this case the sample “16SAMP-3941“). This is also useful to check control samples. As mentioned, by default the read counts are normalised, but this can of course also be disabled by setting normalise = FALSE:

amp_heatmap(
  data = data,
  normalise = FALSE,
  tax_aggregate = "Genus",
  sort_by = "16SAMP-3941"
)

 

Last but not least, we primarily analyse samples from wastewater treatment systems, so we have made a nice way to show known biological functions of Genus level taxa next to the heatmap according to the MiDAS field guide database. Furthermore, when using group_by to group samples, the x-axis will be alphabetically ordered, so in the case of seasonal period, it makes more sense to order it chronologically (by setting order_x_by):

amp_heatmap(
  data = data,
  group_by = "Period",
  tax_aggregate = "Genus",
  tax_show = 20,
  plot_functions = TRUE,
  functions = c("FIL", 
                "AOB",
                "NOB",
                "PAO",
                "GAO"),
  order_x_by = c("Spring",
                 "Summer",
                 "Fall",
                 "Winter"),
  rel_widths = c(0.6, 0.4)
)

Conclusion

I hope this will convince you to try out amp_heatmap from the ampvis2 R package. Only a selection of features have been demonstrated, there are even more, have a look in the amp_heatmap online documentation. We have been using it for quite some years and it has stood the test in many of our analyses, papers and posters. It gives a nice, informative overview of any microbial community data obtained by amplicon sequencing, while still being eye-candy. In fact, it can even be used for transcriptomics as well.

 

What about pie-charts? I love pie-charts!

We certainly don’t, and definitely not for microbial community data. Let Max Roser explain why:

ampvis2: A guide to ordination and how to use amp_ordinate in R

(This post is part of a series of ongoing posts about selected ampvis2 features)

A common way for microbial ecologists to compare the microbial communities of different samples is by using ordination methods. These methods seek to highlight differences between samples based on their microbial community composition. By plotting samples as points in an x/y-plane, the ecological differences between samples can then be interpreted simply by the distance between the points (see figure 1). The term “ordination” in fact stems from the German word “ordnung”, meaning “order”, so ordination is essentially a way of ordering samples based on similarity.

Figure 1: Principal Components Analysis (Hellinger transformation applied) of 8 samples from the activated sludge of two Danish wastewater treatment plants Aalborg East and Aalborg West. Obtained by 16S rRNA amplicon sequencing on the Illumina platform. The sample points are shaped by which seasonal period the samples were taken in 2012.

However, there are numerous types of ordination methods and it is not clear to many researchers which method to use when, and why. This is not surprising, as some rely on highly complex mathematics, but there are a few key differences that are important to know when comparing the microbial communities using ordination.

Assumptions and terminology

An underlying assumption in ordination analysis is that the microbial community composition at a given site (a.k.a. alpha-diversity) is highly dependent on the surrounding environmental and ecological conditions. The microbes present at a given site are therefore termed the response variables, as they are dependent on other factors (the surrounding environment). It follows that the microorganisms present at a particular site (which for a researcher is: in a single sample) are thus also likely to be present at different sites, as long as the environmental and ecological conditions are highly similar. When the environmental conditions, including physical and chemical conditions, as well as the presence of other species, differ, so does the overall microbial community composition. These differences in the environmental conditions over time or space are termed environmental gradients and the species are expected to be gradually less abundant the more the conditions are different from an optimal set of conditions for the individual species to thrive (niche theory). Ordination methods mainly differ in how they hypothesise the abundance distribution of the response variables (aka. microbial communities) along the environmental gradient to be, commonly either showing a unimodal or a linear distribution.

Figure 2: The main structure of two-dimensional data can often be described in one dimension by relative positions on a straight line. Adopted from (Pearson, 1901)

Distances, differences, and dimensionality reduction

When using ordination methods to visualise the differences between samples with respect to their microbial community composition, differences and distances are one and the same thing. The distances are essentially defined by a mathematical formula with which to represent the differences between the samples as a distance- or (dis)similarity coefficient, which is used to calculate a symmetrical distance matrix containing a coefficient for each pair of samples. This multidimensional matrix is all but user-friendly to look at and difficult to visualise, and this is where ordination really gets to shine. The matrix is then subjected to, let’s keep it simple, dimensionality reduction, where the matrix is sort of “compressed” and only the most significant differences are extracted and expressed by new, synthetic axes, and usually plotted in a normal x/y scatter plot. The principle is similar to that of linear regression, where drawing a straight line through points in two dimensions can represent the overall structure of the data in only one dimension (see figure 2). The only difference is that there are just as many dimensions as there are samples. Yes, hard to imagine.

Six common ordination methods used within microbial ecology

In general, there are six methods in particular that are relevant for microbial ecology, see the table below. There are several ways to categorise them, below they are categorised by whether the distance matrix is calculated explicitly or implicitly. That means, whether the user has to manually calculate a distance matrix based on a suited distance measure before performing the ordination or not, respectively.

Unconstrained (exploratory)Constrained (explanatory)
Implicit distance measure
Principal Components Analysis (PCA)Redundancy Analysis (RDA)
Correspondence Analysis (CA)
Canonical Correspondance Analysis (CCA)
Explicit distance measure
Principal Coordinates Analysis (PcoA)
non-metric Multidimensional Scaling (nMDS)

Principal Components Analysis (PCA) vs. Correspondence Analysis (CA)

The distance measure incorporated in Principal Components Analysis (PCA) and Redundancy Analysis (RDA) is the Euclidean distance, which is calculated by the Pythagorean theorem. Correspondence Analysis (CA) (a.k.a. weighted- or reciprocal averaging) and Canonical Correspondance Analysis (CCA) are instead based on the Pearson chi-squared measure.

A common way of distinguishing PCA+RDA from CA+CCA is that the former two are best applied to data showing a linear distribution of the species abundances along the environmental gradient, while the latter two are best applied to a unimodal distribution.

This means that in theory, CA and CCA should be the most ideal for ecological data, because of the chi-squared measure better deals with the fact that some species can be present in some samples while absent in others (double-zeros), which is common in ecological data and fits niche theory well. PCA (and its constrained version RDA) is widely used, perhaps due to its simplicity (it just highlights numerical variance), but within ecology, it underestimates ecological differences due to these double-zeros, and only the most abundant species will influence the distances predominantly. This is by some considered a flaw of PCA when used within ecology, but with appropriate data transformation, for example the Hellinger transformation (see Legendre 2001), PCA can prove quite useful when answering questions involving only the most abundant species and their numerical differences between samples. CA (and its constrained version CCA) on the other hand is very sensitive to the less abundant and often unique species in the samples, which are often numerous in microbial communities. Again, data transformation like the Hellinger transformation, can cope with this problem. Then, CA becomes a powerful tool to tell which species are characteristic of, or correspond to, each sample or group of samples, as we will see in the examples below.

One can think of it like CA (and CCA) is a more qualitative method, while PCA (and RDA) is a more quantitative method, and thus they can supplement each other nicely and reveal different aspects of the data.

Principal Coordinates Analysis (PCoA) and non-metric Multidimensional Scaling (nMDS)

The advantage of Principal Coordinates Analysis (PCoA, a.k.a. metric Multidimensional Scaling or mMDS) and non-metric Multidimensional Scaling (nMDS) is that the user has the freedom to choose the distance measure explicitly. There are well over 50 different measures to choose from, but luckily only a handful are relevant for most analyses (see GUSTA ME). Common measures include the Bray-Curtis dissimilarity index, which has been widely used for decades, Pearson chi-squares, Jaccard, Chord, and lately UniFrac, which incorporates phylogeny. PCoA is very similar to PCA, RDA, CA, and CCA in that they are all based on eigenanalysis: each of the resulting axes is an eigenvector associated with an eigenvalue, and all axes are orthogonal to each other. This means that all axes reveal unique information about the inertia in the data, and exactly how much inertia is indicated by the eigenvalue. When plotting the ordination result in an x/y scatterplot, the axis with the largest eigenvalue is plotted on the first axis, and the one with the second largest on the second axis. How much each axis then contributes to the total inertia in the data is often indicated on the axis labels as a fraction of the eigenvalue of the particular axis and the sum of all eigenvalues. Often the remaining axes other than the two plotted can highlight a large amount information as well, but the best case scenario is that the most inertia is expressed by the first two. A simple bar plot of all the axes and their eigenvalues, called a scree plot, can be used to get an overview of each axis’ contribution to the total inertia and confirm that those plotted are the most significant ones.

In contrast to the other five methods, non-metric Multidimensional Scaling (nMDS) is quite different. It is not an eigenanalysis based method, it is instead a rank-based approach, which means that the magnitudes of the differences are lost (hence the term non-Metric), and only the ranked relationships between the samples are preserved.

This makes nMDS particularly robust and independent of any underlying environmental gradient.

It can even deal with missing values and different types of data at the same time. As opposed to the eigenanalysis-based analyses, nMDS is an iterative procedure, where the resulting ordination does not have a single solution. Instead of eigenvalues, nMDS utilise a stress value to indicate the “success” of the ordination. The number of axes obtained by nMDS is chosen by the user a-priori, and a number of iterations (20 is usually enough) are run to best fit the data to the chosen number of axes. Each solution is used as a starting point for the next iteration and a stress value is calculated after each iteration and will after a number of iterations reach a stable value. A stress value of 0.20 or more is considered unacceptable, 0.1 is acceptable, and 0.05 or below is great. A disadvantage of PCoA and nMDS is that it is only possible to show differences between samples/sites. Plotting both species scores together with site scores in a biplot is not possible.

How to perform ordination with ampvis2 in R

Now that you know the basics of the different ordination methods relevant for microbial ecology, let’s demonstrate how to create publication-ready ordination plots with the ampvis2 package in R and discuss the results of different methods on the same data. The amp_ordinate function aims to simplify the process of performing ordination by basically wrapping numerous functions from other R packages (many of which are from the vegan, ggplot2, and APE packages) into one large function. Performing ordination is often a multi-step workflow and amp_ordinate performs all the required steps with just one function call. Along the way there are also several error checks to help guide the user in the process. All steps are fully customisable by the user, by default amp_ordinate does the following in order:

  1. Filters low abundant OTUs.
  2. Performs data transformation.
  3. Calculates a distance matrix if nMDS/PCoA.
  4. Calculates both site- and species scores by the chosen ordination method.
  5. Generates the ordination plot with numerous visual layers defined by the user.
  6. Fits the correlation of environmental variables to the ordination plot.
  7. Returns a nice looking plot or a interactive plotly plot.

The ordination is performed on the read counts contained in the OTU table (which simply contains the read counts of each OTU in each sample), and the corresponding metadata of those samples are used for all aesthetic options, like the coloring or shape of sample points based on a group variable, as well as for environmental interpretation by constrained ordination methods or the fitting of the correlation of environmental variables onto the ordination plot.

Examples

Some of our 7th semester student groups have really done a great job collecting samples from a plethora of interesting environments in nature and from animal faeces to hunt for novel bacteria. In the following examples, data obtained by 16S rRNA amplicon sequencing (on the Illumina MiSeq platform) of a handful of their samples is used. The samples are from quite different environments, including the faeces of a Bearded dragon and a Vikunja from Aalborg Zoo, sediment from the Limfjord, a wooden boardwalk from Lille Vildmose, and a rusting pipe from somewhere.

To avoid that this blog post gets longer than it already is, only PCA and CA will be demonstrated in the following.

Principal Components Analysis

Principal Components Analysis can be performed by simply setting type = “pca” as shown below. This would produce a plot with just black dots, which is not very useful, so setting sample_color_by and sample_colorframe can color the sample points based on the metadata variable “SampleContent”, which simply contains information about where the samples were taken.

amp_ordinate(
  data = data,
  type = "pca",
  transform = "hellinger",
  sample_color_by = "SampleContent",
  sample_colorframe = TRUE,
  sample_colorframe_label = "SampleContent")

This only highlights the differences between the different samples/sites. To show even more information, making a biplot with also the species scores (with this data species are defined as OTU’s clustered at 97% similarity) plotted can reveal information about which species are abundant in samples or groups of samples, as seen below. The plot is an interactive plotly plot, so go ahead and zoom, hover, export etc 🙂

amp_ordinate(
    data = data,
    type = "pca",
    transform = "hellinger",
    sample_color_by = "SampleContent",
    sample_colorframe = TRUE,
    sample_colorframe_label = "SampleContent",
    species_plotly = TRUE)

Notice that most of the species points are positioned on a line outwards from (0,0) towards the samples. This is very characteristic of PCA, where the relative position of the sample points along these “species lines” or “species gradients” is an expression of how abundant the species are in the samples.

Principal Components Analysis is therefore favored when answering questions involving only the most abundant species in the samples

For example, the 3 samples from Lille Vildmose boardwalk must have a high abundance of OTU 1775 and the Limfjorden samples must have a high abundance of OTU 610. Low abundant species, however, have no impact at all on the observed differences when using PCA, they will all be positioned near (0,0). Often the species scores are plotted as arrows starting from (0,0) when performing PCA, but since microbial community data in most cases contain hundreds or even thousands of different species when sequenced with modern DNA sequencers, they are plotted as points when using amp_ordinate.

Correspondence Analysis

The differences between the samples as observed with Correspondence Analysis are in this case somewhat similar to those obtained with Principal Components Analysis, however there is a big difference when it comes to the species scores, as seen in the biplot below.

amp_ordinate(
  data = data,
  type = "ca",
  transform = "hellinger",
  sample_color_by = "SampleContent",
  sample_colorframe = TRUE,
  sample_colorframe_label = "SampleContent",
  species_plot = TRUE)

As the name suggests, CA tries to find how species “correspond” to samples. Therefore, any species positioned near a sample are then characteristic of that sample and is most likely only present in that sample. If a species point is positioned at the exact same position as a sample, then it is only found in that sample. On the contrary, any species that are present in all samples will be positioned near (0,0) as they are not directly representative of any sample or group of samples. The species abundances also play a role in the observed differences with CA, but not at all as much as with PCA.

This makes Correspondence Analysis a more qualitative method and is therefore favored when answering questions involving community composition and ‘what’ is present ‘where’.

To illustrate this, try zooming in closely at some of the groups of samples in the plot above. For example, the two faeces samples from the Bearded dragon seem to be the only samples containing about 50 species, since they are positioned on a straight line between the two samples. This is also the case for the rusting pipe or the sediment samples from Limfjorden:

Remember to check the scree plot!

Judging from the Correspondence Analysis above it may be tempting to conclude that the samples from the rusting pipe are more similar to the sediment samples from Limfjorden than they are to the faeces samples. Well, the scree plot below tells us that we may be wrong!

result <- amp_ordinate(data, 
  type = "ca",
  transform = "hellinger",
  detailed_output = TRUE)
result$screeplot

The first two axes show about 43% of the total inertia in the data, but the 3rd and 4th axes can actually reveal just as much information as the first two, about 39% combined. By default amp_ordinate plots the first two axes, but by using the x_axis and y_axis arguments it is possible to show any of the remaining axes. Usually, keeping the first axis and only adjusting the y_axis is fine, but any combination of axes can be shown. By looking at a fancy 3D plot with the first 3 axes plotted it is clear that the rusting pipe samples are actually quite different from all the other samples:

It is therefore important to always validate the resulting ordination plot by looking at the scree plot, and maybe even plot other axes than just the first two, before drawing any conclusions!

Read more

Websites

Literature

NCM17 – Day 2: Into the unknown

Day 2 at the NCM17 started, again, with Oxford Nanopore CEO Gordon Sanghera taking the stage, once again stating his dream that sequencing should be available to anybody, anytime, anywhere. There are currently around 12,500 mainframe DNA sequencers around the globe, a number he believes will be passed by ONT in a not too distant future. The availability of portable sequencers to the masses will change the way we sequence DNA (the sequencing singularity?), not to mention all the ethics involved in sequencing the genomes of everyone – a topic which was also discussed in a panel discussion. It will undoubtedly improve our understanding of diversity on our planet.

Arwyn Edwards employs Gordon’s motto in his work as he, in his own words, sequences in strange places. One example of such is a field trip to an old coal mine in Wales, where risks were high due to methane gasses. Although plugging in all equipment over ground to avoid sparks in the explosive environment, he still had a battery powered centrifuge blow up in a blue flame. Nevertheless, the team managed to get some reads out of the run, even though the input material was a mere 3×6 ng DNA! Other field trips have included Greenland to investigate microbial darkening and to Svalbard to monitor microbial populations in the melting Arctic ice. He has also been testing the new lyophilised reagents and found results coherent with other kits and Illumina sequencing. It is important to note however, that all methods have biases, all databases are incomplete and all extraction methods have problems. We need to address these issues before we can go out and sequence metagenomes in the field.

Continuing this subject was Mads Albertsen, my PI, who had a clear message of what we need – more genomes! There are some problems in recovering genomes from metagenomes though; Microdiversity in samples and separation of the genomes (binning). His approach is to use differential coverage, so binning can be performed by abundance. This is something that can, sometimes, be done with Illumina reads only, but hybrid assemblies with long Nanopore reads is simplifying this. Mads believes it won’t be long before we can beat evolution in speed and sequence everything! If you want to do metagenomes with differential coverage binning, check out our mmgenome R package.

It has been a great community meeting (my first!) with plenty of activity in the demonstration zone and flow cell loading workshop.

NCM17 – Day 1: New tech and exciting applications

All the gear

We start out with blazing punk rock (things are never too boring at Oxford Nanopore meetings) and plenty of promises from Oxford Nanopore CEO Gordon Sanghera. He gave us updates on various products, such as PromethION flow cells, which have been harder to produce than expected. However, we and others have now sequenced data on the PromethION! Yes, not the 50 gbp as Oxford Nanopore claim to be able to achive in-house. But we start to be much more optimistic than we have ever been! The VolTRAX should be working now and a second version was announced! The second version will have a fluorescent detector built in for quantification and it will be possible to load the prepped sample directly onto a flow cell. The second version will be released in the first half of 2018.

VolTRAX v. 2

The much-anticipated 128-channel Flongle will be available in the second half of 2018. We are looking forward to fast and cheap sequencing the field and laboratory – although a MinION flowcell is relatively cheap (approx 500 USD) it is still way too expensive for regular use in many applications where we do not need Gbps of data. Gordon Sanghera highlighted that they are currently also moving the flongle into the diagnostic field. The sky seems to be the limits for small, cheap and generic diagnostic kits.

Oxford Nanopore has realised that data yield on flow cells depends largely on extraction and library preparation techniques and have started focusing more on this. Automation of everything from extraction to library prep would be ideal, but no updates were given on Zumbador other than the fact that a lot of work remains to be done. Along these lines, Lyophilised reagents have been sent to a selected few for initial trials, and removing the cold chain will be a large step forward to enable sequencing anywhere. Furthermore the new mobile FPGA base-caller was shown and it will be up to the community to name the unit in a competition – use the #NanoporeAnyone tag! The SmidgION and the mobile basecaller was displayed in the demonstration area and look very fancy:

The talks

Nick Loman was up first and gave us an outlook on what we will be able to achieve with sequencing in the future. Based on the sequencing of Ebola in West Africa he showed a video of how the disease had spread and infected different regions. By using real-time genome sequencing we will be able to track diseases as they develop and although outbreaks will happen – we should be able to avoid epidemics.

This sequencing singularity, as he termed it, is however dependent on sequencing genomes instead of genes, as is standard today. Long reads will make this possible as they can often cover repeat locations in the genomes. The long reads will also mean that much less compute power is needed. An example was given with an E. coli sample, where an assembly was produced in miniasm in 1.5 seconds, producing a single contig with 8 reads (Now that Nick and his team lost the world record of the longest sequenced read, the still claims the world record of assembling E.coli with the fewest number of reads).

Sergey Koren continued in a related subject, talking about how we may be able to finally close the human genome. He showed very convincing results on integrating Hi-C with long-read Nanopore data, see their new tool SALSA and their related paper here. Combining Nanopore and Illumina reads in hybrid assemblies is the preferred way to get the desired accuracy and read length, which was also the theme of Ryan Wick’s talk. He brought up some shortcomings of assemblers and stated that all genomes are somewhat metagenomes because of variation. What should assemblers do when encountering these differences? There is a need for more variation-aware assemblers and more graph-friendly tools.

The diversity of life on Earth has always been tricky to study. Steven Salzberg addressed this in his talk about closing large genomes. Even with long reads you need bioinformagic to assemble the wheat genome, which is 16 Gbp long and full of repetitive transposon regions. The fact that it has taken 8-9 months of sequencing, 1 trillion bases sequenced on Illumina and 545 billion bases sequenced on Pacbio is a good indicator of how difficult it, still, is to close these genomes. And it is still not closed, but a lot closer to than before!

The evening speaker, Aaron Pomerantz, gave us a great presentation about his trip to the Ecuadorian rainforest with a mobile lab. The development of portable sequencing has the potential to give us bigger insights into the diversity in remote areas. Often samples have to be shipped halfway around the globe to be sequenced, adding further expenses and time to the workflow.

Nanopore Community Meeting 2017: Prologue

Oxford Nanopore is hosting their annual community meeting in New York City this week. I’ll be there, thanks to being in top 5 of their one more flowcell competition – thanks to all that voted!

Although not as “big” as the London Calling conference they host, the line-up of speakers is impressive and promises talks on everything from megabase reads and gigabase yields to the newest assembly algorithms, field applications and maybe even some product updates? I’m excited! And in this blog post, I will give an overview of what I expect to hear at the meeting.

The first plenary speaker is Nick Loman, who, until recently, had the boasting rights of the longest read on the MinION. The record is now with Kinghorn Genomics, but still sits below the megabase read.

Although megabase reads may not be practical or strictly necessary for most applications, the importance of read length is bound to be brought up by Loman, who will also speak about the future of sequencing – sequencing singularity, as he terms it. I am also looking forward to hearing Ryan Wick speak as he will undoubtedly tell about his work with Unicycler, which we recently described as our preferred hybrid genome assembler. Wick will give us an overview of what to improve in order to obtain perfect assemblies (In general you should check his awesome analysis and tools hosted on github, see e.g. his comparison on current basecalling algorithms for Oxford Nanopore data).

In general, the algorithm developers are very well representated with both Sergey Koren, developer of Canu, who will give an update on assembling the human genome with ultra-long reads and Steven Salzberg that will present an update on their hybrid assembly method for large genomes (MaSuRCA), which is bound to be interesting (you should also check out his blog).

The current record holder in data yield, Baptiste Mayjonade, has a talk on day 1. It will be interesting to hear updates on yields from both researchers and Oxford Nanopore, as a lot of us are still struggling to reach the 10 Gbp. Perhaps with the right methods we could all compete with Baptiste’s 15.7 Gbp on a MinION flowcell? We could also hope for some updates on PromethION flowcell yields, although these are still under heavy development (we have been running our own PromethION, and are now waiting for next flowcell shipment – a blog post should be in the works..).

The last speaker of day 1 is Aaron Pomerantz, who has been sequencing barcoded DNA in a rainforest in Ecuador. There is also Arwyn Edwards on day 2, who has been sequencing in extreme environments. As I am working with on-site DNA sequencing myself, I am very interested in hearing more about these projects. The general trend lately seems to be increased portability, which will undoubtedly spark a wide array of new exciting projects and give insights into the ecology of the Earth.

Another session in my interest is the “metagenomics, microbiomes and microbiology” session on day 2, where my PI, Mads Albertsen, is giving a talk on metagenomics in the long read era. I am currently moving into the metagenomics field and hope to gain some insight in this session.

With some Oxford Nanopore talks throughout the two days, I hope to get some updates on the new tech released. I’m particularly interested in the “flongle” (the small and dirt-cheap flowcell), which I have heard might be ready soon. Recently, we have seen tweets about the first version of the all-in-one Zumbador, new lyophilised reagents being tested and the first version of chemistry for the VolTRAX sample prep device.

I am sure I will be wiser by the end of next week. I will be giving some daily digests here, so stay tuned. If you are going to NCM17 yourself, please come talk to me by poster 18 on Thursday 10.50-11.15 AM and 3.25-3.50 PM, and I can give you an overview of how we recently sequenced activated sludge samples on-site at a wastewater treatment plant (and on the way home).

What is a good genome assembly?

With short reads you will often get fragmented but high single base accuracy assemblies  and with long error prone reads you can get a single contig assembly with just a few ultra long reads but with a lot of errors due to the lower read quality. Some of these errors can be fixed by adding more coverage and polishing the assembly.  Another approach can be to use both technologies in combination and one can fairly easy produce a single contig assembly. A good assembly should be in as many pieces as the original genetic elements they represent (one contig – one chromosome) but to allow gene calling, genome alignments single base accuracy is also essential. There are many genome assemblers, polishing tools etc. that will help you make a genome assembly but how do you know if you have a good assembly? To test this people have developed tools to calculate the AverageNucleotideIdentity (ANI), estimate genome completeness, assess the quality of genome assemblies compared to a reference. However, it may also be useful to use annotation tools to assess whether genes can be called correctly. The genome assembly, polishing, and annotation strategy is an ongoing discussion in the scientific community on twitter.

To decide which strategy should be our “preferred” genome assembly approach based on data rather than my gut-feeling about the “best assembly” I decided to do some testing with a known “true” reference E Coli K12 MG1655 (U00096.2).

Objective: develop a strategy to compare genome assemblies

Strategy: Use E. Coli K12 MG1655 reference strain data to produce a set assemblies and evaluate these with different assembly metrics in comparison to the “true” reference

Data:

 MbpX coverageRessource 
Nanopore data18340in house E. Coli K12 MG1655 grown and prepped for 1D R9.4 with RAD002(a subset of our total run to get a more realistic view than with 7+Gbp data for a bacterial genome)
Illumina data27058ENA: SRR2627175(I used a subset of 1095594 reads)

Assemblies:

 Read typeContigsSize (bp)Relative sizeANI (%)CheckM completenessProkka # CDSProkka # rRNAProkka # tRNAMedian CDS sizeQUAST NGA50QUAST mismatches per 100kbQUAST indels per 100kb
SpadesShort8445462200.9810099.91423511878101327000.420.09
Spades-hybridHybrid146203770.99610099.914282228881528288795.910.43
MiniasmLong144101010.95183.950.0011241615206NANANA
Miniasm +1xRaconLong146198180.99699.0270.871019222802874618042239.94541.42
Miniasm +2xRaconLong146220210.99699.2374.97962622823084620219222.51449.98
Miniasm +2xRacon +PilonHybrid146220210.99699.9698.5145092288761359021210.3519.19
CANULong145335740.97798.6950.001047821732634531613113.53662.93
CANU +NanopolishLong145631520.98499.2771.79966422842964561292146.74468.02
CANU +Nanopolish +PilonHybrid145674110.98499.9798.414415228777013295246.1715.79
UnicyclerHybrid146339760.99999.9999.914305228881524718074.282.25
Reference14639675110099.9743002288818463967500

Conclusion:

The long read only assemblies are nice contiguous assemblies with approximately the same length as the reference. However, the genes are broken due to indel errors present in the reads as is seen in the high number of CDS detected and the low median CDS size. Polishing the assemblies with long reads improves the ANI value a lot but the genes are still fragmented. Polishing with short reads is needed to get the number of CDS and median CDS size in the range of that of the reference. The short read only assembly has a high sequence identity with the reference but is fragmented and cannot recreate the repeat structure of the genome. The number of CDS is lower than that of the reference and the rRNA genes, which are known to be very similar if not identical, are messed up and co-assembled.  The unicycler assembly seems to deliver the best metrics across the board until we come to the NGA50 value reported by QUAST which was surprisingly low. I used mummer to visualise the alignment of the assembly against the reference genome to investigate if there was a mis-assembly causing this. However, the figure below shows no big inversions or anything, so I guess that for now we will stick to UNIcycler assemblies that combine the benefit of the short read accuracy with the power of long reads.

Dotplot of unicycler assembly aligned to the reference genome

Methods:

I have used the following assemblers

I have used the following mappers

I have used the following polishing tools

I have used the following tools to assess genome assembly characteristics

If you have any ideas or superior tools we have missed please let us know in the comments.

Populating the tree-of-life

 

Hi everybody and welcome to my first blog post at Albertsen Lab. As a newly started PhD student, I have engaged myself with the simple, yet Herculean task of populating the tree-of-life. As most people are aware of, microorganisms are more or less inescapably present in all places of the world — no matter how hostile environments you will encounter, they will probably accommodate some sort of small living organisms. As it once elegantly was put “If you don’t like bacteria, you’re on the wrong planet”. Or in this case, if you don’t like bacteria, you’re in the wrong blog post. Recently, a study was published by Kenneth J. Locey & Jay T. Lennon (2016) trying to illuminate just how omnipresent microorganisms are. This was carried out by applying scaling laws to predict the global microbial diversity. Main conclusion: our planet hosts up to 1012 microbial species. That is 1 trillion species! Of course, one trillion microbial species is still a number that is dwarfed by the approximately 1030 bacterial and archaeal cells living on Earth. Since incredibly large numbers tend to be hard to grasp, the article kindly supplied some illustrative examples. For example, the ~1030 bacterial and archaeal cells exceed the 22 orders of magnitude that separate the mass of a Prochlorococcus cell from a blue whale. It also exceeds the 26 orders of magnitude that result from measuring Earth’s surface area at a spatial grain equivalent to bacteria.

From a perspective concerning my research, it is naturally only the so-called microbial dark matter (microbes not yet discovered) that is of interest. Fortunately, it is only an infinitesimal number of microorganisms that have been discovered to date. For some reason, it is currently a prerequisite to have a bacterium growing in a pure culture before naming it. Alas, if you take a quick look at DSMZ’s homepage (one of the largest bioresource centers worldwide located in Germany), you will find that they boast a collection containing around 31,000 cultures representing some 10,000 species and 2,000 genera in total. Only a tiny bit short of one trillion. On a side note, I guess DSMZ eventually will face some kind of capacity-related problems if we insist on requiring each new species as a pure culture before an ‘official’ name can be granted. Luckily, nowadays bacterial species can also be cataloged by its DNA sequence. Currently, one of the most wide-spread methods for identifying bacteria is the 16S rRNA amplicon sequencing. Large databases such as SILVA and the Ribosomal Database Project (RDP) use huge amounts of digital ink to keep the ever-increasing influx of new sequences up to date. The current version 128 of SILVA (released in September 2016) includes over 6,300,000 SSU/LSU sequences, whereas RDP Release 11 contains 3,356,809 16S rRNA sequences (also released in September 2016). Although this is definitely a lot more than 10,000 pure cultures, it is still ridiculously less than the total estimated microbial diversity of the Earth. Hence, as I begin my PhD study, estimations from Kenneth J. Locey & Jay T. Lennon point to the fact that potentially 99.999% of all microbial taxa remain undiscovered!

This may sound like my research is going to be a cakewalk. After all, I should be able to go to the lab and find novel microorganisms literally by accident. However, I have decided to drastically confine my explorations of microbial dark matter to a specific environment. Although traveling around the world chasing novel bacteria would be pretty cool. So far, my primary focus has been sampling of microbial biomass from drinking water. You may think looking for novelty in a resource where one of the main goals is to restrict living things in is a bit weird. However, there is more than one good reason for this choice. 1) I worked with drinking water during my master’s thesis. Thus I already have experience with sampling and extraction procedures as well as a few connections with people working in this field. 2) Recently, articles such as Antonia Bruno et al. 2017, Karthik Anantharaman et al. 2016 and Birgit Luef et al. 2015 have illustrated that drinking water may potentially contain a large portion of microbial novelty. 3) Unless you are living in a third world country, gaining access to water is not very complicated. But just to clarify, do not mistake “easy access” with “easy sampling”, as the sampling of microorganisms from drinking water can be a king-sized pain in the ass.

A drastically under-represented visualization of the microbial diversity in drinking water.

As you might have guessed, I am not going to dedicate much time to trying to make bacteria living in large water reservoirs deep below the surface grow on plates in the lab. For identification, I am instead going to one-up the conventional 16S rRNA amplicon sequencing method. A new technique developed by some of the people in the Albertsen Lab have shown very promising results, generating full-length 16S rRNA gene sequences. Hopefully, I will have the opportunity to address it a bit more in detail in a later blog post. However, the first hurdle is simply collecting sufficient amounts of biomass for subsequent extraction. As it states in the protocol, the input material is ~ 800 ng RNA. If you — like many of my colleges at Albertsen Lab — have been working with wastewater, 800 ng RNA is no big deal. Getting the same amount from drinking water is a big deal. Drinking water typically has bacterial concentrations in the range of 103 – 105 cells/ml, which is why collecting adequate amounts of biomass is difficult. I naively started out using the same sampling setup that I used in my master’s thesis (where 1 ng/µl DNA would be plenty for further analysis). It basically consisted of a vacuum pump and a disposable funnel, and after spending too many hours pouring several liters of water into a puny 250 ml funnel, I ended up with negligible amounts of RNA. This is the point where you break every single pipette in the lab in despair tell your supervisor that the current sampling method is not feasible. Instead, the sampling task has been partly outsourced to people with more specialized equipment. So I have worked with samples based on 100+ liters of water yielding RNA amounts of more than 800 ng, but the workflow still needs some optimization.

Another aspect complicating the sampling step is the really obvious fact that bacteria are really small (I am perfectly aware of the fact that all bacteria with very good reason can be categorized as “small”), therefore my statement refers to the aforementioned article from Birgit Luef et al. 2015Diverse uncultivated ultra-small bacterial cells in groundwater”. The article highlights findings that demonstrate how a wide range of bacteria can pass through a 0.2 µm-pore-sized membrane filter.  FYI, filtration with a 0.2 µm filter is also commonly referred to as “sterile filtration”. One could argue it is a poorly chosen term for a filtration type that apparently allows numerous types of bacteria to pass the membrane unhindered. Also, sterile filtration is by-far the most used sampling method in papers concerning 16S rRNA amplicon sequencing of drinking water.  During my master’s thesis, I also utilized 0.2 µm filters for water samples, however, in the pursuit of novelty, the filter-size has been reduced to 0.1 µm. This should, as far as I am concerned, capture all microbes inhabiting drinking water (although it is hard not to imagine at least one out of a trillion species slipping through).

Hopefully, I will start to generate full-length 16S rRNA sequences from novel bacteria in the near future and maybe share some interesting findings here on albertsenlab.org.

Can you beat our Nanopore read error correction? We hope so!

Tagging of individual molecules has been used as an effective consensus error-correction strategy for Illumina data (Kivioja et al 2011, Burke et al 2016Zhang et al 2016) and the principle is similar to the circular consensus sequencing strategy used to generate consensus reads with error rate of < 1 % on the PacBio (Travers et al 2010, Schloss et al 2016, Singer et al 2016) and the Oxford Nanopore platforms (Li et al 2016). You simply sequence the same molecule several times and compare the reads to generate a consensus with a better accuracy than the individual reads. As far as we know a tag-based consensus error correction strategy has not been attempted for long reads before, probably because the raw error rate complicates identification of the tags. However, we see several benefits of the tag-based strategy in the long run, which is why we decided to pursue it.

My colleague @SorenKarst tested tag-based error correction on the Nanopore MinION in connection with our work on generating primer free, full length 16S/18S sequences from  environmental rRNA (see our bioRxiv paper: Karst et al 2016). The main approach used in the paper is based on Illumina sequencing inspired by Burke et al 2016, but moving to nanopore sequencing in the future would make the approach considerably easier.  His approach was relatively “simple”; individual cDNA molecules were uniquely tagged at both ends with a 10 bp random sequence, then diluted to a few thousand molecules, amplified by PCR to generate 1000’s of copies of each molecule, which were prepared for 2D sequencing on the Nanopore MinION. The resulting sequence reads were binned based on the unique tags, which  indicated they originated from the same parent molecule, and a consensus was generated from each read bin. The approach was tested on a simple mock community with three reference organisms (E. Coli MG 1655, B. Subtilis str. 168, and P. aeruginosa PAO1), which allowed us to calculate error rates.

For locating the unique tags we used cutadapt with loose settings to locate flanking adaptor sequences and extract the tag sequences. The tags were clustered and filtered based on abundance to remove false tags. As tags and adaptors contain errors, it can be a challenge to cluster the tags correctly without merging groups that do not belong together. Afterwards the filtered tags were used to extract and bin sequence reads using a custom perl script ;). For each bin we used the CANU correction tool followed by USEARCH consensus calling. By this very naive approach we were able to improve the median sequence similarity from 90% to 99%.

We think this is a good start, but we are sure that someone in the nanopore community will be able to come up with a better solution to improve the error rate even further. The data is freely available and a short description of the sequence read composition is provided below. We are looking forward to hear your inputs!

Ps. If you come up with a solution that beats our “quick and dirty” one and post it here or on twitter, I will make sure to mention you in my slides at ASM ;).

 

Data and script availability:

The nanopore reads are available as fastq at: 2D.fq or fasta: 2Dr.fa and fast5: PRJEB20906

The 16S rRNA gene reference sequences: mockrRNAall.fasta

Scripts:

Our approach in a shell script
#############################################################################
# 									    #
# Shell script for generating error corrected FL16S and getting error rates #
#									    #
# Use at your own RISK!							    #
#############################################################################

####################
#     Variables    #
####################
ID_adapt=0.1;
ID_cluster=0.8;
LINKtoCANU=/space/users/rkirke08/Desktop/canu/canu-1.3/Linux-amd64/bin;
# Update path to include poretools installation
# export PATH=$PATH:/space/users/rkirke08/.local/bin
####################
# End of variables #
####################
###############################################
# Depends on the following files and software #
###############################################
# folder with fast5 files "data/pass/"
# file with reference 16S sequences "mockrRNAall.fasta"
# perl script "F16S.cluster.split.pl"
# poretools
# cutadapt
# usearch8.1
# CANU
###############################################
# End of dependencies			      #
###############################################

# Extract fastq files
# poretools fastq --type 2D data/pass/ > data/2D.fq


# Rename headers (Some tools do not accept the long poretools headers)
#awk '{print (NR%4 == 1) ? "@" ++i : $0}' data/2D.fq | sed -n '1~4s/^@/>/p;2~4p' > 2Dr.fa

# Find adapters
cutadapt -g AAAGATGAAGAT -e $ID_adapt -O 12 -m 1300 --untrimmed-output un1.fa -o a1.fa 2Dr.fa
cutadapt -a ATGGATGAGTCT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o a1_a2.fa a1.fa

usearch8.1 -fastx_revcomp un1.fa -label_suffix _RC -fastaout un1_rc.fa
cutadapt -g AAAGATGAAGAT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o ua1.fa un1_rc.fa
cutadapt -a ATGGATGAGTCT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o ua1_a2.fa ua1.fa

cat a1_a2.fa ua1_a2.fa > c.fa

# Extract barcodes
cut -c1-12 c.fa > i1.fa
rev c.fa | cut -c1-12 | rev > i2.fa

paste i1.fa i2.fa -d "" | cut -f1-2 -d ">" > i1i2.fa


# Cluster barcodes
usearch8.1 -cluster_fast i1i2.fa -id $ID_cluster -centroids nr.fa -uc res.uc -sizeout

# Extract raw sequences
perl F16S.cluster.split.pl -c res.uc -i c.fa -m 3 -f 50 -r 40
# Count number of files in directory
find clusters -type f | wc -l

FILES=clusters/*.fa
for OTU in $FILES

do
  wc -l $OTU >> lines.txt	
done

FILES=clusters/*.fa
for OTU in $FILES

do
  OTUNO=$(echo $OTU | cut -f2 -d\/);
  # Rename header
  sed "s/>/>$OTUNO/" clusters/$OTUNO > clusters/newHeaders_$OTUNO

  # Correct reads using CANU
  $LINKtoCANU/canu -correct -p OTU_$OTUNO -d clusters/OTU_$OTUNO genomeSize=1.5k -nanopore-raw  $OTU

  # Unsip corrected reads
  gunzip clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta.gz

  sed -i "s/>/>$OTUNO/" clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta

  # Call consensus using Usearch
  usearch8.1 -cluster_fast clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta -id 0.9 -centroids clusters/OTU_$OTUNO/nr_cor_$OTUNO.fa -uc clusters/OTU_$OTUNO/res_$OTUNO.uc -sizeout -consout clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa

  sed -i "s/>/>$OTUNO/" clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa

  # Map reads to references to estimate error rate
  # Raw reads
  # Map FL16S back to references
  usearch8.1 -usearch_global clusters/newHeaders_$OTUNO -db mockrRNAall.fasta -strand both -id 0.60 -top_hit_only -maxaccepts 10 -query_cov 0.5 -userout clusters/map_raw_$OTUNO.txt -userfields query+target+id+ql+tl+alnlen

  # Usearch consensus corrected sequence
  # Map FL16S back to references
  usearch8.1 -usearch_global clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa -db mockrRNAall.fasta -strand both -id 0.60 -top_hit_only -maxaccepts 10 -query_cov 0.5 -userout clusters/map_cor_Ucons_$OTUNO.txt -userfields query+target+id+ql+tl+alnlen

  cat clusters/map_raw_$OTUNO.txt >> myfile.txt
  cat clusters/map_cor_Ucons_$OTUNO.txt >> myfile.txt

  # Collect corrected sequences
  cat clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa >> final_corrected.fa
done

Requirements:

cutadapt

USEARCH

a perl script F16S-cluster.split.pl

cDNA molecule composition

The cDNA molecules are tagged by attaching adaptors to each end of the molecule. The adaptor contains a priming site red, the unique 10 bp tag sequence (blue) and a flanking sequence (black). Note that the design is complicated as we simply modified it from our approach to get Thousands of primer-free, high-quality, full-length SSU rRNA sequences from all domains of life.

AAAGATGAAGATNNNNNNNNNNCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTTTTTTTTTTTTTTTT<—- fragment of SSU cDNA molecule—->GGGCAATATCAGCACCAACAGAAATAGATCGCNNNNNNNNNNATGGATGAGTCT

The number of T’s before the fragment will vary between molecules because it is a result of the polyA tailing described in the paper. The black parts of the sequence are generally not needed for the purpose of Nanopore sequencing but are present in the molecule because they were needed for the illumina sequencing.

Example Nanopore sequence read:

>18125
GATCTGGCTTCGTTCGGTTACGTATTGCTGGGGGCAAAGATGAAGATGTTCGTTATTCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTGGTCAAGCCTCACGAGCAATTAGTACTGGTTAACTCAACGCCTCACAACGCTTACACACCCAGCCTATCAACGTCGTAGTCTCCGACGGCCCTTCAGGGGAATCAAGTTCCAGTGAGATCTCATCTTGAGGCAAGTTTCCGCTTAGATGCTTTCAGCGGTTATCTTTTCCGAACATGGCTACCCGGCAATGCCACTGGCGTGACAACCGGAACACCAGAGTTCGTCCACCCGGTCCTTCCGTACTAGGAGCAGCCCCTCTCAAATTCAAACGTCCACGGCCAGATGGGGACCGAACTGTCTCACGACGTTCTAAGCCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTAGACCGGCTTCAGCCCCAGGATGTGATGAGCCGACATCGAGGTGCCAAACACCGCCGTCGATAAACTCTTGGGCGGTATCAGCCTGTTATCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCTTCCATACAGAACCACCGGATCTTCAAGACCTACTTTCGTACCTGCTCGACGTGTCTGCTCTGATCAAGCGCTTTTGCCTTTATATTCTCTGCGACCGATTTCCGACCGGTCTGAGCGCACCTTCGTGGTACTCCTCCGTTACTCTTTTAGGAGGAGACCGCCCCAGTCAAACTGCCCACCATACACTGTCCTCGATCCGGATTACGGACCAGAGTTAGAACCTCAAGCATGCCAGGATGGTGATTTCAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCCACCTAATCCTACACAGCAGGCTCAGTCCAGTGCCGCTACAGTAAAGGTTCACGGGGTCTTTCCGTCGCCGCGGATACACTGCATCTTCACAGCGATTTCAATTTCACTGAGTCTCGGGTGGAGACAGCGCCGCCATCGTTACGCCACTCGTGCAGGTCGGAACTTACCCGACAAGGAATTTCGCTACCTTGGACCGTTATCGTTACGGCCGCCGTTTACCGGGGCTAGATCAGGCTTCGCGCCCCATCAATACTTCCGGCACCGGGAGGCGTCACACTTATACGCCGTCCACTTTCGTGTTTTGCAGAGTGCTGTGTTTTTAATAAACAGTCGCAGCGGCCTGGTATCTTCGACCAGCCAGAGCTTACGGAGTAAATCCTTCACCCTAGCCGGCGCACCTTCTCCCGAAGTTACGGTGCCATTTGCCTAGTTCCTTCACCCGAGTTCTCAAGCGCCTTGGTATTCTCTACCCGACCACCTGTGTCGGTTTGGGTGCAGTTCCTGGTGCCTGAAGCTTAGAAGCTTTTGGAAGCATGGCATCAACCACTTCGTCGTCTAAAAGACGACTCGTCATCAACTCTCGGCCTTGAAACCCCGGATTTACCTAAGATTTCAGCCTACCACCTTAAACTTGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGGTGGAACGTTCTGTTTATGTTTCTGAAA

Example cluster with same random sequence:

>18125
GATCTGGCTTCGTTCGGTTACGTATTGCTGGGGGCAAAGATGAAGATGTTCGTTATTCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTGGTCAAGCCTCACGAGCAATTAGTACTGGTTAACTCAACGCCTCACAACGCTTACACACCCAGCCTATCAACGTCGTAGTCTCCGACGGCCCTTCAGGGGAATCAAGTTCCAGTGAGATCTCATCTTGAGGCAAGTTTCCGCTTAGATGCTTTCAGCGGTTATCTTTTCCGAACATGGCTACCCGGCAATGCCACTGGCGTGACAACCGGAACACCAGAGTTCGTCCACCCGGTCCTTCCGTACTAGGAGCAGCCCCTCTCAAATTCAAACGTCCACGGCCAGATGGGGACCGAACTGTCTCACGACGTTCTAAGCCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTAGACCGGCTTCAGCCCCAGGATGTGATGAGCCGACATCGAGGTGCCAAACACCGCCGTCGATAAACTCTTGGGCGGTATCAGCCTGTTATCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCTTCCATACAGAACCACCGGATCTTCAAGACCTACTTTCGTACCTGCTCGACGTGTCTGCTCTGATCAAGCGCTTTTGCCTTTATATTCTCTGCGACCGATTTCCGACCGGTCTGAGCGCACCTTCGTGGTACTCCTCCGTTACTCTTTTAGGAGGAGACCGCCCCAGTCAAACTGCCCACCATACACTGTCCTCGATCCGGATTACGGACCAGAGTTAGAACCTCAAGCATGCCAGGATGGTGATTTCAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCCACCTAATCCTACACAGCAGGCTCAGTCCAGTGCCGCTACAGTAAAGGTTCACGGGGTCTTTCCGTCGCCGCGGATACACTGCATCTTCACAGCGATTTCAATTTCACTGAGTCTCGGGTGGAGACAGCGCCGCCATCGTTACGCCACTCGTGCAGGTCGGAACTTACCCGACAAGGAATTTCGCTACCTTGGACCGTTATCGTTACGGCCGCCGTTTACCGGGGCTAGATCAGGCTTCGCGCCCCATCAATACTTCCGGCACCGGGAGGCGTCACACTTATACGCCGTCCACTTTCGTGTTTTGCAGAGTGCTGTGTTTTTAATAAACAGTCGCAGCGGCCTGGTATCTTCGACCAGCCAGAGCTTACGGAGTAAATCCTTCACCCTAGCCGGCGCACCTTCTCCCGAAGTTACGGTGCCATTTGCCTAGTTCCTTCACCCGAGTTCTCAAGCGCCTTGGTATTCTCTACCCGACCACCTGTGTCGGTTTGGGTGCAGTTCCTGGTGCCTGAAGCTTAGAAGCTTTTGGAAGCATGGCATCAACCACTTCGTCGTCTAAAAGACGACTCGTCATCAACTCTCGGCCTTGAAACCCCGGATTTACCTAAGATTTCAGCCTACCACCTTAAACTTGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGGTGGAACGTTCTGTTTATGTTTCTGAAA
>42262
AGCGTTCAGATTACGTATTGCTAGGGGGCAAAGATGAAGATGTTCGTTATTCTTTAGACTTGCCTGTCGCTCTATCTTCTCTTTTTGGTCAAGCCTCACGGGCAATTAGTACTGGTTAGCTCAACGCCTCACAACGCTTACAGCCTATCAACGTCATAATTCTTCTGACGGCCCTTCAGAATCAAGTTCCCAGTGAGATCTCATCTTGAGCAAGTTTCCCACCGTCTTTCAGCGGTTATCTTTTCGAACCTGCTTCCAGCAATACCACTGGCGTGACAACCGGAACACCAGAGGTTCGTCCACTCCGGTCCTCTCCGTACTAGGGCAGCCCTCTCAAATCTCAGAACGTCCACGGCAGATAGGACCGAACTGTCTCACGACGTTCTAAGCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATGCAGGACCGGCTTCGGCCCCAGGATGTGATGAGCCGGCATCGGGTGCCAAACACCGCCGTCGATATAAACTCGGGCATTGACCTGTTATCCCCGGGTACCTTTTTATCGTTGAGCGATGGCCCTTCCATACCAGAACCACCGGATCACTACAGACCTACTTTCGTACCTGCTCGCTGTCTGTCGCGGCCAAGCGCTTTTGCTATGCTCTGCGACCGATTTCCGACCGGTCTGGGCGCACCTTCGTACTCCGTTGCCTCTTTTGGAGACCGCTGATCAAACTGCCCACCATACACTGTCCTCGATCCGGATTACCAGAGTTTAGAACTCAATGCCAGGGTGGTATTTCAAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCACCTATCCTACAAGCAGGCTCAAAGTCCAGTACAACTACAGTAGGTTCACGGGGTCTTTCCGTCTAGCCGCGGATACCTGCATCTTCAGCGTTTCAATTTCACTGAGTCTCAGGTGGAGACAGCGCCGCCATCGTTACGCCATTCGTGCAGGTCGGAACTTGCCGACAAGGAATTTTGCACCTTGGGACCATTCGTTACGCCGTTTACCGGGGCTGATCAAGAGCTTGCTTGCGCTAACCCCATCAATTAATTTTTCCGGCACCGGGGAGGCGTCACACCTACGTCCCACTGCGTGTTTGCAGAGTGCTGTGTTTAATAAGTCGCAGCAGCTCAGTATCTTCGACCAGCCAGAGCTTACGGAGTAAATCTTCACCTAGCCGGCGACCTTCTCCCGGAAGTTACGGTGCCATTTGCCTAGTTCCTCCGCACCCGAAAGCCCTTCGCGCCTTGGTATTCTCTACCCGACCTGTGTCGGTTTGGGGCACGGTTCCTGGCCTGAAGCAGAAGCTTTTCTTGGAAGCCTGGCATCAACCACTTCGTCATCTAAAAGACGACTCGTCATCAGCTCTCGGCCTTGAAACCGGATTTACCTAAGATTTCAGCCTACCACCTTAAACTTGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGTGGAGACGTTCTGTTTATGTTTCTATC
>50101
CCCGGTTACGTATTGCTAGGGGCAAAGATGAAGATGTTCGTTATTCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTGGTCAAGCCTGCGGGCAATTATACTGGATAGCTCAACGCCTCACAACGCATACACCCAGCTTCTATCAACGTCGTAGTCTTCGACGGCCCTTCAGGAATCAAGTTCCCAGTGAGATCTCATCTTGAGGCAAGTTTCCCGCTTAGATGCTTTCAGCGGTTATCTTTTCCGAACATAGCTACCCGGCAATGCCACTGGCGTGACAACCGGAACACCAGAGGTTCGTCCACTCCGGTCCTCTCGTACTAGGAGCAGCCTCTCAAATCAAACGTCCACGGCAGATATAGGGACCGAACTGTCTCACGACGTTTCTAAACCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCACCCTTGGGACCGGCTTCAGCCCCAGGATGTGATGAGCCGACATCGGGAACAAACACCGCCGTCGATATAAACTCTTGGGCGGTATCAGCCTGTTATCCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCCTTCCATACAGAACCACCGGATCACTAAGACCTACTTTCGTACCTGCTCGACGTGTCTGTCTCGCAGTCAAGCGCGCTTTTGCTTTATACTCTGCGACCGATTTCCGACCGGTCTGAGCGCACCTTCGTACTCTCCGTTACTCTTTAGGAGACCGCCCCAGTCAAACTGCCCACCATACACTGTCCCTATCGATCCGGATTACGGACAGAGTTAGAACCTCAAGCATGCCAGGGTGGTATTTCAAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCACCTATCCTACACAAGCAGGCTCAAAGTCCAGTGCAAAGCTACAGTAAGGTTCACGGGTCTTTCCGTCTAGCCGCGGATACACTGCATCTCCACAGCGATTTCACCTCACTGAGTCTCTCGGGTGGAGACAGCGCCGCCATCGTTACGCCATTCGTGCAGGTCGGAACTTACCGACAAGGAATTTCGCTACCTTAGACCGTTATCGTTACGGCCGCCGTTTACCGGGGCTTCGATCAAGAGCTTCGCTTGCGCTAACCCCATCAATTAACCTTCGGCACCGGGGAGGCGTCACACCCTATACGTCCACTTTCGTGTTTGCAGAGTGCTGTGGCTTTTAATAAACAGTCGCAGCGGCCTGGTATCTTTTCGACCAGCCAGAGCTTACGGAGTAAATCCTTCACCTTAGCCGGCGCACCTTCTCCCGAAGTTACGGTGCCATTTGCTAGTTCCTTCACCCGAGTTCTCTCAAGCGCCTTGGTATTCTCTACCCGACCACCTGTGTCGGTTTGGGGTACGGTTCTGGTTACCTGAAGCTTAGAAGCTTTTCTTGGAAGCATGGCATCAACCACTTCGTCGTCTAAAGACGACTCGTCATCAGCTCTCGGCCTTGAAACCCCGGATTTACCTAAAGATTTCAGCCTACCACCTTAAACTTAGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGGTGGAAGTTCTGTTTATGTTTCTTGAGC

Continue reading

Starting from scratch – building a package in R

For the first time, I am going to share something more related to my master thesis. When I started this thesis, I did not know how to use R. In order to learn R, I started using DataCamp, which is a series of interactive courses. You can start from scratch and build your skills step by step. My favorite course so far is called “Writing Functions in R”. During the course, you are told:

If you need to copy something three times or more – build a function.

As a rookie, it sounds complicated, but in fact it is really simple AND it will save you so much trouble later. Defining a function will allow you to produce code that is easy to read, easy to reuse and easy to share with colleagues.  A function can be defined by the following syntax:

my_function <- function(arguments){
  Function Body
}

my_function  can be any valid variable name, however, it is a good idea to avoid names used elsewhere in R. Arguments (also called formals) can be any R object that is needed for my_function to run,  for example numbers or data frames. Arguments can have a default value or not. If not, a value must be provided for the given function to run. Function body is the code between the { brackets }  and this is run every time the function is applied. Preferably, the function body should be short and a function should do just one thing. If a large function cannot be avoided, often they can be constructed using a series of small functions. An example of a simple function could be:

cookies <- function(who, number=10){
  print(paste(who, "ate", number, "cookies", sep = " "))
}

The cookie function has two arguments, the number argument defaults to 10 and the user does not necessarily need to provide a value. The who argument on the other hand has no default and a name must be provided. I had some cookies BUT I only had nine cookies so I better change the number argument:

cookies(who="Julie", number=9)
[1] "Julie ate 9 cookies"

So, now I have defined a function to keep track of my cookie consumption. What if I want to share this with the rest of Albertsen Lab? I could forward a script for them to save locally.  No no, I will build a personal R package. This might seem like overkill for the cookie function, but imagine a more complex function.  In my search for helpful tools for calculating correlations, I have come by several functions/sets of functions with no documentation. It is nearly impossible to piece together how, what and when to use arguments with no provided help.  So, now I will build a bare minimum package to help me share my function with the group, strongly inspired by Not So Standard Deviations. For more information check out the excellent book  “R-packages” by Hadley Wickham.

First, you will need the following packages:

install.packages("devtools")
library("devtools")
install.packages("roxygen2")
library("roxygen2")

After this we need to create a package directory:

create("cookies") #create package

 So now,  a package called cookies has been created (you can change the folder with: setwd("my_directory")).

It is a good idea to update the DESCRIPTION file, so that it contains the relevant information about the package (cookies) and the author (me). Next step is to add the cookie function to the package. For this I save a script containing the function in the R folder. If you want to add more functions to your package, you can either create a new file for each function (recommended) or define the functions sequentially in one file.

Now comes the important part – documentation. Good documentation is key if you want other people to benefit from your work. This can be done easily using the roxygen2 package also by Hadley Wickham. roxygen2 uses a custom syntax so that the text starting with #' will be compiled into the correct format for R documentation when processed. Make a new R script with the following code and save it as cookies.R in the folder cookies/R:

#' Cookies
#'
#' This function will allow you to keep track of who your cookies.
#'
#' @param who Who ate the cookies? (Requires input)
#' @param number How many cookies has been eaten? (Default: 10)
#' @keywords cookies
#' @export


cookies <- function(who, number=10){
  print(paste(who, "ate", number, "cookies", sep = " "))
}

After creating the script then roxygen2 can be used to create all the needed files and documentation:

roxygenise("cookies/")

Lastly the package needs to be installed:

install_local("cookies")

You can now access your awesome functions by loading your brand new package:

library("cookies")

 Now you have built a real R package! If you type ?cookies in the console a help page will actually pop up.

Finally, you can upload you package to github.com (Guide). This will allow your others to try out your package, point out issues and suggest changes. Download and install directly from github is easy using install_github() from the devtools package. Try it out by typing this:

devtools::install_github("julieklessner/cookies")

It really can be this easy! So next time you copy something more than three times or really want to share your work, consider defining a function and building your own personal package with associated documentation.