What is a good genome assembly?

With short reads you will often get fragmented but high single base accuracy assemblies  and with long error prone reads you can get a single contig assembly with just a few ultra long reads but with a lot of errors due to the lower read quality. Some of these errors can be fixed by adding more coverage and polishing the assembly.  Another approach can be to use both technologies in combination and one can fairly easy produce a single contig assembly. A good assembly should be in as many pieces as the original genetic elements they represent (one contig – one chromosome) but to allow gene calling, genome alignments single base accuracy is also essential. There are many genome assemblers, polishing tools etc. that will help you make a genome assembly but how do you know if you have a good assembly? To test this people have developed tools to calculate the AverageNucleotideIdentity (ANI), estimate genome completeness, assess the quality of genome assemblies compared to a reference. However, it may also be useful to use annotation tools to assess whether genes can be called correctly. The genome assembly, polishing, and annotation strategy is an ongoing discussion in the scientific community on twitter.

To decide which strategy should be our “preferred” genome assembly approach based on data rather than my gut-feeling about the “best assembly” I decided to do some testing with a known “true” reference E Coli K12 MG1655 (U00096.2).

Objective: develop a strategy to compare genome assemblies

Strategy: Use E. Coli K12 MG1655 reference strain data to produce a set assemblies and evaluate these with different assembly metrics in comparison to the “true” reference


 MbpX coverageRessource 
Nanopore data18340in house E. Coli K12 MG1655 grown and prepped for 1D R9.4 with RAD002(a subset of our total run to get a more realistic view than with 7+Gbp data for a bacterial genome)
Illumina data27058ENA: SRR2627175(I used a subset of 1095594 reads)


 Read typeContigsSize (bp)Relative sizeANI (%)CheckM completenessProkka # CDSProkka # rRNAProkka # tRNAMedian CDS sizeQUAST NGA50QUAST mismatches per 100kbQUAST indels per 100kb
Miniasm +1xRaconLong146198180.99699.0270.871019222802874618042239.94541.42
Miniasm +2xRaconLong146220210.99699.2374.97962622823084620219222.51449.98
Miniasm +2xRacon +PilonHybrid146220210.99699.9698.5145092288761359021210.3519.19
CANU +NanopolishLong145631520.98499.2771.79966422842964561292146.74468.02
CANU +Nanopolish +PilonHybrid145674110.98499.9798.414415228777013295246.1715.79


The long read only assemblies are nice contiguous assemblies with approximately the same length as the reference. However, the genes are broken due to indel errors present in the reads as is seen in the high number of CDS detected and the low median CDS size. Polishing the assemblies with long reads improves the ANI value a lot but the genes are still fragmented. Polishing with short reads is needed to get the number of CDS and median CDS size in the range of that of the reference. The short read only assembly has a high sequence identity with the reference but is fragmented and cannot recreate the repeat structure of the genome. The number of CDS is lower than that of the reference and the rRNA genes, which are known to be very similar if not identical, are messed up and co-assembled.  The unicycler assembly seems to deliver the best metrics across the board until we come to the NGA50 value reported by QUAST which was surprisingly low. I used mummer to visualise the alignment of the assembly against the reference genome to investigate if there was a mis-assembly causing this. However, the figure below shows no big inversions or anything, so I guess that for now we will stick to UNIcycler assemblies that combine the benefit of the short read accuracy with the power of long reads.

Dotplot of unicycler assembly aligned to the reference genome


I have used the following assemblers

I have used the following mappers

I have used the following polishing tools

I have used the following tools to assess genome assembly characteristics

If you have any ideas or superior tools we have missed please let us know in the comments.

Populating the tree-of-life


Hi everybody and welcome to my first blog post at Albertsen Lab. As a newly started PhD student, I have engaged myself with the simple, yet Herculean task of populating the tree-of-life. As most people are aware of, microorganisms are more or less inescapably present in all places of the world — no matter how hostile environments you will encounter, they will probably accommodate some sort of small living organisms. As it once elegantly was put “If you don’t like bacteria, you’re on the wrong planet”. Or in this case, if you don’t like bacteria, you’re in the wrong blog post. Recently, a study was published by Kenneth J. Locey & Jay T. Lennon (2016) trying to illuminate just how omnipresent microorganisms are. This was carried out by applying scaling laws to predict the global microbial diversity. Main conclusion: our planet hosts up to 1012 microbial species. That is 1 trillion species! Of course, one trillion microbial species is still a number that is dwarfed by the approximately 1030 bacterial and archaeal cells living on Earth. Since incredibly large numbers tend to be hard to grasp, the article kindly supplied some illustrative examples. For example, the ~1030 bacterial and archaeal cells exceed the 22 orders of magnitude that separate the mass of a Prochlorococcus cell from a blue whale. It also exceeds the 26 orders of magnitude that result from measuring Earth’s surface area at a spatial grain equivalent to bacteria.

From a perspective concerning my research, it is naturally only the so-called microbial dark matter (microbes not yet discovered) that is of interest. Fortunately, it is only an infinitesimal number of microorganisms that have been discovered to date. For some reason, it is currently a prerequisite to have a bacterium growing in a pure culture before naming it. Alas, if you take a quick look at DSMZ’s homepage (one of the largest bioresource centers worldwide located in Germany), you will find that they boast a collection containing around 31,000 cultures representing some 10,000 species and 2,000 genera in total. Only a tiny bit short of one trillion. On a side note, I guess DSMZ eventually will face some kind of capacity-related problems if we insist on requiring each new species as a pure culture before an ‘official’ name can be granted. Luckily, nowadays bacterial species can also be cataloged by its DNA sequence. Currently, one of the most wide-spread methods for identifying bacteria is the 16S rRNA amplicon sequencing. Large databases such as SILVA and the Ribosomal Database Project (RDP) use huge amounts of digital ink to keep the ever-increasing influx of new sequences up to date. The current version 128 of SILVA (released in September 2016) includes over 6,300,000 SSU/LSU sequences, whereas RDP Release 11 contains 3,356,809 16S rRNA sequences (also released in September 2016). Although this is definitely a lot more than 10,000 pure cultures, it is still ridiculously less than the total estimated microbial diversity of the Earth. Hence, as I begin my PhD study, estimations from Kenneth J. Locey & Jay T. Lennon point to the fact that potentially 99.999% of all microbial taxa remain undiscovered!

This may sound like my research is going to be a cakewalk. After all, I should be able to go to the lab and find novel microorganisms literally by accident. However, I have decided to drastically confine my explorations of microbial dark matter to a specific environment. Although traveling around the world chasing novel bacteria would be pretty cool. So far, my primary focus has been sampling of microbial biomass from drinking water. You may think looking for novelty in a resource where one of the main goals is to restrict living things in is a bit weird. However, there is more than one good reason for this choice. 1) I worked with drinking water during my master’s thesis. Thus I already have experience with sampling and extraction procedures as well as a few connections with people working in this field. 2) Recently, articles such as Antonia Bruno et al. 2017, Karthik Anantharaman et al. 2016 and Birgit Luef et al. 2015 have illustrated that drinking water may potentially contain a large portion of microbial novelty. 3) Unless you are living in a third world country, gaining access to water is not very complicated. But just to clarify, do not mistake “easy access” with “easy sampling”, as the sampling of microorganisms from drinking water can be a king-sized pain in the ass.

A drastically under-represented visualization of the microbial diversity in drinking water.

As you might have guessed, I am not going to dedicate much time to trying to make bacteria living in large water reservoirs deep below the surface grow on plates in the lab. For identification, I am instead going to one-up the conventional 16S rRNA amplicon sequencing method. A new technique developed by some of the people in the Albertsen Lab have shown very promising results, generating full-length 16S rRNA gene sequences. Hopefully, I will have the opportunity to address it a bit more in detail in a later blog post. However, the first hurdle is simply collecting sufficient amounts of biomass for subsequent extraction. As it states in the protocol, the input material is ~ 800 ng RNA. If you — like many of my colleges at Albertsen Lab — have been working with wastewater, 800 ng RNA is no big deal. Getting the same amount from drinking water is a big deal. Drinking water typically has bacterial concentrations in the range of 103 – 105 cells/ml, which is why collecting adequate amounts of biomass is difficult. I naively started out using the same sampling setup that I used in my master’s thesis (where 1 ng/µl DNA would be plenty for further analysis). It basically consisted of a vacuum pump and a disposable funnel, and after spending too many hours pouring several liters of water into a puny 250 ml funnel, I ended up with negligible amounts of RNA. This is the point where you break every single pipette in the lab in despair tell your supervisor that the current sampling method is not feasible. Instead, the sampling task has been partly outsourced to people with more specialized equipment. So I have worked with samples based on 100+ liters of water yielding RNA amounts of more than 800 ng, but the workflow still needs some optimization.

Another aspect complicating the sampling step is the really obvious fact that bacteria are really small (I am perfectly aware of the fact that all bacteria with very good reason can be categorized as “small”), therefore my statement refers to the aforementioned article from Birgit Luef et al. 2015Diverse uncultivated ultra-small bacterial cells in groundwater”. The article highlights findings that demonstrate how a wide range of bacteria can pass through a 0.2 µm-pore-sized membrane filter.  FYI, filtration with a 0.2 µm filter is also commonly referred to as “sterile filtration”. One could argue it is a poorly chosen term for a filtration type that apparently allows numerous types of bacteria to pass the membrane unhindered. Also, sterile filtration is by-far the most used sampling method in papers concerning 16S rRNA amplicon sequencing of drinking water.  During my master’s thesis, I also utilized 0.2 µm filters for water samples, however, in the pursuit of novelty, the filter-size has been reduced to 0.1 µm. This should, as far as I am concerned, capture all microbes inhabiting drinking water (although it is hard not to imagine at least one out of a trillion species slipping through).

Hopefully, I will start to generate full-length 16S rRNA sequences from novel bacteria in the near future and maybe share some interesting findings here on albertsenlab.org.

Can you beat our Nanopore read error correction? We hope so!

Tagging of individual molecules has been used as an effective consensus error-correction strategy for Illumina data (Kivioja et al 2011, Burke et al 2016Zhang et al 2016) and the principle is similar to the circular consensus sequencing strategy used to generate consensus reads with error rate of < 1 % on the PacBio (Travers et al 2010, Schloss et al 2016, Singer et al 2016) and the Oxford Nanopore platforms (Li et al 2016). You simply sequence the same molecule several times and compare the reads to generate a consensus with a better accuracy than the individual reads. As far as we know a tag-based consensus error correction strategy has not been attempted for long reads before, probably because the raw error rate complicates identification of the tags. However, we see several benefits of the tag-based strategy in the long run, which is why we decided to pursue it.

My colleague @SorenKarst tested tag-based error correction on the Nanopore MinION in connection with our work on generating primer free, full length 16S/18S sequences from  environmental rRNA (see our bioRxiv paper: Karst et al 2016). The main approach used in the paper is based on Illumina sequencing inspired by Burke et al 2016, but moving to nanopore sequencing in the future would make the approach considerably easier.  His approach was relatively “simple”; individual cDNA molecules were uniquely tagged at both ends with a 10 bp random sequence, then diluted to a few thousand molecules, amplified by PCR to generate 1000’s of copies of each molecule, which were prepared for 2D sequencing on the Nanopore MinION. The resulting sequence reads were binned based on the unique tags, which  indicated they originated from the same parent molecule, and a consensus was generated from each read bin. The approach was tested on a simple mock community with three reference organisms (E. Coli MG 1655, B. Subtilis str. 168, and P. aeruginosa PAO1), which allowed us to calculate error rates.

For locating the unique tags we used cutadapt with loose settings to locate flanking adaptor sequences and extract the tag sequences. The tags were clustered and filtered based on abundance to remove false tags. As tags and adaptors contain errors, it can be a challenge to cluster the tags correctly without merging groups that do not belong together. Afterwards the filtered tags were used to extract and bin sequence reads using a custom perl script ;). For each bin we used the CANU correction tool followed by USEARCH consensus calling. By this very naive approach we were able to improve the median sequence similarity from 90% to 99%.

We think this is a good start, but we are sure that someone in the nanopore community will be able to come up with a better solution to improve the error rate even further. The data is freely available and a short description of the sequence read composition is provided below. We are looking forward to hear your inputs!

Ps. If you come up with a solution that beats our “quick and dirty” one and post it here or on twitter, I will make sure to mention you in my slides at ASM ;).


Data and script availability:

The nanopore reads are available as fastq at: 2D.fq or fasta: 2Dr.fa and fast5: PRJEB20906

The 16S rRNA gene reference sequences: mockrRNAall.fasta


Our approach in a shell script
# 									    #
# Shell script for generating error corrected FL16S and getting error rates #
#									    #
# Use at your own RISK!							    #

#     Variables    #
# Update path to include poretools installation
# export PATH=$PATH:/space/users/rkirke08/.local/bin
# End of variables #
# Depends on the following files and software #
# folder with fast5 files "data/pass/"
# file with reference 16S sequences "mockrRNAall.fasta"
# perl script "F16S.cluster.split.pl"
# poretools
# cutadapt
# usearch8.1
# End of dependencies			      #

# Extract fastq files
# poretools fastq --type 2D data/pass/ > data/2D.fq

# Rename headers (Some tools do not accept the long poretools headers)
#awk '{print (NR%4 == 1) ? "@" ++i : $0}' data/2D.fq | sed -n '1~4s/^@/>/p;2~4p' > 2Dr.fa

# Find adapters
cutadapt -g AAAGATGAAGAT -e $ID_adapt -O 12 -m 1300 --untrimmed-output un1.fa -o a1.fa 2Dr.fa
cutadapt -a ATGGATGAGTCT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o a1_a2.fa a1.fa

usearch8.1 -fastx_revcomp un1.fa -label_suffix _RC -fastaout un1_rc.fa
cutadapt -g AAAGATGAAGAT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o ua1.fa un1_rc.fa
cutadapt -a ATGGATGAGTCT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o ua1_a2.fa ua1.fa

cat a1_a2.fa ua1_a2.fa > c.fa

# Extract barcodes
cut -c1-12 c.fa > i1.fa
rev c.fa | cut -c1-12 | rev > i2.fa

paste i1.fa i2.fa -d "" | cut -f1-2 -d ">" > i1i2.fa

# Cluster barcodes
usearch8.1 -cluster_fast i1i2.fa -id $ID_cluster -centroids nr.fa -uc res.uc -sizeout

# Extract raw sequences
perl F16S.cluster.split.pl -c res.uc -i c.fa -m 3 -f 50 -r 40
# Count number of files in directory
find clusters -type f | wc -l

for OTU in $FILES

  wc -l $OTU >> lines.txt	

for OTU in $FILES

  OTUNO=$(echo $OTU | cut -f2 -d\/);
  # Rename header
  sed "s/>/>$OTUNO/" clusters/$OTUNO > clusters/newHeaders_$OTUNO

  # Correct reads using CANU
  $LINKtoCANU/canu -correct -p OTU_$OTUNO -d clusters/OTU_$OTUNO genomeSize=1.5k -nanopore-raw  $OTU

  # Unsip corrected reads
  gunzip clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta.gz

  sed -i "s/>/>$OTUNO/" clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta

  # Call consensus using Usearch
  usearch8.1 -cluster_fast clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta -id 0.9 -centroids clusters/OTU_$OTUNO/nr_cor_$OTUNO.fa -uc clusters/OTU_$OTUNO/res_$OTUNO.uc -sizeout -consout clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa

  sed -i "s/>/>$OTUNO/" clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa

  # Map reads to references to estimate error rate
  # Raw reads
  # Map FL16S back to references
  usearch8.1 -usearch_global clusters/newHeaders_$OTUNO -db mockrRNAall.fasta -strand both -id 0.60 -top_hit_only -maxaccepts 10 -query_cov 0.5 -userout clusters/map_raw_$OTUNO.txt -userfields query+target+id+ql+tl+alnlen

  # Usearch consensus corrected sequence
  # Map FL16S back to references
  usearch8.1 -usearch_global clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa -db mockrRNAall.fasta -strand both -id 0.60 -top_hit_only -maxaccepts 10 -query_cov 0.5 -userout clusters/map_cor_Ucons_$OTUNO.txt -userfields query+target+id+ql+tl+alnlen

  cat clusters/map_raw_$OTUNO.txt >> myfile.txt
  cat clusters/map_cor_Ucons_$OTUNO.txt >> myfile.txt

  # Collect corrected sequences
  cat clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa >> final_corrected.fa




a perl script F16S-cluster.split.pl

cDNA molecule composition

The cDNA molecules are tagged by attaching adaptors to each end of the molecule. The adaptor contains a priming site red, the unique 10 bp tag sequence (blue) and a flanking sequence (black). Note that the design is complicated as we simply modified it from our approach to get Thousands of primer-free, high-quality, full-length SSU rRNA sequences from all domains of life.


The number of T’s before the fragment will vary between molecules because it is a result of the polyA tailing described in the paper. The black parts of the sequence are generally not needed for the purpose of Nanopore sequencing but are present in the molecule because they were needed for the illumina sequencing.

Example Nanopore sequence read:


Example cluster with same random sequence:


Continue reading

Starting from scratch – building a package in R

For the first time, I am going to share something more related to my master thesis. When I started this thesis, I did not know how to use R. In order to learn R, I started using DataCamp, which is a series of interactive courses. You can start from scratch and build your skills step by step. My favorite course so far is called “Writing Functions in R”. During the course, you are told:

If you need to copy something three times or more – build a function.

As a rookie, it sounds complicated, but in fact it is really simple AND it will save you so much trouble later. Defining a function will allow you to produce code that is easy to read, easy to reuse and easy to share with colleagues.  A function can be defined by the following syntax:

my_function <- function(arguments){
  Function Body

my_function  can be any valid variable name, however, it is a good idea to avoid names used elsewhere in R. Arguments (also called formals) can be any R object that is needed for my_function to run,  for example numbers or data frames. Arguments can have a default value or not. If not, a value must be provided for the given function to run. Function body is the code between the { brackets }  and this is run every time the function is applied. Preferably, the function body should be short and a function should do just one thing. If a large function cannot be avoided, often they can be constructed using a series of small functions. An example of a simple function could be:

cookies <- function(who, number=10){
  print(paste(who, "ate", number, "cookies", sep = " "))

The cookie function has two arguments, the number argument defaults to 10 and the user does not necessarily need to provide a value. The who argument on the other hand has no default and a name must be provided. I had some cookies BUT I only had nine cookies so I better change the number argument:

cookies(who="Julie", number=9)
[1] "Julie ate 9 cookies"

So, now I have defined a function to keep track of my cookie consumption. What if I want to share this with the rest of Albertsen Lab? I could forward a script for them to save locally.  No no, I will build a personal R package. This might seem like overkill for the cookie function, but imagine a more complex function.  In my search for helpful tools for calculating correlations, I have come by several functions/sets of functions with no documentation. It is nearly impossible to piece together how, what and when to use arguments with no provided help.  So, now I will build a bare minimum package to help me share my function with the group, strongly inspired by Not So Standard Deviations. For more information check out the excellent book  “R-packages” by Hadley Wickham.

First, you will need the following packages:


After this we need to create a package directory:

create("cookies") #create package

 So now,  a package called cookies has been created (you can change the folder with: setwd("my_directory")).

It is a good idea to update the DESCRIPTION file, so that it contains the relevant information about the package (cookies) and the author (me). Next step is to add the cookie function to the package. For this I save a script containing the function in the R folder. If you want to add more functions to your package, you can either create a new file for each function (recommended) or define the functions sequentially in one file.

Now comes the important part – documentation. Good documentation is key if you want other people to benefit from your work. This can be done easily using the roxygen2 package also by Hadley Wickham. roxygen2 uses a custom syntax so that the text starting with #' will be compiled into the correct format for R documentation when processed. Make a new R script with the following code and save it as cookies.R in the folder cookies/R:

#' Cookies
#' This function will allow you to keep track of who your cookies.
#' @param who Who ate the cookies? (Requires input)
#' @param number How many cookies has been eaten? (Default: 10)
#' @keywords cookies
#' @export

cookies <- function(who, number=10){
  print(paste(who, "ate", number, "cookies", sep = " "))

After creating the script then roxygen2 can be used to create all the needed files and documentation:


Lastly the package needs to be installed:


You can now access your awesome functions by loading your brand new package:


 Now you have built a real R package! If you type ?cookies in the console a help page will actually pop up.

Finally, you can upload you package to github.com (Guide). This will allow your others to try out your package, point out issues and suggest changes. Download and install directly from github is easy using install_github() from the devtools package. Try it out by typing this:


It really can be this easy! So next time you copy something more than three times or really want to share your work, consider defining a function and building your own personal package with associated documentation.

Analysing amplicon data, how easy does it get?

Ever done amplicon DNA sequencing of the 16S rRNA gene to identify microbes? If so, then you must know about the challenge of analysing such complex data easily.

My name is Kasper, and I am currently a master student here at the Albertsens lab. When I first learned how DNA is sequenced today, I was astonished by the rapid development DNA sequencing technologies have been experiencing during the last decade. A whole human genome can now be sequenced within a day for less than $1000! The applications of DNA sequencing are countless and there are countless questions yet to be answered. But the first and most important question is perhaps… how? What to do with millions of sequences of just A, C, T and G’s? Well, that question is the foundation of a huge field within biology: Bioinformatics! Which is something we try to expand here at Albertsen Lab.

During my master thesis I have been working with ampvis to take it a step or two further. Using R for bioinformatics takes a certain skill level and I’ve spent weeks of my project at learning R in depth to be able to write R functions for my thesis. During my project, I have specifically been applying various ordination methods (such as Principal Components Analysis, Correspondence Analysis, Redundancy Analysis and more) and other multivariate statistics to activated sludge samples to identify patterns between and within danish wastewater treatment plants – more on that will follow in subsequent blog posts. However, after I spent all that time learning R I thought:

Does everyone need to spend so much time at learning how to do complex bioinformatics, at least, just to do simple data analysis?

Interactive data analysis through Shiny

If you want to do reproducible research, then yes. There is no other way. But if you just want to do brief analysis of amplicon data using basic functions of ampvis, I have done all the work for you by making an interactive Shiny app of ampvis. A shiny app is basically designed from a collection of HTML widgets custom made to communicate with an active R session, so anything that can be done in R, can be done with a shiny app. This means that you can now use ampvis using only mouse clicks! No R experience required.

Amplicon Visualiser in action

All you have to do is upload your data and start clicking around! If the data is suited for ampvis it is also suited for the app (an OTU table + metadata if any, minimal examples can be downloaded from within the app). We recommend using UPARSE for OTU clustering into an otutable. You can then do basic filtering based on the metadata columns and also subset taxa if your are looking for specific microbes. With the app is some example data from the MiDAS database with samples taken from Danish wastewater treatment plants (WWTP’s) from 2014 and 2015. As a start you can simply click analysis, then render plot to get a simple overview of all the samples taken grouped by WWTP. No harm can be done, feel free to try out every clickable element!

The app is available here: https://kasperskytte.shinyapps.io/shinyampvis/.

Of course, it is open source, so the sourcecode is available on my Github. If you encounter any issues please report them here or in the comments below. Any suggestions for improvements are of course welcome.
(spoiler: Among future plans of the app is more extensive ordination with various metrics, data transformation and ordination types)

Traversing the fog… Soon graduating – then what??

The other day I watched my boyfriend playing the game Dark Souls III. At some point he had to step through this fog, entering a new area or a boss area or something else. You never know.

It was a boss, but I don’t remember how it went from that point on. Because it got me thinking, this summer I will graduate and, for the first time since I was a kid, I will not be a student. Wow, it is scary! My boyfriend is also graduating this summer. So, everyday life for our family will change. We live in a dormitory with our 1.5 year old son and we will have to move. But where? Closer to family? Closer to job opportunities? Somewhere in the middle? And where will that be?

The above are a small selection of the thoughts that crossed my mind. All of them circled one thing – can I find a job? I believe that the answer is yes. But can I do something before I graduate to help the process? Again, I believe that the answer is yes. So, as the to-do-list kinda person I am – I made a plan! Firstly, I must figure out what direction I want to go. The projects I did earlier are very different from my thesis. If I want to continue this path, the words I usually use to describe my professional profile, must change. This is a work in progress.

Second – LinkedIn. My profile is a mess at the moment and not accurately describing my capabilities. The next project is therefore to tailor my profile to the path I want to take. Together with my fellow Albertens Lab students, Kasper and Peter, I attended an afternoon course in LinkedIn, which was very for useful. For example, changing a simple thing as such as your headline can make a big difference.







It still needs some work I think and there a several great guides out there. I will find some inspiration in one available at the university called Pejling (Also notice the great interview about being parent and student at page 20).  As before, we are giving each other feedback. This way I hope to get a clear-cut profile before I start the job hunt.

Next step will be updating my basic résumé. This brings up the eternal question: How to write a good CV? There is a different answer depending on who you ask. Again, I plan to find some inspiration in Pejling. Afterwards I will kindly ask for feedback from my supervisors to help me improve it even further. After building my basic résumé, I plan to tailor my CV for each application, including only what is relevant for the particular job.

Lastly, I plan to apply for jobs before I graduate. It would be awesome to have the security of job before I graduate and the only way to do this is to start before I’m done. Furthermore, I think it demonstrates a capability to juggle several tasks at once. Finally, I plan to get feedback if/when some of my applications are turned down.

In the end, this turned out to be very personal. When this is published, people will know my intentions, also if I don’t succeed in finding a job right away. This makes me feel vulnerable. However, for me, being close to the graduation and the huge decisions that follow, has become a big part of being a master student.

Fast, easy and robust DNA extraction for on-site bacterial identification using MinION

My name is Peter Rendbæk, and I’m currently a master student in the Albertsen lab. The overarching aim of my master project, is as a pre-test for several of the new big projects in the group, which focus on applying the on-line bacterial identification for process control at wastewater treatment plants. Hence, last couple of months I have been working on the project “Developing methods for on-site DNA sequencing using the Oxford Nanopore MinION platform”. The MinION has improved a lot since its release three years ago, and it can now be used to make rapid determination of bacterial compositions.

The potential for this fast and mobile DNA-sequencing is mind-blowing. However, given that the technology is here now (!), there has been relatively little focus on portable, fast, easy and robust DNA extraction. Hence, I’ve spent the last months on trying to develop a fast, cheap, mobile, robust and easy to use DNA extraction method.

There is a significant amount of bias connected with DNA extraction, but the bias associated with wastewater treatment samples has been investigated in depth. However, the “optimized method” is not suited for on-site DNA-extraction. There are 3 principle steps in DNA extraction, cell lysis, debris removal and DNA isolation, which I will cover below and discuss how I simplified each step.

In general, complex samples require bead beating for cell lysis and homogenization. The problem is that our in-house bead-beating is done by a big table top tool weighing 17 kg, which makes it hard to transport. However, I came across a blog post from loman labs about sample preparation and DNA extraction in the field for Nanopore sequencing. In the blog post, the possibilities of a portable bead beater outlined, by the use of a remodeled power-tool. I thought this was interesting, so I went out and bought an Oscillating Multi-Tool cutter and tried this with lots of duct tape…

The amazing part was that it worked! But the problem was that the samples would get “beaten” differently depend on how you taped the sample to the power-tool, which could give rise to variation large variations in the observed microbial community.

I solved this by 3D printing an adapter to the power-tool that fits the bead-beater tube (Finally, a good excuse to use a 3D printer!). I used Solidworks to design the adapter and collaborated with our local department of mechanical and manufacturing engineering (m-tech) in 3D printing it. You can make your own by simply downloading my design from Thingiverse (It did take a few iterations to make it durable enough, and I still use a little duct tape..).


After the bead beating, the cell debris removal is done by centrifugation. Our “standard” protocol recommends centrifugation at 14000 x G for 10 minutes at 4 C. However, in our minds that seemed a little extensive and requires a huge non-transportable centrifuge… Alternatively, there are a lot of possibilities to use small, easy to transport and easy to use centrifuges if we do not have to centrifuge at 14.000 xG at 4 C. There is even the possibility to 3D print a hand-powered centrifuge. However, I did not follow this path, as it seems a bit dangerous… After several tests, we discovered that a simple table top centrifuge could do the job perfectly well, using 2000 xG for 1 min at room temperature if we combined it with the DNA isolation described below.

The last step is DNA isolation, I tried several different methods, but we got the idea to simply use Agencourt AMPure XP that is routinely used in e.g. PCR purification (we 10 diluted the AMPure XP beads 1:10 to save some money and it seems to work just as good). And… It works..

So, now you have an overview of the method I developed. The most amazing part is that it works! It takes 10-15 minutes from the sample is taken until you’ve got ready DNA for use, compared to 60+ minutes for our “standard” protocol. Furthermore, it requires inexpensive equipment that can be carried in a small suitcase. So, just to prove that this approach is fast, I filmed myself doing the DNA extraction with a GO-PRO camera, as you can see below.

The next part is to test the MinION in the lab. How, fast can we identify bacteria and is the extracted DNA compatible with the downstream library preparation, which we hope to do on the our new and shiny Voltrax (which is now moving liquids!).

Promethion configuration test and “reboxing”

Following our PromethION unboxing event we have finally managed to connect the machine to our University network and pass the configuration run. The configuration test demonstrates that the network can handle the data produced and transfer it fast enough to a network storage solution.


After a successful configuration test (millions of fast5 files generated) we contacted Oxford Nanopore and they organised to pick up the configuration unit.

So now we just have to wait for the sequencing unit which they mentioned should be available around the end of February (Pretty soon!). Maybe the CTO is building it himself this very moment.

The hardware installation was super-easy, but the network configuration was a challenge. The PromethION settings are controlled through a web-interface, which seem to run some configuration scripts whenever we have changed something and submitted the change. However, it is not completely transparent what is going on and handling the PromethION as a “black” box system with multiple network interfaces was not exactly straightforward. Furthermore, our University runs what they call an “enterprise configuration”. This essentially means that we are not allowed to play with settings, and in order to ensure maintenance of the entire network is feasible, we cannot make too many “quick and dirty” workarounds.

Hence, our local IT-experts have been essential in getting the PromethION up and running!

I would highly recommend that you get your hands on a “linux ninja” and a true network expert if you risk running into non standard configurations (My personal guess is that it would probably have worked fairly straightforward if we would have plugged it into a standard off the shelf router with default configuration).

Our IT guys came up with the following wishes for improving the network setup:

“For faster understanding of the device, it would be nice with a network drawing, and maybe a few lines on how it is designed to work (for a Sysadmin or Network admin)

The possibility to use static address for everything should be possible.

It would be really nice if you could tell the setup script a start or end IP. We had to reconfigure our network because setup insisted on using, the top 2 addresses in the /24 network we had assigned to it (.253 and .254) which are used for HRSP in our network. We had to remove HSRP, on our 10G network and enable DHCP on the local server network (Ethernet), which is not desirable.

Having a “management” interface on a DHCP assigned address, is a hassle. Allow static addresses, so its easy to use DNS names in stead of guessing IP´s

Allow configuring static DNS.

The test device came with static defined routes (Cant remember subnet on this one)

Both are problematic, as they are WAY bigger than they need to.

Especially is problematic for our network, as our DNS servers are on 172.18.x.y

I did not find any good reason for 172.16 anywhere.
192.168 I am guessing is there for the Default setup. That should in my point of view be limited to as this is the actual range used. Static routes can be needed, but should be kept as small as possible”

Learning how to make a good presentation

As a student, you will have to present sometime during your education. Despite this, there is hardly any time allocated to learning the skills required in giving a good presentation .

As part of your Masters degree at Aalborg University you’ll have to participate in at least one status seminar presenting your thesis (20 minutes). Afterwards there is a 5 minute time slot for questions from the audience. The audience will be your fellow students, your supervisor(s) and other students or employees who may be interested in your project content.

My fellow (Albertsen Lab) master students and I, spend approximately two weeks preparing for this. During this period, it became clear that the amount of guidance we got was pretty unusual. Hence, I thought I would share how we prepared and the differences it made in general and specifically to our slides.

  • 19th: Meeting, brainstorming about content of presentation
  • 20th: Sending the first draft of the presentation and receiving feedback.
  • 26th: Rehearsal of presentations. Each student within our group presented and we were constructively critiqued by others in the group regarding slide content and presentation skills.
  • 27th: Improved slideshow was sent once again and feedback was given for the final time.
  • 31st: Status seminar

Although it seems to be rather extensive, I feel all of our presentations benefited from the extra effort.

Example from Peters presentation 

Before: Peter wanted to illustrate how he had optimized the method.

After: A line-up of conditions before and after optimization.

Example from Kaspers presentation

Before: Kasper wanted to illustrate how your ordination plot can change depending on your choice of distance metric.  

After: Kasper added a progress bar (with neutral colors), found an example to better illustrate his point, added the citation and underlined his point with big red statement.

Example from my presentation (1)

Before: I wanted to show the current status of my network function.

After: I changed some visual properties in my tools for better visualization. I also changed the specific OTU name to example names, as my audience could not relate to the MiDAS data base.

Example from my presentation (2)

Before: I wanted to make a quick introduction to correlation


After: Removing text for simplification and adding citation.


What you cannot see from the examples, is the improvement in the delivery of our presentations. As a student it can be nerve-racking to present science in front of an audience. If you haven’t had feedback, that is just one more thing to be nervous about. Getting feedback both on my slides and my way of presenting them gave me the safety of proper preparation.

After this experience, I can’t help but feeling thankful that learning to present is of high priority in our group. It is key to be able to communicate your message clearly, especially in a scientific community. It is not a part of our curriculum and maybe it is too much to expect, that students can learn to master this without any guidance.

The final presentation slides for Kasper, Peter and I can be found on SlideShare.

How to spend 10.000.000 DKK (1.3 mill. EUR)

Recently, I was one of 16 recipients of a 10 mill. DKK grant (1.3 mill. EUR) from the VILLUM foundation under their Young Investigator Program (YIP). The program is unique in Denmark and offers young scientists an opportunity to build a research group on their own terms. The foundation is working on the premise of the founder of who famously said:

“One experiment is better than a thousand expert opinions”

Villum Kann Rasmussen

Hence, they simply support good experiments and trust that the researchers will come up with great solutions, if the foundation interfere as little as possible. This means as little as possible administration and flexible funding if new opportunities arise during the project. While this sounds almost too good to be true, previous grantees have all said that it actually works this way!

So, how do I plan to spend 10 mill DKK (1.5 mill. EUR)?

Microbial communities underpin all processes in the environment and have direct impact on human health. Despite their importance, only a tiny fraction of the millions of different microbes is known. This is mainly due to the immense difficulties of cultivating microbes from natural systems in the laboratory. This discrepancy is also known as the “microbial dark matter”.

For any microbe, the genome is the blueprint of its physiological properties. Having this in hand, it is possible to reconstruct its potential metabolism and establish hypotheses for evolution, function and ecology. Furthermore, it provides a foundation for further validating its function through a variety of in situ methods. However, genomes are extremely difficult to obtain from the microbial dark matter.

Currently, multiple metagenomes combined with bioinformatic approaches, is used to retrieve individual genomes from complex samples (see e.g. our paper from 2013). This has let to numerous fundamental discoveries, including the discovery of bacteria cable of complete ammonia oxidation (Comammox, see here and here), which radically change our view of the global nitrogen cycle and granted us the “Danish research result of the year, 2015”.

However, we are still far from realizing the full potential of metagenomics to retrieve genomes. Mainly due to the complexity of nature, where multiple closely related strains co-exists, which renders the current approaches useless.

Using the VILLUM YIP grant we want to use cutting-edge DNA sequencing related techniques to enable access to all genomes despite strain-complexity, link genomes, plasmids and phages, and enable direct measurements of in situ bacterial activity. The ability to readily obtain activity measurements of any bacteria, in any microbial ecosystem, will radically change microbial ecology and environmental biotechnology.

Obtaining complete bacterial genomes

Retrieving individual bacterial genomes from complex microbial communities can be compared to mixing hundreds of puzzles with millions of pieces, all containing different shades of blue sky. However, one way to circumvent the problem with closely related strains, is to use bigger pieces of DNA to assemble the genomes. The current standard approach is to use short read sequencing (Illumina; approx. 2 x 250 bp). However, the rapid development within long-read DNA sequencing, means that it is possible to start to experiment and envision how this is going to be solved.

The newest technology to the long read market is the Oxford Nanopore. It has successfully been used to generate complete genomes from pure cultures and we have used it for metagenomics of simple enrichment reactors to obtain the first complete Comammox genome. We have been early access testers of the MinION and are currently involved in the developer program. The improvement of the technology that has happened in the first half of 2016, means that the quality and throughput of the technology are now sufficient to attempt medium complexity metagenomes. Furthermore, we are one of the early customers to the high-throughput version of the MinION, the PromethION, which, in theory, would allow us to tackle even complex metagenomes.

Furthermore, while long-read DNA sequencing might enable closed bacterial chromosomes, they are still not associated directly with e.g. plasmids and phages. However, the last couple of years the several new methods have appeared, e.g. Hi-C and 3C, that utilize physical cross-linking of the DNA inside cells to generate sequencing libraries where proximity information can be retrieved. This information can then be used to infer which genetic elements were in close proximity, and thereby originated from the same bacterial cell. However, until now, the methods have only been used in microbial communities of limited complexity, but there is does not seem to be theoretical limits that would hinder the use of the methods, if complete genomes are available.


Measuring in situ replication rates using DNA sequencing

An exciting new possibility is that complete genomes enable measurements of bacterial replication rates directly from metagenomic data (see here and here). The method is very simple and based on the fact that the majority of bacteria starts replication at a single origin and then proceeds bi-directional. Hence, in an active dividing bacterial population there will be more DNA at the origin of replication than at the termini. This can be directly measured using DNA sequencing, as the number of reads is proportional to the amount of DNA in the original sample. Hence, by comparing the number of reads (coverage) at the origin to the termini, a measure of bacterial replication rate is obtained. This allows direct observations of individual bacterial response to stimuli in the environment, even in very complex environments as e.g. the human gut and with sub-hourly resolution. This type of information has been the dream of microbial ecologists since the field emerged over 100 years ago and will allow for countless new experiments within microbial ecology. Recently, the method has even been demonstrated to work with high-quality metagenome bins (see here). It is going to be interesting to further explore the potentials and limitations of the method using complete genomes at an unprecedented scale.

A few closing remarks

I am thrilled to have the next five years to explore how we can apply new DNA sequencing methods to understand the bacterial world and have the chance to build up a group of young scientists that share my excitement! If you think the project sounds great and either want to collaborate or work with us – then drop me an email!

Finally, I have to thank the people and mentors that made this possible. First of all my long-term mentor Per H. Nielsen; 6 years ago he introduced me to the world of microbial communities and throughput the years he have given me the freedom to pursue my own ideas – “freedom with responsibility” as we say in Danish. A leadership style that I very much try to adapt in my own newly found role as group leader. Secondly, my colleagues and friends Søren M. Karst and Rasmus H. Kirkegaard, whom I have persuaded to join me on further adventures down the rabbit hole! Furthermore, the long list of collaborators over the past years, where I have been fortunate to learn from some best scientists in the world (if you ask me). There are too many to mention, but a special thank goes out to Holger Daims, Michael Wagner, Gene Tyson and Phil Hugenholtz.