ampvis2: The bread and butter of our amplicon analyses: amp_heatmap!

(This post is part of a series of ongoing posts about selected ampvis2 features)

Now that the ampvis2 R package has got its own paper on bioRxiv it is a good time to also write a small blog post about our beloved heatmap, which Albertsen lab and the Center for Microbial Communities use in practically every amplicon analysis.

Why not stacked bar charts?

Getting a simple overview of the microbial community composition of environmental samples can be done in different ways. However, when visualisation of both the composition as well as abundance is desired, very few options exist. For visualisation at higher taxonomic levels (mostly phylum level) a common way of achieving this is to use a stacked bar chart, which can show the relative abundances of all taxa in each sample (example) for comparison. Sadly, stacked bar charts have somehow found a way into microbial ecology even though they are best used for much simpler data and to illustrate gradual changes between values, not large compositional differences as with microbial community data. One of the reasons is that there is only about a dozen or so of different colors that are easily distinguishable by the human eye. There are just not enough colors to cover just a fraction of the diversity in microbial community data. Another reason is that it can be very difficult, if not impossible, to easily read the exact values of each constituent bar (the abundance of a taxa in a sample) as they are shown as an interval. You would first have to judge what the lower and upper limits are, calculate the difference, and only then get the value, and then it would only be approximate. Have a look at this example plot showing the species level composition of a range of samples obtained by 16S amplicon sequencing:

 

This just makes you reconsider why you would ever visualise anything. If the purpose of graphical visualisation is not to simplify and communicate an important message that would otherwise require pages of words to convey, then there are just no reasons to do it. For example, there are probably 7-8 different shades of green. Which one is E. coli? I can’t tell, but maybe you can. In any case, there should not be any doubt. When it comes to reading the relative abundances of the taxa in the samples, in most cases “around” X % abundance is fine to answer most research questions (there are biases anyways), but when you can’t tell the difference between taxa, the plot almost becomes useless.

Rethinking the heatmap

Enough hate on stacked bar charts, but these are some of the main reasons why we had to come up with something different without compromising both precision and simplicity, as well as being pleasant for the eye. This resulted in the amp_heatmap which has several additional features compared to other software tools that are also able to generate heatmaps, these will be demonstrated in the following. The example data used in the following is from the activated sludge of two wastewater treatment plants (WWTP’s) in Aalborg and is included in the ampvis2 R-package.

A heatmap is basically a grid of colored tiles with samples on the x-axis, taxa on the y-axis, and then the abundance (relative or absolute) of each taxon in each sample is indicated by coloring the tiles by a gradient of usually 2 or 3 colors. The colors are then used to identify patterns between the samples and the exact abundance values can be written on top of each tile. Here is an overview of the 10 most abundant Genera in all 16 samples in the dataset and the corresponding R code to generate the plot, by default the abundances are shown relative to the total in each sample:

amp_heatmap(
  data = data,
  tax_aggregate = "Genus"
)

 

Simple, right? We have all information we need in terms of composition and abundance in one plot. All OTU’s are aggregated to a specific taxonomic level of choice and the order of the taxa on the y-axis is automatically sorted by the cumulative abundance of each taxa across all samples. amp_heatmap particularly stands out in the way that it can incorporate sample metadata to dynamically group samples based on one or more metadata variable(s). These variables can contain any relevant information about the samples (for example when and where the samples were taken, physical parameters like pH or temperature, etc…) and amp_heatmap then calculates the mean (or even median, minimum, or maximum) abundance of each taxon in each group of samples as defined by the variable. For example, we can show the yearly average for each WWTP by setting the group_by argument (figure A below). A different way to show the same information is by using faceting on one of the variables instead by setting the facet_by argument (figure B below).

#plot_A:

amp_heatmap(
  data = data,
  tax_aggregate = "Genus",
  group_by = c("Plant", "Year")

#plot_B:
 
amp_heatmap(
  data = data,
  tax_aggregate = "Genus",
  group_by = "Plant",
  facet_by = "Year")

 

We can come a long way with just the group_by and facet_by features alone, but we may also be interested in looking at a different taxonomic level (by setting tax_aggregate), for example family level, add more taxonomic information (tax_add), or show more of the most abundant taxa (tax_show). Furthermore, we can hide the values (plot_values = FALSE) if we are more interested in patterns between the groups and not the exact values, or even adjust the color scale (plot_colorscale) and the colors themselves (color_vector):

amp_heatmap(
  data = data,
  group_by = "Plant",
  facet_by = "Period",
  tax_show = 20,
  tax_aggregate = "Family",
  tax_add = "Phylum",
  plot_values = FALSE,
  plot_colorscale = "sqrt",
  color_vector = c("royalblue4",
                   "whitesmoke",
                   "darkred")
)

 

To compare the top most abundant taxa in a single sample to all other samples it can be done by the sort_by argument (in this case the sample “16SAMP-3941“). This is also useful to check control samples. As mentioned, by default the read counts are normalised, but this can of course also be disabled by setting normalise = FALSE:

amp_heatmap(
  data = data,
  normalise = FALSE,
  tax_aggregate = "Genus",
  sort_by = "16SAMP-3941"
)

 

Last but not least, we primarily analyse samples from wastewater treatment systems, so we have made a nice way to show known biological functions of Genus level taxa next to the heatmap according to the MiDAS field guide database. Furthermore, when using group_by to group samples, the x-axis will be alphabetically ordered, so in the case of seasonal period, it makes more sense to order it chronologically (by setting order_x_by):

amp_heatmap(
  data = data,
  group_by = "Period",
  tax_aggregate = "Genus",
  tax_show = 20,
  plot_functions = TRUE,
  functions = c("FIL", 
                "AOB",
                "NOB",
                "PAO",
                "GAO"),
  order_x_by = c("Spring",
                 "Summer",
                 "Fall",
                 "Winter"),
  rel_widths = c(0.6, 0.4)
)

Conclusion

I hope this will convince you to try out amp_heatmap from the ampvis2 R package. Only a selection of features have been demonstrated, there are even more, have a look in the amp_heatmap online documentation. We have been using it for quite some years and it has stood the test in many of our analyses, papers and posters. It gives a nice, informative overview of any microbial community data obtained by amplicon sequencing, while still being eye-candy. In fact, it can even be used for transcriptomics as well.

 

What about pie-charts? I love pie-charts!

We certainly don’t, and definitely not for microbial community data. Let Max Roser explain why:

Populating the tree-of-life

 

Hi everybody and welcome to my first blog post at Albertsen Lab. As a newly started PhD student, I have engaged myself with the simple, yet Herculean task of populating the tree-of-life. As most people are aware of, microorganisms are more or less inescapably present in all places of the world — no matter how hostile environments you will encounter, they will probably accommodate some sort of small living organisms. As it once elegantly was put “If you don’t like bacteria, you’re on the wrong planet”. Or in this case, if you don’t like bacteria, you’re in the wrong blog post. Recently, a study was published by Kenneth J. Locey & Jay T. Lennon (2016) trying to illuminate just how omnipresent microorganisms are. This was carried out by applying scaling laws to predict the global microbial diversity. Main conclusion: our planet hosts up to 1012 microbial species. That is 1 trillion species! Of course, one trillion microbial species is still a number that is dwarfed by the approximately 1030 bacterial and archaeal cells living on Earth. Since incredibly large numbers tend to be hard to grasp, the article kindly supplied some illustrative examples. For example, the ~1030 bacterial and archaeal cells exceed the 22 orders of magnitude that separate the mass of a Prochlorococcus cell from a blue whale. It also exceeds the 26 orders of magnitude that result from measuring Earth’s surface area at a spatial grain equivalent to bacteria.

From a perspective concerning my research, it is naturally only the so-called microbial dark matter (microbes not yet discovered) that is of interest. Fortunately, it is only an infinitesimal number of microorganisms that have been discovered to date. For some reason, it is currently a prerequisite to have a bacterium growing in a pure culture before naming it. Alas, if you take a quick look at DSMZ’s homepage (one of the largest bioresource centers worldwide located in Germany), you will find that they boast a collection containing around 31,000 cultures representing some 10,000 species and 2,000 genera in total. Only a tiny bit short of one trillion. On a side note, I guess DSMZ eventually will face some kind of capacity-related problems if we insist on requiring each new species as a pure culture before an ‘official’ name can be granted. Luckily, nowadays bacterial species can also be cataloged by its DNA sequence. Currently, one of the most wide-spread methods for identifying bacteria is the 16S rRNA amplicon sequencing. Large databases such as SILVA and the Ribosomal Database Project (RDP) use huge amounts of digital ink to keep the ever-increasing influx of new sequences up to date. The current version 128 of SILVA (released in September 2016) includes over 6,300,000 SSU/LSU sequences, whereas RDP Release 11 contains 3,356,809 16S rRNA sequences (also released in September 2016). Although this is definitely a lot more than 10,000 pure cultures, it is still ridiculously less than the total estimated microbial diversity of the Earth. Hence, as I begin my PhD study, estimations from Kenneth J. Locey & Jay T. Lennon point to the fact that potentially 99.999% of all microbial taxa remain undiscovered!

This may sound like my research is going to be a cakewalk. After all, I should be able to go to the lab and find novel microorganisms literally by accident. However, I have decided to drastically confine my explorations of microbial dark matter to a specific environment. Although traveling around the world chasing novel bacteria would be pretty cool. So far, my primary focus has been sampling of microbial biomass from drinking water. You may think looking for novelty in a resource where one of the main goals is to restrict living things in is a bit weird. However, there is more than one good reason for this choice. 1) I worked with drinking water during my master’s thesis. Thus I already have experience with sampling and extraction procedures as well as a few connections with people working in this field. 2) Recently, articles such as Antonia Bruno et al. 2017, Karthik Anantharaman et al. 2016 and Birgit Luef et al. 2015 have illustrated that drinking water may potentially contain a large portion of microbial novelty. 3) Unless you are living in a third world country, gaining access to water is not very complicated. But just to clarify, do not mistake “easy access” with “easy sampling”, as the sampling of microorganisms from drinking water can be a king-sized pain in the ass.

A drastically under-represented visualization of the microbial diversity in drinking water.

As you might have guessed, I am not going to dedicate much time to trying to make bacteria living in large water reservoirs deep below the surface grow on plates in the lab. For identification, I am instead going to one-up the conventional 16S rRNA amplicon sequencing method. A new technique developed by some of the people in the Albertsen Lab have shown very promising results, generating full-length 16S rRNA gene sequences. Hopefully, I will have the opportunity to address it a bit more in detail in a later blog post. However, the first hurdle is simply collecting sufficient amounts of biomass for subsequent extraction. As it states in the protocol, the input material is ~ 800 ng RNA. If you — like many of my colleges at Albertsen Lab — have been working with wastewater, 800 ng RNA is no big deal. Getting the same amount from drinking water is a big deal. Drinking water typically has bacterial concentrations in the range of 103 – 105 cells/ml, which is why collecting adequate amounts of biomass is difficult. I naively started out using the same sampling setup that I used in my master’s thesis (where 1 ng/µl DNA would be plenty for further analysis). It basically consisted of a vacuum pump and a disposable funnel, and after spending too many hours pouring several liters of water into a puny 250 ml funnel, I ended up with negligible amounts of RNA. This is the point where you break every single pipette in the lab in despair tell your supervisor that the current sampling method is not feasible. Instead, the sampling task has been partly outsourced to people with more specialized equipment. So I have worked with samples based on 100+ liters of water yielding RNA amounts of more than 800 ng, but the workflow still needs some optimization.

Another aspect complicating the sampling step is the really obvious fact that bacteria are really small (I am perfectly aware of the fact that all bacteria with very good reason can be categorized as “small”), therefore my statement refers to the aforementioned article from Birgit Luef et al. 2015Diverse uncultivated ultra-small bacterial cells in groundwater”. The article highlights findings that demonstrate how a wide range of bacteria can pass through a 0.2 µm-pore-sized membrane filter.  FYI, filtration with a 0.2 µm filter is also commonly referred to as “sterile filtration”. One could argue it is a poorly chosen term for a filtration type that apparently allows numerous types of bacteria to pass the membrane unhindered. Also, sterile filtration is by-far the most used sampling method in papers concerning 16S rRNA amplicon sequencing of drinking water.  During my master’s thesis, I also utilized 0.2 µm filters for water samples, however, in the pursuit of novelty, the filter-size has been reduced to 0.1 µm. This should, as far as I am concerned, capture all microbes inhabiting drinking water (although it is hard not to imagine at least one out of a trillion species slipping through).

Hopefully, I will start to generate full-length 16S rRNA sequences from novel bacteria in the near future and maybe share some interesting findings here on albertsenlab.org.

Can you beat our Nanopore read error correction? We hope so!

Tagging of individual molecules has been used as an effective consensus error-correction strategy for Illumina data (Kivioja et al 2011, Burke et al 2016Zhang et al 2016) and the principle is similar to the circular consensus sequencing strategy used to generate consensus reads with error rate of < 1 % on the PacBio (Travers et al 2010, Schloss et al 2016, Singer et al 2016) and the Oxford Nanopore platforms (Li et al 2016). You simply sequence the same molecule several times and compare the reads to generate a consensus with a better accuracy than the individual reads. As far as we know a tag-based consensus error correction strategy has not been attempted for long reads before, probably because the raw error rate complicates identification of the tags. However, we see several benefits of the tag-based strategy in the long run, which is why we decided to pursue it.

My colleague @SorenKarst tested tag-based error correction on the Nanopore MinION in connection with our work on generating primer free, full length 16S/18S sequences from  environmental rRNA (see our bioRxiv paper: Karst et al 2016). The main approach used in the paper is based on Illumina sequencing inspired by Burke et al 2016, but moving to nanopore sequencing in the future would make the approach considerably easier.  His approach was relatively “simple”; individual cDNA molecules were uniquely tagged at both ends with a 10 bp random sequence, then diluted to a few thousand molecules, amplified by PCR to generate 1000’s of copies of each molecule, which were prepared for 2D sequencing on the Nanopore MinION. The resulting sequence reads were binned based on the unique tags, which  indicated they originated from the same parent molecule, and a consensus was generated from each read bin. The approach was tested on a simple mock community with three reference organisms (E. Coli MG 1655, B. Subtilis str. 168, and P. aeruginosa PAO1), which allowed us to calculate error rates.

For locating the unique tags we used cutadapt with loose settings to locate flanking adaptor sequences and extract the tag sequences. The tags were clustered and filtered based on abundance to remove false tags. As tags and adaptors contain errors, it can be a challenge to cluster the tags correctly without merging groups that do not belong together. Afterwards the filtered tags were used to extract and bin sequence reads using a custom perl script ;). For each bin we used the CANU correction tool followed by USEARCH consensus calling. By this very naive approach we were able to improve the median sequence similarity from 90% to 99%.

We think this is a good start, but we are sure that someone in the nanopore community will be able to come up with a better solution to improve the error rate even further. The data is freely available and a short description of the sequence read composition is provided below. We are looking forward to hear your inputs!

Ps. If you come up with a solution that beats our “quick and dirty” one and post it here or on twitter, I will make sure to mention you in my slides at ASM ;).

 

Data and script availability:

The nanopore reads are available as fastq at: 2D.fq or fasta: 2Dr.fa and fast5: PRJEB20906

The 16S rRNA gene reference sequences: mockrRNAall.fasta

Scripts:

Our approach in a shell script
#############################################################################
# 									    #
# Shell script for generating error corrected FL16S and getting error rates #
#									    #
# Use at your own RISK!							    #
#############################################################################

####################
#     Variables    #
####################
ID_adapt=0.1;
ID_cluster=0.8;
LINKtoCANU=/space/users/rkirke08/Desktop/canu/canu-1.3/Linux-amd64/bin;
# Update path to include poretools installation
# export PATH=$PATH:/space/users/rkirke08/.local/bin
####################
# End of variables #
####################
###############################################
# Depends on the following files and software #
###############################################
# folder with fast5 files "data/pass/"
# file with reference 16S sequences "mockrRNAall.fasta"
# perl script "F16S.cluster.split.pl"
# poretools
# cutadapt
# usearch8.1
# CANU
###############################################
# End of dependencies			      #
###############################################

# Extract fastq files
# poretools fastq --type 2D data/pass/ > data/2D.fq


# Rename headers (Some tools do not accept the long poretools headers)
#awk '{print (NR%4 == 1) ? "@" ++i : $0}' data/2D.fq | sed -n '1~4s/^@/>/p;2~4p' > 2Dr.fa

# Find adapters
cutadapt -g AAAGATGAAGAT -e $ID_adapt -O 12 -m 1300 --untrimmed-output un1.fa -o a1.fa 2Dr.fa
cutadapt -a ATGGATGAGTCT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o a1_a2.fa a1.fa

usearch8.1 -fastx_revcomp un1.fa -label_suffix _RC -fastaout un1_rc.fa
cutadapt -g AAAGATGAAGAT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o ua1.fa un1_rc.fa
cutadapt -a ATGGATGAGTCT -e $ID_adapt -O 12 -m 1300 --discard-untrimmed -o ua1_a2.fa ua1.fa

cat a1_a2.fa ua1_a2.fa > c.fa

# Extract barcodes
cut -c1-12 c.fa > i1.fa
rev c.fa | cut -c1-12 | rev > i2.fa

paste i1.fa i2.fa -d "" | cut -f1-2 -d ">" > i1i2.fa


# Cluster barcodes
usearch8.1 -cluster_fast i1i2.fa -id $ID_cluster -centroids nr.fa -uc res.uc -sizeout

# Extract raw sequences
perl F16S.cluster.split.pl -c res.uc -i c.fa -m 3 -f 50 -r 40
# Count number of files in directory
find clusters -type f | wc -l

FILES=clusters/*.fa
for OTU in $FILES

do
  wc -l $OTU >> lines.txt	
done

FILES=clusters/*.fa
for OTU in $FILES

do
  OTUNO=$(echo $OTU | cut -f2 -d\/);
  # Rename header
  sed "s/>/>$OTUNO/" clusters/$OTUNO > clusters/newHeaders_$OTUNO

  # Correct reads using CANU
  $LINKtoCANU/canu -correct -p OTU_$OTUNO -d clusters/OTU_$OTUNO genomeSize=1.5k -nanopore-raw  $OTU

  # Unsip corrected reads
  gunzip clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta.gz

  sed -i "s/>/>$OTUNO/" clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta

  # Call consensus using Usearch
  usearch8.1 -cluster_fast clusters/OTU_$OTUNO/OTU_$OTUNO.correctedReads.fasta -id 0.9 -centroids clusters/OTU_$OTUNO/nr_cor_$OTUNO.fa -uc clusters/OTU_$OTUNO/res_$OTUNO.uc -sizeout -consout clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa

  sed -i "s/>/>$OTUNO/" clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa

  # Map reads to references to estimate error rate
  # Raw reads
  # Map FL16S back to references
  usearch8.1 -usearch_global clusters/newHeaders_$OTUNO -db mockrRNAall.fasta -strand both -id 0.60 -top_hit_only -maxaccepts 10 -query_cov 0.5 -userout clusters/map_raw_$OTUNO.txt -userfields query+target+id+ql+tl+alnlen

  # Usearch consensus corrected sequence
  # Map FL16S back to references
  usearch8.1 -usearch_global clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa -db mockrRNAall.fasta -strand both -id 0.60 -top_hit_only -maxaccepts 10 -query_cov 0.5 -userout clusters/map_cor_Ucons_$OTUNO.txt -userfields query+target+id+ql+tl+alnlen

  cat clusters/map_raw_$OTUNO.txt >> myfile.txt
  cat clusters/map_cor_Ucons_$OTUNO.txt >> myfile.txt

  # Collect corrected sequences
  cat clusters/OTU_$OTUNO/Ucons_CANUcor_$OTUNO.fa >> final_corrected.fa
done

Requirements:

cutadapt

USEARCH

a perl script F16S-cluster.split.pl

cDNA molecule composition

The cDNA molecules are tagged by attaching adaptors to each end of the molecule. The adaptor contains a priming site red, the unique 10 bp tag sequence (blue) and a flanking sequence (black). Note that the design is complicated as we simply modified it from our approach to get Thousands of primer-free, high-quality, full-length SSU rRNA sequences from all domains of life.

AAAGATGAAGATNNNNNNNNNNCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTTTTTTTTTTTTTTTT<—- fragment of SSU cDNA molecule—->GGGCAATATCAGCACCAACAGAAATAGATCGCNNNNNNNNNNATGGATGAGTCT

The number of T’s before the fragment will vary between molecules because it is a result of the polyA tailing described in the paper. The black parts of the sequence are generally not needed for the purpose of Nanopore sequencing but are present in the molecule because they were needed for the illumina sequencing.

Example Nanopore sequence read:

>18125
GATCTGGCTTCGTTCGGTTACGTATTGCTGGGGGCAAAGATGAAGATGTTCGTTATTCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTGGTCAAGCCTCACGAGCAATTAGTACTGGTTAACTCAACGCCTCACAACGCTTACACACCCAGCCTATCAACGTCGTAGTCTCCGACGGCCCTTCAGGGGAATCAAGTTCCAGTGAGATCTCATCTTGAGGCAAGTTTCCGCTTAGATGCTTTCAGCGGTTATCTTTTCCGAACATGGCTACCCGGCAATGCCACTGGCGTGACAACCGGAACACCAGAGTTCGTCCACCCGGTCCTTCCGTACTAGGAGCAGCCCCTCTCAAATTCAAACGTCCACGGCCAGATGGGGACCGAACTGTCTCACGACGTTCTAAGCCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTAGACCGGCTTCAGCCCCAGGATGTGATGAGCCGACATCGAGGTGCCAAACACCGCCGTCGATAAACTCTTGGGCGGTATCAGCCTGTTATCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCTTCCATACAGAACCACCGGATCTTCAAGACCTACTTTCGTACCTGCTCGACGTGTCTGCTCTGATCAAGCGCTTTTGCCTTTATATTCTCTGCGACCGATTTCCGACCGGTCTGAGCGCACCTTCGTGGTACTCCTCCGTTACTCTTTTAGGAGGAGACCGCCCCAGTCAAACTGCCCACCATACACTGTCCTCGATCCGGATTACGGACCAGAGTTAGAACCTCAAGCATGCCAGGATGGTGATTTCAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCCACCTAATCCTACACAGCAGGCTCAGTCCAGTGCCGCTACAGTAAAGGTTCACGGGGTCTTTCCGTCGCCGCGGATACACTGCATCTTCACAGCGATTTCAATTTCACTGAGTCTCGGGTGGAGACAGCGCCGCCATCGTTACGCCACTCGTGCAGGTCGGAACTTACCCGACAAGGAATTTCGCTACCTTGGACCGTTATCGTTACGGCCGCCGTTTACCGGGGCTAGATCAGGCTTCGCGCCCCATCAATACTTCCGGCACCGGGAGGCGTCACACTTATACGCCGTCCACTTTCGTGTTTTGCAGAGTGCTGTGTTTTTAATAAACAGTCGCAGCGGCCTGGTATCTTCGACCAGCCAGAGCTTACGGAGTAAATCCTTCACCCTAGCCGGCGCACCTTCTCCCGAAGTTACGGTGCCATTTGCCTAGTTCCTTCACCCGAGTTCTCAAGCGCCTTGGTATTCTCTACCCGACCACCTGTGTCGGTTTGGGTGCAGTTCCTGGTGCCTGAAGCTTAGAAGCTTTTGGAAGCATGGCATCAACCACTTCGTCGTCTAAAAGACGACTCGTCATCAACTCTCGGCCTTGAAACCCCGGATTTACCTAAGATTTCAGCCTACCACCTTAAACTTGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGGTGGAACGTTCTGTTTATGTTTCTGAAA

Example cluster with same random sequence:

>18125
GATCTGGCTTCGTTCGGTTACGTATTGCTGGGGGCAAAGATGAAGATGTTCGTTATTCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTGGTCAAGCCTCACGAGCAATTAGTACTGGTTAACTCAACGCCTCACAACGCTTACACACCCAGCCTATCAACGTCGTAGTCTCCGACGGCCCTTCAGGGGAATCAAGTTCCAGTGAGATCTCATCTTGAGGCAAGTTTCCGCTTAGATGCTTTCAGCGGTTATCTTTTCCGAACATGGCTACCCGGCAATGCCACTGGCGTGACAACCGGAACACCAGAGTTCGTCCACCCGGTCCTTCCGTACTAGGAGCAGCCCCTCTCAAATTCAAACGTCCACGGCCAGATGGGGACCGAACTGTCTCACGACGTTCTAAGCCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATACCCTTAGACCGGCTTCAGCCCCAGGATGTGATGAGCCGACATCGAGGTGCCAAACACCGCCGTCGATAAACTCTTGGGCGGTATCAGCCTGTTATCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCTTCCATACAGAACCACCGGATCTTCAAGACCTACTTTCGTACCTGCTCGACGTGTCTGCTCTGATCAAGCGCTTTTGCCTTTATATTCTCTGCGACCGATTTCCGACCGGTCTGAGCGCACCTTCGTGGTACTCCTCCGTTACTCTTTTAGGAGGAGACCGCCCCAGTCAAACTGCCCACCATACACTGTCCTCGATCCGGATTACGGACCAGAGTTAGAACCTCAAGCATGCCAGGATGGTGATTTCAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCCACCTAATCCTACACAGCAGGCTCAGTCCAGTGCCGCTACAGTAAAGGTTCACGGGGTCTTTCCGTCGCCGCGGATACACTGCATCTTCACAGCGATTTCAATTTCACTGAGTCTCGGGTGGAGACAGCGCCGCCATCGTTACGCCACTCGTGCAGGTCGGAACTTACCCGACAAGGAATTTCGCTACCTTGGACCGTTATCGTTACGGCCGCCGTTTACCGGGGCTAGATCAGGCTTCGCGCCCCATCAATACTTCCGGCACCGGGAGGCGTCACACTTATACGCCGTCCACTTTCGTGTTTTGCAGAGTGCTGTGTTTTTAATAAACAGTCGCAGCGGCCTGGTATCTTCGACCAGCCAGAGCTTACGGAGTAAATCCTTCACCCTAGCCGGCGCACCTTCTCCCGAAGTTACGGTGCCATTTGCCTAGTTCCTTCACCCGAGTTCTCAAGCGCCTTGGTATTCTCTACCCGACCACCTGTGTCGGTTTGGGTGCAGTTCCTGGTGCCTGAAGCTTAGAAGCTTTTGGAAGCATGGCATCAACCACTTCGTCGTCTAAAAGACGACTCGTCATCAACTCTCGGCCTTGAAACCCCGGATTTACCTAAGATTTCAGCCTACCACCTTAAACTTGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGGTGGAACGTTCTGTTTATGTTTCTGAAA
>42262
AGCGTTCAGATTACGTATTGCTAGGGGGCAAAGATGAAGATGTTCGTTATTCTTTAGACTTGCCTGTCGCTCTATCTTCTCTTTTTGGTCAAGCCTCACGGGCAATTAGTACTGGTTAGCTCAACGCCTCACAACGCTTACAGCCTATCAACGTCATAATTCTTCTGACGGCCCTTCAGAATCAAGTTCCCAGTGAGATCTCATCTTGAGCAAGTTTCCCACCGTCTTTCAGCGGTTATCTTTTCGAACCTGCTTCCAGCAATACCACTGGCGTGACAACCGGAACACCAGAGGTTCGTCCACTCCGGTCCTCTCCGTACTAGGGCAGCCCTCTCAAATCTCAGAACGTCCACGGCAGATAGGACCGAACTGTCTCACGACGTTCTAAGCAGCTCGCGTACCACTTTAAATGGCGAACAGCCATGCAGGACCGGCTTCGGCCCCAGGATGTGATGAGCCGGCATCGGGTGCCAAACACCGCCGTCGATATAAACTCGGGCATTGACCTGTTATCCCCGGGTACCTTTTTATCGTTGAGCGATGGCCCTTCCATACCAGAACCACCGGATCACTACAGACCTACTTTCGTACCTGCTCGCTGTCTGTCGCGGCCAAGCGCTTTTGCTATGCTCTGCGACCGATTTCCGACCGGTCTGGGCGCACCTTCGTACTCCGTTGCCTCTTTTGGAGACCGCTGATCAAACTGCCCACCATACACTGTCCTCGATCCGGATTACCAGAGTTTAGAACTCAATGCCAGGGTGGTATTTCAAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCACCTATCCTACAAGCAGGCTCAAAGTCCAGTACAACTACAGTAGGTTCACGGGGTCTTTCCGTCTAGCCGCGGATACCTGCATCTTCAGCGTTTCAATTTCACTGAGTCTCAGGTGGAGACAGCGCCGCCATCGTTACGCCATTCGTGCAGGTCGGAACTTGCCGACAAGGAATTTTGCACCTTGGGACCATTCGTTACGCCGTTTACCGGGGCTGATCAAGAGCTTGCTTGCGCTAACCCCATCAATTAATTTTTCCGGCACCGGGGAGGCGTCACACCTACGTCCCACTGCGTGTTTGCAGAGTGCTGTGTTTAATAAGTCGCAGCAGCTCAGTATCTTCGACCAGCCAGAGCTTACGGAGTAAATCTTCACCTAGCCGGCGACCTTCTCCCGGAAGTTACGGTGCCATTTGCCTAGTTCCTCCGCACCCGAAAGCCCTTCGCGCCTTGGTATTCTCTACCCGACCTGTGTCGGTTTGGGGCACGGTTCCTGGCCTGAAGCAGAAGCTTTTCTTGGAAGCCTGGCATCAACCACTTCGTCATCTAAAAGACGACTCGTCATCAGCTCTCGGCCTTGAAACCGGATTTACCTAAGATTTCAGCCTACCACCTTAAACTTGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGTGGAGACGTTCTGTTTATGTTTCTATC
>50101
CCCGGTTACGTATTGCTAGGGGCAAAGATGAAGATGTTCGTTATTCGTACTAGACTTGCCTGTCGCTCTATCTTCTTTTTGGTCAAGCCTGCGGGCAATTATACTGGATAGCTCAACGCCTCACAACGCATACACCCAGCTTCTATCAACGTCGTAGTCTTCGACGGCCCTTCAGGAATCAAGTTCCCAGTGAGATCTCATCTTGAGGCAAGTTTCCCGCTTAGATGCTTTCAGCGGTTATCTTTTCCGAACATAGCTACCCGGCAATGCCACTGGCGTGACAACCGGAACACCAGAGGTTCGTCCACTCCGGTCCTCTCGTACTAGGAGCAGCCTCTCAAATCAAACGTCCACGGCAGATATAGGGACCGAACTGTCTCACGACGTTTCTAAACCCAGCTCGCGTACCACTTTAAATGGCGAACAGCCACCCTTGGGACCGGCTTCAGCCCCAGGATGTGATGAGCCGACATCGGGAACAAACACCGCCGTCGATATAAACTCTTGGGCGGTATCAGCCTGTTATCCCCGGAGTACCTTTTATCCGTTGAGCGATGGCCCTTCCATACAGAACCACCGGATCACTAAGACCTACTTTCGTACCTGCTCGACGTGTCTGTCTCGCAGTCAAGCGCGCTTTTGCTTTATACTCTGCGACCGATTTCCGACCGGTCTGAGCGCACCTTCGTACTCTCCGTTACTCTTTAGGAGACCGCCCCAGTCAAACTGCCCACCATACACTGTCCCTATCGATCCGGATTACGGACAGAGTTAGAACCTCAAGCATGCCAGGGTGGTATTTCAAGGATGGCTCCACGCGAACTGGCGTCCACGCTTCAAAGCCTCCACCTATCCTACACAAGCAGGCTCAAAGTCCAGTGCAAAGCTACAGTAAGGTTCACGGGTCTTTCCGTCTAGCCGCGGATACACTGCATCTCCACAGCGATTTCACCTCACTGAGTCTCTCGGGTGGAGACAGCGCCGCCATCGTTACGCCATTCGTGCAGGTCGGAACTTACCGACAAGGAATTTCGCTACCTTAGACCGTTATCGTTACGGCCGCCGTTTACCGGGGCTTCGATCAAGAGCTTCGCTTGCGCTAACCCCATCAATTAACCTTCGGCACCGGGGAGGCGTCACACCCTATACGTCCACTTTCGTGTTTGCAGAGTGCTGTGGCTTTTAATAAACAGTCGCAGCGGCCTGGTATCTTTTCGACCAGCCAGAGCTTACGGAGTAAATCCTTCACCTTAGCCGGCGCACCTTCTCCCGAAGTTACGGTGCCATTTGCTAGTTCCTTCACCCGAGTTCTCTCAAGCGCCTTGGTATTCTCTACCCGACCACCTGTGTCGGTTTGGGGTACGGTTCTGGTTACCTGAAGCTTAGAAGCTTTTCTTGGAAGCATGGCATCAACCACTTCGTCGTCTAAAGACGACTCGTCATCAGCTCTCGGCCTTGAAACCCCGGATTTACCTAAAGATTTCAGCCTACCACCTTAAACTTAGGGGGCAATATCAGCACCAACAGAAACTCTCTATACCATGGACAATGGATGAGTCTGGGTGGAAGTTCTGTTTATGTTTCTTGAGC

Continue reading