Is improvement possible? Quality control of a Illumina Nextera dataset for the novo genome assembly

In my current postdoc I work on genome analysis of multiple thermophilic bacteria belonging to the phylum Thermotogae.

Electron Micrograph of a typical Thermotogae bacterium: Thermotoga maritima.  The cells ot Thermotogae are surrounded by a membrane called a "toga".

Electron micrograph of a typical Thermotogae bacterium: Thermotoga maritima. Thermotogae cells (dark structure) are surrounded by a membrane called a “toga”, which is the light circle around the cell. Figure by
K.O. Stetter

The data for such analyses comes from DNA shotgun sequencing. A technique were DNA chains are broken up in many small pieces and with sophisticated algorithms the original sequence is retrieved on our computers. In a perfect world we can combine the shotgun sequences into a complete genome sequence without any holes. But as we know, the world is not perfect. Many issues prevent us from creating the original genome sequence of our bug of interest. This blog is an example of what stops us.

Recently, I was asked to assemble the genome of a bacterium isolated from an oil field using a paired-end Illumina MiSeq dataset. (A neat explanation of what paired-end sequences are, can be found here). I got this question from my collaborators since they only got very fragmented genome assemblies (more than 400 contigs) using the CLC genome assembler. We thought that the problem was due to the assembly software, so my job was to try other assembly software on the data, such as Velvet. However, in the process we figured out that sequence quality of the shotgun data was the real problem. The shotgun sequence library was created with the Illumina Nextera DNA sample preparation kit.

Quality control of Illumina sequence data
The data that we got from the sequencing lab was already processed by the lab. The sequences came in two data files, one with the forward paired-end sequences and one with the corresponding reverse paired-end sequences. The lab had removed the adapters and the barcodes from all the sequences. Since we used Illumina sequencing, the first thing I did was to peek at the data quality. I use the Fastqc software for this, which generates several informative plots about the data. The website hosting the Fastqc software has examples of good and bad Illumina data. In the figure below you see the quality scores of our troublesome dataset. They are okay for the forward sequences, but quite awful for the reverse sequences.

Phred quality scores of the raw data for the the forward and reverse reads. A base pair quality score of 20 is minimally required in order to trust the DNA nucleotide identified.

Phred quality scores of the raw data for the forward and reverse reads. A base pair quality score of 20 is minimally required to trust the DNA nucleotide identified. For a better view click on the figure.

But this is not the real trouble. We can improve our data by removing sequences and DNA bases with poor quality scores. The trouble is seen in the next figure and I only show it for the reverse sequences (Fastqc gave identical plots for the forward sequences).

Fastqc plots showing left the sequence content per base, and right the GC-content per base. Both figures are complementary to eachother. Left figure, the four colors represent the bases A,T,G or C.

Fastqc plots showing left the sequence content per base, and right the GC-content per base. Both figures are complementary to each other. Left figure, the four colors represent the bases A,T,G or C.

The figure above shows a few things. The most important thing is that between base pairs 25 and 230 the GC-content is 35 % and represents the average GC-content of our genome. The same is shown on the left for each nucleotide separately. The problem with our data is the first stretch of 25 bases and at the last 20 bases, which have a GC-content that is different from the average genome GC-content. My first idea was that this was due to adapters still being present despite the processing done by the lab.

But on second thought, that should have left only adapters and barcodes that were not recognized, and here it is clear that the majority of sequences have this issue. Something else is happening here. While I was struggling to understand what was wrong with the data,  Anthony Underwood posted a similar plot on twitter.

Some of the people commenting on his tweet suggested that problem was due to the library preparation based on the Nextera kit. Anthony confirmed that his data was also generated using the Nextera kit.  The Nextera kit uses transposase to cut the genome at random locations. The problem is that the enzyme might not actually work randomly, and the GC-content bias in our data set could be an indication for just that.

Quality control with preqc
Since Fastqc indicated that something was wrong with the dataset I wanted to explore the dataset a bit more. An alternative option for quality control of sequence data, is found with the SGA assembly software. The software comes with a module called: preqc. This module takes sequence data and determines how complex the genome assembly is going to be. The results of this are presented in a pdf report.

Here I used preqc on the raw Phalo data, and I compared it to a Illumina MiSeq dataset for a different Thermotoga strain, that was not created with the Nextera kit. The Norwegian Sequencing Centre provided us with a Fastqc report for the forward and the reverse sequences. The only thing I did was to remove the adapters and barcodes myself using AdapterRemoval.

The preqc comparison

The full preqc report displays all the figures and information on the data coming out of preqc. Not all of that is interesting for us, therefore I focus on the differences between the datasets compared here.

Selection of preqc plots comparing the Illumina miseq datasets for two bacterial strains. A) Simulated contig lengths. B) k-mer position of first error. C) Per-position error rate. D) 51-mer count distribution. E) GC content vs Kmer coverage for phalophila. F) GC content vs Kmer coverage for 1070.

Selection of preqc plots comparing the Illumina miseq datasets for two bacterial strains. A) Simulated contig lengths. B) k-mer position of first error. C) Per-position error rate. D) 51-mer count distribution. E) GC content vs k-mer coverage for phalophila. F) GC content vs k-mer coverage for 1070.

The most striking result from the figure above is that the k-mer (a DNA sequence with length “k”) coverage versus GC-content between the two datasets is different (Figure E and F). A “good” dataset would show a distribution as found with 1070, where there are k-mer sequences that are really low coverage, e.g. only found once, and k-mers in a relatively narrow range (here 200-300). Our problem dataset shows a k-mer coverage from 1 all the way up to 350. Furthermore, it seems that coverage increases with percentage GC-content. I believe that one would not see such a smear when the genome was randomly fragmented. Something is odd with the Phalo data and the question is if we can do anything to correct this dataset?

Basepair

The first thing that comes to mind is to trim away the first 25 bp and the last bases of the long reads because of the clear GC-content bias in those sections (see the above figure C as well). I used fastx_trimmer from the FASTX-Toolkit to do this. Then we compare this data with preqc to the original dataset.

Preqc plots comparing the  the Raw and the trimmed Phalo dataset for kmer position of 1st error, Per-position error rate and the GC-content vs kmer coverage. Note the different scaling on the kmer coverage bars.

Preqc plots comparing the the Raw and the trimmed Phalo dataset for kmer position of 1st error, Per-position error rate and the GC-content vs kmer coverage. Note the different scaling on the kmer coverage bars. The full report of the comparison

So the trimming had an effect at the 5′ end of the sequences. The per-position error rate plot shows that the initial bases in the raw data are now trimmed away. We also see that the k-mer coverage of the dataset has dropped from a maximum of 350 to 300. The big error peak is still there and more trimming could remove that as well. But that would reduce the read length more and might not help our assembly.

Performing error correction

Another option is to error correct the reads. Some assembly software comes with error correction tools that try to reduce the number of badly called bases. For instance, both the SGA assembler and the SPAdes assembler come with a different error correction method. The error correction in the SPAdes assembler V3.0 is done within the program and there are no settings you can tweak. The SGA assembler comes with two modules that can help you reduce the noise in the data: “correct” and “filter”. The filter module removes reads that have low abundance k-mers, and/or have exact match duplicates. The default is to check for both. With the SGA modules you can play around with the settings, but for simplicity I used the default settings here. I first ran the correct module on the raw Phalo data and then the filter module. With the SPAdes assembler I only ran the error correction on the raw Phalo data. Below I compare the preqc results for the three treatments tested here.

The first thing I check are the assembly simulation plots generated by preqc.

Selection of preqc plots comparing the Illumina miseq datasets for two bacterial strains

Preqc plots comparing the simulated assemblies using the data from Phalo_raw, Phalo_ec (SGA error corrected), Phalo_ec.filter (SGA er.corrected and filtered) and Phalo_Spades_ec_pairs( SPAdes 3.0 error corrected). The full report is found here.

Here we see that the datasets are all very similar when comparing simulated contig lengths and the frequency of the repeat branches in the assembly de bruijn graphs. The SGA error corrected and filtered dataset is showing a higher frequency of variant branches and has less error branches than the other treatments. Interestingly, with increasing k-mer size the frequency of error branches drops for the SPAdes error corrected towards a value closer to the SGA filtered dataset.

Preqc plots comparing the Illumina MiSeq reads before and after the different treatments: Phalo_raw, Phalo_ec (SGA error corrected), Phalo_ec.filter (SGA er.correction and filtering) and Phalo_Spades_ec_pairs( SPAdes 3.0 error corrected

Preqc plots comparing the Illumina MiSeq reads before and after the different treatments: Phalo_raw, Phalo_ec (SGA error corrected), Phalo_ec.filter (SGA er.corrected and filtered) and Phalo_Spades_ec_pairs( SPAdes 3.0 error corrected)

When we take a closer look at the sequences, we see that the SGA filtered dataset has almost no duplicates left. In addition, the estimated fragment insert size increases in proportion sharply at 250 bp (not sure why…), while the error rate is almost zero for the SGA filtered dataset. The SPAdes error correction, reduces the major peak at about k-mer 40 dramatically, and the start of the sequences have lower error rates as well. It is clear from this graph that the SPAdes error correction does a lot more than the standard SGA error correction.

Preqc plots comparing the GC-content vs kmer coverage before and after different treatments: Phalo_raw, Phalo_ec (SGA error corrected), Phalo_ec.filter (SGA er.correction and filtering) and Phalo_Spades_ec_pairs( SPAdes 3.0 error corrected

Preqc plots comparing the GC-content vs kmer coverage before and after different treatments: Phalo_raw, Phalo_ec (SGA error corrected), Phalo_ec.filter (SGA er.corrected and filtered) and Phalo_Spades_ec_pairs( SPAdes 3.0 error corrected).

The GC-content versus k-mer coverage seems much more narrow for the filtered dataset, where the coverage dropped to a maximum of about 125. The SPAdes error correction only shows an improvement with a reduction of k-mers with a low coverage.

The question now is, does any of my data manipulations here improve my assembly for this particular bacterial genome.

Doing the assemblies.

For reasons of simplicity I decided to run an assembly on the various trimmed and error corrected datasets using the Velvet assembler. For k-mer selection I used the VelvetOptimiser script and ran the script with k-mers ranging from 21 to 127 with a step size of 5. The best assembly was chosen using an optimization based on N50. The results of the three assemblies are in the table below.

Screenshot 2014-02-21 14.09.03

Conclusions

Looking at these results, I conclude that none of my data corrections really improved the assemblies. Having more than 2000 contigs for a 2 million base pair genome is bad.

Based on these results, we decided to send the data back to the sequencing lab, so they can sort out what needs to be done here. My impression is that in this situation the decision to use Nextera for library preparation was bad. I do not want to disqualify the Nextera kit here, since it might be a good kit in the proper hands. However, for the novo genome assembly the Nextera kit is not the right option. The above analysis indicates that for this particular genome. For the novo genome assembly you need high quality, high coverage data, and in my opinion both the Nextera kit and the sequencing lab failed here in providing such data.

And I end this post by answering the question in the title. No, it is not possible to improve a bad dataset with the methods I have used. Rubbish in is rubbish out.

Advertisements

About Thomas Haverkamp

A microbial ecologist, an amateur photographer and a proud father of a tiny little girl.
This entry was posted in Genomics & more, High-throughput Sequencing and tagged , , , , , , , , , , , , , , , , . Bookmark the permalink.

8 Responses to Is improvement possible? Quality control of a Illumina Nextera dataset for the novo genome assembly

  1. Mabula says:

    Nice, once you know rubbish goes in, you can expect rubbish to come out… and I think you found a plausibel explanation for rubbish to go in, in the first place 😉

  2. akorobeynikov says:

    It’s really interesting that you used SPAdes for error correction, but not for assembly. Any particular reason? It will be really nice if you add to your tables SPAdes results on these reads (at least in the default mode).

    Also, note that you cannot compare N50 values across the datasets, because the lengths of your assemblies are different.

    • Thomas Haverkamp says:

      Thanks for your the comment. The only reason why I used Velvet over SPAdes was that I had been using Velvet before. And with this particular dataset, I was not convinced that any assembler would work properly. Here at my lab a colleague ran an assembly with Celera, and that was sort of an improvement, but still over 300 contigs. I will try to add SPAdes results as well to the table in the next days.

      One thing I am still confused about is what to do with all the small contigs with low coverage that the SPAdes produces. Any suggestions on how I should deal with them. I am guessing that most of them are from low quality reads, or from contamination.

      I did not know that you can not compare N50 values between different assemblies and I do not understand that. Do you have a good reference / explanation on why that is?

      • akorobeynikov says:

        Feel free to email your spades.log to spades.support@bioinf.spbau.ru, so we can double check that indeed stuff is bad 🙂

        SPAdes happily assembles all the data you feed into it. It does not do any filtering basing on the coverage / length. This is essential for single cell assemblies, however, it may benefit for multi-cell ones as well, because it may assemble low-covered regions which appeared due to various biases. So it’s up to you how to proceed with them. Probably you can just remove all the contigs smaller than 500 bp when calculating N50 and similar statistics (QUAST does this by default, btw).

        As for N50 – N50 gives you the median contig length wrt the assembly length. So when you compare different assemblies you need to take this into account – the N50 value of the longer assembly tend to be smaller due to this and even a small change of assembly length can make a significant N50 drop, especially if N50 is high. Therefore for honest comparison the use of NG50 is usually advised (here the genome length if same for all the assemblies). If you do not have a reference, you can still provide QUAST an estimate of the genome length (via –est-ref-size option) and it will happily calculate NG50 wrt the estimated genome size for you.

  3. You are correct, the difference in distribution of k-mer coverage between Phalo_raw and TT1070 in the GC plots is extreme, and likely is responsible for your poor assembly. But I would think it unlikely this is due to the Nextera preparation, at least based on our results with this kit. Have you checked that your original DNA isn’t contaminated, perhaps with another strain? Another thing to look at is that your insert size distribution has a lot of short fragments, which should have been removed in the library prep stage.

    • Thomas Haverkamp says:

      Thanks Nick, I do not know if the DNA is contaminated but that is for sure a thing to look at. For the insert size, I have also been puzzled by all the short fragments. Currently, illumina is checking with the sequencing lab what happened in the run. Maybe that will explain the short sized fragments…

      • akorobeynikov says:

        We saw many times short fragments (and short reads) in Nextera datasets. This makes me think that this is the pretty common pattern of bias in the tagmentation process. We’re also seeing some other “interesting” artifacts like pretty uneven coverage which we’re trying to understand.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s