In my current postdoc I work on genome analysis of multiple thermophilic bacteria belonging to the phylum Thermotogae.
The data for such analyses comes from DNA shotgun sequencing. A technique were DNA chains are broken up in many small pieces and with sophisticated algorithms the original sequence is retrieved on our computers. In a perfect world we can combine the shotgun sequences into a complete genome sequence without any holes. But as we know, the world is not perfect. Many issues prevent us from creating the original genome sequence of our bug of interest. This blog is an example of what stops us.
Recently, I was asked to assemble the genome of a bacterium isolated from an oil field using a paired-end Illumina MiSeq dataset. (A neat explanation of what paired-end sequences are, can be found here). I got this question from my collaborators since they only got very fragmented genome assemblies (more than 400 contigs) using the CLC genome assembler. We thought that the problem was due to the assembly software, so my job was to try other assembly software on the data, such as Velvet. However, in the process we figured out that sequence quality of the shotgun data was the real problem. The shotgun sequence library was created with the Illumina Nextera DNA sample preparation kit.
Quality control of Illumina sequence data
The data that we got from the sequencing lab was already processed by the lab. The sequences came in two data files, one with the forward paired-end sequences and one with the corresponding reverse paired-end sequences. The lab had removed the adapters and the barcodes from all the sequences. Since we used Illumina sequencing, the first thing I did was to peek at the data quality. I use the Fastqc software for this, which generates several informative plots about the data. The website hosting the Fastqc software has examples of good and bad Illumina data. In the figure below you see the quality scores of our troublesome dataset. They are okay for the forward sequences, but quite awful for the reverse sequences.
But this is not the real trouble. We can improve our data by removing sequences and DNA bases with poor quality scores. The trouble is seen in the next figure and I only show it for the reverse sequences (Fastqc gave identical plots for the forward sequences).
The figure above shows a few things. The most important thing is that between base pairs 25 and 230 the GC-content is 35 % and represents the average GC-content of our genome. The same is shown on the left for each nucleotide separately. The problem with our data is the first stretch of 25 bases and at the last 20 bases, which have a GC-content that is different from the average genome GC-content. My first idea was that this was due to adapters still being present despite the processing done by the lab.
But on second thought, that should have left only adapters and barcodes that were not recognized, and here it is clear that the majority of sequences have this issue. Something else is happening here. While I was struggling to understand what was wrong with the data, Anthony Underwood posted a similar plot on twitter.
Some of the people commenting on his tweet suggested that problem was due to the library preparation based on the Nextera kit. Anthony confirmed that his data was also generated using the Nextera kit. The Nextera kit uses transposase to cut the genome at random locations. The problem is that the enzyme might not actually work randomly, and the GC-content bias in our data set could be an indication for just that.
Quality control with preqc
Since Fastqc indicated that something was wrong with the dataset I wanted to explore the dataset a bit more. An alternative option for quality control of sequence data, is found with the SGA assembly software. The software comes with a module called: preqc. This module takes sequence data and determines how complex the genome assembly is going to be. The results of this are presented in a pdf report.
Here I used preqc on the raw Phalo data, and I compared it to a Illumina MiSeq dataset for a different Thermotoga strain, that was not created with the Nextera kit. The Norwegian Sequencing Centre provided us with a Fastqc report for the forward and the reverse sequences. The only thing I did was to remove the adapters and barcodes myself using AdapterRemoval.
The preqc comparison
The full preqc report displays all the figures and information on the data coming out of preqc. Not all of that is interesting for us, therefore I focus on the differences between the datasets compared here.
The most striking result from the figure above is that the k-mer (a DNA sequence with length “k”) coverage versus GC-content between the two datasets is different (Figure E and F). A “good” dataset would show a distribution as found with 1070, where there are k-mer sequences that are really low coverage, e.g. only found once, and k-mers in a relatively narrow range (here 200-300). Our problem dataset shows a k-mer coverage from 1 all the way up to 350. Furthermore, it seems that coverage increases with percentage GC-content. I believe that one would not see such a smear when the genome was randomly fragmented. Something is odd with the Phalo data and the question is if we can do anything to correct this dataset?
The first thing that comes to mind is to trim away the first 25 bp and the last bases of the long reads because of the clear GC-content bias in those sections (see the above figure C as well). I used fastx_trimmer from the FASTX-Toolkit to do this. Then we compare this data with preqc to the original dataset.
So the trimming had an effect at the 5′ end of the sequences. The per-position error rate plot shows that the initial bases in the raw data are now trimmed away. We also see that the k-mer coverage of the dataset has dropped from a maximum of 350 to 300. The big error peak is still there and more trimming could remove that as well. But that would reduce the read length more and might not help our assembly.
Performing error correction
Another option is to error correct the reads. Some assembly software comes with error correction tools that try to reduce the number of badly called bases. For instance, both the SGA assembler and the SPAdes assembler come with a different error correction method. The error correction in the SPAdes assembler V3.0 is done within the program and there are no settings you can tweak. The SGA assembler comes with two modules that can help you reduce the noise in the data: “correct” and “filter”. The filter module removes reads that have low abundance k-mers, and/or have exact match duplicates. The default is to check for both. With the SGA modules you can play around with the settings, but for simplicity I used the default settings here. I first ran the correct module on the raw Phalo data and then the filter module. With the SPAdes assembler I only ran the error correction on the raw Phalo data. Below I compare the preqc results for the three treatments tested here.
The first thing I check are the assembly simulation plots generated by preqc.
Here we see that the datasets are all very similar when comparing simulated contig lengths and the frequency of the repeat branches in the assembly de bruijn graphs. The SGA error corrected and filtered dataset is showing a higher frequency of variant branches and has less error branches than the other treatments. Interestingly, with increasing k-mer size the frequency of error branches drops for the SPAdes error corrected towards a value closer to the SGA filtered dataset.
When we take a closer look at the sequences, we see that the SGA filtered dataset has almost no duplicates left. In addition, the estimated fragment insert size increases in proportion sharply at 250 bp (not sure why…), while the error rate is almost zero for the SGA filtered dataset. The SPAdes error correction, reduces the major peak at about k-mer 40 dramatically, and the start of the sequences have lower error rates as well. It is clear from this graph that the SPAdes error correction does a lot more than the standard SGA error correction.
The GC-content versus k-mer coverage seems much more narrow for the filtered dataset, where the coverage dropped to a maximum of about 125. The SPAdes error correction only shows an improvement with a reduction of k-mers with a low coverage.
The question now is, does any of my data manipulations here improve my assembly for this particular bacterial genome.
Doing the assemblies.
For reasons of simplicity I decided to run an assembly on the various trimmed and error corrected datasets using the Velvet assembler. For k-mer selection I used the VelvetOptimiser script and ran the script with k-mers ranging from 21 to 127 with a step size of 5. The best assembly was chosen using an optimization based on N50. The results of the three assemblies are in the table below.
Looking at these results, I conclude that none of my data corrections really improved the assemblies. Having more than 2000 contigs for a 2 million base pair genome is bad.
Based on these results, we decided to send the data back to the sequencing lab, so they can sort out what needs to be done here. My impression is that in this situation the decision to use Nextera for library preparation was bad. I do not want to disqualify the Nextera kit here, since it might be a good kit in the proper hands. However, for the novo genome assembly the Nextera kit is not the right option. The above analysis indicates that for this particular genome. For the novo genome assembly you need high quality, high coverage data, and in my opinion both the Nextera kit and the sequencing lab failed here in providing such data.
And I end this post by answering the question in the title. No, it is not possible to improve a bad dataset with the methods I have used. Rubbish in is rubbish out.