Removing small & low coverage contigs from a Spades assembly

At the moment I am processing quite a few bacterial genome assemblies that I created for various Thermotogae species.

I am comparing assemblies created by CLC, Velvet and the latest SPAdes assembler with REAPR. Both CLC and Velvet were set to only keep contigs larger than 500 bp, but for SPAdes I have not found that restriction. In addition, I did not specify what the coverage cut-off should be in order to keep or thrash low coverage contigs.

So when I use SPAdes I end up with a few hundred contigs for my bacterial genome assemblies.

In order to process these assemblies and remove small and low coverage contigs I made a little bash one liner that includes the perl fastagrep.pl script.

I thought I share it since it is quick, efficient and uses a variety of unix commands such as, grep, sort, sed, and awk.

The script:

for file in contigs.fasta;do
grep -F “>” $file | sed -e ‘s/_/ /g’ |sort -nrk 6 |awk ‘$6>=2.0 && $4>=500 {print $0}’| sed -e ‘s/ /_/g’|sed -e ‘s/>//g’>$file.txt
echo sequences to keep
wc -l $file.txt
echo running fastagrep.pl
../fastagrep.pl -f $file.txt $file > HCov.$file
echo sequences kept
grep -c “>” HCov.$file

done

If you copy this to a text file, make sure you use tabs for the indented lines, and at the end keep the done.

The awk command : awk ‘$6>=2.0 && $4>=500 {print $0}’ specifies the coverage is >= 2.0 and the length is >=500. These data are extracted from the fasta header in the contigs.fasta file.

This little script produces two files:

contigs.fasta.txt  – contains a list of the high coverage contigs

HCov.contigs.fasta – the fasta file with the high coverage contigs.

The fastagrep.pl script can be in the same folder, or better in your path so you can call it from anywhere.

Easy does it.

Advertisements

About Thomas Haverkamp

A microbial ecologist, an amateur photographer and a proud father of a tiny little girl.
Quote | This entry was posted in Genomics & more, software. Bookmark the permalink.

6 Responses to Removing small & low coverage contigs from a Spades assembly

  1. In BioPython it’s pretty easy to filter contigs based on length:


    #!/usr/bin/env python
    import sys
    from Bio import SeqIO
    min_length, fasta_file_path = sys.argv[1:]
    with open(fasta_file_path.replace('fasta', 'filter{}.fasta'.format(min_length)), 'w') as filtered_fasta:
    with open(fasta_file_path, 'rU') as input_fasta:
    def filtered_contigs_generator(min):
    for contig in SeqIO.parse(input_fasta, 'fasta'):
    if len(contig) >= min:
    yield contig
    SeqIO.write(filtered_contigs_generator(int(min_length)), filtered_fasta, 'fasta')

    Put this in a file (“filter_contigs”, let’s say), chmod 755 it, put it on your path, and then when you invoke

    filter_contigs 500 my_assembly.fasta

    then it will produce, in the same directory, a file called “my_assembly.filter500.fasta” that contains only the contigs of length 500 or larger, in the same order they appear in the source file (if that’s important.) More importantly it’ll do it without loading the entire assembly into memory first, which can be important if you’re working with larger genomes.

    It can be easily adapted to work on streams as well (it’s basically a stream processor inside two open file contexts, anyway) so it may be of use to someone’s pipeline. And it wouldn’t be hard to add additional parameters for coverage filtering, as well (though in our experience there’s not a lot of difference between filtering short contigs and low-coverage ones.)

    (If the commenting system borks the line indents, I’ll put it up on Github and link.)

    • Derp, that’s exactly what it did. My bad! Here’s the same code in a repo:

      https://github.com/CFSAN-Biostatistics/filter_contigs

    • Thomas Haverkamp says:

      Thanks for giving the python script. But, with an assembly it is also important to remove contigs that have low coverage, since they are ussualy based from bad reads. Do you also know if it is easy to filter fasta sequences based on their coverage in an assembly? Such a command needs to check the header of each fasta sequence in Spades assembly contig / scaffold fasta files. I am just curious?

      • Thomas Haverkamp says:

        Argh, the ipad wordpress app made your comment quite unreadable. It seems i was asking things you could do easily 🙂

      • SeqRecords returned by SeqIO.parse have a ‘name’ attribute which gives you the fasta header, so provided that your assembler produces “Velvet-style” headers (and SPAdes is one such) then it would be easy to use a regex match (via python’s ‘re’ library), grab the parameter, and compare it against a minimum:


        import re
        cov_pattern = re.compile("cov_([0-9.]+)$")
        # ... open the files, iterate through the contigs, test each contig for coverage:
        result = cov_pattern.search(contig.name)
        if result and float(result.group(1)) >= min:
        yield contig

        I’ve added such an example to the Git repo linked above. Left as an exercise to the reader: adapt these two examples to process streams instead of files, so that they can be piped together, as in:


        $ cat my_assembly.fa | filter_length 500 | filter_coverage 20 > my_filtered_assembly.fa

        Hint: you’ll want to import sys, parse the contigs on sys.stdin, yield them to sys.stdout, but override this behavior if one of the command line arguments is a file.

      • Thomas Haverkamp says:

        Very nice! And thanks for the explanation of your code. You made my post more interesting by your comments.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s