Genome assemblers under evaluation in GAGE

What genome assemblers are under evaluation?

We are evaluating the performance of the following genome assemblers on our data sets:

ABySS (Assembly By Short Sequencing) (Birol et al): A denovo assembler for short read sequence data which uses a distributed representation of a de Bruijn graph, allowing parallel computation of the assembly algorithm across a network of commodity computers. Developed at Canada's Michael Smith Genome Sciences Centre.

ALLPATHS-LG (Gnerre et al): a de Bruijn graph-based de novo assembler for large (and small) genomes. ALLPATHS-LG is being developed by scientists at the Broad Institute.

Bambus2: The second generation Bambus scaffolder relies on a combination of a novel method for detecting genomic repeats and algorithms that analyze assembly graphs to identify biologically meaningful genomic variants. Bambus2 compares favorably to existing scaffolds generated by CABOG, Newbler and SOAPdenovo with respect to contiguity and error rate. While Bambus 2 was specifically designed for polymorphic and metagenomic scaffolding, its modular and efficient algorithm allows it to be used to scaffold mammalian genomes and used a drop-in replacement scaffolder for CABOG, Newbler, and SOAPdenovo. Bambus2 is being primarily developed by Sergey Koren and Mihai Pop, with input from Todd Treangen,

Celera Assembler: an Overlap-Layout-Consenus based de novo whole-genome shotgun (WGS) DNA sequence assembler. It reconstructs long sequences of genomic DNA from fragmentary data produced by whole-genome shotgun sequencing. Celera Assembler has enabled many advances in genomics, including the first whole genome shotgun sequence of a multi-cellular organism (Myers 2000) and the first diploid sequence of an individual human (Levy 2007). Celera Assembler was developed at Celera Genomics starting in 1999. It was released to SourceForge in 2004 as the wgs-assembler under the GNU General Public License. The pipeline revised for 454 data was named CABOG (Miller 2008).

MSR-CA (pronounced "MizerKa") is a new technique that pre-processes the short read data and then performs the final assembly using a modified version of Celera Assembler. MSR-CA stands for Maryland Super-Reads + Celera Assembler. The pre-processing steps include error correction and subsequent coverage reduction by creating "super-reads," which are produced using a de Bruijn graph. The algorithm then groups together the reads that map to the same sets of nodes and edges, and for each set replaces them by a single super-read that contains these nodes and edges. This can reduce the number of reads by a factor of 50 or more, resulting in the data set that is much easier to manage.

SGA (Simpson et al): stands for String Graph Assembler. Experimental de novo assembler based on string graphs. SGA is being developed by scientists at the Wellcome Trust Sanger Institute.

SOAPdenovo (Li et al): is the short-read assembler that was used for the panda genome, the first mammalian genome assembled entirely from Illumina reads, and for several human genomes and other genomes subsequently. It is being developed by scientists at BGI.

Velvet (Zerbino et al): Velvet is a de novo genome assembler specially designed for short read sequencing technologies, particularly Illumina reads, and was one of the first short-read assemblers to be published. It was developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), near Cambridge, England.

In addition to these seven assemblers, we are also evaluating error correction methods, both standalone methods and the error correctors built into some assempblers, specifically ALLPATHS-LG (Gnerre et al) and Quake (Kelley et al).