Optimal reference sequence selection for genome assembly using minimum description length principle
1 Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA
2 Department of Electrical Engineering, University of Engineering & Technology, Lahore, Punjab 54890, Pakistan
3 Department of Chemical Engineering, Texas A&M University, Doha, Qatar
4 Department of Electrical and Computer Engineering, , Doha, Qatar
EURASIP Journal on Bioinformatics and Systems Biology 2012, 2012:18 doi:10.1186/1687-4153-2012-18Published: 27 November 2012
Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that “counting the number of reads of the novel genome present in the reference sequence” is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of “counting the number of reads that align to the reference sequence” and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome.