Identification of CpG islands in DNA sequences using statistically optimal null filters
1 Department of Electrical and Computer Engineering, Concordia University, 1455 de Maisonneuve Blvd. West Montreal, QC H3G1M8, Canada
2 Department of Electrical Engineering and Computer Science, University of Toledo, MS 308, 2801 W. Bancroft St., Toledo, OH 43606, USA
EURASIP Journal on Bioinformatics and Systems Biology 2012, 2012:12 doi:10.1186/1687-4153-2012-12Published: 29 August 2012
CpG dinucleotide clusters also referred to as CpG islands (CGIs) are usually located in the promoter regions of genes in a deoxyribonucleic acid (DNA) sequence. CGIs play a crucial role in gene expression and cell differentiation, as such, they are normally used as gene markers. The earlier CGI identification methods used the rich CpG dinucleotide content in CGIs, as a characteristic measure to identify the locations of CGIs. The fact, that the probability of nucleotide G following nucleotide C in a CGI is greater as compared to a non-CGI, is employed by some of the recent methods. These methods use the difference in transition probabilities between subsequent nucleotides to distinguish between a CGI from a non-CGI. These transition probabilities vary with the data being analyzed and several of them have been reported in the literature sometimes leading to contradictory results. In this article, we propose a new and efficient scheme for identification of CGIs using statistically optimal null filters. We formulate a new CGI identification characteristic to reliably and efficiently identify CGIs in a given DNA sequence which is devoid of any ambiguities. Our proposed scheme combines maximum signal-to-noise ratio and least squares optimization criteria to estimate the CGI identification characteristic in the DNA sequence. The proposed scheme is tested on a number of DNA sequences taken from human chromosomes 21 and 22, and proved to be highly reliable as well as efficient in identifying the CGIs.