Techreport,

A Grammar based Strongly Typed Genetic Programming system for finding Regular Expression which predict Affymetrix DNA probe performance

, and .
Computing and Electronic Systems, University of Essex Wivenhoe Park, Colchester CO4 3SQ, UK, (April 2008)

Abstract

Commercial GeneChips provide highly redundant but noisy data. Rapid identification and subsequent rejection of bad data effectively increases the quality of the remaining data at little cost whilst serving as a basis for better understanding the bio-physics of short surface mounted DNA sequences. Affymetrix High Density Oligonuclotide Arrays (HDONA) simultaneously measure expression of thousands of genes using millions of probes. Regular expressions can be evolved from a Backus-Naur form (BNF) context-free grammar using tree based strongly typed genetic programming written in gawk. Fitness is given by egrep. The quality of individual HG-U133A probes is indicated by its correlation across 6685 human tissue samples from NCBI's GEO database with other measurements for the same gene. Low concordance indicates a poor probe. The evolved data mined motif is better at predicting poor DNA sequences than an existing human generated RE, suggesting runs of Cytosine and Guanine should both be avoided. Code is available at ftp://cs.ucl.ac.uk/genetic/gp-code/RE_gp.tar

Tags

Users

  • @brazovayeye

Comments and Reviews