Abstract
Commercial GeneChips provide highly redundant but
noisy data. Rapid identification and subsequent
rejection of bad data effectively increases the quality
of the remaining data at little cost whilst serving as
a basis for better understanding the bio-physics of
short surface mounted DNA sequences.
Affymetrix High Density Oligonuclotide Arrays (HDONA)
simultaneously measure expression of thousands of genes
using millions of probes. Regular expressions can be
evolved from a Backus-Naur form (BNF) context-free
grammar using tree based strongly typed genetic
programming written in gawk. Fitness is given by egrep.
The quality of individual HG-U133A probes is indicated
by its correlation across 6685 human tissue samples
from NCBI's GEO database with other measurements for
the same gene. Low concordance indicates a poor probe.
The evolved data mined motif is better at predicting
poor DNA sequences than an existing human generated RE,
suggesting runs of Cytosine and Guanine should both be
avoided.
Code is available at
ftp://cs.ucl.ac.uk/genetic/gp-code/RE_gp.tar
Users
Please
log in to take part in the discussion (add own reviews or comments).