Abstract
A tremendous amount of genomic sequence data of
relatively high quality has become publicly available
due to the human genome sequencing projects that were
completed a few years ago. Despite considerable
efforts, we do not yet know everything that is to know
about the various parts of the genome, what all the
regions code for, and how their gene products
contribute in the myriad of biological processes that
are performed within the cells. New high-performance
methods are needed to extract knowledge from this vast
amount of information.
Furthermore, the traditional view that DNA codes for
RNA that codes for protein, which is known as the
central dogma of molecular biology, seems to be only
part of the story. The discovery of many
non-proteincoding gene families with housekeeping and
regulatory functions brings an entirely new perspective
to molecular biology. Also, sequence analysis of the
new gene families require new methods, as there are
significant differences between protein-coding and
non-protein-coding genes.
This work describes a new search processor that can
search for complex patterns in sequence data for which
no efficient lookup-index is known. When several chips
are mounted on search cards that are fitted into PCs in
a small cluster configuration, the system's performance
is orders of magnitude higher than that of comparable
solutions for selected applications. The applications
treated in this work fall into two main categories,
namely pattern screening and data mining, and both take
advantage of the search capacity of the cluster to
achieve adequate performance. Specifically, the thesis
describes an interactive system for exploration of all
types of genomic sequence data. Moreover, a genetic
programming-based data mining system finds classifiers
that consist of potentially complex patterns that are
characteristic for groups of sequences. The screening
and mining capacity has been used to develop an
algorithm for identification of new non-protein-coding
genes in bacteria; a system for rational design of
effective and specific short interfering RNA for
sequence-specific silencing of protein-coding genes;
and an improved algorithmic step for identification of
new regulatory targets for the microRNA family of
non-protein-coding genes.
Users
Please
log in to take part in the discussion (add own reviews or comments).