Abstract
This paper describes the application of genetic
programming as a novel approach to the problem of
filtering junk e-mail. We benchmark our results against
the common standard: the naive Bayes classifier. While
the genetically programmed classifier demonstrated a
precision comparable to that of naive Bayes, it was
slightly outperformed in recall. Since both learning
methods gave similar results, it is recommended that a
larger study be undertaken to ascertain whether these
differences are indeed statistically significant.
Further it is recommended that the performance of these
classifiers be tested in a richer feature space more
typical of real-world classifiers. Although the
genetically programming classifier greatly outperformed
the naive Bayes classifier in speed, it is concluded
that a more efficient implementation of naive Bayes
needs to be used in order to provide a fair comparison.
We show that when left unabated, e-mail signatures also
known as taglines reduce the value of several important
features in junk e-mail detection; however it is also
shown that these e-mail signatures may be harvested as
advantageous features if some of their components are
removed and noted as a feature. We therefore recommend
that a better parser capable of meeting this criteria
be implemented. To aid the reader in the theoretical
aspects of our work, we have included introductory
background for both approaches, including a full
derivation of the generative naive Bayes model.
Users
Please
log in to take part in the discussion (add own reviews or comments).