Filtering Junk E-Mail: A Performance Comparison
between Genetic Programming and Naive Bayes
H. Katirai. (10 September 1999)4A Year student project.
Abstract
This paper describes the application of genetic
programming as a novel approach to the problem of
filtering junk e-mail. We benchmark our results against
the common standard: the naive Bayes classifier. While
the genetically programmed classifier demonstrated a
precision comparable to that of naive Bayes, it was
slightly outperformed in recall. Since both learning
methods gave similar results, it is recommended that a
larger study be undertaken to ascertain whether these
differences are indeed statistically significant.
Further it is recommended that the performance of these
classifiers be tested in a richer feature space more
typical of real-world classifiers. Although the
genetically programming classifier greatly outperformed
the naive Bayes classifier in speed, it is concluded
that a more efficient implementation of naive Bayes
needs to be used in order to provide a fair comparison.
We show that when left unabated, e-mail signatures also
known as taglines reduce the value of several important
features in junk e-mail detection; however it is also
shown that these e-mail signatures may be harvested as
advantageous features if some of their components are
removed and noted as a feature. We therefore recommend
that a better parser capable of meeting this criteria
be implemented. To aid the reader in the theoretical
aspects of our work, we have included introductory
background for both approaches, including a full
derivation of the generative naive Bayes model.
%0 Unpublished Work
%1 katirai99
%A Katirai, Hooman
%D 1999
%K algorithms, genetic programming
%T Filtering Junk E-Mail: A Performance Comparison
between Genetic Programming and Naive Bayes
%U http://citeseer.ist.psu.edu/310632.html
%X This paper describes the application of genetic
programming as a novel approach to the problem of
filtering junk e-mail. We benchmark our results against
the common standard: the naive Bayes classifier. While
the genetically programmed classifier demonstrated a
precision comparable to that of naive Bayes, it was
slightly outperformed in recall. Since both learning
methods gave similar results, it is recommended that a
larger study be undertaken to ascertain whether these
differences are indeed statistically significant.
Further it is recommended that the performance of these
classifiers be tested in a richer feature space more
typical of real-world classifiers. Although the
genetically programming classifier greatly outperformed
the naive Bayes classifier in speed, it is concluded
that a more efficient implementation of naive Bayes
needs to be used in order to provide a fair comparison.
We show that when left unabated, e-mail signatures also
known as taglines reduce the value of several important
features in junk e-mail detection; however it is also
shown that these e-mail signatures may be harvested as
advantageous features if some of their components are
removed and noted as a feature. We therefore recommend
that a better parser capable of meeting this criteria
be implemented. To aid the reader in the theoretical
aspects of our work, we have included introductory
background for both approaches, including a full
derivation of the generative naive Bayes model.
@unpublished{katirai99,
abstract = {This paper describes the application of genetic
programming as a novel approach to the problem of
filtering junk e-mail. We benchmark our results against
the common standard: the naive Bayes classifier. While
the genetically programmed classifier demonstrated a
precision comparable to that of naive Bayes, it was
slightly outperformed in recall. Since both learning
methods gave similar results, it is recommended that a
larger study be undertaken to ascertain whether these
differences are indeed statistically significant.
Further it is recommended that the performance of these
classifiers be tested in a richer feature space more
typical of real-world classifiers. Although the
genetically programming classifier greatly outperformed
the naive Bayes classifier in speed, it is concluded
that a more efficient implementation of naive Bayes
needs to be used in order to provide a fair comparison.
We show that when left unabated, e-mail signatures also
known as taglines reduce the value of several important
features in junk e-mail detection; however it is also
shown that these e-mail signatures may be harvested as
advantageous features if some of their components are
removed and noted as a feature. We therefore recommend
that a better parser capable of meeting this criteria
be implemented. To aid the reader in the theoretical
aspects of our work, we have included introductory
background for both approaches, including a full
derivation of the generative naive Bayes model.},
added-at = {2008-06-19T17:35:00.000+0200},
author = {Katirai, Hooman},
biburl = {https://www.bibsonomy.org/bibtex/2d663b3f5137587ced418b59b66884a52/brazovayeye},
interhash = {c77dfc78fc615af3f37d421d998b7661},
intrahash = {d663b3f5137587ced418b59b66884a52},
keywords = {algorithms, genetic programming},
month = {10 September},
note = {4A Year student project},
size = {27 pages},
timestamp = {2008-06-19T17:42:58.000+0200},
title = {Filtering Junk {E}-Mail: {A} Performance Comparison
between Genetic Programming and Naive Bayes},
url = {http://citeseer.ist.psu.edu/310632.html},
year = 1999
}