Unpublished,

Filtering Junk E-Mail: A Performance Comparison between Genetic Programming and Naive Bayes

H. Katirai.
(10 September 1999)4A Year student project.

Abstract

This paper describes the application of genetic programming as a novel approach to the problem of filtering junk e-mail. We benchmark our results against the common standard: the naive Bayes classifier. While the genetically programmed classifier demonstrated a precision comparable to that of naive Bayes, it was slightly outperformed in recall. Since both learning methods gave similar results, it is recommended that a larger study be undertaken to ascertain whether these differences are indeed statistically significant. Further it is recommended that the performance of these classifiers be tested in a richer feature space more typical of real-world classifiers. Although the genetically programming classifier greatly outperformed the naive Bayes classifier in speed, it is concluded that a more efficient implementation of naive Bayes needs to be used in order to provide a fair comparison. We show that when left unabated, e-mail signatures also known as taglines reduce the value of several important features in junk e-mail detection; however it is also shown that these e-mail signatures may be harvested as advantageous features if some of their components are removed and noted as a feature. We therefore recommend that a better parser capable of meeting this criteria be implemented. To aid the reader in the theoretical aspects of our work, we have included introductory background for both approaches, including a full derivation of the generative naive Bayes model.

BibTeX key: katirai99
entry type: unpublished
year: 1999
month: 10 September
size: 27 pages
Document: http://citeseer.ist.psu.edu/310632.html
note: 4A Year student project

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Unpublished Work %1 katirai99 %A Katirai, Hooman %D 1999 %K algorithms, genetic programming %T Filtering Junk E-Mail: A Performance Comparison between Genetic Programming and Naive Bayes %U http://citeseer.ist.psu.edu/310632.html %X This paper describes the application of genetic programming as a novel approach to the problem of filtering junk e-mail. We benchmark our results against the common standard: the naive Bayes classifier. While the genetically programmed classifier demonstrated a precision comparable to that of naive Bayes, it was slightly outperformed in recall. Since both learning methods gave similar results, it is recommended that a larger study be undertaken to ascertain whether these differences are indeed statistically significant. Further it is recommended that the performance of these classifiers be tested in a richer feature space more typical of real-world classifiers. Although the genetically programming classifier greatly outperformed the naive Bayes classifier in speed, it is concluded that a more efficient implementation of naive Bayes needs to be used in order to provide a fair comparison. We show that when left unabated, e-mail signatures also known as taglines reduce the value of several important features in junk e-mail detection; however it is also shown that these e-mail signatures may be harvested as advantageous features if some of their components are removed and noted as a feature. We therefore recommend that a better parser capable of meeting this criteria be implemented. To aid the reader in the theoretical aspects of our work, we have included introductory background for both approaches, including a full derivation of the generative naive Bayes model.

@unpublished{katirai99, abstract = {This paper describes the application of genetic programming as a novel approach to the problem of filtering junk e-mail. We benchmark our results against the common standard: the naive Bayes classifier. While the genetically programmed classifier demonstrated a precision comparable to that of naive Bayes, it was slightly outperformed in recall. Since both learning methods gave similar results, it is recommended that a larger study be undertaken to ascertain whether these differences are indeed statistically significant. Further it is recommended that the performance of these classifiers be tested in a richer feature space more typical of real-world classifiers. Although the genetically programming classifier greatly outperformed the naive Bayes classifier in speed, it is concluded that a more efficient implementation of naive Bayes needs to be used in order to provide a fair comparison. We show that when left unabated, e-mail signatures also known as taglines reduce the value of several important features in junk e-mail detection; however it is also shown that these e-mail signatures may be harvested as advantageous features if some of their components are removed and noted as a feature. We therefore recommend that a better parser capable of meeting this criteria be implemented. To aid the reader in the theoretical aspects of our work, we have included introductory background for both approaches, including a full derivation of the generative naive Bayes model.}, added-at = {2008-06-19T17:35:00.000+0200}, author = {Katirai, Hooman}, biburl = {https://www.bibsonomy.org/bibtex/2d663b3f5137587ced418b59b66884a52/brazovayeye}, interhash = {c77dfc78fc615af3f37d421d998b7661}, intrahash = {d663b3f5137587ced418b59b66884a52}, keywords = {algorithms, genetic programming}, month = {10 September}, note = {4A Year student project}, size = {27 pages}, timestamp = {2008-06-19T17:42:58.000+0200}, title = {Filtering Junk {E}-Mail: {A} Performance Comparison between Genetic Programming and Naive Bayes}, url = {http://citeseer.ist.psu.edu/310632.html}, year = 1999 }

BibSonomy

Filtering Junk E-Mail: A Performance Comparison between Genetic Programming and Naive Bayes

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on