Unpublished,

Filtering Junk E-Mail: A Performance Comparison between Genetic Programming and Naive Bayes

.
(10 September 1999)4A Year student project.

Abstract

This paper describes the application of genetic programming as a novel approach to the problem of filtering junk e-mail. We benchmark our results against the common standard: the naive Bayes classifier. While the genetically programmed classifier demonstrated a precision comparable to that of naive Bayes, it was slightly outperformed in recall. Since both learning methods gave similar results, it is recommended that a larger study be undertaken to ascertain whether these differences are indeed statistically significant. Further it is recommended that the performance of these classifiers be tested in a richer feature space more typical of real-world classifiers. Although the genetically programming classifier greatly outperformed the naive Bayes classifier in speed, it is concluded that a more efficient implementation of naive Bayes needs to be used in order to provide a fair comparison. We show that when left unabated, e-mail signatures also known as taglines reduce the value of several important features in junk e-mail detection; however it is also shown that these e-mail signatures may be harvested as advantageous features if some of their components are removed and noted as a feature. We therefore recommend that a better parser capable of meeting this criteria be implemented. To aid the reader in the theoretical aspects of our work, we have included introductory background for both approaches, including a full derivation of the generative naive Bayes model.

Tags

Users

  • @brazovayeye

Comments and Reviews