Combining winnow and orthogonal sparse bigrams for incremental spam filtering
C. Siefkes, F. Assis, S. Chhabra, and W. Yerazunis. PKDD '04: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, page 410--421. New York, NY, USA, Springer-Verlag New York, Inc., (2004)
Abstract
Spam filtering is a text categorization task that has attracted significant attention due to the increasingly huge amounts of junk email on the Internet. While current best-practice systems use Naive Bayes filtering and other probabilistic methods, we propose using a statistical, but non-probabilistic classifier based on the Winnow algorithm. The feature space considered by most current methods is either limited in expressivity or imposes a large computational cost. We introduce orthogonal sparse bigrams (OSB) as a feature combination technique that overcomes both these weaknesses. By combining Winnow and OSB with refined preprocessing and tokenization techniques we are able to reach an accuracy of 99.68% on a difficult test corpus, compared to 98.88% previously reported by the CRM114 classifier on the same test corpus.
%0 Conference Paper
%1 osb
%A Siefkes, Christian
%A Assis, Fidelis
%A Chhabra, Shalendra
%A Yerazunis, William S.
%B PKDD '04: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
%C New York, NY, USA
%D 2004
%I Springer-Verlag New York, Inc.
%K bi bigram categorization gram kdd n ngram orthogonal osb spam spamassassin text window winnow
%P 410--421
%T Combining winnow and orthogonal sparse bigrams for incremental spam filtering
%U http://www.cs.ucr.edu/~schhabra/winnow-spam.pdf
%X Spam filtering is a text categorization task that has attracted significant attention due to the increasingly huge amounts of junk email on the Internet. While current best-practice systems use Naive Bayes filtering and other probabilistic methods, we propose using a statistical, but non-probabilistic classifier based on the Winnow algorithm. The feature space considered by most current methods is either limited in expressivity or imposes a large computational cost. We introduce orthogonal sparse bigrams (OSB) as a feature combination technique that overcomes both these weaknesses. By combining Winnow and OSB with refined preprocessing and tokenization techniques we are able to reach an accuracy of 99.68% on a difficult test corpus, compared to 98.88% previously reported by the CRM114 classifier on the same test corpus.
%@ 3-540-23108-0
@inproceedings{osb,
abstract = {Spam filtering is a text categorization task that has attracted significant attention due to the increasingly huge amounts of junk email on the Internet. While current best-practice systems use Naive Bayes filtering and other probabilistic methods, we propose using a statistical, but non-probabilistic classifier based on the Winnow algorithm. The feature space considered by most current methods is either limited in expressivity or imposes a large computational cost. We introduce orthogonal sparse bigrams (OSB) as a feature combination technique that overcomes both these weaknesses. By combining Winnow and OSB with refined preprocessing and tokenization techniques we are able to reach an accuracy of 99.68% on a difficult test corpus, compared to 98.88% previously reported by the CRM114 classifier on the same test corpus.},
added-at = {2007-11-10T16:00:52.000+0100},
address = {New York, NY, USA},
author = {Siefkes, Christian and Assis, Fidelis and Chhabra, Shalendra and Yerazunis, William S.},
biburl = {https://www.bibsonomy.org/bibtex/271047c2340c155bf8dab876110b8df75/jil},
booktitle = {PKDD '04: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases},
interhash = {fc59135554967cf59590b2ef9c08c9de},
intrahash = {71047c2340c155bf8dab876110b8df75},
isbn = {3-540-23108-0},
keywords = {bi bigram categorization gram kdd n ngram orthogonal osb spam spamassassin text window winnow},
location = {Pisa, Italy},
pages = {410--421},
publisher = {Springer-Verlag New York, Inc.},
timestamp = {2013-11-23T20:11:51.000+0100},
title = {Combining winnow and orthogonal sparse bigrams for incremental spam filtering},
url = {http://www.cs.ucr.edu/~schhabra/winnow-spam.pdf},
year = 2004
}