@jil

Combining winnow and orthogonal sparse bigrams for incremental spam filtering

, , , and . PKDD '04: Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, page 410--421. New York, NY, USA, Springer-Verlag New York, Inc., (2004)

Abstract

Spam filtering is a text categorization task that has attracted significant attention due to the increasingly huge amounts of junk email on the Internet. While current best-practice systems use Naive Bayes filtering and other probabilistic methods, we propose using a statistical, but non-probabilistic classifier based on the Winnow algorithm. The feature space considered by most current methods is either limited in expressivity or imposes a large computational cost. We introduce orthogonal sparse bigrams (OSB) as a feature combination technique that overcomes both these weaknesses. By combining Winnow and OSB with refined preprocessing and tokenization techniques we are able to reach an accuracy of 99.68% on a difficult test corpus, compared to 98.88% previously reported by the CRM114 classifier on the same test corpus.

Links and resources

Tags

community

  • @dblp
  • @jil
@jil's tags highlighted