Abstract
There are numerous text documents available in electronic form. More and more
are becoming available every day. Such documents represent a massive amount of
information that is easily accessible. Seeking value in this huge collection requires
organization; much of the work of organizing documents can be automated through
text classification. The accuracy and our understanding of such systems greatly
influences their usefulness. In this paper, we seek 1) to advance the understanding
of commonly used text classification techniques, and 2) through that understanding,
improve the tools that are available for text classification. We begin by clarifying
the assumptions made in the derivation of Naive Bayes, noting basic properties and
proposing ways for its extension and improvement. Next, we investigate the quality
of Naive Bayes parameter estimates and their impact on classification. Our analysis
leads to a theorem which gives an explanation for the improvements that can be
found in multiclass classification with Naive Bayes using Error-Correcting Output
Codes. We use experimental evidence on two commonly-used data sets to exhibit an
application of the theorem. Finally, we show fundamental flaws in a commonly-used
feature selection algorithm and develop a statistics-based framework for text feature
selection. Greater understanding of Naive Bayes and the properties of text allows us
to make better use of it in text classification.
Users
Please
log in to take part in the discussion (add own reviews or comments).