Inproceedings,

Combining link-based and content-based methods for web document classification

, , , , , and .
Proceedings of the twelfth international conference on Information and knowledge management, page 394--401. New York, NY, USA, ACM, (2003)
DOI: 10.1145/956863.956938

Abstract

This paper studies how link information can be used to improve classification results for Web collections. We evaluate four different measures of subject similarity, derived from the Web link structure, and determine how accurate they are in predicting document categories. Using a Bayesian network model, we combine these measures with the results obtained by traditional content-based classifiers. Experiments on a Web directory show that best results are achieved when links from pages outside the directory are considered. Link information alone is able to obtain gains of up to <b>46</b> points in <i> F</i><inf>1</inf>, when compared to a traditional content-based classifier. The combination with content-based methods can further improve the results, but too much noise may be introduced, since the text of Web pages is a much less reliable source of information. This work provides an important insight on which measures derived from links are more appropriate to compare Web documents and how these measures can be combined with content-based algorithms to improve the effectiveness of Web classification.

Tags

Users

  • @telekoma

Comments and Reviews