Inproceedings,

Combining link-based and content-based methods for web document classification

P. Calado, M. Cristo, E. Moura, N. Ziviani, B. Ribeiro-Neto, and M. Goncalves.
Proceedings of the twelfth international conference on Information and knowledge management, page 394--401. New York, NY, USA, ACM, (2003)
DOI: 10.1145/956863.956938

Abstract

This paper studies how link information can be used to improve classification results for Web collections. We evaluate four different measures of subject similarity, derived from the Web link structure, and determine how accurate they are in predicting document categories. Using a Bayesian network model, we combine these measures with the results obtained by traditional content-based classifiers. Experiments on a Web directory show that best results are achieved when links from pages outside the directory are considered. Link information alone is able to obtain gains of up to 46 points in F<inf>1</inf>, when compared to a traditional content-based classifier. The combination with content-based methods can further improve the results, but too much noise may be introduced, since the text of Web pages is a much less reliable source of information. This work provides an important insight on which measures derived from links are more appropriate to compare Web documents and how these measures can be combined with content-based algorithms to improve the effectiveness of Web classification.

BibTeX key: Calado:2003:CLC:956863.956938
entry type: inproceedings
address: New York, NY, USA
booktitle: Proceedings of the twelfth international conference on Information and knowledge management
year: 2003
pages: 394--401
publisher: ACM
series: CIKM '03
acmid: 956938
location: New Orleans, LA, USA
isbn: 1-58113-723-0
numpages: 8
DOI: 10.1145/956863.956938
url: http://doi.acm.org/10.1145/956863.956938

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@inproceedings{Calado:2003:CLC:956863.956938, abstract = {This paper studies how link information can be used to improve classification results for Web collections. We evaluate four different measures of subject similarity, derived from the Web link structure, and determine how accurate they are in predicting document categories. Using a Bayesian network model, we combine these measures with the results obtained by traditional content-based classifiers. Experiments on a Web directory show that best results are achieved when links from pages outside the directory are considered. Link information alone is able to obtain gains of up to 46 points in F<inf>1</inf>, when compared to a traditional content-based classifier. The combination with content-based methods can further improve the results, but too much noise may be introduced, since the text of Web pages is a much less reliable source of information. This work provides an important insight on which measures derived from links are more appropriate to compare Web documents and how these measures can be combined with content-based algorithms to improve the effectiveness of Web classification.}, acmid = {956938}, added-at = {2011-11-30T17:39:42.000+0100}, address = {New York, NY, USA}, author = {Calado, P\'{a}vel and Cristo, Marco and Moura, Edleno and Ziviani, Nivio and Ribeiro-Neto, Berthier and Gon\c{c}alves, Marcos Andr\'{e}}, biburl = {https://www.bibsonomy.org/bibtex/2149e564bc41e9097278fce30322eaad2/telekoma}, booktitle = {Proceedings of the twelfth international conference on Information and knowledge management}, description = {Combining link-based and content-based methods for web document classification}, doi = {10.1145/956863.956938}, interhash = {8b64b9219ff3dc4b1758bcd136849b15}, intrahash = {149e564bc41e9097278fce30322eaad2}, isbn = {1-58113-723-0}, keywords = {bachelor:2011:bachmann combining contentbased link_analysis webpage}, location = {New Orleans, LA, USA}, numpages = {8}, pages = {394--401}, publisher = {ACM}, series = {CIKM '03}, timestamp = {2011-11-30T17:39:43.000+0100}, title = {Combining link-based and content-based methods for web document classification}, url = {http://doi.acm.org/10.1145/956863.956938}, year = 2003 }

BibSonomy

Combining link-based and content-based methods for web document classification

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on