An Application of Improved PageRank in Focused Crawler
Y. Zhang, C. Yin, и F. Yuan. Fuzzy Systems and Knowledge Discovery, 2007. FSKD 2007. Fourth International Conference on, 2, стр. 331--335. (2007)
DOI: 10.1109/FSKD.2007.142
Аннотация
The focused crawler of a special-purpose search engine aims to selectively seek out pages that are relevant to a pre-defined set of topics, rather than to exploit all regions of the Web. The PageRank algorithm is often used in ranking web pages, and it is also used in URL ordering for focused crawler. It estimates the page's authority by taking into account the link structure of the Web. However, it assigns each outlink the same weight and is independent of topics, resulting in topic-drift. In this paper, we propose an improved PageRank algorithm, which we called "To-PageRank", and then we present a crawling strategy using the To-PageRank algorithm combining with the topic similarity of the hyperlink metadata. The experiment in focused crawler shows that the new improved crawling strategy has better performance than the Breath-first and PageRank algorithms.
%0 Conference Paper
%1 citeulike:4544608
%A Zhang, Yulian
%A Yin, Chunxia
%A Yuan, Fuyong
%B Fuzzy Systems and Knowledge Discovery, 2007. FSKD 2007. Fourth International Conference on
%D 2007
%J Fuzzy Systems and Knowledge Discovery, 2007. FSKD 2007. Fourth International Conference on
%K crawling, indexing
%P 331--335
%R 10.1109/FSKD.2007.142
%T An Application of Improved PageRank in Focused Crawler
%U http://dx.doi.org/10.1109/FSKD.2007.142
%V 2
%X The focused crawler of a special-purpose search engine aims to selectively seek out pages that are relevant to a pre-defined set of topics, rather than to exploit all regions of the Web. The PageRank algorithm is often used in ranking web pages, and it is also used in URL ordering for focused crawler. It estimates the page's authority by taking into account the link structure of the Web. However, it assigns each outlink the same weight and is independent of topics, resulting in topic-drift. In this paper, we propose an improved PageRank algorithm, which we called "To-PageRank", and then we present a crawling strategy using the To-PageRank algorithm combining with the topic similarity of the hyperlink metadata. The experiment in focused crawler shows that the new improved crawling strategy has better performance than the Breath-first and PageRank algorithms.
@inproceedings{citeulike:4544608,
abstract = {The focused crawler of a special-purpose search engine aims to selectively seek out pages that are relevant to a pre-defined set of topics, rather than to exploit all regions of the Web. The PageRank algorithm is often used in ranking web pages, and it is also used in URL ordering for focused crawler. It estimates the page's authority by taking into account the link structure of the Web. However, it assigns each outlink the same weight and is independent of topics, resulting in topic-drift. In this paper, we propose an improved PageRank algorithm, which we called "To-PageRank", and then we present a crawling strategy using the To-PageRank algorithm combining with the topic similarity of the hyperlink metadata. The experiment in focused crawler shows that the new improved crawling strategy has better performance than the Breath-first and PageRank algorithms.},
added-at = {2009-05-19T18:00:18.000+0200},
author = {Zhang, Yulian and Yin, Chunxia and Yuan, Fuyong},
biburl = {https://www.bibsonomy.org/bibtex/2d909a1d33e8dcd0210b206b2a5d85935/earthfare},
booktitle = {Fuzzy Systems and Knowledge Discovery, 2007. FSKD 2007. Fourth International Conference on},
citeulike-article-id = {4544608},
description = {CiteULike: Everyone's library},
doi = {10.1109/FSKD.2007.142},
interhash = {6c30741aaddb2c872e16563cd78058df},
intrahash = {d909a1d33e8dcd0210b206b2a5d85935},
journal = {Fuzzy Systems and Knowledge Discovery, 2007. FSKD 2007. Fourth International Conference on},
keywords = {crawling, indexing},
pages = {331--335},
posted-at = {2009-05-19 10:01:13},
priority = {2},
timestamp = {2009-05-19T18:03:27.000+0200},
title = {An Application of Improved PageRank in Focused Crawler},
url = {http://dx.doi.org/10.1109/FSKD.2007.142},
volume = 2,
year = 2007
}