Abstract
We compare several algorithms for identifying mirrored
hosts on the World Wide Web. The algorithms operate on
the basis of URL strings and linkage data: the type of
information easily available from web proxies and
crawlers. Identification of mirrored hosts can improve
web-based information retrieval in several ways: First,
by identifying mirrored hosts, search engines can avoid
storing and returning duplicate documents. Second,
several new information retrieval techniques for the
Web make inferences based on the explicit links among
hypertext documents -- mirroring perturbs their graph
model and degrades performance. Third, mirroring
information can be used to redirect users to alternate
mirror sites to compensate for various failures, and
can thus improve the performance of web browsers and
proxies.
Users
Please
log in to take part in the discussion (add own reviews or comments).