Abstract
The key factors for the success of the World Wide Web are its large size and the lack of a centralized control
over its contents. Both issues are also the most important source of problems for locating information. The
Web is a context in which traditional Information Retrieval methods are challenged, and given the volume
of the Web and its speed of change, the coverage of modern search engines is relatively small. Moreover,
the distribution of quality is very skewed, and interesting pages are scarce in comparison with the rest of the
content.
Web crawling is the process used by search engines to collect pages from the Web. This thesis studies
Web crawling at several different levels, ranging from the long-term goal of crawling important pages first,
to the short-term goal of using the network connectivity efficiently, including implementation issues that are
essential for crawling in practice.
We start by designing a new model and architecture for aWeb crawler that tightly integrates the crawler
with the rest of the search engine, providing access to the metadata and links of the documents that can be
used to guide the crawling process effectively. We implement this design in the WIRE project as an efficient
Web crawler that provides an experimental framework for this research. In fact, we have used our crawler to
characterize the Chilean Web, using the results as feedback to improve the crawler design.
We argue that the number of pages on the Web can be considered infinite, and given that a Web crawler
cannot download all the pages, it is important to capture the most important ones as early as possible during
the crawling process. We propose, study, and implement algorithms for achieving this goal, showing that we
can crawl 50% of a large Web collection and capture 80% of its total Pagerank value in both simulated and
real Web environments.
We also model and study user browsing behavior in Web sites, concluding that it is not necessary to
go deeper than five levels from the home page to capture most of the pages actually visited by people, and
support this conclusion with log analysis of several Web sites. We also propose several mechanisms for
server cooperation to reduce network traffic and improve the representation of aWeb page in a search engine
with the help of Web site managers.
Users
Please
log in to take part in the discussion (add own reviews or comments).