@jaeschke

Semi-automatic web resource discovery using ontology-focused crawling

, und . Agder University College, Grimstad, Norway, (Mai 2005)

Zusammenfassung

The enormous amount of information available on the Internet makes it difficult to find resources with relevant information using regular breadth-first crawlers. Focused crawlers seek to exclusively find web pages that are relevant for the user, and avoid downloading irrelevant web pages. Ontologies have recently been proposed as a tool for defining the target domain for focused crawlers.In this project we have developed a prototype of an ontology-focused crawler. We have accomplished this by developing extra modules to the Java open source crawler Heritrix. In one of the modules we have developed, we measure the relevancy of web pages in relation to an ontology describing the area of interest. We have also developed a link analysis module to determine the importance of web pages. This module uses the link analysis component from the open source search engine Nutch. The importance measure is used to ensure that the most important web pages are downloaded first.This thesis also contains an evaluation of several open source crawlers. We found that Heritrix was the easiest to extend, and best suited for our purpose. Our prototype is therefore built upon Heritrix.To measure the performance of the prototype several test crawls with different settings has been carried out. Focused crawlers are often evaluated by harvest rate, which is the ratio between number of relevant and all of the web pages downloaded. The prototype performed well in the tests, and in one of them the prototype had a harvest rate of about 0.55. In a similar unfocused crawl, the harvest rate was only about 0.15. Both the prototype and the algorithm are designed to be easily configured. More testing and adjustments of the settings could improve the performance of the prototype even further, but we have shown that ontologies are a suitable technology for creating focused crawlers.

Links und Ressourcen

Tags