A Fast Template-Based Approach to Automatically Identify Primary Text Content of a Web Page

D. Nguyen, D. Nguyen, S. Pham, и T. Bui.
Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, стр. 232--236. IEEE Computer Society, (2009)

Аннотация

Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant web pages. One reason is because search engines also look at noninformative blocks of web pages such as advertisement, navigation links, etc. In this paper, we propose a fast algorithm called FastContentExtractor to automatically detect main content blocks in a web page by improving the ContentExtractor algorithm. By automatically identifying and storing templates representing the structure of content blocks in a website, content blocks of a new web page from the website can be extracted quickly. The hierarchical order of the output blocks is also maintained which guarantees that the extracted content blocks are in the same order as the original ones.

ключ BibTeX: NguyenNPB09
тип записи: inproceedings
название книги: Proceedings of the 2009 International Conference on Knowledge and Systems Engineering
год: 2009
страницы: 232--236
издательство: IEEE Computer Society
серии: KSE 2009
url: http://dx.doi.org/10.1109/KSE.2009.39

тэги

Пользователи данного ресурса

Комментарии и рецензиипоказать / перейти в невидимый режим

Пожалуйста, войдите в систему, чтобы принять участие в дискуссии (добавить собственные рецензию, или комментарий)

BibSonomy