Аннотация
Standard techniques for a web page classification usually take a sim-
ple text-based approach, in which most of the information provided by the vis-
ual layout of a page is discarded. In our work we propose a new classification
approach based on the visual layout analyses, conducted before implementing
standard classification techniques. A page is represented as a hierarchical struc-
ture – Visual Adjacency Multigraph, in which nodes represent simple HTML
objects (text, images) while directed edges reflect spatial relations ‘immediately
before’, ‘immediately after’, ‘immediately left’ and ‘immediately right’ on the
browser screen. Using visual information contained in the multigraph, one is
able to define heuristics for recognition of common page entities such as verti-
cal and horizontal link lists, titles and subtitles, and paragraphs of text. Visual
analyses results in more accurate method for representing the page contents,
which splits the text features into different subsets according to the groups they
belong to. Finally, we introduce a classification system, which taking into ac-
count the proposed layout analysis clearly outperforms a standard bag-of-words
approach.
Пользователи данного ресурса
Пожалуйста,
войдите в систему, чтобы принять участие в дискуссии (добавить собственные рецензию, или комментарий)