Abstract
This paper shows how citation-based information and
structural content (e.g., title, abstract) can be
combined to improve classification of text documents
into predefined categories. We evaluate different
measures of similarity -- five derived from the
citation information of the collection, and three
derived from the structural content -- and determine
how they can be fused to improve classification
effectiveness. To discover the best fusion framework,
we apply Genetic Programming (GP) techniques. Our
experiments with the ACM Computing Classification
Scheme, using documents from the ACM Digital Library,
indicate that GP can discover similarity functions
superior to those based solely on a single type of
evidence. Effectiveness of the similarity functions
discovered through simple majority voting is better
than that of content-based as well as combination-based
Support Vector Machine classifiers. Experiments also
were conducted to compare the performance between GP
techniques and other fusion techniques such as Genetic
Algorithms (GA) and linear fusion. Empirical results
show that GP was able to discover better similarity
functions than GA or other fusion techniques.
Users
Please
log in to take part in the discussion (add own reviews or comments).