Abstract
Text document clustering plays an important role in providing intuitive
navigation and browsing mechanisms by organizing large amounts of information
into a small number of meaningful clusters. Standard partitional or agglomerative
clustering methods efficiently compute results to this end.
However, the bag of words representation used for these clustering methods is often
unsatisfactory as it ignores relationships between important terms that do not
co-occur literally. Also, it is mostly left to the user to find out why a particular partitioning
has been achieved, because it is only specified extensionally. In order to
deal with the two problems, we integrate background knowledge into the process of
clustering text documents.
First, we preprocess the texts, enriching their representations by background knowledge
provided in a core ontology — in our application Wordnet. Then, we cluster
the documents by a partitional algorithm. Our experimental evaluation on Reuters
newsfeeds compares clustering results with pre-categorizations of news. In the experiments,
improvements of results by background knowledge compared to the baseline
can be shown for many interesting tasks.
Second, the clustering partitions the large number of documents to a relatively small
number of clusters, which may then be analyzed by conceptual clustering. In our approach,
we applied Formal Concept Analysis. Conceptual clustering techniques are
known to be too slow for directly clustering several hundreds of documents, but they
give an intensional account of cluster results. They allow for a concise description
of commonalities and distinctions of different clusters. With background knowledge
they even find abstractions like “food” (vs. specializations like “beef” or “corn”).
Thus, in our approach, partitional clustering reduces first the size of the problem
such that it becomes tractable for conceptual clustering, which then facilitates the
understanding of the results.
Links and resources
Tags
community