copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Text Clustering Based on Background Knowledge

A. Hotho, S. Staab, and G. Stumme. Technical Report, volume 425. University of Karlsruhe, Institute AIFB, (2003)

Abstract

Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. Standard partitional or agglomerative clustering methods efficiently compute results to this end. However, the bag of words representation used for these clustering methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. Also, it is mostly left to the user to find out why a particular partitioning has been achieved, because it is only specified extensionally. In order to deal with the two problems, we integrate background knowledge into the process of clustering text documents. First, we preprocess the texts, enriching their representations by background knowledge provided in a core ontology — in our application Wordnet. Then, we cluster the documents by a partitional algorithm. Our experimental evaluation on Reuters newsfeeds compares clustering results with pre-categorizations of news. In the experiments, improvements of results by background knowledge compared to the baseline can be shown for many interesting tasks. Second, the clustering partitions the large number of documents to a relatively small number of clusters, which may then be analyzed by conceptual clustering. In our approach, we applied Formal Concept Analysis. Conceptual clustering techniques are known to be too slow for directly clustering several hundreds of documents, but they give an intensional account of cluster results. They allow for a concise description of commonalities and distinctions of different clusters. With background knowledge they even find abstractions like “food” (vs. specializations like “beef” or “corn”). Thus, in our approach, partitional clustering reduces first the size of the problem such that it becomes tractable for conceptual clustering, which then facilitates the understanding of the results.

Links and resources

BibTeX key: hotho03textclustering
entry type: techreport
year: 2003
institution: University of Karlsruhe, Institute AIFB
volume: 425
type: Technical Report
comment: alpha
Document: http://www.kde.cs.uni-kassel.de/stumme/papers/2003/hotho2003text.pdf

@jaeschke's tags highlighted

Cite this publication

%0 Report %1 hotho03textclustering %A Hotho, Andreas %A Staab, Steffen %A Stumme, Gerd %D 2003 %K background clustering iccs_example knowledge text trias_example %T Text Clustering Based on Background Knowledge %U http://www.kde.cs.uni-kassel.de/stumme/papers/2003/hotho2003text.pdf %V 425 %X Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. Standard partitional or agglomerative clustering methods efficiently compute results to this end. However, the bag of words representation used for these clustering methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. Also, it is mostly left to the user to find out why a particular partitioning has been achieved, because it is only specified extensionally. In order to deal with the two problems, we integrate background knowledge into the process of clustering text documents. First, we preprocess the texts, enriching their representations by background knowledge provided in a core ontology — in our application Wordnet. Then, we cluster the documents by a partitional algorithm. Our experimental evaluation on Reuters newsfeeds compares clustering results with pre-categorizations of news. In the experiments, improvements of results by background knowledge compared to the baseline can be shown for many interesting tasks. Second, the clustering partitions the large number of documents to a relatively small number of clusters, which may then be analyzed by conceptual clustering. In our approach, we applied Formal Concept Analysis. Conceptual clustering techniques are known to be too slow for directly clustering several hundreds of documents, but they give an intensional account of cluster results. They allow for a concise description of commonalities and distinctions of different clusters. With background knowledge they even find abstractions like “food” (vs. specializations like “beef” or “corn”). Thus, in our approach, partitional clustering reduces first the size of the problem such that it becomes tractable for conceptual clustering, which then facilitates the understanding of the results.

@techreport{hotho03textclustering, abstract = {Text document clustering plays an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. Standard partitional or agglomerative clustering methods efficiently compute results to this end. However, the bag of words representation used for these clustering methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. Also, it is mostly left to the user to find out why a particular partitioning has been achieved, because it is only specified extensionally. In order to deal with the two problems, we integrate background knowledge into the process of clustering text documents. First, we preprocess the texts, enriching their representations by background knowledge provided in a core ontology — in our application Wordnet. Then, we cluster the documents by a partitional algorithm. Our experimental evaluation on Reuters newsfeeds compares clustering results with pre-categorizations of news. In the experiments, improvements of results by background knowledge compared to the baseline can be shown for many interesting tasks. Second, the clustering partitions the large number of documents to a relatively small number of clusters, which may then be analyzed by conceptual clustering. In our approach, we applied Formal Concept Analysis. Conceptual clustering techniques are known to be too slow for directly clustering several hundreds of documents, but they give an intensional account of cluster results. They allow for a concise description of commonalities and distinctions of different clusters. With background knowledge they even find abstractions like “food” (vs. specializations like “beef” or “corn”). Thus, in our approach, partitional clustering reduces first the size of the problem such that it becomes tractable for conceptual clustering, which then facilitates the understanding of the results.}, added-at = {2007-02-01T14:04:37.000+0100}, author = {Hotho, Andreas and Staab, Steffen and Stumme, Gerd}, biburl = {https://www.bibsonomy.org/bibtex/261d58db419af0dbc3681432588219c3d/jaeschke}, comment = {alpha}, institution = {University of Karlsruhe, Institute AIFB}, interhash = {0bc7c3fc1273355f45c8970a7ea58f97}, intrahash = {61d58db419af0dbc3681432588219c3d}, keywords = {background clustering iccs_example knowledge text trias_example}, timestamp = {2014-07-28T15:57:31.000+0200}, title = {Text Clustering Based on Background Knowledge}, type = {Technical Report }, url = {http://www.kde.cs.uni-kassel.de/stumme/papers/2003/hotho2003text.pdf}, volume = 425, year = 2003 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Text Clustering Based on Background Knowledge

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Text Clustering Based on Background Knowledge

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Text Clustering Based on Background Knowledge

Comments and Reviews
(0)