SIPs are not infallible and may produce phrases that
have no bearing on the content in general. There-
The problem of determining key words fore, it is clear that there are phrases in text that are
and phases which best characterize a text significant inasmuch that they signify the content of
document has important applications such the document. We pose this research question: by
as building a compact index for a large- selecting and displaying significant phrases, are we
scale text processing system, or using a able to give users a sense of the general ideas, bet-
keyword set for summarization and topic ter understanding, and increased search power of the
detection. We approached this problem text? What properties do signicant phrases posses
from two perspectives. Our knowledge- and how can we identify them?
poor approach is based on statistical collo- We will approach this problem from two per-
cation detection using the t-test and like- spectives: Knowledge Poor and Knowledge Rich.
lihood ratio, and applying latent seman- Knowledge Poor techniques rely on using shallow
tic analysis to identify terms important in text processing which primarily utilizes the informa-
a particular document. The knowledge- tion about word and collocation frequencies. From
rich approach addresses the problem us- the Knowledge Rich perspective, we hope to use
ing noun phrase chunking and coreference many computational linguistic techniques to intel-
resolution. Both approaches use a deci- ligently parse documents and rank words to dis-
sion tree classifier to answer whether a cover meaningful phrases. We have compared these
given phrase is a key word looking at the two approaches in selecting significant phrases, and
set of calculated features. We have built found that they should be combined to augment each
prototypes and compared results of these other. The knowledge poor approach is robust and
two approaches. fast, but the knowledge rich approach has the ad-
vantage of tackling phrases relevant to the contents
more precisely.
Beschreibung
Algorithm for key words detection based on SIPs
(Statistically Improbable Phrases)
%0 Journal Article
%1 BautinHart2007
%A Hart, Michael
%A Bautin, Mikhail
%D 2007
%K algorithms detection idiom keywords phrases similarity
%T Significant Phrases Detection
%X SIPs are not infallible and may produce phrases that
have no bearing on the content in general. There-
The problem of determining key words fore, it is clear that there are phrases in text that are
and phases which best characterize a text significant inasmuch that they signify the content of
document has important applications such the document. We pose this research question: by
as building a compact index for a large- selecting and displaying significant phrases, are we
scale text processing system, or using a able to give users a sense of the general ideas, bet-
keyword set for summarization and topic ter understanding, and increased search power of the
detection. We approached this problem text? What properties do signicant phrases posses
from two perspectives. Our knowledge- and how can we identify them?
poor approach is based on statistical collo- We will approach this problem from two per-
cation detection using the t-test and like- spectives: Knowledge Poor and Knowledge Rich.
lihood ratio, and applying latent seman- Knowledge Poor techniques rely on using shallow
tic analysis to identify terms important in text processing which primarily utilizes the informa-
a particular document. The knowledge- tion about word and collocation frequencies. From
rich approach addresses the problem us- the Knowledge Rich perspective, we hope to use
ing noun phrase chunking and coreference many computational linguistic techniques to intel-
resolution. Both approaches use a deci- ligently parse documents and rank words to dis-
sion tree classifier to answer whether a cover meaningful phrases. We have compared these
given phrase is a key word looking at the two approaches in selecting significant phrases, and
set of calculated features. We have built found that they should be combined to augment each
prototypes and compared results of these other. The knowledge poor approach is robust and
two approaches. fast, but the knowledge rich approach has the ad-
vantage of tackling phrases relevant to the contents
more precisely.
@article{BautinHart2007,
abstract = { SIPs are not infallible and may produce phrases that
have no bearing on the content in general. There-
The problem of determining key words fore, it is clear that there are phrases in text that are
and phases which best characterize a text significant inasmuch that they signify the content of
document has important applications such the document. We pose this research question: by
as building a compact index for a large- selecting and displaying significant phrases, are we
scale text processing system, or using a able to give users a sense of the general ideas, bet-
keyword set for summarization and topic ter understanding, and increased search power of the
detection. We approached this problem text? What properties do signicant phrases posses
from two perspectives. Our knowledge- and how can we identify them?
poor approach is based on statistical collo- We will approach this problem from two per-
cation detection using the t-test and like- spectives: Knowledge Poor and Knowledge Rich.
lihood ratio, and applying latent seman- Knowledge Poor techniques rely on using shallow
tic analysis to identify terms important in text processing which primarily utilizes the informa-
a particular document. The knowledge- tion about word and collocation frequencies. From
rich approach addresses the problem us- the Knowledge Rich perspective, we hope to use
ing noun phrase chunking and coreference many computational linguistic techniques to intel-
resolution. Both approaches use a deci- ligently parse documents and rank words to dis-
sion tree classifier to answer whether a cover meaningful phrases. We have compared these
given phrase is a key word looking at the two approaches in selecting significant phrases, and
set of calculated features. We have built found that they should be combined to augment each
prototypes and compared results of these other. The knowledge poor approach is robust and
two approaches. fast, but the knowledge rich approach has the ad-
vantage of tackling phrases relevant to the contents
more precisely.
},
added-at = {2010-12-23T18:55:37.000+0100},
author = {Hart, Michael and Bautin, Mikhail},
biburl = {https://www.bibsonomy.org/bibtex/27c4b26cc63190a1bc27161b5f425b2f4/dzibold},
description = {Algorithm for key words detection based on SIPs
(Statistically Improbable Phrases)},
interhash = {635c90448b1f1a8c5b000ea4f578c06a},
intrahash = {7c4b26cc63190a1bc27161b5f425b2f4},
keywords = {algorithms detection idiom keywords phrases similarity},
timestamp = {2010-12-23T18:55:38.000+0100},
title = {Significant Phrases Detection},
year = 2007
}