
SIPs are not infallible and may produce phrases that have no bearing on the content in general. There- The problem of determining key words fore, it is clear that there are phrases in text that are and phases which best characterize a text significant inasmuch that they signify the content of document has important applications such the document. We pose this research question: by as building a compact index for a large- selecting and displaying significant phrases, are we scale text processing system, or using a able to give users a sense of the general ideas, bet- keyword set for summarization and topic ter understanding, and increased search power of the detection. We approached this problem text? What properties do signicant phrases posses from two perspectives. Our knowledge- and how can we identify them? poor approach is based on statistical collo- We will approach this problem from two per- cation detection using the t-test and like- spectives: Knowledge Poor and Knowledge Rich. lihood ratio, and applying latent seman- Knowledge Poor techniques rely on using shallow tic analysis to identify terms important in text processing which primarily utilizes the informa- a particular document. The knowledge- tion about word and collocation frequencies. From rich approach addresses the problem us- the Knowledge Rich perspective, we hope to use ing noun phrase chunking and coreference many computational linguistic techniques to intel- resolution. Both approaches use a deci- ligently parse documents and rank words to dis- sion tree classifier to answer whether a cover meaningful phrases. We have compared these given phrase is a key word looking at the two approaches in selecting significant phrases, and set of calculated features. We have built found that they should be combined to augment each prototypes and compared results of these other. The knowledge poor approach is robust and two approaches. fast, but the knowledge rich approach has the ad- vantage of tackling phrases relevant to the contents more precisely.


Algorithm for key words detection based on SIPs (Statistically Improbable Phrases)

