
The Automatic Construction of Large-Scale Corpora for Summarization Research.

. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, page 137-144. ACM, (1999)


Summarization research is notorious for its lack of adequatecorpora: today, there exist only a few small collections oftexts whose units have been manually annotated for textualimportance. Given the cost and tediousness of the annota-tion process, it is very unlikely that we will ever manuallyannotate for textual importance sufficiently large corpora oftexts. To circumvent this problem, we have developed analgorithm that constructs such corpora automatically.Our algorithm takes as input an $<$Abstract, Text$>$ tuple andgenerates the corresponding Extract, i.e., the set of clauses(sentences) in the Text that were used to write the Abstract.The performance of the algorithm is shown to be close to thatof humans by means of an empirical experiment. The exper-iment also suggests extraction strategies that could improvethe performance of automatic summarization systems.

Links and resources



  • @diego_ma
  • @dblp
@diego_ma's tags highlighted