
Mining the web for discourse markers

. Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), page 407--410. (2004)


This paper proposes a methodology for obtaining sentences containing discourse markers from the World Wide Web. The proposed methodology is particularly suitable for collecting large numbers of discourse marker tokens. It relies on the automatic identification of discourse markers, and we show that this can be done with an accuracy within 9\% of that of human performance. We also show that the distribution of discourse markers on the web correlates highly with those in a conventional balanced corpus.

Links and resources



  • @mortimer_m8
  • @dblp
@mortimer_m8's tags highlighted