Inproceedings,

A study of using search engine page hits as a proxy for n-gram frequencies

, and .
Proceedings of Recent Advances in Natural Language Processing 2005, (2005)

Abstract

The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-gram counts cross different search engines as well as for the same search engine across time, finding that although there are measurable differences, they are not statistically significantly different for the task examined.

Tags

Users

  • @wqfeng
  • @seandalai

Comments and Reviews