a corpus of 1 million documents that are freely available for research and may be (to the best of our knowledge) freely redistributed. These documents were obtained by performing searches for words randomly chosen from the Unix dictionary, numbers randomly chosen between 1 and 1 million, and randomized combinations of the two, for documents of specified file types that resided on web servers in the .gov domain using the Yahoo an Google search engines.
PIE incorporates a database derived from the second or World Edition of the British National Corpus (BNC 2000). It aims to provide a simple yet powerful interface for studying words and phrases up to eight words long appropriate for both experienced researchers and novice users.
MemeTracker builds maps of the daily news cycle by analyzing around 900,000 news stories and blog posts per day from 1 million online sources, ranging from mass media to personal blogs. We track the quotes and phrases that appear most frequently over time across this entire online news spectrum. This makes it possible to see how different stories compete for news and blog coverage each day, and how certain stories persist while others fade quickly.
the Google Books corpus of American English, 155 billion words in size. limited to what you can do via the website at Brigham Young University. The easy thing to do is type in a word or phrase and see its frequency by decade, going back to the 1810s. The interface allows you to look for collocates (words that go with other words), view charts showing relative word frequency in the corpus by decade, handles parts of speech, and gives you various limits and display options. Other kinds of analysis that might be done with text corpora can’t be done through the interface.