Article,

Term frequencies for 235k language and literature texts

.
(March 2016)

Abstract

Corpus-level term statistics are valuable for numerous text analysis activities, such as term weighting or probability distribution smoothing. In instances where there is an insufficient corpus to calculate such statistics, falling back on a general corpus of similar texts is useful. This dataset provides statistics for a collection of 235k books from the HathiTrust that are classified as Language and Literature (i.e. class P in LCC). For each term seen in these books, book frequency, page frequency, and term frequency are provided. Book frequency is the count of books that the term is seen in, page frequency is the number of pages that have the term, and term frequency is the overall count of the term. This data is derived from the holding of the HathiTrust, using the Extracted Features dataset from the HathiTrust Research Center.

Tags

Users

  • @lepsky

Comments and Reviews