Term frequencies for 235k language and literature texts

Zusammenfassung

Corpus-level term statistics are valuable for numerous text analysis activities, such as term weighting or probability distribution smoothing. In instances where there is an insufficient corpus to calculate such statistics, falling back on a general corpus of similar texts is useful. This dataset provides statistics for a collection of 235k books from the HathiTrust that are classified as Language and Literature (i.e. class P in LCC). For each term seen in these books, book frequency, page frequency, and term frequency are provided. Book frequency is the count of books that the term is seen in, page frequency is the number of pages that have the term, and term frequency is the overall count of the term. This data is derived from the holding of the HathiTrust, using the Extracted Features dataset from the HathiTrust Research Center.

BibTeX-Schlüssel: organisciak_term_2016
Eintragstyp: article
Jahr: 2016
Monat: mar
language: en
urldate: 2016-03-23
URL: https://www.ideals.illinois.edu/handle/2142/89515

BibSonomy

Term frequencies for 235k language and literature texts

Zusammenfassung

Tags

Nutzer

Kommentare und Rezensionenanzeigen / verbergen

Zitieren Sie diese Publikation

Mehr Zitationsstile

Suchen auf