The SemCor corpus
The SemCor corpus, created by the Princeton University, is a subset of the English Brown corpus containing almost 700,000 running words. In SemCor all the words are tagged by PoS, and more than 200,000 content words are also lemmatized and sense-tagged according to Princeton WordNet 1.6.
More in detail, the SemCor corpus is composed of 352 texts. In 186 texts all the open class words (nouns, verbs, adjectives, and adverbs) are annotated with PoS, lemma and sense, while in the remaining 166 texts only verbs are annotated with lemma and sense.
The "all-words" component of SemCor has 359,732 tokens among which 192,639 are semantically annotated, while the "only-verbs" component has 316,814 tokens among which 41,497 verb occurrences are semantically annotated.
Different versions of SemCor are available for downloading here.
Related Publications:
1. Landes S., Leacock C., and Tengi, R.I. (1998) "Building semantic concordances". In Fellbaum, C. (ed.) (1998) WordNet: An Electronic Lexical Database. Cambridge (Mass.): The MIT Press.
2. Fellbaum, C., Grabowski, J. and Landes, S. (1998). "Performance and confidence in a semantic annotation task". In Fellbaum, C. (ed.) (1998) WordNet: An Electronic Lexical Database. Cambridge (Mass.): The MIT Press.
3. Fellbaum, C. (ed.) (1998) WordNet: An Electronic Lexical Database. Cambridge (Mass.): The MIT Press.