Misc,

W3C Corpus Annotated with W3C People Identity

.
\urlhttp://ir.nist.gov/w3c/contrib/W3Ctagged.html, (September 2006)

Abstract

The annotated W3C corpus was produced by tagging W3C people found in the W3C test collection. The list of 1092 W3C people's names and their ID is provided by TREC at: http://trec.nist.gov/data/enterprise/05/ent05.expert.candidates A description of the W3C test collection crawled in June 2004 can be found at: http://ir.nist.gov/w3c/w3c-summary.html Each W3C related person is identified by his/her name, name variations, email addresses, and email user ID etc. For example, "Dan Brickley" is identified by "Dan Brickley", "D. Brickley", "Brickley, Dan", "danbri@w3.org", and "danbri". For each occurrence of a W3C person, I tagged the occurrence following the format as "<candidate-id>original text</candidate-id>", e.g., "<candidate-0001>D. Brickley</candidate-0001>". Please note that identity tagging is the only operation made on the original W3C corpus, and no pre-processing is done. HTML tags are preserved for rendering documents in web browsers. The total number of occurrences of W3C people's identities in the corpus is 1,662,024. The largest number of occurrences of a person's identity is 129,109. The number of people's identities that does not occur at all, i.e., number of occurrences is zero, is 303. The occurrences of people's identity are shown in Figure 1, where people's IDs are in ascending order. When we sort the people' IDs by their occurrences, we get the graph in Figure 2. We can see that the identity occurrences distribute exponentially, where a small number of people have a large number of occurrences and the majority of people have a small number of occurrences.

Tags

Users

  • @lillejul

Comments and Reviews