Researchers at Google annotated English-language Web pages from the ClueWeb09 and ClueWeb12 corpora. The annotation process was automatic, and hence imperfect. However, the annotations are of generally high quality, as they strove for high precision (and, by necessity, lower recall). For each entity they recognized with high confidence, they provide the beginning and end byte offsets of the entity mention in the input text, its Freebase identifier (mid), and two confidence levels (computed differently, see below).
You might consider using this data in conjunction with the recently released Freebase annotations of several TREC query sets.
I've been thinking about the best approach to implement pure function verification in the Scala compiler. An approach similar to the one in D would fit a lot better than the one used in Haskell (which would break all existing code and cause some problems due to strict evaluation). A solution using annotations would be quite simple to implement:
M. Sabou, K. Bontcheva, L. Derczynski, и A. Scharl. Proceedings of the Ninth International Conference on Language Resources and Evaluation, ŁREC\ 2014, Reykjavik, Iceland, May 26-31, 2014, стр. 859--866. European Language Resources Association \(ELRA)\, (2014)
R. Snow, B. O'Connor, D. Jurafsky, и A. Ng. Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, стр. 254--263. Honolulu, Hawaii, Association for Computational Linguistics, (октября 2008)
J. Parvanova, V. Alexiev, и S. Kostadinov. International Workshop on Collaborative Annotations in Shared Environment: metadata, vocabularies and techniques in the Digital Humanities (DH-CASE 2013). Collocated with DocEng 2013, Florence, Italy, (сентября 2013)
A. List. Proceedings of the 17th International Conference on Educational Data Mining, стр. 692--697. Atlanta, Georgia, USA, International Educational Data Mining Society, (июля 2024)