This page provides two large hyperlink graph for public download. The graphs have been extracted from the 2012 and 2014 versions of the Common Crawl web corpera. The 2012 graph covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. The2014 graph covers 1.7 billion web pages connected by 64 billion hyperlinks. Below we provide instructions on how to download the graphs as well as basic statistics about their topology.
A STATEMENT OF COMMITMENT BY STM PUBLISHERS TO A ROADMAP TO ENABLE TEXT AND DATA MINING (TDM) FOR NON COMMERCIAL SCIENTIFIC RESEARCH IN THE EUROPEAN UNION
We have released over a million images onto Flickr Commons for anyone to use, remix and repurpose. These images were taken from the pages of 17th, 18th and 19th century books digitised by Microsoft who then generously gifted the scanned images to us, allowing us to release them back into...
The files below contain XML (and only XML) for all the articles in the PMC open access subset. These files were created for users who need PMC XML for data mining and processing purposes, but do not need PDFs, images, or supplementary data.
To help researchers investigate relation extraction, we’re releasing a human-judged dataset of two relations about public figures on Wikipedia: nearly 10,000 examples of “place of birth”, and over 40,000 examples of “attended or graduated from an institution”. Each of these was judged by at least 5 raters, and can be used to train or evaluate relation extraction systems. We also plan to release more relations of new types in the coming months.
H. Zhang, A. Santos, и J. Freire. Proceedings of the 30th ACM International Conference on Information &$\mathsemicolon$ Knowledge Management, ACM, (октября 2021)
A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B. Hsu, и K. Wang. Proceedings of the 24th International Conference on World Wide Web, стр. 243--246. Republic and Canton of Geneva, Switzerland, International World Wide Web Conferences Steering Committee, (2015)