More and more websites have started to embed structured data describing products, people, organizations, places, and events into their HTML pages using markup standards such as Microdata, JSON-LD, RDFa, and Microformats. The Web Data Commons project extracts this data from several billion web pages. So far the project provides 11 different data set releases extracted from the Common Crawls 2010 to 2022. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.
J. Choi, A. Khlif, and E. Epure. Proceedings of the 1st Workshop on NLP for Music and Audio (NLP4MusA), page 23--27. Online, Association for Computational Linguistics, (2020)
J. Choi, A. Khlif, and E. Epure. Proceedings of the 1st Workshop on NLP for Music and Audio (NLP4MusA), page 23--27. Online, Association for Computational Linguistics, (2020)
S. Staab, J. Lehmann, and R. Verborgh. Companion Proceedings of the The Web Conference 2018, page 885--886. Republic and Canton of Geneva, Switzerland, International World Wide Web Conferences Steering Committee, (2018)
R. Zgheib, A. Nicola, M. Villani, E. Conchon, and R. Bastide. 2017 IEEE 26th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), page 284-289. (June 2017)
A. Dridi, S. Sassi, and S. Faiz. 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), page 1421-1428. (October 2017)