Kopieren Löschen Diese Publikation zur Ablage hinzufügen
Community-Eintrag
Versionsverlauf dieses Eintrags
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset Pages

T. Alrashed, D. Paparas, O. Benjelloun, Y. Sheng, und N. Noy. The Semantic Web -- ISWC 2021, Seite 338--356. Cham, Springer International Publishing, (2021)

Zusammenfassung

Semantic markup, such as Schema.org, allows providers on the Web to describe content using a shared controlled vocabulary. This markup is invaluable in enabling a broad range of applications, from vertical search engines, to rich snippets in search results, to actions on emails, to many others. In this paper, we focus on semantic markup for datasets, specifically in the context of developing a vertical search engine for datasets on the Web, Google's Dataset Search. Dataset Search relies on Schema.org to identify pages that describe datasets. While Schema.org was the core enabling technology for this vertical search, we also discovered that we need to address the following problem: pages from 61\% of internet hosts that provide Schema.org/Dataset markup do not actually describe datasets. We analyze the veracity of dataset markup for Dataset Search's Web-scale corpus and categorize pages where this markup is not reliable. We then propose a way to drastically increase the quality of the dataset metadata corpus by developing a deep neural-network classifier that identifies whether or not a page with Schema.org/Dataset markup is a dataset page. Our classifier achieves 96.7\% recall at the 95\% precision point. This level of precision enables Dataset Search to circumvent the noise in semantic markup and to use the metadata to provide high quality results to users.

Beschreibung

Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset Pages | SpringerLink

@jaeschkes Tags hervorgehoben

Zitieren Sie diese Publikation

@inproceedings{alrashed2021dataset, abstract = {Semantic markup, such as Schema.org, allows providers on the Web to describe content using a shared controlled vocabulary. This markup is invaluable in enabling a broad range of applications, from vertical search engines, to rich snippets in search results, to actions on emails, to many others. In this paper, we focus on semantic markup for datasets, specifically in the context of developing a vertical search engine for datasets on the Web, Google's Dataset Search. Dataset Search relies on Schema.org to identify pages that describe datasets. While Schema.org was the core enabling technology for this vertical search, we also discovered that we need to address the following problem: pages from 61{\%} of internet hosts that provide Schema.org/Dataset markup do not actually describe datasets. We analyze the veracity of dataset markup for Dataset Search's Web-scale corpus and categorize pages where this markup is not reliable. We then propose a way to drastically increase the quality of the dataset metadata corpus by developing a deep neural-network classifier that identifies whether or not a page with Schema.org/Dataset markup is a dataset page. Our classifier achieves 96.7{\%} recall at the 95{\%} precision point. This level of precision enables Dataset Search to circumvent the noise in semantic markup and to use the metadata to provide high quality results to users.}, added-at = {2022-06-27T14:10:31.000+0200}, address = {Cham}, author = {Alrashed, Tarfah and Paparas, Dimitris and Benjelloun, Omar and Sheng, Ying and Noy, Natasha}, biburl = {https://www.bibsonomy.org/bibtex/259e0f31eaaf5f10c5155a44f5ee3cbab/jaeschke}, booktitle = {The Semantic Web -- ISWC 2021}, description = {Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset Pages | SpringerLink}, editor = {Hotho, Andreas and Blomqvist, Eva and Dietze, Stefan and Fokoue, Achille and Ding, Ying and Barnaghi, Payam and Haller, Armin and Dragoni, Mauro and Alani, Harith}, interhash = {6ca941366cb64f3d041444ac77ec0ad0}, intrahash = {59e0f31eaaf5f10c5155a44f5ee3cbab}, isbn = {978-3-030-88361-4}, keywords = {dataset extraction markup semantics semanticweb unknowndata web}, pages = {338--356}, publisher = {Springer International Publishing}, timestamp = {2022-06-27T14:10:31.000+0200}, title = {Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset Pages}, year = 2021 }

BibSonomy

Kopieren Löschen Diese Publikation zur Ablage hinzufügen
Community-Eintrag
Versionsverlauf dieses Eintrags
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset Pages

Zusammenfassung

Beschreibung

Links und Ressourcen

Tags

Community

Zitieren Sie diese Publikation

Mehr Zitationsstile

Suchen auf

Metadaten

Kommentare und Rezensionen
(0)

BibSonomy

KopierenLöschenDiese Publikation zur Ablage hinzufügenCommunity-EintragVersionsverlauf dieses EintragsURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset Pages

Zusammenfassung

Beschreibung

Links und Ressourcen

Tags

Community

Zitieren Sie diese Publikation

Mehr Zitationsstile

Suchen auf

Metadaten

Kommentare und Rezensionen (0)

Kopieren Löschen Diese Publikation zur Ablage hinzufügen
Community-Eintrag
Versionsverlauf dieses Eintrags
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset Pages

Kommentare und Rezensionen
(0)