Inproceedings,

DistLODStats: Distributed Computation of RDF Dataset Statistics

G. Sejdiu, I. Ermilov, J. Lehmann, and M. Nadjib-Mami.
Proceedings of 17th International Semantic Web Conference, (2018)

Full text

Abstract

Over the last years, the Semantic Web has been growing steadily. Today, we count more than 10,000 datasets made available online following Semantic Web standards. Nevertheless, many applications, such as data integration, search, and interlinking, may not take the full advantage of the data without having a priori statistical information about its internal structure and coverage. In fact, there are already a number of tools, which offer such statistics, providing basic information about RDF datasets and vocabularies. However, those usually show severe deficiencies in terms of performance once the dataset size grows beyond the capabilities of a single machine. In this paper, we introduce a software component for statistical calculations of large RDF datasets, which scales out to clusters of machines. More specifically, we describe the first distributed in-memory approach for computing 32 different statistical criteria for RDF datasets using Apache Spark. The preliminary results show that our distributed approach improves upon a previous centralized approach we compare against and provides approximately linear horizontal scale-up. The criteria are extensible beyond the 32 default criteria, is integrated into the larger SANSA framework and employed in at least four major usage scenarios beyond the SANSA community.

BibTeX key: sejdiu-2018-dist-lod-stats-iswc
entry type: inproceedings
booktitle: Proceedings of 17th International Semantic Web Conference
year: 2018
Document: http://jens-lehmann.org/files/2018/iswc_distlodstats.pdf

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@inproceedings{sejdiu-2018-dist-lod-stats-iswc, abstract = {Over the last years, the Semantic Web has been growing steadily. Today, we count more than 10,000 datasets made available online following Semantic Web standards. Nevertheless, many applications, such as data integration, search, and interlinking, may not take the full advantage of the data without having a priori statistical information about its internal structure and coverage. In fact, there are already a number of tools, which offer such statistics, providing basic information about RDF datasets and vocabularies. However, those usually show severe deficiencies in terms of performance once the dataset size grows beyond the capabilities of a single machine. In this paper, we introduce a software component for statistical calculations of large RDF datasets, which scales out to clusters of machines. More specifically, we describe the first distributed in-memory approach for computing 32 different statistical criteria for RDF datasets using Apache Spark. The preliminary results show that our distributed approach improves upon a previous centralized approach we compare against and provides approximately linear horizontal scale-up. The criteria are extensible beyond the 32 default criteria, is integrated into the larger SANSA framework and employed in at least four major usage scenarios beyond the SANSA community.}, added-at = {2024-11-01T19:15:04.000+0100}, author = {Sejdiu, Gezim and Ermilov, Ivan and Lehmann, Jens and Nadjib-Mami, Mohamed}, biburl = {https://www.bibsonomy.org/bibtex/2c1a8e1b76afe6b7d1bfe8d0e9d6895fd/aksw}, booktitle = {Proceedings of 17th International Semantic Web Conference}, interhash = {d56dab81a9be5cecff1efb73e907cc23}, intrahash = {c1a8e1b76afe6b7d1bfe8d0e9d6895fd}, keywords = {2018 bde group_aksw iermilov lehmann sejdiu}, timestamp = {2024-11-01T19:15:04.000+0100}, title = {Dist{LODS}tats: {D}istributed {C}omputation of {RDF} {D}ataset {S}tatistics}, url = {http://jens-lehmann.org/files/2018/iswc_distlodstats.pdf}, year = 2018 }

BibSonomy

DistLODStats: Distributed Computation of RDF Dataset Statistics

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on