Using Provenance for Personalized Quality Ranking of Scientific Datasets
Y. Simmhan, and B. Plale. International Journal of Computers and Their Applications (IJCA), 18 (3):
180--195(September 2011)
Abstract
The rapid growth of eScience has led to an explosion in the creation
and availability of scientific datasets that includes raw instrument
data and derived datasets from model simulations. A large number
of these datasets are surfacing online in public and private catalogs,
often annotated with XML metadata, as part of community efforts to
foster open research. With this rapid expansion comes the challenge
of filtering and selecting datasets that best match the needs of
scientists. We address a key aspect of the scientific data discovery
process by ranking search results according to a personalized data
quality score based on a declarative quality profile to help scientists
select the most suitable data for their applications. Our quality
model is resilient to missing metadata using a novel strategy that
uses provenance in its absence. Intuitively, our premise is that
the quality score for a dataset depends on its provenance – the scientific
task and its inputs that created the dataset – and it is possible
to define a quality function based on provenance metadata that predicts
the same quality score as one evaluated using the user’s quality
profile over the complete metadata. Here, we present a model and
architecture for data quality scoring, apply machine learning techniques
to construct a quality function that uses provenance as proxy for
missing metadata, and empirically test the prediction power of our
quality function. Our results show that for some scientific tasks,
quality scores based on provenance closely track the quality scores
based on complete metadata properties, with error margins between
1 – 29%.
%0 Journal Article
%1 Simmhan:ijca:2011
%A Simmhan, Yogesh
%A Plale, Beth
%D 2011
%I ISCA
%J International Journal of Computers and Their Applications (IJCA)
%K issue iu, karma, peer provenance, reviewed, special usc,
%N 3
%P 180--195
%T Using Provenance for Personalized Quality Ranking of Scientific Datasets
%U http://ceng.usc.edu/~simmhan/pubs/simmhan-ijca-2011.pdf
%V 18
%X The rapid growth of eScience has led to an explosion in the creation
and availability of scientific datasets that includes raw instrument
data and derived datasets from model simulations. A large number
of these datasets are surfacing online in public and private catalogs,
often annotated with XML metadata, as part of community efforts to
foster open research. With this rapid expansion comes the challenge
of filtering and selecting datasets that best match the needs of
scientists. We address a key aspect of the scientific data discovery
process by ranking search results according to a personalized data
quality score based on a declarative quality profile to help scientists
select the most suitable data for their applications. Our quality
model is resilient to missing metadata using a novel strategy that
uses provenance in its absence. Intuitively, our premise is that
the quality score for a dataset depends on its provenance – the scientific
task and its inputs that created the dataset – and it is possible
to define a quality function based on provenance metadata that predicts
the same quality score as one evaluated using the user’s quality
profile over the complete metadata. Here, we present a model and
architecture for data quality scoring, apply machine learning techniques
to construct a quality function that uses provenance as proxy for
missing metadata, and empirically test the prediction power of our
quality function. Our results show that for some scientific tasks,
quality scores based on provenance closely track the quality scores
based on complete metadata properties, with error margins between
1 – 29%.
@article{Simmhan:ijca:2011,
abstract = {The rapid growth of eScience has led to an explosion in the creation
and availability of scientific datasets that includes raw instrument
data and derived datasets from model simulations. A large number
of these datasets are surfacing online in public and private catalogs,
often annotated with XML metadata, as part of community efforts to
foster open research. With this rapid expansion comes the challenge
of filtering and selecting datasets that best match the needs of
scientists. We address a key aspect of the scientific data discovery
process by ranking search results according to a personalized data
quality score based on a declarative quality profile to help scientists
select the most suitable data for their applications. Our quality
model is resilient to missing metadata using a novel strategy that
uses provenance in its absence. Intuitively, our premise is that
the quality score for a dataset depends on its provenance – the scientific
task and its inputs that created the dataset – and it is possible
to define a quality function based on provenance metadata that predicts
the same quality score as one evaluated using the user’s quality
profile over the complete metadata. Here, we present a model and
architecture for data quality scoring, apply machine learning techniques
to construct a quality function that uses provenance as proxy for
missing metadata, and empirically test the prediction power of our
quality function. Our results show that for some scientific tasks,
quality scores based on provenance closely track the quality scores
based on complete metadata properties, with error margins between
1 – 29%.},
added-at = {2014-08-13T04:08:36.000+0200},
author = {Simmhan, Yogesh and Plale, Beth},
biburl = {https://www.bibsonomy.org/bibtex/2f342c776ed18a0211a4cf9334f7c8332/simmhan},
entrytype = {journal},
interhash = {1d1f95297e0ece15b8eae4775eaaf470},
intrahash = {f342c776ed18a0211a4cf9334f7c8332},
issn = {1076-5204},
journal = {International Journal of Computers and Their Applications (IJCA)},
keywords = {issue iu, karma, peer provenance, reviewed, special usc,},
month = {September},
number = 3,
owner = {Simmhan},
pages = {180--195},
publisher = {ISCA},
timestamp = {2014-08-13T04:08:36.000+0200},
title = {Using Provenance for Personalized Quality Ranking of Scientific Datasets},
url = {http://ceng.usc.edu/~simmhan/pubs/simmhan-ijca-2011.pdf},
volume = 18,
year = 2011
}