Аннотация
With the availability of powerful computational and communication
systems, scientists now readily access large, complicated derived
datasets and build on those results to produce, through further
processing, yet other derived datasets of interest. The scientific
processes used to create such datasets must be clearly documented so
that scientists can evaluate their soundness, reproduce the results,
and build upon them in responsible and appropriate ways. Here, we
present the concept of an analytic web, which defines the scientific
processes employed and details the exact application of those processes
in creating derived datasets. The work described here is similar to
work often referred to as ``scientific workflow,'' but emphasizes the
need for a semantically rich, rigorously defined process definition
language. We illustrate the information that comprises an analytic web
for a scientific process that measures and analyzes the flux of water
through a forested watershed. This is a complex and demanding
scientific process that illustrates the benefits of using a
semantically rich, executable language for defining processes and for
supporting automatic creation of process provenance metadata.
Note to Practitioners-The Internet and associated computing
capabilities have made it possible for scientists to derive novel
datasets through complex processing of existing datasets that may be
collected from many locations. But scientists rarely document dataset
provenance - the set of processes and a description of how those
processes were used - to allow derived datasets to be recreated.
Enabling such recreation is an essential part of repeatable science,
and thus it is imperative that any dataset generated by scientific
computation include provenance metadata, documentation of the precise
way in which that dataset was produced. Provenance metadata can help
assure that scientists and others understand the value and limitations
associated with using that data, but creating provenance metadata is a
difficult and time-consuming problem. This paper describes an approach
for helping scientists deal with the production and management of their
datasets, including the automated generation of provenance metadata.
The approach is based on the use of a precisely defined process
definition language. The language is relatively clear and easy for
scientists to understand, yet it is precise enough to support their
control of the application of computing capabilities to the generation
of datasets, and is also an aid to the management and understanding of
these datasets. This paper illustrates these ideas by providing a case
study of a specific problem in ecological dataset production and
metadata provenance generation.
Линки и ресурсы
тэги
сообщество