Аннотация

With the availability of powerful computational and communication systems, scientists now readily access large, complicated derived datasets and build on those results to produce, through further processing, yet other derived datasets of interest. The scientific processes used to create such datasets must be clearly documented so that scientists can evaluate their soundness, reproduce the results, and build upon them in responsible and appropriate ways. Here, we present the concept of an analytic web, which defines the scientific processes employed and details the exact application of those processes in creating derived datasets. The work described here is similar to work often referred to as ``scientific workflow,'' but emphasizes the need for a semantically rich, rigorously defined process definition language. We illustrate the information that comprises an analytic web for a scientific process that measures and analyzes the flux of water through a forested watershed. This is a complex and demanding scientific process that illustrates the benefits of using a semantically rich, executable language for defining processes and for supporting automatic creation of process provenance metadata. Note to Practitioners-The Internet and associated computing capabilities have made it possible for scientists to derive novel datasets through complex processing of existing datasets that may be collected from many locations. But scientists rarely document dataset provenance - the set of processes and a description of how those processes were used - to allow derived datasets to be recreated. Enabling such recreation is an essential part of repeatable science, and thus it is imperative that any dataset generated by scientific computation include provenance metadata, documentation of the precise way in which that dataset was produced. Provenance metadata can help assure that scientists and others understand the value and limitations associated with using that data, but creating provenance metadata is a difficult and time-consuming problem. This paper describes an approach for helping scientists deal with the production and management of their datasets, including the automated generation of provenance metadata. The approach is based on the use of a precisely defined process definition language. The language is relatively clear and easy for scientists to understand, yet it is precise enough to support their control of the application of computing capabilities to the generation of datasets, and is also an aid to the management and understanding of these datasets. This paper illustrates these ideas by providing a case study of a specific problem in ecological dataset production and metadata provenance generation.

Линки и ресурсы

тэги

сообщество

  • @cstrasser
  • @dblp
  • @mbjones.89
@cstrasser- тэги данного пользователя выделены