Abstract
The pervasive availability of scientific data from sensors and field
observations is posing a challenge to data valets responsible for
accumulating and managing them in data repositories. Science collaborations,
big and small, are standing up repositories built on commodity clusters
need to reliably ingest data constantly and ensure its availability
to a wide user community. Workflows provide several benefits to model
data-intensive science applications and many of these benefits can
be transmitted effectively to manage the data ingest pipelines. But
using workflows is not panacea in itself and data valets need to
consider several issues when designing workflows that behave reliably
on fault prone hardware while retaining the consistency of the scientific
data, and when selecting workflow frameworks that support these requirements.
In this paper, we propose workflow design models for reliable data
ingest in a distributed environment and identify workflow framework
features to support resilience. We illustrate these using the data
ingest pipeline for the Pan-STARRS sky survey, one of the largest
digital surveys that accumulates 100TB of data annually, where these
concepts are applied.
Users
Please
log in to take part in the discussion (add own reviews or comments).