DASH -- Event-Driven Pipeline Infrastructure

Navigation Bar

DASH -- Event-Driven Pipeline Infrastructure

John "Scooter" Morris, Conrad C. Huang, Doug Stryke, and Thomas E. Ferrin

Computer Graphics Laboratory
University of California, San Francisco

Collaboration appears to increasingly be a common theme among "big science" research projects, and is especially relevant in the field of computational biology. The collaborative model often consists of several self-contained research units (e.g. PI labs), each performing some specialized data analysis based on the expertise available within that unit/lab to contribute to the overall project goals. In order for these projects to be successful, each group must be able to share results with others, and in turn, be able to easily access the data generated by others.
The greater the number of collaborators in such projects, and the more distributed the required storage and computational resources, the harder it is to manage these research networks. Concerns such as security, data provisioning, and concurrency control become increasingly difficult as the project scales up. The potentially large volume of generated data can take on many forms, including flat files arranged into directory hierarchies on a server's filesystem, XML (entensible markup language) data, and relational data stored in databases. Newly deposited or modified data by one group may require updating of related or dependent data used or maintained by another group. A system relying on manual means (e.g. sending an e-mail to interested parties) to relay the existence of novel data can be adversely affected by bottlenecks due to human factors.
DASH is designed to address many of these concerns, by providing a software infrastructure aimed at facilitating data sharing in small- to medium-sized collaborative computational biology projects. DASH will enable users to describe a data network in terms of the component data sources and processing protocols, and to specify the inherent relationships between them. This information can then be fed to a subsequent component of the data-processing "pipeline," which will monitor the relevant data sources, and invoke the required protocols automatically in the presence of data updates, thereby maintaining data integrity across the network. Additionally, DASH will provide software tools for monitoring various aspects of the network, including important information such as the amount and nature of data available, currently executing/scheduled protocols, and provenance data (e.g. protocol execution and data deposition logs). Finally, DASH will provide tools and libraries for constructing custom web interfaces to present data from the data network in a semantically meaningful manner.
Development on this project began in August 2003, and thus far has concentrated on developing preliminary versions of key components. Work over the past year has focused on building a robust, event-based notification infrastructure, which will serve as the core of the DASH system. This event model uses lightweight messages (referred to as events) to communicate changes such as the modification of data or the completion of a protocol. Further work will build on this generic event model to create a system optimized for data sharing.

References:

Enhancing data sharing in collaborative research projects with DASH; Ferrin TE, Huang CC, Greenblatt DM, Stryke D, Giacomini KM, Morris JH; Pac. Symp. Biocomput. 2005, pp. 260-271. PMID: 15759632