Ph.D
Group : Bioinformatics
Designing scientific workflows following a structure and provenance-aware strategy
Starts on 15/12/2011
Advisor : FROIDEVAUX, Christine
[COHEN-BOULAKIA Sarah]
Funding : ETR-BGF
Affiliation : Université Paris-Saclay
Laboratory : LRI-Bioinfo
Defended on 11/10/2013, committee :
Prof. Christine Froidevaux, Université Paris Sud (directeur de these)
Dr. Sarah Cohen-Boulakia, Université Paris Sud (co-encadrante)
Prof. Mohand-Said Hacid, Université Lyon 1 (rapporteur)
Prof. Therese Libourel, Université Montepellier II (rapporteur)
Prof. Daniela Grigori, Université Paris Dauphine (examinateur)
Prof. Chantal Reynaud, Université Paris Sud (examinateur)
Research activities :
Abstract :
Scientific workflow systems are equipped of provenance modules able to collect data produced and consumed during workflow runs to enhance reproducibility. For several reasons, the complexity of workflow and workflow execution structures is increasing over time, with a clear impact on scientific workflows reuse.
The global aim of this thesis is to enhance workflow reuse by providing strategies to reduce the complexity of workflow structures while preserving provenance. Two strategies are introduced.
First, we propose an approach to rewrite any scientific workflow (represented as a directed acyclic graph (DAG)) into a series-parallel (SP) structure while preserving provenance. Such structures allow to design polynomial-time algorithms for complex workflow operations (e.g., comparing workflows) while such operations are related to an NP-hard problem for general DAG structures. The SPFlow rewriting and provenance-preserving algorithm is thus introduced.
Second, we provide a methodology and a technique to reduce the redundancy present in workflows by detecting and removing "anti-patterns" responsible for such redundancy. The DistillFlow algorithm is able to transform a workflow into a distilled semantically-equivalent workflow, free or partly free of anti-patterns and with a more concise and simpler structure.
The two main approaches (SPFlow and DistillFlow) are based on a provenance model that we have introduced to represent the provenance structure of the workflow executions. Our solutions are available for use at https://www.lri.fr/~chenj. They have been systematically tested on large collections of real workflows, especially from the Taverna system.