Master Internship opportunity
Querying heterogeneous phylogenetic data in a relational framework
Advisors
Sarah Cohen-Boulakia (Assistant Professor), cohen AT lri.fr
Christine Froidevaux (Professor), chris AT lri.fr
Laboratoire de Recherche en Informatique, Universite Paris-Sud, Orsay, France
Collaborators
Olivier Lespinet (Institut de Genetique et de Microbiologie, Universite Paris-Sud, Orsay, France)
Val Tannen (University of Pennsylvania, USA)
Location
Bioinformatics group at LRI (Laboratoire de Recherche en Informatique), Universite Paris-Sud, Orsay, France
Topic
Understanding the relationships between different species may have consequences both at practical level (e.g., history of a pathogen agent involved in a disease),
and at fundamental level (e.g., construction of the tree of life: finding the history of all organisms).
The study of evolution requires various and numerous pieces of data such as morphological characteristics of a set of specimens
or, more importantly, genomic and proteomic sequences of a group of species, functional and structural annotations.
For several years now, there has been an avalanche of data available. Combining and integrating masses of phylogenomics data is of parmount importance for better understanding evolution.
Background
More specifically, this subject is part of two projects, pPOD and Microbiogenomics.
On the one hand, pPOD is an international project, in which data are collected throughout the world by research groups having distinct interests (specialized on different groups of species). Data collected by those various groups are available in a variety of formats (relational, tabulated files, and so on.).
On the other hand, partners of the the French project "ANR masses de donnees" Microbiogenomics study evolution by building phylogenetic trees based on families of proteins, part of their data is currently stored in flat-files format (trees) and within a relational warehouse (sequences, annotations).
In both projects, phylogenetists need to ask complex questions involving all the data, both produced and consumed by the various steps of the generation and analysis of phylogenetic trees, including the trees themselves or families of trees.
Examples of queries include:
"What has been the supertree produced by those two phylogenetic trees in this experiment?
What are the differences between those two trees?
If I modify the alignment by adding gaps, what would be the impact on the final generated tree?"
or
"Among the available trees, what are the subtrees whose proteins are all involved in a given metabolic pathway? Which are the protein modules appearing in trees in which a given group of species is monophyletic?"
Work
The aim of this work is to enable various phylogenomists to make use of all these data in a unified way, in the context of a relational database. One of the main challenges lies in that the relational model does not fit with hierarchical (tree-based) data.
- As a first step, queries frequently asked by major pPOD and Microbiogenomics partners should be identified. Queries may be increasingly complex and can be expressed using different kinds of language (relational algebra, SQL and so on). In particular, queries may involve families of trees, comparison and clustering algorithms, and topological features of the trees.
- A classification of queries should then be proposed.
- The third step would consist in exploring the limitations of current relational standards (latest versions of SQL) to represent phylogenetic data and express queries. This study should provide the building blocks for a more expressive high-level query language.
More information on the pPOD project: http://phylodata.seas.upenn.edu
More information on the Microbiogenomics project: http://microbiogenomics.u-psud.fr
More information on the Bioinformatics group at LRI: http://www.lri.fr/bioinfo