Master Internship opportunity

Querying heterogeneous phylogenetic data in a relational framework



Advisors


Sarah Cohen-Boulakia (Assistant Professor), cohen AT lri.fr
Christine Froidevaux (Professor), chris AT lri.fr
Laboratoire de Recherche en Informatique, Universite Paris-Sud, Orsay, France


Collaborators


Olivier Lespinet (Institut de Genetique et de Microbiologie, Universite Paris-Sud, Orsay, France)
Val Tannen (University of Pennsylvania, USA)


Location


Bioinformatics group at LRI (Laboratoire de Recherche en Informatique), Universite Paris-Sud, Orsay, France


Topic


Understanding the relationships between different species may have consequences both at practical level (e.g., history of a pathogen agent involved in a disease), and at fundamental level (e.g., construction of the tree of life: finding the history of all organisms). The study of evolution requires various and numerous pieces of data such as morphological characteristics of a set of specimens or, more importantly, genomic and proteomic sequences of a group of species, functional and structural annotations. For several years now, there has been an avalanche of data available. Combining and integrating masses of phylogenomics data is of parmount importance for better understanding evolution.


Background


More specifically, this subject is part of two projects, pPOD and Microbiogenomics.

On the one hand, pPOD is an international project, in which data are collected throughout the world by research groups having distinct interests (specialized on different groups of species). Data collected by those various groups are available in a variety of formats (relational, tabulated files, and so on.).

On the other hand, partners of the the French project "ANR masses de donnees" Microbiogenomics study evolution by building phylogenetic trees based on families of proteins, part of their data is currently stored in flat-files format (trees) and within a relational warehouse (sequences, annotations).

In both projects, phylogenetists need to ask complex questions involving all the data, both produced and consumed by the various steps of the generation and analysis of phylogenetic trees, including the trees themselves or families of trees. Examples of queries include:
"What has been the supertree produced by those two phylogenetic trees in this experiment? What are the differences between those two trees? If I modify the alignment by adding gaps, what would be the impact on the final generated tree?" or
"Among the available trees, what are the subtrees whose proteins are all involved in a given metabolic pathway? Which are the protein modules appearing in trees in which a given group of species is monophyletic?"


Work


The aim of this work is to enable various phylogenomists to make use of all these data in a unified way, in the context of a relational database. One of the main challenges lies in that the relational model does not fit with hierarchical (tree-based) data.

More information on the pPOD project: http://phylodata.seas.upenn.edu
More information on the Microbiogenomics project: http://microbiogenomics.u-psud.fr
More information on the Bioinformatics group at LRI: http://www.lri.fr/bioinfo