Towards a semi-automatic functional annotation tool based on decision tree techniques

MLSB'07
Jérôme Azé1,*, Lucie Gentils1,*, Claire Toffano-Nioche1, Valentin Loux2, Jean-François Gibrat2, Philippe Bessières2,Céline Rouveirol1,3, Anne Poupon4, Christine Froidevaux1

* First and second authors have contributed equally to the paper
1 LRI - UMR 8623 CNRS, Univ. Paris-Sud 11, F-91405 Orsay, France
2 MIG - INRA, Domaine de Vilvert, F-78352 Jouy-en-Josas, France
3 LIPN - UMR 7030 CNRS, Institut Galilée - Univ. Paris-Nord, F-93430 Villetaneuse, France
4 IBBMC - UMR 8619 CNRS, Univ. Paris-Sud 11, F-91405 Orsay, France

contact : jerome.aze [at] lri.fr

Menu :

Abstract :

Due to the continuous improvements of high throughput technologies and experimental procedures, the number of sequenced genomes is increasing exponentially.

Biologist experts play a central role in the analysis of this massive amount of raw data. To annotate a new genome they need to integrate many pieces of information coming from various sources: results of bioinformatics analysis programs, data stored in specialized databases, results of high-throughput experiments such as transcriptomics, proteomics, etc., information stored in the literature, general knowledge about the domain of interest (biological properties of the studied organism, its ecology, etc.). To face the deluge of new genomic data, there is a crying need to automate, as far as possible, the annotation process itself.

Numerous annotation platforms have been designed to help scientists in this task. They distinguish from each other by the way the strategy of annotators is taken into account and automated, by the degree of interactivity with the user and by the richness of the characteristics of the proteins that are considered. A comprehensive survey of automatic annotation softwares and methods is given in [1] and a table summarising characteristics of 20 platforms is given in [2]. However, a common characteristic is that the final annotation still relies on the analysis of the information gathered by the annotator. In the context of the RAFALE project our goal is to provide biologists with a semi-automatic tool for functional annotation that permits to improve both the productivity of the annotators and the consistency of the annotations. It is a semi-automatic tool in the sense that the process is collaborative: annotation is suggested by rules learnt from the annotated genomes and biologists have to decide the final annotation. We chose to learn rules obtained through decision trees because they exhibit several good features: they can be easily understood and used by human annotators, and represent modular pieces of information that can be considered as explanations of the annotations proposed. In our approach not only do we aim at obtaining good quality annotations but we also focus on how they have been obtained.

In the following, we propose to apply decision trees techniques to the problem of predicting classes from a functional hierarchy. Two different frameworks have been chosen to represent rules that are more or less expressive and, accordingly, more or less expensive: first-order decision trees [3] and multi-labeled attribute-values decision trees [4]. Characteristics of proteins used for the training stage are available from the AGMIAL annotation platform which allows us to combine various pieces of information about many proteins, such as homology relationships between proteins and intrinsic properties of protein sequences (isoelectric point, molecular mass and number of transmembrane segments [5] ). This platform has been used to annotate two lactic bacteria: Lactobacillus sakei [6] and Lactobacillus bulgaricus [7], using the Subtilist functional hierarchy [8]. In our study, we have excluded classes 4, 5, 6 and their subclasses. Class 4 groups together proteins with very diverse functions and thus presents no homogeneity. Classes 5 and 6 contain proteins of unknown function. In this paper we focus on the methods used to learn the prediction rules. Detailed examples of rules are available here.

menu