Towards a semi-automatic functional annotation tool based on decision tree techniques
MLSB'07
Jérôme Azé1,*, Lucie Gentils1,*, Claire Toffano-Nioche1, Valentin Loux2, Jean-François Gibrat2, Philippe Bessières2,Céline Rouveirol1,3, Anne Poupon4, Christine Froidevaux1
* First and second authors have contributed equally to the paper
1 LRI - UMR 8623 CNRS, Univ. Paris-Sud 11, F-91405 Orsay, France
2 MIG - INRA, Domaine de Vilvert, F-78352 Jouy-en-Josas, France
3 LIPN - UMR 7030 CNRS, Institut Galilée - Univ. Paris-Nord, F-93430 Villetaneuse, France
4 IBBMC - UMR 8619 CNRS, Univ. Paris-Sud 11, F-91405 Orsay, France
contact : jerome.aze [at] lri.fr
Menu :
Abstract :
Due to the continuous improvements of high throughput technologies and
experimental procedures, the number of sequenced genomes is increasing
exponentially.
Biologist experts play a central role in the analysis of this massive
amount of raw data. To annotate a new genome they need to integrate
many pieces of information coming from various sources: results of
bioinformatics analysis programs, data stored in specialized
databases, results of high-throughput experiments such as
transcriptomics, proteomics, etc., information stored in the
literature, general knowledge about the domain of interest (biological
properties of the studied organism, its ecology, etc.). To face the
deluge of new genomic data, there is a crying need to automate, as far
as possible, the annotation process itself.
Numerous annotation platforms have been designed to help scientists in
this task. They distinguish from each other by the way the strategy of
annotators is taken into account and automated, by the degree of
interactivity with the user and by the richness of the characteristics
of the proteins that are considered. A comprehensive survey of
automatic annotation softwares and methods is given in [1]
and a table summarising characteristics of 20
platforms is given in [2].
However, a common characteristic
is that the final annotation still relies on the analysis of the
information gathered by the annotator. In the context of the RAFALE project our goal is to
provide biologists with a semi-automatic tool for functional
annotation that permits to improve both the productivity of the
annotators and the consistency of the annotations. It is a
semi-automatic tool in the sense that the process is collaborative:
annotation is suggested by rules learnt from the annotated genomes and
biologists have to decide the final annotation. We chose to learn
rules obtained through decision trees because they exhibit several
good features: they can be easily understood and used by human
annotators, and represent modular pieces of information that can be
considered as explanations of the annotations proposed. In our
approach not only do we aim at obtaining good quality annotations but
we also focus on how they have been obtained.
In the following, we propose to apply decision trees techniques to
the problem of predicting classes from a functional hierarchy. Two
different frameworks have been chosen to represent rules that are more
or less expressive and, accordingly, more or less expensive:
first-order decision trees [3] and multi-labeled
attribute-values decision trees [4]. Characteristics of proteins used for the
training stage are available from the AGMIAL annotation platform which allows us to
combine various pieces of information about many proteins, such as
homology relationships between proteins and intrinsic properties of
protein sequences (isoelectric point, molecular mass and number of
transmembrane segments [5]
). This platform has been used to
annotate two lactic bacteria: Lactobacillus sakei [6]
and Lactobacillus bulgaricus [7], using the Subtilist
functional hierarchy [8]. In our study, we have excluded
classes 4, 5, 6 and their subclasses. Class 4 groups together
proteins with very diverse functions and thus presents no homogeneity.
Classes 5 and 6 contain proteins of unknown function. In this paper
we focus on the methods used to learn the prediction rules. Detailed
examples of rules are available here.
menu