Français Anglais
Accueil Annuaire Plan du site
Accueil > Production scientifique > Thèses et habilitations
Production scientifique
Doctorat de

Equipe : Intelligence Artificielle et Systèmes d'Inférence

Comprendre le web caché

Début le 01/01/1970
Direction : ABITEBOUL, Serge

Ecole doctorale :
Etablissement d'inscription : INRIA

Lieu de déroulement :

Soutenue le 12/01/2007 devant le jury composé de :
Serge Abitboul
francois Bourdoncle
Patrick gallinari
Georg Gottlob
Christine Paulin-Mohring
Val Tannen

Activités de recherche :
   - Web sémantique

Résumé :
The hidden Web (also known as deep or invisible Web), that is, the part of the Web not directly accessible through hyperlinks, but through HTML forms or Web services, is of great value, but difficult to exploit.
We discuss a process for the fully automatic discovery, syntactic and semantic analysis, and querying of hidden-Web services. We propose first a general architecture that relies on a semi-structured warehouse of imprecise (probabilistic) content. We provide a detailed complexity analysis of the underlying probabilistic tree model. We describe how we can use a combination of heuristics and probing to understand the structure of an HTML form. We present an original use of a supervised machine-learning method, namely conditional random fields, in an unsupervised manner, on an automatic, imperfect, and imprecise, annotation based on domain knowledge, in order to extract relevant information from HTML result pages. So as to obtain semantic relations between inputs and outputs of a hidden-Web service, we investigate the complexity of deriving a schema mapping between database instances, solely relying on the presence of constants in the
two instances. We finally describe a model for the semantic representation and intensional indexing of hidden-Web sources, and discuss how to process a user’s high-level query using such descriptions.