Français Anglais
Accueil Annuaire Plan du site
Home > Research results > Dissertations & habilitations
Research results
Faculty habilitation de

Faculty habilitation
Group : Artificial Intelligence and Inference Systems

Knowledge Representation meets Databases for the sake of ontology-based data management

Starts on 11/07/2012
Advisor :

Funding :
Affiliation : Université Paris-Sud
Laboratory : Univ. Paris-Sud

Defended on 11/07/2012, committee :
Serge Abiteboul, Directeur de recherche, INRIA Saclay Ile-de-France (Examinateur)
Alexander Borgida, Professeur, Rutgers University (Rapporteur)
Michel Chein, Professeur, Université de Montpellier (Examinateur)
Vassilis Christophides, Professeur, ICS-FORTH Crète (Rapporteur)
Philippe Dague, Professeur, Université Paris-Sud (Examinateur)
Pierre Marquis, Professeur, Université de Lens (Rapporteur)
Marie-Christine Rousset, Professeur, Université de Grenoble (Marraine)

Research activities :
   - Databases
   - Knowledge representation
   - Automated reasoning

Abstract :
Data management is a longstanding research topic in Knowledge Representation (KR), a prominent discipline of Artificial Intelligence (AI), and -- of course -- in Databases (DB).
Till the end of the 20th century, there have been few interactions between these two research fields concerning data management, essentially because they were addressing it from different perspectives. KR was investigating data management according to human cognitive schemes for the sake of intelligibility, e.g., using Conceptual Graphs or Description Logics, while DB was focusing on data management according to simple mathematical structures for the sake of efficiency, e.g., using the relational model or the eXtensible Markup Language.
In the beginning of the 21st century, these ideological stances have changed with the new era of ontology-based data management. Roughly speaking, ontology-based data management brings data management one step closer to end-users, especially to those that are not computer scientists or engineers. It basically revisits the traditional architecture of database management systems by decoupling the models with which data is exposed to end-users from the models with which data is stored. Notably, ontology-based data management advocates the use of conceptual models from KR as human intelligible front-ends called ontologies, relegating DB models to back-end storage.
The World Wide Web Consortium (W3C) has greatly contributed to ontology-based data management by providing standards for handling data through ontologies, the two Semantic Web data models. The first standard, the Resource Description Framework (RDF), was introduced in 1998. It's a graph data model coming with a very simple ontology language, RDF Schema, strongly related to description logics. The second standard, the Web Ontology Language (OWL), was introduced in 2004. It's actually a family of well-established description logics with varying expressivity/complexity tradeoffs.
The advent of RDF and OWL has rapidly focused the attention of academia and industry on practical ontology-based data management. The research community has undertaken this challenge at the highest level, leading to pioneering and compelling contributions in top venues on Artificial Intelligence (e.g., AAAI, ECAI, IJCAI, and KR), on Databases (e.g., ICDT/EDBT, ICDE, SIGMOD/PODS, and VLDB), and on the Web (e.g., ESWC, ISWC, and WWW). Also, open-source and commercial software providers are releasing an ever-growing number of tools allowing effective RDF and OWL data management (e.g., Jena, ORACLE 10/11g, OWLIM, Protégé, RDF-3X, and Sesame).
Last but not least, large societies have promptly adhered to RDF and OWL data management (e.g., library and information science, life science, and medicine), sustaining and begetting further efforts towards always more convenient, efficient, and scalable ontology-based data management techniques.
This HDR thesis reports on some of my contributions to the design, the optimization, and the decentralization of data management techniques for RDF and OWL.

Ph.D. dissertations & Faculty habilitations
Question Answering is a discipline which lies in between natural language processing and information retrieval domains. Emergence of deep learning approaches in several fields of research such as computer vision, natural language processing, speech recognition etc. has led to the rise of end-to-end models. In the context of GoASQ project, we investigate, compare and combine different approaches for answering questions formulated in natural language over textual data on open domain and biomedical domain data. The thesis work mainly focuses on 1) Building models for small scale and large scale datasets, and 2) Leveraging structured and semantic information into question answering models. Hybrid data in our research context is fusion of knowledge from free text, ontologies, entity information etc. applied towards free text question answering. The current state-of-the-art models for question answering use deep learning based models. In order to facilitate using them on small scale datasets on closed domain data, we propose to use domain adaptation. We model the BIOASQ biomedical question answering task dataset into two different QA task models and show how the Open Domain Question Answering task suits better than the Reading Comprehension task by comparing experimental results. We pre-train the Reading Comprehension model with different datasets to show the variability in performance when these models are adapted to biomedical domain. We find that using one particular dataset (SQUAD v2.0 dataset) for pre-training performs the best on single dataset pre-training and a combination of four Reading Comprehension datasets performed the best towards the biomedical domain adaptation. We perform some of the above experiments using large scale pre-trained language models like BERT which are fine-tuned to the question answering task. The performance varies based on the type of data used to pre-train BERT. For BERT pre-training on the language modelling task, we find the biomedical data trained BIOBERT to be the best choice for biomedical QA. Since deep learning models tend to function in an end-to-end fashion, semantic and structured information coming from expert annotated information sources are not explicitly used. We highlight the necessity for using Lexical and Expected Answer Types in open domain and biomedical domain question answering by performing several verification experiments. These types are used to highlight entities in two QA tasks which shows improvements while using entity embeddings based on the answer type annotations. We manually annotated an answer variant dataset for BIOASQ and show the importance of learning a QA model with answer variants present in the paragraphs. Our hypothesis is that the results obtained from deep learning models can further be improved using semantic features and collective features from different paragraphs for a question. We propose to use ranking models based on binary classification methods to better rank Top-1 prediction among Top-K predictions using these features, leading to an hybrid model that outperforms state-of-art-results on several datasets. We experiment with several overall Open Domain Question Answering models on QA sub-task datasets built for Reading Comprehension and Answer Sentence Selection tasks. We show the difference in performance when these are modelled as overall QA task and highlight the wide gap in building end-to-end models for overall question answering task.

The original manuscript conceptualizes the recent rise of digital platforms along three main dimensions: their nature of coordination devices fueled by data, the ensuing transformations of labor, and the accompanying promises of societal innovation. The overall ambition is to unpack the coordination role of the platform and where it stands in the horizon of the classical firm – market duality. It is also to precisely understand how it uses data to do so, where it drives labor, and how it accommodates socially innovative projects. I extend this analysis to show continuity between today’s society dominated by platforms and the “organizational society”, claiming that platforms are organized structures that distribute resources, produce asymmetries of wealth and power, and push social innovation to the periphery of the system. I discuss the policy implications of these tendencies and propose avenues for follow-up research.