Français Anglais
Accueil Annuaire Plan du site
Home > Research results > Dissertations & habilitations
Research results
Ph.D de

Group : Large-scale Heterogeneous DAta and Knowledge

Automatic key discovery for Data Linking

Starts on 05/10/2011
Advisor : PERNELLE-MANSCOUR, Nathalie
[SAIS Fatiha]

Funding :
Affiliation : Université Paris-Sud
Laboratory : LRI-IASI

Defended on 09/10/2014, committee :
Directrice de thèse :
- Mme Nathalie Pernelle, Maître de Conférences, LRI, Université Paris Sud

Co-encadrante :
- Mme Fatiha Saïs, Maître de Conférences, LRI, Université Paris Sud

Rapporteurs :
- Mme Marie-Christine Rousset, Professeur, LIG, Université de Grenoble
- M. Aldo Gangemi , Professeur, LIPN, Université Paris 13

Examinateurs :
- M. Olivier Curé, Maître de Conférences, LIGM, Université Marne-la-Vallée
- M. Alain Denise, Professeur, LRI, Université Paris Sud

Research activities :

Abstract :
In the recent years, the Web of Data has increased significantly, containing a huge number of RDF triples. Integrating data described in different RDF datasets and creating semantic links among them, has become one of the most important goals of RDF applications. These links express semantic correspondences between ontology entities or data. Among the different kinds of semantic links that can be established, identity links express that different resources refer to the same real world entity. By comparing the number of resources published on the Web to the number of identity links, one can observe that the goal of building a Web of data is still not accomplished. Several data linking approaches infer identity links using keys. Nevertheless, in most datasets published on the Web, keys are not available and it can be difficult, even for an expert, to declare them.

The aim of this thesis is to study the problem of automatic key discovery in RDF data and to propose new efficient approaches to tackle this problem. Data published on the Web are usually created automatically, thus may contain erroneous information, duplicates or may be incomplete. Therefore, we focus on developing key discovery approaches that can handle datasets with numerous, incomplete or erroneous information. Our objective is to discover as many keys as possible, even ones that are valid in subparts of the data.

We first introduce KD2R, an approach that allows the automatic discovery of composite keys in RDF datasets that may conform to different ontologies. KD2R is able to treat datasets that may be incomplete and for which the Unique Name Assumption is fulfilled. To deal with the incompleteness of data, KD2R proposes two heuristics that offer different interpretations for the absence of data. KD2R uses pruning techniques to reduce the search space. However, this approach is overwhelmed by the huge amount of data found on the Web. Thus, we present our second approach, SAKey, which is able to scale in very large datasets by using effective filtering and pruning techniques. Moreover, SAKey is capable of discovering keys in datasets where erroneous data or duplicates may exist. More precisely, the notion of almost keys is proposed to describe sets of properties that are not keys due to few exceptions.

Ph.D. dissertations & Faculty habilitations


Creative work has been at the core of research in Human-Computer Interaction (HCI). I describe the results of a series of studies that look at how creators work, where creators include artists with years of professional practice, as well as learners, or novices and casual makers. My research focuses on three creation activities: drawing, physical modeling, and music composition. For these activities, I examine how artists switch between representations and how these representations evolve throughout their creative process, from early sketches to fine-grained forms or structured vocabularies. I present interactive systems that enrich their workflow (i) by extending their computer tools with physical user interfaces, or (ii) by making physical materials interactive. I also argue that sketch-based representations can allow for user interfaces that are more personal and less rigid. My presentation will reflect on lessons and limitations of this work and discuss challenges for future design-support tools.