M2-DS: Knowledge Discovery in Graph Data - Projects

Objectives

We propose seven different projects for the evaluation of Infromation Integration course. These projects should be realized by groups of 4 students (depending on the complexity of the project).

For the scoring part, each group should provide a report of 8 to 10 pages on what has been done in the project and on the remaining tasks. You may also give ideas on possible improvements of what you have developed. The report should be sent by mail two days before the presentation.

We ask also each group to prepare an oral presentation of 20 minutes.

Important Dates

  • Reports are due: Monday, January 4th, 2021
  • Project Presentation/demo: Tuseday, January 5th, 2021


Project details and materials

1. Link invalidation project

The aim of this project is to develop a simple tool that help to detect incorrect sameAs links by using the functionality degree of the properties. This degree has to be computed by your tool in one dataset and then used to detect dissimilar property values for the functional properties that describe two URIs of two datasets (very simple inspiration from Papaleo et al 2014).

The RDF dataset with its corresponding OWL ontology:
  • IIMB datasets are available here: link
  • Choose OWL datatrack IIMB large.
  • There are many versions of the dataset, owl ontology with their instances (file OWL of the folders 000, 001 ….).
  • Refalign in folder 001 is the gold standard (the set of correct owl:sameAs links between the data in 000 and the ones in 001).
  • Take the first dataset (000) and extract the functional properties (i.e. compute the degree of functionality of each property and select the ones having a very high degree according to a fixed threshold. Then, develop a simple invalidation tool based on these functional properties.
  • For the evaluation of your tool, you may inject random erroneous sameAs links for the class Film in the refalign and check whether your tool finds these erroneous links (recall and precision measures)
  • A library of a bench of similarity measures: Second String distance. You may find additional details on here

2. Difference project

The aim of this project is to propose a formal definition of the contextual difference relation between 2 URIs. A context of difference can be a subgraph of the instances description of the URIs. Then develop a tool that can extract these subgraphs for each pair of URIS.
    The RDF dataset with its corresponding OWL ontology:
  • IIMB datasets are available here: link
  • Choose OWL datatrack IIMB large.
  • There are many versions of the dataset, owl ontology with their instances (file OWL of the folders 000, 001 ….).
  • Refalign in folder 001 is the gold standard (the set of correct owl:sameAs links between the data in 000 and the ones in 001).
  • For the qualitative evaluation of your tool, you may inject some random erroneous sameAs links for the class Film in the refalign and compute the contextual difference of each pair of URIs (no recall and precision are needed). You may evaluate there readability of the contexts and scalability of your tool.