Ph.D
Group : Large-scale Heterogeneous DAta and Knowledge
Performance and optimization in peer to peer data management
Starts on 01/09/2009
Advisor : MANOLESCU-GOUJOT, Ioana
Funding : A
Affiliation : Université Paris-Saclay
Laboratory : LRI INRIA SACLAY
Defended on 05/07/2013, committee :
Alain Denise, Professeur, Université Paris Sud (examinateur)
Yanlei Diao, Professeur, University of Massachusetts Amherst (rapporteur)
Ioana Manolescu, Directeur de Recherche, INRIA Saclay & Université Paris-Sud (directrice de thèse)
Philippe Rigaux, Professeur, Conservatoire National des Arts et Métiers (rapporteur)
Patrick Valduriez, Directeur de Recherche, INRIA Sophia Antipolis – Méditerranée (examinateur)
Vasilis Vassalos, Professeur, Athens University of Economics and Business (examinateur)
Research activities :
Abstract :
XML was recommended by W3C in 1998 as a markup language to be used by device- and system-independent methods of representing information. XML is nowadays used as a data model for storing and querying large volumes of data in database systems. In spite of significant research and systems development, many performance problems are raised by processing very large amounts of XML data.
Materialized views have long been used in databases to speed up queries. Materialized views can be seen as precomputed query results that can be re-used to evaluate (part of) another query, and have been a topic of intensive research, in particular in the context of relational data warehousing.
This thesis investigates the applicability of materialized views techniques to optimize the performance of Web data management tools, in particular in distributed settings, considering XML data and queries. We make three contributions.
We first consider the problem of choosing the best views to materialize within a given space budget in order to improve the performance of a query workload. Our work is the first to address the view selection problem for a rich subset of XQuery. The challenges we face stem from the expressive power and features of both the query and view languages and from the size of the search space of candidate views to materialize. While the general problem has prohibitive complexity, we propose and study a heuristic algorithm and demonstrate its superior performance compared to the state of the art.
Second, we consider the management of large XML corpora in peer-to-peer networks, based on distributed hash tables (or DHTs, in short). We consider a platform leveraging distributed materialized XML views, defined by arbitrary XML queries, filled in with data published anywhere in the network, and exploited to efficiently answer queries issued by any network peer. This thesis has contributed important scalability oriented optimizations, as well as a comprehensive set of experiments deployed in a country-wide WAN. These experiments outgrow by orders of magnitude similar competitor systems in terms of data volumes and data dissemination throughput. Thus, they are the most advanced in understanding the performance behavior of DHT-based XML content management in real settings.
Finally, we present a novel approach for scalable content-based publish/subscribe (pub/sub, in short) in the presence of constraints on the available computational resources of data publishers. We achieve scalability by off-loading subscriptions from the publisher, and leveraging view-based query rewriting to feed these subscriptions from the data accumulated in others. Our main contribution is a novel algorithm for organizing subscriptions in a multi-level dissemination network in order to serve large numbers of subscriptions, respect capacity constraints, and minimize latency. The efficiency and effectiveness of our algorithm are confirmed through extensive experiments and a large deployment in a WAN.