Ph.D
Group : Learning and Optimization
Efficient End-to-End Monitoring for Fault Management in Distributed Systems
Starts on 15/10/2010
Advisor : GERMAIN, Cécile
Funding : Bourse pour étudiant étranger
Affiliation : Université Paris-Saclay
Laboratory : LRI
Defended on 27/03/2014, committee :
Cécile Germain-Renaud Professeur Directrice de thèse Université Paris Sud, LRI/TAO
Joffroy Beauquier Professeur Président du jury Université Paris-Sud, LRI/ParSys
Lorenza Saitta Professeur Rapporteur Università del Piemonte Orientale (Italy)
Johan Montagnat DR CNRS Rapporteur CNRS
Michèle Sebag DR CNRS Examinatrice CNRS
Xiangliang Zhang Assistant professeur Examinatrice King Abdullah University of Science & Technology (Saudi Arabia)
Irina Rish Chercheur Examinatrice IBM T. J. Watson Research Center (USA)
Research activities :
Abstract :
In this dissertation, we present our work on fault management in distributed systems, with motivating application roots in monitoring fault and abrupt change of large computing systems like the grid and the cloud. Instead of building a complete a priori knowledge of the software and hardware infrastructures as in conventional detection or diagnosis methods, we propose to use appropriate techniques to perform end-to-end monitoring for such large scale systems, leaving the inaccessible details of involved components in a black box.
For the fault monitoring of a distributed system, we first model this probe-based application as a static collaborative prediction (CP) task, and experimentally demonstrate the effectiveness of CP methods by using the max margin matrix factorization method. We further introduce active learning to the CP framework and exhibit its critical advantage in dealing with highly imbalanced data, which is especially useful for identifying the minority fault class.
Further we extend the static fault monitoring to the sequential case by proposing the sequential matrix factorization (SMF) method. SMF takes a sequence of partially observed matrices as input, and produces predictions with information both from the current and history time windows. Active learning is also employed to SMF, such that the highly imbalanced data can be coped with properly. In addition to the sequential methods, a smoothing action taken on the estimation sequence has shown to be a practically useful trick for enhancing sequential prediction performance.
Since the stationary assumption employed in the static and sequential fault monitoring becomes unrealistic in the presence of abrupt changes, we propose a semi-supervised online change detection (SSOCD) framework to detect intended changes in time series data. In this way, the static model of the system can be recomputed once an abrupt change is detected. In SSOCD, an unsupervised offline method is proposed to analyze a sample data series. The change points thus detected are used to train a supervised online model, which gives online decision about whether there is a change presented in the arriving data sequence. State-of-the-art change detection methods are employed to demonstrate the usefulness of the framework.
All presented work is verified on real-world datasets. Specifically, the fault monitoring experiments are conducted on a dataset collected from the Biomed grid infrastructure within the European Grid Initiative, and the abrupt change detection framework is verified on a dataset concerning the performance change of an online site with large amount of traffic.