Français Anglais
Accueil Annuaire Plan du site
Home > Research results > Dissertations & habilitations
Research results
Ph.D de

Group : Parallelism

Nouveaux protocoles de tolérance aux pannes pour les applications de calcul haute performance

Starts on 01/10/2008
Advisor : CAPPELLO, Franck

Funding : AM
Affiliation : Université Paris-Sud
Laboratory : LRI

Defended on 06/12/2011, committee :
André Schiper (Rapporteur), Professeur, EPFL
Pierre Sens (Rapporteur), Professeur, Université Paris 6

George Bosilca (Examinateur), Research Assistant and Adjunct Assistant Professor, ICL, Univeristy of Tenessee
Claude Puech (Examinateur), Professeur, Université Paris Sud
Jean-Louis Roch (Examinateur), Maître de Conférences, IMAG
Frédéric Vivien (Examinateur), Directeur de Recherche, INRIA

Marc Snir (Invité), Professeur, University of Illinois at Urbana Champaign

Franck Cappello (Directeur de thèse), Directeur de Recherche, INRIA

Research activities :

Abstract :
With the evolution of parallel computers, the use of fault
tolerance protocols is required. The techniques used must allow to
minimize the impact of failures while providing good failure free perfromances.
Existing fault tolerance protocols force either a global restart (coordinated
checkpointing protocols) or the log of all messages (message logging protocols)
and thus they are not adapted to these architectures.

We studied the characteristics of the existing protocols. We first studied the
determinism of the applications, since existing protocols assume non deterministic
executions (checkpointing protocols) or piecewise deterministic ones (message
logging protocols). In our study, we focused on the message passing model, and more
specifically on MPI applications. We have analyzed 26 MPI applications and
highlighted a new characteristic called "send-determinism" which corresponds to
most studied applications. In a second step, we focused on the communication
patterns of the applications to study the existence of clusters of processes in
these patterns. The study showed that for most applications, it is possible to
create clusters of processes to minimize the size of clusters and the volume of
inter-cluster messages.

Then we designed two fault tolérance protocols. The first one is an uncoordinated
checkpointing protocol which is based on the send-deterministic assumption and
avoids domino effect while logging only a subset of the application messages. We
have also adapted the protocol to clusters of processes. Then, we proposed HydEE,
a hierarchical protocol that is also based on the send-deterministic assumption and
that is used on clusters of processes. It combines a coordinated checkpointing
protocol inside clusters to a message logging protocol for inter-cluster
messages. Both protocols have been implemented in MPICH2 library and the
performance evaluation showed that they both have a low impact on the
applications failure free performances.

Ph.D. dissertations & Faculty habilitations
The original manuscript conceptualizes the recent rise of digital platforms along three main dimensions: their nature of coordination devices fueled by data, the ensuing transformations of labor, and the accompanying promises of societal innovation. The overall ambition is to unpack the coordination role of the platform and where it stands in the horizon of the classical firm – market duality. It is also to precisely understand how it uses data to do so, where it drives labor, and how it accommodates socially innovative projects. I extend this analysis to show continuity between today’s society dominated by platforms and the “organizational society”, claiming that platforms are organized structures that distribute resources, produce asymmetries of wealth and power, and push social innovation to the periphery of the system. I discuss the policy implications of these tendencies and propose avenues for follow-up research.