Ph.D
Group : Learning and Optimization
Multi-Objective Sequential Decision Making
Starts on 23/09/2010
Advisor : SEBAG, Michèle
[SCHOENAUER Marc]
Funding :
Affiliation : Université Paris-Saclay
Laboratory : LRI
Defended on 11/07/2014, committee :
Directrice de Thèse :
- Michèle Sebag, DR CNRS, Université Paris-Sud, LRI/TAO
Examinateurs :
- Yann Chevaleyre, Professeur, Université Paris 13, LIPN
- Cécile Germain-Renaud, Professeur, Université Paris-Sud, LRI/TAO
- Dominique Gouyou-Beauchamps, Professeur, Université Paris-Sud, LRI
Rapporteurs :
- Jin-Kao Hao, Professeur, Université d'Angers, LERIA
- Philippe Preux, Professeur, Université de Lille 3, INRIA Lille
Research activities :
Abstract :
This thesis is concerned with multi-objective sequential decision making (MOSDM).
The motivation is twofold. On the one hand, many decision problems in the domains of e.g., robotics, scheduling or games, involve the optimization of sequences of decisions. On the other hand, many real-world applications are most naturally formulated in terms
of multi-objective optimization (MOO).
The proposed approach extends the well-known Monte-Carlo tree search (MCTS) framework to the MOO setting, with the goal of discovering several optimal sequences of decisions through growing a single search tree. The main challenge is to propose a new reward, able to guide the exploration of the tree although the MOO setting does not enforce a total order among solutions.
The main contribution of the thesis is to propose and experimentally study two such rewards, inspired from the MOO literature and assessing a solution with respect to the archive of previous solutions (Pareto archive): the hypervolume indicator and the Pareto dominance reward.
The study shows the complementarity of these two criteria. The hypervolume indicator suers from its known computational complexity; however the proposed extension thereof provides ne-grained information abut the quality of solutions with respect to the current archive. Quite the contrary, the Pareto-dominance reward is linear but it provides increasingly rare information.
Proofs of principle of the approach are given on articial problems and challenges, and confirm the merits of the approach. In particular, MOMCTS is able to discover policies lying in non-convex regions of the Pareto front, contrasting with the state of the art: existing Multi-Objective Reinforcement Learning algorithms are based on linear scalarization and thus fail to sample such non-convex regions.
Finally MOMCTS honorably competes with the state of the art on the 2013 MOPTSP competition.