Ph.D
Group : Parallel Systems
Méthodes de génération automatique de code appliquées à l’algèbre linéaire numérique dans le calcul haute performance
Starts on 01/10/2013
Advisor : BABOULIN, Marc
Funding : Contrat doctoral uniquement recherche
Affiliation : Université Paris-Saclay
Laboratory : LRI PARSYS
Defended on 26/09/2016, committee :
Directeur de thèse
Marc Baboulin, Professeur, Univ. Paris-Sud, Orsay
Co-encadrant de thèse
Joël Falcou, Maître de Conférences, Univ. Paris-Sud,
Rapporteurs
-Paolo Bientinesi, Professeur, Aachen University, Aachen, Germany
-David Hill, Professeur, Univ. Blaise Pascal, Clermont-Ferrand
Examinateurs
-Frédéric Magoulès, Professeur, Ecole Centrale Paris
-Emmanuel Chailloux, Professeur, Université Pierre et Marie Curie
Research activities :
Abstract :
Parallelism in today’s computer architectures is ubiquitous whether it be in supercomputers, workstations or on portable devices such as smartphones. Exploiting efficiently these systems for a specific application requires a multidisciplinary effort that concerns Domain Specific Languages (DSL), code generation and optimization techniques as well as application-specific numerical algorithms.
In this PhD thesis, we present a method of high level programming that takes into account the features of heterogeneous architectures and the properties of matrices to build a generic dense linear algebra solver. As GPUs have become an asset in high performance computing, incorporating their use in general solvers is an important issue.
We extend our approach to a new multistage programming model that alleviates the interoperability problems between the CPU and GPU programming models.
Our multistage approach is used to automatically generate GPU code for CPU-based
element-wise expressions and parallel skeletons while allowing for type-safe program generation.
Finally, we investigate how to apply high level programming techniques to batched
computations and tensor contractions. We first explain how to design a simple data container
using modern C++-14 programming techniques. Then, we study the issues around batched
computations, memory locality and code vectorization to implement a highly optimized
matrix-matrix product for small sizes using SIMD instructions.