Ph.D
Group : Parallel Systems
Lattice QCD Optimization and Polytopic Representations of Distributed Memory
Starts on 13/09/2010
Advisor : EISENBEIS, Christine
Funding : contrat doctoral INRIA
Affiliation : Université Paris-Saclay
Laboratory : INRIA Alchemy
Defended on 26/09/2014, committee :
Directrice de thèse :
- Christine Eisenbeis, INRIA/LRI/U-PSud
Rapporteurs :
- Philippe Clauss, Université de Strasbourg ;
- J. Ramanujam, Louisiana State University.
Examinateurs :
- Sid Touati, Université Nice Sophia Antipolis ;
- Paul Feautrier, ENS Lyon ;
- Sylvain Conchon, Université Paris-Sud XI ;
- Olivier Pène, CNRS/LPT Orsay.
Research activities :
Abstract :
Motivated by modern day physics which in addition to experiments also
tries to verify and deduce laws of nature by simulating the
state-of-the-art physical models using oversized computers, this
thesis explores means of accelerating such simulations by improving
the simulation programs they run. The primary focus is Lattice
Quantum Chromodynamics (QCD), a branch of quantum field theory,
running on IBM newest supercomputer, the Blue Gene/Q.
In a first approach, the source code of tmLQCD, a Lattice QCD program,
is improved to run faster on the Blue Gene machine. Its most
performance-relevant operation is a 8-point stencil in 4 dimensional
space. Two different optimization strategies are perused: One with
the priority of improving spatial and temporal locality, and a second
making use of the hardware's data stream prefetcher. On Blue Gene/Q
the first strategy reaches up to 20% of the peak theoretical floating
point operation performance of that machine. The second strategy with
up to 54% of peak is much faster at the cost of using 4 times more
memory by storing the data in the order they will be used in the next
stencil operation, duplicating data where necessary.
Other techniques exploited are direct programming of the messaging
hardware (called MUSPI by IBM), a low-overhead work distribution
mechanism for threads, explicit data prefetching of data (using dcbt
instruction) and manual vectorization (using QPX; width-4 SIMD
instructions). Hardware-based list prefetching and transactional
memory - both distinct and novel features of the Blue Gene/Q system --
did not improve the program's performance.
The second approach is the newly-written LLVM compiler extension
called Molly which optimizes the program itself, specifically the
distribution of data and work between the nodes of a cluster machine
such as Blue Gene/Q. Molly represents arrays using integer polyhedra
and uses another already existing compiler extension Polly which
represents statements and loops using polyhedra. When Molly knows how
data is distributed among the nodes and where statements are executed,
it adds code that manages the data flow between the nodes. Molly can
also permute the order of data in memory.
Molly's main task is to cluster data into sets that are sent to the
same target into the same buffer because single transfers involve a
massive overhead. We present an algorithm that minimizes the number
of transfers for unparametrized loops using anti-chains of data flows.
In addition, we implement a heuristic that takes into account how the
programmer wrote the code. Asynchronous communication primitives are
inserted right after the data is available respectively just before it
is used. A runtime library implements these primitives using MPI.
Molly manages to distribute any code that is representable by the
polyhedral model, but does so best for stencils codes such as Lattice
QCD. Compiled using Molly, the Lattice QCD stencil reaches 2.5% of
the theoretical peak performance. The performance gap is mostly
because all the other optimizations are missing, such as
vectorization. Future versions of Molly may also effectively handle
non-stencil codes and use make use of all the optimizations that make
the manually optimized Lattice QCD stencil so fast.