Français Anglais
Accueil Annuaire Plan du site
Accueil > Evenements > Séminaires
Séminaire d'équipe(s) ParSys
Magma and Batched Small Dense Matrix Computation on the GPU
Tingxing Dong

26 August 2014, 10h30 - 26 August 2014, 11h30
Salle/Bat : 465/PCRI-N
Contact :

Activités de recherche : Calcul à haute performance

Résumé :
The Recent Progress of MAGMA (less than 10min)

The MAGMA (Matrix Algebra on GPU and Multicore Architectures) project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures, like "Multicore+GPU", "Multcore+MIC" systems.
MAGMA uses a hybrid methodology where algorithms of interest are slit into tasks of varying
granularity and their execution scheduled over the available hardware component. Small non-parallelizable tasks often on critical path are schedule on the CPU, and large parallelizable tasks are schedule on accelerators. We talk about the recent features of MAGMA for CUDA 1.5, MAGMA MIC 1.2, clMAGMA 1.1.

Batched Small Dense Matrix Computation on the GPU (20min)

Ones-sided factorizations (Cholesky, LU and QR) are commonly used to solve
dense linear systems in scientific models. In a large number of
applications, a need arises to solve many small size problems,
instead of few large linear systems. The size of each of these
small linear systems depends, for example, on the number of
the ordinary differential equations (ODEs) used in the model,
and can be on the order of hundreds of unknowns. To efficiently
exploit the computing power of modern accelerator hardware,
these linear systems are processed in batches. To improve the
numerical stability of the Gaussian Elimination(LU), at least partial
pivoting is required, most often accomplished with row pivoting.
However, row pivoting can result in a severe performance penalty
on GPUs because it brings in thread divergence and non-coalesced
memory accesses. In this paper, we propose a batched LU
factorization for GPUs by using a multi-level blocked right
looking algorithm that preserves the data layout but minimizes
the penalty of partial pivoting. We extend this algorithm to Cholesky and LU.
Our batched LU achieves up to 2.5-fold speedup when compared to the alternative CUBLAS
solution on a K40c GPU. Our batched Cholesky, batched QR achieves 1.8 speedup
compared to the optimized parallel implementation in the MKL
library on two sockets of Intel Sandy Bridge CPUs.

Pour en savoir plus :
Séminaires
A Family of Tractable Graph Distances
Gestion de données du Web
Wednesday 04 July 2018 - 10h30
Salle : 465 - PCRI-N
Stratis Ioannidis .............................................

Binary pattern of length greater than 14 are abeli
Combinatoire
Friday 29 June 2018 - 14h30
Salle : 445 - PCRI-N
Matthieu Rosenfeld .............................................

Distributionally Robust Optimization with Principa
Optimisation combinatoire et stochastique
Friday 29 June 2018 - 11h00
Salle : 455 - PCRI-N
Dr. Jianqiang Cheng .............................................

Caractérisation de réseaux égocentrés par l'énumér
Friday 15 June 2018 - 14h30
Salle : 455 - PCRI-N
Raphaël Charbey .............................................

DATA VERACITY ASSESSMENT: HOW A-PRIORI KNOWLEDGE E
Intégration de données et de connaissances
Friday 15 June 2018 - 14h00
Salle : 445 - PCRI-N
Valentina Beretta .............................................