Performance Evaluation of Clusters of Multiprocessors (CLUMPs)

CLUMPS and Programming Models

Parallel machines are currently built as clusters of workstations using a standard high speed network (Myrinet, Gigabit Ethernet) or high-end supercomputers using a proprietary interconnection network. The current trend is towards clusters (or cluster of clusters) of multiprocessor nodes. Clusters of PCs use 2-way or 4-way nodes, and distributed memory parallel machines (IBM SP, Compaq SC cluster, etc.) use wider SMP nodes. In future SP4 machines, the node will be implemented as a cluster of smaller multiprocessors. These hierarchical architectures, that will become widely used in the coming years, raise difficult programming issues because they mix two different memory models : shared memory inside the nodes and distributed memory (message passing) between the nodes. As many parallel programs have been developed according to a simple programming approach such as message passing (MPI) or shared memory (OpenMP), the best approach for hierarchical architectures is not clear. The performance and programming effort issues when using either a unified programming model (either "message passing" or "shared memory") or a hybrid programming model mixing message passing and shared memory approaches have to be investigated.

Unified or mixed programming models

In 1998 and 1999, I have been working with Dr. F. Cappello at LRI, University of Paris Sud in Orsay, France on the comparison of unified and hybrid programming models for clusters of SMPs. We have focused on existing MPI programs, by comparing performance and programming efforts of unified models (unified MPI) and one hybrid model (MPI+OpenMP) which combines message passing between nodes and OpenMP fine grain parallelization of loop nests with manual optimizations and profiling to choose the loop nests to parallelize. The experiments have been done on Myrinet clusters of 2-way and 4-way PCs and two different IBM SP3 systems. The experiments with the NAS benchmarks have shown that the unified MPI model is generally better than the hybrid ones and the explanations have been presented in two papers in HPCA-6 and SC2000 conferences [CAP00a, CAP00b].

Fine-grain and Coarse-grain OpenMP

I will carry out performance comparisons with other hybrid programming models and other hardware platforms. One interesting model is SPMD OpenMP, which is a coarse grain parallelization for distributed memory computers. In this approach, OpenMP is still used to take advantage of the shared memory inside the SMP nodes but a SPMD programming style is used instead of the traditional shared memory multithread approach. In this mode, OpenMP is used to spawn N threads in the main program, having each thread act similarly to a MPI process. The OpenMP Parallel directive is used at the outermost level of the program. The principle is to spawn the threads just after the spawn of the MPI processes (some initializations may separate the two spawns). As for the message passing SPMD approach, the programmer must take care of several issues: array distribution among threads, work distribution among threads and coordination between threads.

The experiments are and will be done on the local cluster of PCs (Computer Group, University of Toronto) and on distributed memory machines (IBM SPs or Compaq SC). This work will be extended to NUMA machines.

A general performance model

The final objective is to derive a general performance model that may help in choosing between unified and hybrid models according to the characteristics of the architecture and the application. Clusters, and clusters of clusters of nodes, will lead to a wider range of parallel architectures, hence it is of tremendous importance to help the user to choose the best programming model according to the hardware and application features. Showing that a unified model, which directly uses the "dusty" parallel codes, gives the best performance or comes close to the best performance would be a significant result. Otherwise, we can derive the programming guidelines to use the best programming model with minimum programming effort, which would also be very useful for all users of high end applications.

References:

[CAP00a] F. Cappello, O. Richard O and D. Etiemble, Investigating the performance of two programming models for clusters of SMP PCs in Proceedings of HPCA6, January 2000, Toulouse, France.

[CAP00b] F. Cappello and D. Etiemble, MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks, in Proceedings High Performance Networking and Computing Conference (SC2000), November 2000, Dallas, Texas.