In regular parallel problems, the data arrays are distributed across the processors following some mathematical law, e.g. cyclic(k), and the data access pattern is an affine function of the loop indices. This situation is typical of dense linear algebra, which mainly uses array sections, and of automatically parallelized code. Irregular parallel problems cover all other cases. In data-parallel programs, irregular parallel problem are typically implemented by vector-valued subscripts, such as A(L(i)), or by the INDIRECT mapping directive proposed in HPF 2.0 Approved Extensions [12].
Our framework is distributed memory machines and message-passing communication. For regular parallel problems, the data structures required to access the distributed arrays and to optimize the communications can be worked out at compile time, or set up at run-time at low cost [2, 7, 9, 10, 6]. For irregular parallel problems, one solution is to set up at run-time the analogous data structures, which have now to explicitly store all the global to local translation tables. These data structures are called schedules and the code generation is the so-called inspector/executor scheme [13, 5, 14]. Setting up schedules is costly: [14] reports that the time to generate a schedule for a 14026-sized data array is in the range 0.4 to 1 second for 128 down to 16 processors on a iPSC/860, and that the time to build the translation tables is in the 1 to 7 seconds range. Hence the inspector/executor scheme targets irregular problems which exhibit spatial and temporal locality. Temporal locality allows to reuse already bought off-processor data (incremental schedules). Spatial locality eliminates duplicate off-processor references by hash tables. Real-world applications displaying a high level of spatial locality have been proven to benefit extensively from the inspector/executor scheme [13, 14]. Temporal an spatial locality can be also exploited through data-flow analysis [8], with the same purpose : eliminate redundant communications.
Undoubtedly, if there is no spatial nor temporal locality any parallel implementation of an irregular problem will behave poorly. Nevertheless, minimizing the penalty deserves some effort, either not to cancel the speedup of other parts of the program, or because a parallel implementation is mandatory because of memory space limitation. The straightforward method to compile irregular assignments (used for instance by the IBM XL-HPF compiler) is to make available all informations (data and indirection arrays) to all processors, and to let them pick up independently what data they need. In the following, we call this method the global method. Although robust, the global method does not really solve the problem, for two reasons. First, the memory requirement on each processor may be the size of the global array. Second, the performance deteriorates when the available parallelism, i.e. number of processors, increases, as we show in sections 3 and 4.
This paper describes a technique to optimize some irregular array assignments when no locality can be exploited. We show that in many cases, an irregular problem can be converted into a sorting problem. In the framework of routing theory, this idea is quite well-known [11]; however, we are not aware that it has been applied in the framework of parallel compilation.
Section 2 describes when and how the sorting method may apply. Section 3 discusses the theoretical advantages of the sorting method over the global method. Section 4 reports an experimental comparison, and demonstrates that the sorting method consistenly outperforms the global method in the range of feasible machines. Conclusions are presented in section 5.