Next: The NEC SX-6.
Up: Recount of
(almost) available ... Previous: The IBM eServer
p690.
Machine type |
RISC-based distributed-memory multi-processor |
Models |
IBM BlueGene/L. |
Operating system |
Linux |
Connection structure |
3-D Torus, Tree network |
Compilers |
XL Fortran (Fortran 90), XL C, C++ |
Vendors information Web page |
www-1.ibm.com/servers/deepcomputing/ |
Year of introduction |
2004. |
System parameters:
Model |
BlueGene/L |
Clock cycle |
700 MHz |
Theor. peak performance |
Per Proc. (64-bits) |
2.8 Gflop/s |
Maximal |
367/183.5 Tflop/s |
Main memory |
Memory/card |
<= 512 MB |
Memory/maximal |
<= 16 TB |
No. of processors |
2×65,536 |
Communication bandwidth |
Point-to-point (3-D Torus) |
175 MB/s |
Point-to-point (Tree network) |
175 MB/s |
Remarks:
The BlueGene/L is the first in a new generation of systems made
by IBM for very massively parallel computing. The individual speed
of the processor has therefore been traded in favour of very dense
packaging and a low power consumption per processor. The basic
processor in the system is a modified PowerPC 400 at 700 MHz. Two of
these processors reside on a chip together with 4 MB of shared L3
cache and a 2 KB L2 cache for each of the processors. The processors
have two load ports and one store port from/to the L2 caches at 8
bytes/cycle. This is half of the bandwidth required by the two
floating-point units (FPUs) and as such quite high. The CPUs have 32
KB of instruction cache and of data cache on board. In favourable
circumstances a CPU can deliver a peak speed of 2.8 Gflop/s because
the two FPUs can perform fused multiply-add operations. Note that
the L2 cache is smaller than the L1 cache which is quite unusual but
which allows it to be fast.
The packaging in the system is as follows: two chips fit on a
compute card with 512 MB of memory. Sixteen of these compute cards
are placed on a node board of which in turn 32 go into one cabinet.
So, one cabinet contains 1024 chips, i.e., 2048 CPUs. For a maximal
configuration 64 cabinets are coupled to form one system with 65,356
chips/130,712 CPUs. In normal operation mode one of the CPUs on a
chip is used for computation while the other takes care of
communication tasks. In this mode the \tpp of the system is 183.5
Tflop/s. It is however possible when the communication requirements
are very low to use both CPUs for computation, doubling the peak
speed; hence the double entries in the System Parameters table
above. The number of 360 Tflop/s is also the speed that IBM is using
in its marketing material.
The BlueGene/L possesses no less than 5 networks, 2 of which are
of interest for inter-processor communication: a 3-D torus network
and a tree network. The torus network is used for most general
communication patterns. The tree network is used for often occurring
collective communication patterns like broadcasting, reduction
operations, etc. The hardware bandwidth of the tree network is twice
that of the torus: 350 MB/s against 175 MB/s per link.\\ At the time
of writing this report no fully configured system exists yet. One
such system should be delivered to Lawrence Livermore Lab by the end
of this year. A smaller system of around 34 Tflop/s peak will be
delivered at ASTRON, an astronomical research organisation in the
Netherlands for the synthesis of radio-astronomical images.
Measured Performances: Recently IBM has reported to
have attained a speed of 36.01 Tflop/s on the HPC Linpack benchmark.
Neither the order of the linear system, nor the size of the the
BlueGene system were disclosed in the press release.
Next: The NEC SX-6.
Up: Recount of
(almost) available ... Previous: The IBM eServer
p690.
Aad van der Steen Mon Oct 11 15:01:44 CEST 2004
|