High-performance
computing (HPC) users face the ever-increasing size and complexity
of enormous data sets. They continually push the limits of the
operating system by requiring larger numbers of CPUs, higher I/O
bandwidth, and faster and more efficient parallel programming
support. These requirements are changing the way OEMs design
supercomputer systems – creating more efficient, cost-effective
systems that must increasingly leverage embedded capabilities for
specialized breakthrough functionality.
SGI meets
these demands with technical compute servers and supercomputers
based on its SGI NUMAflex cache-coherent, non-uniform memory
architecture (SGI ccNUMA). A key enabler of NUMAflex is SGI NUMAlink
interconnect technology, an embedded interconnect technology that
dramatically reduces the time and resources required to run
technical applications by managing extremely large data sets in a
single, system-wide, shared-memory space called global shared
memory.
In January of this year, SGI introduced SGI Altix
3000, a supercluster based on the unique NUMAflex system
architecture used in SGI Origin 3000 systems. While Origin systems
are based on the MIPS microprocessor and the IRIX operating system,
the SGI Altix 3000 family combines NUMAflex with industry-standard
components, such as the Intel Itanium 2 microprocessor and the Linux
operating system.
With this embedded system interconnect,
the Altix 3000 family enables new capabilities and time-to-solution
breakthroughs that neither traditional Linux OS-based clusters nor
competitive SMP architectures can tackle. The NUMAflex architecture
allows the system to scale applications performance for up to 512
processors, all working together in a cache-coherent
manner.
Global shared memory means that a single memory
address space is visible to all system resources, including
microprocessors and I/O, across all nodes. Systems with global
shared memory allow access to all data in the system’s memory
directly and extremely quickly, without having to go through I/O or
networking bottlenecks. Systems with multiple nodes without global
shared memory instead must pass copies of data, often in the form of
messages, which can greatly complicate programming and slow down
performance by significantly increasing the time processors must
wait for data. Global shared memory requires a sophisticated system
memory interconnect like NUMAlink.
NUMAflex
architecture NUMAflex uses an SGI NUMA protocol
implemented directly in hardware for performance and a modular
packaging scheme. NUMAflex gets its name from the flexibility it has
to scale independently in the three dimensions of processor count,
memory capacity, and I/O capacity.
The key to the NUMAflex
design of the Altix 3000 system is a controller ASIC, referred to as
the SHUB, that interfaces to the Itanium 2 front-side bus, to the
memory DIMMs, I/O subsystem, and other NUMAflex components in the
system. Altix 3000 is built from a number of component modules, or
bricks, most of which are shared with Origin 3000. The C-brick,
compute brick, is the module that customizes the system to a given
processor architecture. The Altix C-brick consists of four
processors connected to two SHUBs and up to 32 Gbytes of memory
implemented on two equal “node” boards in a 3U brick. Figure 1 shows
a diagram of the Altix 3000 C-brick schematic.
The remaining
components of the system are:
- The R-brick, an eight-port NUMAlink 3
router brick, which architects use to build the interconnect
fabric between the C-bricks
- The M-bricks, memory bricks, for
independent memory scaling on the same embedded interconnect
fabric
- The IX-brick, the base I/O brick, and the
PX-brick, a PCI-X expansion brick, which attach to the C-brick via
the I/O channel
- The D-brick2, a second-generation JBOD
brick
SGI supplies a variety of networking, Fibre
Channel SAN, RAID, and offline storage products to complete the
Altix 3000 offering.
Open Figure 1
Cache
coherency The cache-coherency protocol on the
Altix 3000 family is implemented in the SHUB ASIC, which interfaces
to both the snooping operations of the Itanium 2 processor and the
directory-based scheme used across the NUMAlink interconnect fabric.
If the contents of a neighboring cache can satisfied a processor’s
request, data will flow directly from one processor cache to the
other processor without the extra latency of sending the request to
memory. The directory-based cache-coherence used in Altix requires
that the system inform only the processors currently playing an
active role in the use of a given cache line about an operation to
that cache line. This operation reduces the amount of information
that is needed to flow around the system to maintain cache
coherence, resulting in lower system overhead, reduced latency, and
higher delivered bandwidth for actual data operations.
Interconnection
network Altix 3000 makes use of a new advance
in NUMAflex technology – the NUMAlink 4 communications channel.
Developers have employed prior generations of NUMAlink in SGI
scalable systems. NUMAlink 4 provides double the bandwidth of
NUMAlink 3 while maintaining compatibility with NUMAlink 3 physical
connections. NUMAlink 4 is able to achieve this performance boost by
employing advanced bidirectional signaling
technology.
Engineers configured the NUMAflex network for
Altix in a fat-tree topology. Figure 2 shows this topology for a
512-processor configuration. The circles in the figure represent
R-bricks, the lines represent NUMAlink cables, and the 128 small
squares across the center of the diagram represent
C-bricks.
The fat-tree topology enables the system
performance to scale well by providing a linear increase in
bisection bandwidth as the systems increase in size. Altix 3000 also
provides dual-plane or parallel networks for increased bisection
bandwidth. Designers can make this dual-plane configuration possible
by providing two NUMAlink ports on the Altix C-brick. Because
designers have initially deployed Altix using the NUMAlink 3 router
brick, the system will be able to double the bisection bandwidth
when the NUMAlink 4 router brick becomes available, allowing the
systems’ capabilities to grow along with the demands of new
generations of Itanium 2 family microprocessors.
Open
Figure 2
Table 1 lists the main memory
characteristics of Altix 3000. The two numbers in the
bandwidth-per-processor column correspond to NUMAlink 3 and NUMAlink
4. The two numbers in the maximum local memory column correspond to
using 512-Mbyte and 1-Gbyte DIMMs. Because designers can add memory
DIMMs to the system using M-bricks, they can scale the memory
independent of processors. Hence, it is entirely possible to build a
system with 16 processors and 4 Tbytes of shared cache-coherent
memory.
Open Table 1
Reliability, availability, and
service ability (RAS) SGI designed the Altix
3000 family to provide a robust operating environment. The design
protects data flowing around the system using a number of
techniques.
An error-correcting code that can correct all
single-bit errors, detect multiple-bit errors, and provide chip-kill
error detection protects the memory.
The NUMAlink channels
protect the messages flowing across the channels with a CRC
protection code. If the system detects a CRC error at the receiving
end of a NUMAlink channel, it resends the message, enabling the
system to provide reliable communications. The dual-plane NUMAlink
interconnect fabric also provides enhanced availability since the
system design allows it to remain fully operational if one of the
planes fails.
The system can run multiple nodes of the Linux
kernel, so even if one node suffers a fatal error, other nodes may
continue operating while the system repairs or reboots the failed
node. The Altix 3000 packaging provides N+1 hot-swap fans and N+1
redundant power supplies on each of the C-bricks, R-bricks, and I/O
bricks in the system.
Linux superclusters SGI
designers have scaled the standard Linux operating system up to the
new peak of 64 processors within a single system image on the Altix
3000 family. In a traditional cluster, designers must provide each
host with the maximum memory that any process may ever need. In
addition, there is also the need for enough memory to run a copy of
the operating system on each host. If a process needs more memory
than a single host can provide, then an engineer must rework the
code to spread that load across multiple hosts, if
possible.
With a large host size, a larger pool of memory is
available to each individual process running on that host. For
example, a single 64-processor Linux kernel system will have up to 4
Tbytes of memory available for a single process to use. This gives
application developers an extraordinarily large sandbox to work in,
so they can concentrate on the demands of their applications without
worrying about arbitrary node configuration limits.
The SGI
NUMAflex architecture supports the capability of having multiple
nodes on a single NUMAlink network that are independent systems,
each running its own copy of the operating system. Firewalls, which
can be raised or lowered to prevent or allow development of memory,
CPU, and I/O access across the node boundary by processes on the
other side, can separate the physical memory of an Altix 3000
system. SGI NUMAflex architectures also contain Block Transfer
Engines (BTEs), which can operate as cache-coherent DMA engines.
BTEs are used to copy data from one physical memory range to another
at very high bandwidth, even across node
boundaries.
Internode memory access allows users to access
memory belonging to processes on the same Altix 3000 system. This
memory can reside within the same node or on a separate node. The
system can access memory by data copies utilizing the BTE or by
directly sharing the underlying physical memory. Figure 3 depicts
the software stacks that allow designers to build internode
shared-memory (XPMEM) and networking (XPNET) software layers for SGI
Linux.
Open Figure 3
The XP and
XPC kernel modules provide a reliable and fault-tolerant internode
communication channel that transfers data over the NUMAlink
interconnect. XPNET utilizes the NUMAlink interconnect to provide
high-speed TPC and UDP protocols for applications. The libxpmem user
library and the XPMEM kernel module provide internode memory access
to user applications. XPMEM allows a source process to make regions
of its virtual address space accessible to processes within or
across node boundaries. The source process can define a permission
requirement to limit which processes can access its underlying
memory. Other processes can then request access to this region of
memory and can attach to it if they satisfy the permission
requirement.
Once the system attaches and faults the
underlying physical memory, the remote process operates on it via
cache-coherent loads and stores just as the source process does.
XPMEM locks shared physical pages across node boundaries in memory
so the application cannot swap them. The XPMEM kernel module does
this dynamically and only when a physical page is first used across
node boundaries. Before that, the system does not require locking of
the physical page in memory.
The SGI Message Passing Toolkit3
(MPI + SHMEM) is optimized to use XPMEM via the process-to-process
interfaces outlined above. Future enhancements for XPMEM may include
interfaces similar to existing System V shared-memory
interfaces.
The global shared-memory capabilities of the
Altix 3000 create a powerful new server and supercluster
architecture for HPC applications. With a fast, extensible, globally
shared, cache-coherent memory, and an array of APIs, users will be
able to solve computational problems of greater mathematical
complexity at finer resolution in a shorter time.
. . . . .
Michael Woodacre is chief engineer of the
Server Products Division for SGI. He is responsible for future
system architecture. Michael’s interests include cache-coherence
protocols, microprocessor architecture, scalable system design, and
verification. He recieved a BS in computer systems engineering from
the University of Kent, Canterbury, UK.
SGI, also
known as Silicon Graphics, Inc., specializes in high-performance
computing, visualization, and storage. SGI’s vision is to provide
technology that enables the most significant scientific and creative
breakthroughs of the 21st century. Whether it’s sharing images to
aid in brain surgery, finding oil more efficiently, studying global
climate, or enabling the transition from analog to digital
broadcasting, SGI is dedicated to addressing the next class of
challenges for scientific, engineering, and creative users. For
further information about the company and its products or services,
contact:
SGI 1600 Amphitheatre Parkway Mountain View,
CA 94043 Tel: 650-960-1980 Web site: http://www.sgi.com/
. . . . .
Copyright © 2003
Embedded Computing Design. Reproduction in whole or part
without permission is prohibited. All rights reserved. 6/2003
www.embedded-computing.com
|