Embedded Computing Design - SGI global shared-memory architecture: Enabling system-wide shared memory

Michael Woodacre
SGI

MORE EMBEDDED ARTICLES

Recently Published

Advanced Article Search (PDFs)

RELATED ARTICLES

PC/104 and Small Form Factor Articles

DSP & FGPA Articles

PXI Articles

VMEbus/VXIbus Articles

CompactPCI and AdvancedTCA Articles

Printer-Friendly Version

SGI global shared-memory architecture: Enabling system-wide shared memory

By Michael Woodacre

High-performance computing (HPC) users face the ever-increasing size and complexity of enormous data sets. They continually push the limits of the operating system by requiring larger numbers of CPUs, higher I/O bandwidth, and faster and more efficient parallel programming support. These requirements are changing the way OEMs design supercomputer systems – creating more efficient, cost-effective systems that must increasingly leverage embedded capabilities for specialized breakthrough functionality.

SGI meets these demands with technical compute servers and supercomputers based on its SGI NUMAflex cache-coherent, non-uniform memory architecture (SGI ccNUMA). A key enabler of NUMAflex is SGI NUMAlink interconnect technology, an embedded interconnect technology that dramatically reduces the time and resources required to run technical applications by managing extremely large data sets in a single, system-wide, shared-memory space called global shared memory.

In January of this year, SGI introduced SGI Altix 3000, a supercluster based on the unique NUMAflex system architecture used in SGI Origin 3000 systems. While Origin systems are based on the MIPS microprocessor and the IRIX operating system, the SGI Altix 3000 family combines NUMAflex with industry-standard components, such as the Intel Itanium 2 microprocessor and the Linux operating system.

With this embedded system interconnect, the Altix 3000 family enables new capabilities and time-to-solution breakthroughs that neither traditional Linux OS-based clusters nor competitive SMP architectures can tackle. The NUMAflex architecture allows the system to scale applications performance for up to 512 processors, all working together in a cache-coherent manner.

Global shared memory means that a single memory address space is visible to all system resources, including microprocessors and I/O, across all nodes. Systems with global shared memory allow access to all data in the system’s memory directly and extremely quickly, without having to go through I/O or networking bottlenecks. Systems with multiple nodes without global shared memory instead must pass copies of data, often in the form of messages, which can greatly complicate programming and slow down performance by significantly increasing the time processors must wait for data. Global shared memory requires a sophisticated system memory interconnect like NUMAlink.

NUMAflex architecture
NUMAflex uses an SGI NUMA protocol implemented directly in hardware for performance and a modular packaging scheme. NUMAflex gets its name from the flexibility it has to scale independently in the three dimensions of processor count, memory capacity, and I/O capacity.

The key to the NUMAflex design of the Altix 3000 system is a controller ASIC, referred to as the SHUB, that interfaces to the Itanium 2 front-side bus, to the memory DIMMs, I/O subsystem, and other NUMAflex components in the system. Altix 3000 is built from a number of component modules, or bricks, most of which are shared with Origin 3000. The C-brick, compute brick, is the module that customizes the system to a given processor architecture. The Altix C-brick consists of four processors connected to two SHUBs and up to 32 Gbytes of memory implemented on two equal “node” boards in a 3U brick. Figure 1 shows a diagram of the Altix 3000 C-brick schematic.

The remaining components of the system are:

The R-brick, an eight-port NUMAlink 3 router brick, which architects use to build the interconnect fabric between the C-bricks
The M-bricks, memory bricks, for independent memory scaling on the same embedded interconnect fabric
The IX-brick, the base I/O brick, and the PX-brick, a PCI-X expansion brick, which attach to the C-brick via the I/O channel
The D-brick2, a second-generation JBOD brick

SGI supplies a variety of networking, Fibre Channel SAN, RAID, and offline storage products to complete the Altix 3000 offering.

Open Figure 1

Cache coherency
The cache-coherency protocol on the Altix 3000 family is implemented in the SHUB ASIC, which interfaces to both the snooping operations of the Itanium 2 processor and the directory-based scheme used across the NUMAlink interconnect fabric. If the contents of a neighboring cache can satisfied a processor’s request, data will flow directly from one processor cache to the other processor without the extra latency of sending the request to memory. The directory-based cache-coherence used in Altix requires that the system inform only the processors currently playing an active role in the use of a given cache line about an operation to that cache line. This operation reduces the amount of information that is needed to flow around the system to maintain cache coherence, resulting in lower system overhead, reduced latency, and higher delivered bandwidth for actual data operations.

Interconnection network
Altix 3000 makes use of a new advance in NUMAflex technology – the NUMAlink 4 communications channel. Developers have employed prior generations of NUMAlink in SGI scalable systems. NUMAlink 4 provides double the bandwidth of NUMAlink 3 while maintaining compatibility with NUMAlink 3 physical connections. NUMAlink 4 is able to achieve this performance boost by employing advanced bidirectional signaling technology.

Engineers configured the NUMAflex network for Altix in a fat-tree topology. Figure 2 shows this topology for a 512-processor configuration. The circles in the figure represent R-bricks, the lines represent NUMAlink cables, and the 128 small squares across the center of the diagram represent C-bricks.

The fat-tree topology enables the system performance to scale well by providing a linear increase in bisection bandwidth as the systems increase in size. Altix 3000 also provides dual-plane or parallel networks for increased bisection bandwidth. Designers can make this dual-plane configuration possible by providing two NUMAlink ports on the Altix C-brick. Because designers have initially deployed Altix using the NUMAlink 3 router brick, the system will be able to double the bisection bandwidth when the NUMAlink 4 router brick becomes available, allowing the systems’ capabilities to grow along with the demands of new generations of Itanium 2 family microprocessors.

Open Figure 2

Table 1 lists the main memory characteristics of Altix 3000. The two numbers in the bandwidth-per-processor column correspond to NUMAlink 3 and NUMAlink 4. The two numbers in the maximum local memory column correspond to using 512-Mbyte and 1-Gbyte DIMMs. Because designers can add memory DIMMs to the system using M-bricks, they can scale the memory independent of processors. Hence, it is entirely possible to build a system with 16 processors and 4 Tbytes of shared cache-coherent memory.

Open Table 1

Reliability, availability, and service ability (RAS)
SGI designed the Altix 3000 family to provide a robust operating environment. The design protects data flowing around the system using a number of techniques.

An error-correcting code that can correct all single-bit errors, detect multiple-bit errors, and provide chip-kill error detection protects the memory.

The NUMAlink channels protect the messages flowing across the channels with a CRC protection code. If the system detects a CRC error at the receiving end of a NUMAlink channel, it resends the message, enabling the system to provide reliable communications. The dual-plane NUMAlink interconnect fabric also provides enhanced availability since the system design allows it to remain fully operational if one of the planes fails.

The system can run multiple nodes of the Linux kernel, so even if one node suffers a fatal error, other nodes may continue operating while the system repairs or reboots the failed node. The Altix 3000 packaging provides N+1 hot-swap fans and N+1 redundant power supplies on each of the C-bricks, R-bricks, and I/O bricks in the system.

Linux superclusters
SGI designers have scaled the standard Linux operating system up to the new peak of 64 processors within a single system image on the Altix 3000 family. In a traditional cluster, designers must provide each host with the maximum memory that any process may ever need. In addition, there is also the need for enough memory to run a copy of the operating system on each host. If a process needs more memory than a single host can provide, then an engineer must rework the code to spread that load across multiple hosts, if possible.

With a large host size, a larger pool of memory is available to each individual process running on that host. For example, a single 64-processor Linux kernel system will have up to 4 Tbytes of memory available for a single process to use. This gives application developers an extraordinarily large sandbox to work in, so they can concentrate on the demands of their applications without worrying about arbitrary node configuration limits.

The SGI NUMAflex architecture supports the capability of having multiple nodes on a single NUMAlink network that are independent systems, each running its own copy of the operating system. Firewalls, which can be raised or lowered to prevent or allow development of memory, CPU, and I/O access across the node boundary by processes on the other side, can separate the physical memory of an Altix 3000 system. SGI NUMAflex architectures also contain Block Transfer Engines (BTEs), which can operate as cache-coherent DMA engines. BTEs are used to copy data from one physical memory range to another at very high bandwidth, even across node boundaries.

Internode memory access allows users to access memory belonging to processes on the same Altix 3000 system. This memory can reside within the same node or on a separate node. The system can access memory by data copies utilizing the BTE or by directly sharing the underlying physical memory. Figure 3 depicts the software stacks that allow designers to build internode shared-memory (XPMEM) and networking (XPNET) software layers for SGI Linux.

Open Figure 3

The XP and XPC kernel modules provide a reliable and fault-tolerant internode communication channel that transfers data over the NUMAlink interconnect. XPNET utilizes the NUMAlink interconnect to provide high-speed TPC and UDP protocols for applications. The libxpmem user library and the XPMEM kernel module provide internode memory access to user applications. XPMEM allows a source process to make regions of its virtual address space accessible to processes within or across node boundaries. The source process can define a permission requirement to limit which processes can access its underlying memory. Other processes can then request access to this region of memory and can attach to it if they satisfy the permission requirement.

Once the system attaches and faults the underlying physical memory, the remote process operates on it via cache-coherent loads and stores just as the source process does. XPMEM locks shared physical pages across node boundaries in memory so the application cannot swap them. The XPMEM kernel module does this dynamically and only when a physical page is first used across node boundaries. Before that, the system does not require locking of the physical page in memory.

The SGI Message Passing Toolkit3 (MPI + SHMEM) is optimized to use XPMEM via the process-to-process interfaces outlined above. Future enhancements for XPMEM may include interfaces similar to existing System V shared-memory interfaces.

The global shared-memory capabilities of the Altix 3000 create a powerful new server and supercluster architecture for HPC applications. With a fast, extensible, globally shared, cache-coherent memory, and an array of APIs, users will be able to solve computational problems of greater mathematical complexity at finer resolution in a shorter time.

. . . . .

Michael Woodacre is chief engineer of the Server Products Division for SGI. He is responsible for future system architecture. Michael’s interests include cache-coherence protocols, microprocessor architecture, scalable system design, and verification. He recieved a BS in computer systems engineering from the University of Kent, Canterbury, UK.

SGI, also known as Silicon Graphics, Inc., specializes in high-performance computing, visualization, and storage. SGI’s vision is to provide technology that enables the most significant scientific and creative breakthroughs of the 21st century. Whether it’s sharing images to aid in brain surgery, finding oil more efficiently, studying global climate, or enabling the transition from analog to digital broadcasting, SGI is dedicated to addressing the next class of challenges for scientific, engineering, and creative users. For further information about the company and its products or services, contact:

SGI
1600 Amphitheatre Parkway
Mountain View, CA 94043
Tel: 650-960-1980
Web site: http://www.sgi.com/

. . . . .

Copyright © 2003 Embedded Computing Design. Reproduction in whole or part without permission is prohibited. All rights reserved. 6/2003 www.embedded-computing.com