Stepping Back to Move Forward: Breaking the rules could offer some new avenues for the future of HPC.

The History of Cluster HPC

The history of cluster HPC is rather interesting. In the early days, the late 1990s, HPC clusters, or “Beowulfs” as they were called, were often cobbled together in interesting ways. The standard 1U server was just starting to emerge, and many systems had a real DIY look and feel. HPC clusters at that point in time were a pioneering proposition. Indeed, many of the first cluster users actually built systems themselves, using shelves and tower cases. Fast Ethernet was the interconnect of choice, and if you could afford it, there was this small company called Myricom that made something much faster called Myrinet. To offer some perspective, this formative time period was the age of the DEC Alpha and Intel Pentium Pro processors.

As with all pioneers, there was no trail to follow. Wrong turns were common, but at the same time, there could be immense rewards for “doing it different.” The rule of thumb for early clusters was a factor of 10 (or more) reduction in price to performance coupled with a similar reduction in entry price. Additionally, the entry level price was scalable – you could tailor the machine to your budget. Almost overnight, clusters of all sizes started showing up in homes, labs, offices, and datacenters.

In those early days, some held opinions that suggested cluster builders/users were doing it the wrong way. Historically, HPC required RISC processors, exotic memory designs, and plenty of proprietary glue to hold it all together. These machines were traditional supercomputers. How could anyone get usable results with some x86 processors and Ethernet in the back of their lab? There was also this knock-off Unix OS called Linux. To many people, the whole thing smacked of a short-term hobbyist craze.

The story ends with a complete take over of the HPC market by the Linux x86 cluster. The predominant interconnects included Gigabit Ethernet, Myrinet, and InfiniBand (IB). Rows of rack-mounted 1U servers have become the new supercomputers and are now called HPC systems.

In the mid 2000s, things began to change a bit. First, processors hit the megahertz wall; instead of growing faster, processors grew wider, and the multicore age began. Second, the use of GP-GPUs for HPC started to take hold. These changes created some challenges for HPC users. Cluster nodes are now powerful eight-way (or higher) SMP (with optional GP-GPU enhancement) islands connected to similar nodes.

Cramming more CPU cores into a single node seems to work well for web servers and virtualization, but for HPC, the payoff is not as clear. The first issue to consider is memory bandwidth. Cores need access to memory and the more cores, the more traffic jams can result. Both AMD HyperTransport and Intel Quickpath Interconnect help mitigate this issue by creating several connected memory domains (one for each CPU socket), but data locality has now become a big issue with performance. The second question concerns how far multicore can go before memory bandwidth stifles everything.

In a similar vein, GP-GPUs have gotten much faster and much better at doing HPC on a single cluster node. The concerns with this approach include re-writing software and shuffling data back and forth across the PCIe bus, but for some applications, GP-GPUs are a big win. Some efforts have been made to use IB to connect GPUs across nodes. Similar to the growth of multicore, Gp-GPU-enabled cluster nodes have become extremely powerful systems but might be limited as to how many more cores can effectively operate in a single box.

A new design or trend called “Manycore” has entered the market. The only rule as to what constitutes a manycore processor is that the design is scalable beyond the limits seen for traditional multicore processors. Some may consider GP-GPU to be manycore; however, GP-GPUs need a host CPU to work effectively. As promising as the manycore approach might be, you still face the issue of programming. In some respects, specialized manycore processors might never enjoy the advantages of the commodity market, at least in the near term.

With the cost of a single CPU or GPU core getting less each year, the HPC market should be celebrating each time more cores show up in the cluster node. Looking forward, however, there are some real issues that suggest other paths might be worth exploring. In particular, the software issue is troubling. Most traditional HPC code uses MPI (Message Passing Interface) to communicate between cores. Although MPI will work on mutlicore nodes, it might not be the most efficient. Other programming models, such as OpenMP have been used to program multicore nodes, but programs are limited to the local node. Similarly, programming GP-GPU requires re-writing code to use the GPU cores. Some new methods (OpenACC) could help with this issue, but the traditional MPI model is being stressed by the new hardware. Similar to the early days of Beowulf Cluster HPC, following market trends might not be the most effective path in all cases. Instead of pushing more and more cores into nodes, other methods and approaches might deserve some consideration in the HPC world.

Single Socket Redux

The latest server might offer a large numbers of cores, but the important question for HPC is how well do they perform in parallel? This question might not have been asked because it is assumed all modern servers have at least two CPU sockets and at least eight cores (two quad core processors). Little effort has been made to assess the “effective cores” delivered by multicore systems. Results for a 12-core (dual socket with six-core processors) server can be found here. These tests, which used a variety of numerical benchmarks, resulted in a range from 41% to 98% efficiency, with an average utilization for all tests of 64%. Thus, on average, you can expect to effectively use 7.7 cores out of the 12 present in the server.

In contrast to these results, similar tests done on a number of four-core single-socket processors showed the best case performance ranged from 50% to 100% and the average utilization was 74%. On average, one can effectively expect to use three out of four cores.

The variation is due to memory bandwidth of each system. In general, more cores means more sharing of memory and more possible contention. Cache-friendly programs usually scale well on multicore, whereas those that rely on heavy access to main memory have the most difficulty with large multicore systems.

A valid argument for high-density multicore nodes is the cost amortization of power supplies, hard drives, interconnects, and case/rack hardware across the large number of cores in a single node. This does make sense, but unless the amortization cost is based on effective cores, the assumed savings might not accurately reflect the reality of the situation. Using a single-socket node also reduces the MPI messaging and I/O load on the interconnect but does increase the number of switch ports and network cards needed. In some cases, lower cost Gigabit Ethernet might be adequate for single-socket nodes, thus offsetting the increase in interconnect costs. Furthermore, it is possible to build nodes that contain multiple single-socket motherboards that share power supply and packaging costs, gaining back some of the lost amortization.

Single-socket nodes also provide a more MPI-friendly environment. There is no data locality issue, and programmers can continue to use existing code with little loss of efficiency and no re-programming.

Slice Up The GP-GPU

While GP-GPUs have made great strides in the last several years, one glaring bottleneck impedes some GP-GPU applications – the PCIe bus. Essentially, using GP-GPUs creates two memory domains. The first is the CPU memory on the main board and the second it the GP-GPU memory on the PCIe board. Managing these two domains is not easy, and transferring between the two can totally swamp any computational advantage the GP-GPU might offer. Obviously, moving the GP-GPU directly onto the processor die would have several benefits:

  • No need to transfer data across the PCIe bus to and from the GP-GPU
  • No need for two memory regions (CPU vs. GPU)
  • Tighter integration between CPU and GPU (i.e., sharing caches and power control)

The engineering required to do this is not trivial, but not impossible. Because of power and heat issues, the full GP-GPU cannot live on the CPU die. It has to be smaller, which is fine because the GP-GPU will be smeared across the processors and not limited to a single large and hot PCIe card. Additionally, like all new CPU hardware (i8087, SSE, etc.), eventually the compilers hide these from the programmer. When the GP-GPU or “SIMD unit” is part of the CPU, the compiler writers can go to work using this new hardware.

Fortunately, AMD has recognized the advantages of this approach and has introduced processors based on their AMD Fusion technology. In the Fusion design there is one memory bank for both the CPU and GPU. The design is so different, the term Accelerated Processing Unit or APU, is used to describe these new processors. AMD has already released some desktop and laptop versions of these processors, but sharing of memory is somewhat restricted (i.e., it looks like a separate video unit). The next-generation Trinity APU is said to have a shared L3 cache and should allow the CPU and GPU to work together better than previous generations.

The AMD Fusion APUs are not considered server processors. Right now, they are designed for the consumer laptop and desktop markets. Similar to the early days of HPC clusters, these processors might be “politically incorrect,” but they work rather well. They are designed for consumer single-socket operation and thus will not be found in the off-the-shelf server for any time soon. It could be up to the pioneers to try these “wrong” processors for the right application.

Cheaper, Better, Faster

One of the driving slogans of early Beowulf Clusters was “cheaper, better, faster.” To a large degree, this has been the case. Because of the many changes in the market, it might be worthwhile to rethink how best to the use mainstream hardware for HPC. In many respects, the x86 HPC market has become legitimized, and it now shows up on many marketing pie charts. Other components of the HPC market, including interconnect, storage, and software, have also helped move HPC clusters from the back of labs and server rooms to the front of a respectable and sizable market. Given the hardware pressures facing the market, however, it might be time to set up some “wrong” hardware on those old shelves and see what happens.