A tour through AMD’s Opteron 6200 Series Linux Tuning Guide provides a practical look at some important HPC startup tasks.

Exploring AMD’s Opteron 6200 Series Linux Tuning Guide

AMD’s Opteron 6200 Series processor, which is based on the “Bulldozer” architecture, has been in production since October 2011, and potential users are still learning about the powers of the next-generation Opteron chip. Bulldozer is already beginning to appear in some high-profile HPC systems (such as the NCSA’s “Blue Waters” supercomputer in development at the University of Illinois), and AMD expects to deliver many more chips in 2012 for high-end, high-performance computer systems winding their way to market through an extensive network of partner vendors.

The “AMD Opteron 6200 Series Processors Linux Tuning Guide” is a hands-on, practical document from a team of AMD engineers spelling out a best-practice approach for configuring and tuning an AMD Opteron 6200 Series processor based system. The Tuning Guide, which is available online, is intended for readers who are shopping for an HPC or industrial-strength IT server system, as well as readers who have already purchased an AMD Opteron 6200 Series processor based system and are preparing for the configuration and testing phase. The authors provide a real-world summary of configuration tips and considerations, which makes the Tuning Guide essential reading for anyone who is interested in someday guiding an HPC system through its incubation from the drawing board to the real world.

Bulldozer Architecture at a Glance

Three new features of the AMD Opteron 6200 Series processor are:

  • A new architecture designed to make optimal use of resources and provide high efficiency in HPC environments.
  • A system for hardware-based power optimization, including a new “boost” feature that will increase the processor speed by as much as 300MHz to make use of unused capacity.
  • Support for new instructions defined within the IEEE 754-2008 standard.

Figure 1 is a diagram of the AMD 6200 Series processor (courtesy of the AMD Opteron 6200 Series Processors Linux Tuning Guide).

Figure 1: A close look at Bulldozer’s next-generation architecture.

According to the Guide, the AMD Opteron 6200 Series processor is “… based on a building block called a module. Each module has two tightly coupled x86 processing engines that are called cores. The architecture is based on a ‘share what makes sense’ approach, with each core having its own dedicated resources, such as integer schedule, execution engine, and L1 cache.” Other resources, such as a floating-point unit, are shared between the two cores within the same module.

The AMD Opteron 6200 Series processors come in 4-, 8-, 12-, and 16-core models. The 16-core chip (Figure 1) comprises two 8-core dies, each with its own memory controller and support for two memory channels per die.

Bulldozer’s new instruction set includes support for the FMA4 instruction defined in IEEE 754-2008. FMA4 performs the following in a single step:

A = B * C + D

This critical new instruction, which is important in future floating-point computations, uses one step for what used to take two, resulting in a large performance edge for many floating-point scenarios. Other enhancements include the new XOP instruction, which is intended for graphics and multimedia applications.

Getting Ready

Preparing your AMD Opteron 6200 Series processor based system for action starts with checking and configuring memory. According to the Tuning Guide, “Many performance issues are caused by poorly configured memory. Because HPC performance is strongly dependent on memory performance, we describe how to verify that the machine is configured to achieve the maximum memory performance possible with the AMD Opteron 4200/6200 Series Processors.”

This task begins with a physical memory configuration check. Make sure all DIMMS are installed and identical. The tuning guide also gives tips for alternative physical memory configurations, such as when only one DIMM is used per memory channel.

The command

numactl --hardware

checks whether the “size of memory on each node is as expected.”

The Guide also recommends the STREAM benchmark to verify memory bandwidth, and it advises on how to configure STREAM to analyze a Bulldozer-based system.

Lastly, the authors recommend running the demanding LINPACK test to check power usage, cooling, floating-point performance, and other parameters important to HPC environments.

Choices

Tuning and configuring memory is only the first step. To get the best performance, you will also need:

  • An operating system with full support for the “Bulldozer” enhanced features.
  • A compiler (for compiling any code that will run on your system) that supports the new “Bulldozer” instructions and optimizations.
  • Program libraries with “Bulldozer” support.

The Tuning Guide lists several operating systems with varying levels of support for “Bulldozer” chips. AMD distinguishes between compatible systems that boot and run on “Bulldozer” but do not take advantage of the full range of new features (e.g., Red Hat Enterprise Server 6.1, SLES 11 SP1, and Ubuntu 10.10) and enabled systems that support some or all new features of the “Bulldozer” core architecture (e.g., RHEL 6.2, SLES 11 SP2, and Ubuntu 11.04). See the sidebar titled “Bulldozer-Ready Kernels” for a list of Linux kernels that support some or all “Bulldozer” features. Other operating systems will also boot on “Bulldozer”, including Windows Server 2008 and Solaris 10u9 11, although these alternative systems do not support the new instructions or offer “Bulldozer”-specific optimizations.

For compilers, the Tuning Guide recommends GCC 4.6.2, GCC 4.7, Open64 4.5, or PGI 11.9. As noted previously, the new floating-point instructions offer a significant performance edge, which means that, if your program is heavy on floating-point operations, you’ll see a significant benefit if you recompile your code using a compiler with support for the new instructions.

The libraries, especially math libraries, are another important consideration. Some users have experienced slower-than-expected performance from their AMD Opteron 6200 Series processor based systems, only to discover later that background libraries lacked support for the new FMA4 floating-point instruction and other “Bulldozer” features.

AMD recommends its own AMD Core Math Library (ACML) version 5.10 or later. See your Linux vendor documentation and release notes for more on FMA4 support.

Getting Ready

Power management is especially significant in HPC environments, and the Tuning Guide offers some tips for getting the system ready for real-world HPC workloads.

The Application Power Management (APM) system “… allows the processor to provide maximum performance while remaining within the specified power delivery and removal envelope.” “Bulldozer” adds a pair of additional boosted power states to the P-state power performance system familiar in x86 environments. You need to enable APM to take advantage of the frequency boost options. Also, be sure to enable HPC P-state mode if the BIOS option exists.

“Bulldozer” also supports an innovative new feature that lets you cap power used by the CPU at less than the full Thermal Designed Power (TDP). If the power usage exceeds the cap, the APM system limits the P-state to stay below the cap. If the CPU is running at the highest conventional P-state frequency and the power usage is still below the TDP (or cap, if lower than the TDP), APM increases the frequency to one of the boost states.

Conclusions

The AMD Opteron 6200 Series processor offers some exciting features for HPC environments – as long as your software supports these enhanced features and your system is configured for optimum performance. Ensure that your memory is configured properly; get a “Bulldozer”-ready OS, compiler, and libraries; and spend some time studying and testing the advanced power management features to achieve optimal performance for HPC workloads.

If you are testing, configuring, planning, or shopping for an AMD Opteron 6200 Series processor HPC system based on the “Bulldozer” core architecture, download the “AMD Opteron 6200 Series Processors Linux Tuning Guide” for a practical introduction to the AMD Opteron 6200 Series processor in the real world.

Author Talk

ADMIN magazine spoke with AMD engineers Bill Brantley and Chip Freitag – two authors of the “AMD Opteron Series Processors Linux Tuning Guide” on the goals for the guide and the best features of the AMD Opteron 6200 Series processors.

Q. Your team at AMD recently completed the AMD Opteron 6200 Series Processors Linux Tuning Guide. What is the Tuning Guide?

The Tuning Guide is a technical document on the Opteron 6200 “Bulldozer” chip series that puts the emphasis on the end user. Opteron 6200 chips have lots of great new features, but it is possible to do things in a way that doesn’t result in good performance. The Tuning Guide is intended to help users get the most from the potential of the new chips.
For instance, we knew we needed a section on getting started, even at the hardware level. With lots of memory channels, it is easy to deploy the “Bulldozer” chip in a way that leaves a memory channel empty. We put the beginning steps in to verify that the system is properly configured.
Later sections look at tuning. One of the features of the Tuning Guide is a section describing how you can verify that you are indeed getting the performance you desire.
The guide is targeted at the HPC audience. You’ll find information on the ability to boost or change the core frequency, as well as details on how to run LINPACK.

Q. The Tuning Guide includes lots of references to other documents for information on the AMD Opteron 6200 Series processors. In many ways, the real story with the Tuning Guide is the AMD Opteron 6200 Series processor itself and the advanced features that make it so tunable. What’s different about the Opteron 6200? What are some of the distinguishing features that make the 6200 series more advanced than your previous AMD chips? Maybe you could describe your personal list of favorite new features?

The AMD Opteron 6200 Series processors provide high-end peak power, but at the same time, they are flexible enough for general use. Floating-point operations are much faster with the fused multiply-add instruction – up to twice the floating-point performance over our previous generation.
The fused multiply-add instruction is defined in the IEEE 754-2008 standard, but Opteron 6200 is the first chip series to include the instruction in an x86 instruction set.
Also, power gate pairs for cores: Bulldozer will boost power frequency for highly used cores and shut down unused or underused compute units. The hardware is capable of measuring the power used by all these cores; if a core is underused, it scales up. If the operating system halts a core, Bulldozer scales up the remaining cores.