Computing hardware is constantly changing, with new CPUs and accelerators, and the integration of both. How do you know which processors are right for your code?

Profiling Is the Key to Survival

Know your enemy and know yourself, and you can fight a hundred battles without disaster. – Sun Tzu

There are three things extremely hard: steel, a diamond, and to know one’s self. – Benjamin Franklin

New Processors Are Happening

Some big changes are happening in the processor world right now. For the last 15 years or so, both the HPC world and the enterprise world have settled on, for the most part, x86 as the processor technology. However, some other processor technologies, such as IBM’s Power architecture, are still being used, but at a much lower level relative to x86.

In the last few years, the development of accelerators such as GPUs (graphical processing units); dedicated processors that use lots of classic x86 cores, such as the Intel Xeon Phi; FPGAs (field-programmable gate arrays); and DSPs (digital signal processors) have risen in popularity. These accelerators have the potential to provide much better performance in applications and algorithms that can take advantage of them compared with conventional CPUs. Plus, the performance per watt for these accelerators make them compelling for a range of application classes.

In a secondary theme in this processing trend of processors that are not based on x86, the best known alternative is ARM, which has spun off solutions such as the Raspberry Pi, or some of the server-oriented products that Dell, HP, and Boston Limited  have discussed. Even AMD has announced that they will create a 64-bit ARM processor based on their Opteron processors. These ARM-based processors are focused on very power efficient computing; that is, doing reasonable amounts of computing with very low power levels. Integrators take the ARM processors and create Systems on a Chip (SoCs) that combine the processor and the ancillary chips and interfaces into a very small package that uses, typically, less than 15W under load.

The Chinese government is also investing in some new processors that use the MIPS64 definition. The processor family, called Loongson has been under development for several years. In early 2013, the developers from the Institute of Compute Technology (ICT) at the Chinese Academy of Sciences will present the latest development called the Godson-3B. This processor has eight cores running at 1.35GHz, reaching a theoretical peak of 172.8GFLOPS but using only 40W of power.

Other interesting new chips combine a classic CPU with an accelerator on a single chip. The first example of this is AMD’s Fusion product line called APU (Accelerated Processing Unit). This processor combines a CPU with a GPU on a single chip. For example, the recently announced AMD A10-5800K has the following specifications:

  • 4 cores at 3.8GHz (turbo to 4.2GHz)
  • 4MB L2 cache
  • 384 Radeon cores
  • 800MHz GPU clock speed
  • DDR3 1866MHz memory
  • 100W

Putting both the CPU and the GPU on the same processor allows the GPU to have access to system memory; even though it’s slower than typical GPU memory, it allows much larger GPU memory capacities. Moreover, because the memory is “unified” between the CPU and the GPU, data transfer between the two parts simply involves an exchange of pointers.

Texas Instruments has also very recently announced the combination of an ARM processor and TI’s DSP processor. It combines an ARM Cortex A15 CPU (four cores), a Texas Instrument Keystone DSP, a shared memory controller, an integrated fabric, and an I/O interface. The integrated fabric connects the ARM processor, the DSP, and the memory controller.

With so many changes happening – accelerators, non-x86 processors, and coupled CPUs and accelerators – how does one decide which processor is best or which ones work well with a given application or algorithm? In my opinion, the answer is given in the quotes at the beginning of the article but adapted for HPC – “know your application.” To know your application, develop a complete profile of it.

Profiling and Tracing

Two main types of tools can be used to develop a complete profile of an application. One is called profiling and the other is called tracing. It is important to differentiate between profiling and tracing because they are different but complimentary tools. Profiling an application typically means to aggregate or summarize statistics of the application when it runs. On the other hand, tracing gathers data, also referred to as event histories, while the application runs and presents it as a time history. Profiling sometimes produces a fairly small amount of data, whereas tracing can produce a great deal of information.

Tracing will produce data such as how much wall clock time was spent in a routine or a set of nested loops. Profiling goes beyond this to monitor the system while the application is running, which is really monitoring “events” that happen on the system. For example, you could measure the number of different cache misses or hits, translation lookaside buffer (TLB) misses, branch mispredictions, number of instructions executed, number of memory loads/stores, the number of floating point operations per second, and so on. Typically, this data is presented as a time history (i.e., a plot versus time).

You can also use tools to gather more “global” information about the system during the run. You can gather information about general CPU load, networking information and statistics, I/O information, and OS information, including process scheduling, I/O scheduling, context switches, and so on.

Remember that the overall goal is to take all of this information – the timing information for various portions of the code, event histories, and system information – and create a picture of the application. I think of this picture as a true “profile” that can be used to understand how it functions. This is very important in today’s climate, where processors and computing technologies are changing rapidly because we need to identify parallel regions of the code that might be good for accelerators or modify the application to reduce cache misses, change the I/O pattern, and so on. All of this is focused on improving the performance of the application. (And who doesn’t like performance?)

Wikipedia has an extensive list of performance analysis tools. In this article, I’m only going to cover a few tool sets for profiling and tracing applications.

Application Profiling Tools

The first class of tools I will cover in this article are application profiling tools. Just about any compiler comes with a profiling tool. The GNU compilers come with a basic profiler called gprof. To prep the application for profiling when building code with GNU compilers, you use the -pg option. When you execute the application, it creates an output file called gmon.out or progname.gmon. Then, you can use gprof to analyze this file, producing two sets of information: (1) timing information that consists of execution time spent in every function and the equivalent percentage of total run time, and (2) a call graph that shows who called each function within the program and its children.

The timing information created by gprof is probably the most immediately useful data allowing us to see where the application spends most of its time. However, remember that it measures execution time on the basis of subroutines or functions, which makes it more difficult to find time “hotspots” in the middle of code. Usually to do this, you need to instrument your code to add the time spent in various portions of the code. This does mean modifying your code, but it might be worth the time.

One thing to remember about gprof is that it is a “sample”-based profile tool. It uses system interrupts to take snapshots of the application’s progress, so it doesn’t precisely measure timings but instead uses statistical sampling to measure time.

A very good online tutorial on using gprof has lots of postprocessing examples, along with an explanation of the output. Another gprof tutorial and gprof quick-start guide can help you interpret results.

Another well-used profiling tool is Valgrind, a framework for analyzing applications as well as detecting memory and threading bugs, but I’m interested in it for its ability to do application profiling. Valgrind uses “dynamic binary instrumentation,” which allows it to work with precompiled binaries so you don’t have to recompile your applications. The valgrind distribution comes with six tools:

  • Memory error detector
  • Two-thread error detectors
  • Cache and branch-prediction profiler (cachegrind)
  • Call-graph-generating cache and branch-prediction profiler (callgrind)
  • Heap profiler
  • Three experimental tools
  • Heap/stack/global array overrun detector
  • Second heap profiler that examines how heap blocks are used
  • SimPoint basic block vector generator

The profiling tools of interest for profiling applications within Valgrind are cachegrind and callgrind. Cachegrind primarily simulates how the application interacts with a system’s cache hierarchy. It interacts with both the instruction (I) and data (D) of the L1 cache of typical processors. If the processor has three levels of caches, as most processors have, then Cachegrind will simulate that level of cache because it has the most influence on run time. So in Valgrind-speak, it looks at I1, D1, and LL (last-level) caches.

The cache statistics that Cachegrind gathers are:

  • I cache reads
  • I cache read misses
  • LL cache instruction read misses
  • D cache reads
  • D cache read misses
  • LL cache read misses
  • D cache writes
  • D cache write misses
  • LL cache write misses
  • Conditional branches executed
  • Conditional branches mispredicted
  • Indirect branches executed
  • Indirect branches mispredicted

Some good tutorials on Valgrind primarily focus on using its memory-checking features, which is not so useful for profiling, but one Valgrind tutorial exists.

Processor Tracing

A second useful tool set for analyzing applications performs “tracing.” These tools capture “processor events” while the application is running to create time histories. Most modern processors have the ability to expose certain processor events, such as the number of different cache misses, TLB misses, branch mispredictions, instructions executed, memory loads/stores, floating point operations per second, and so on. Linux kernel modules can capture these events and make them generally available to userspace, then userspace tools can capture and manipulate the data and use it for analysis and profiling of the application. However, these events or counters vary from processor to processor, so what is really needed is a standard set of tools to capture a common set of counters that will help application tracing. Fortunately, such a thing exists: PAPI.

PAPI (Performance Application Programming Interface) is a cross-platform interface to hardware counters. PAPI has defined a standard set of events or counters across a number of platforms that have relevance to application tracing. It has a set of routines for accessing the counters both from a low-level perspective to control where certain events are recorded, and a high-level perspective that starts, stops, and accesses the counters.

PAPI uses something called Linux-perfctr or Linux “performance counters,” which all kernels after about 2.6.32 should have; however, if your kernel is old enough, you can always download PAPI and add it to your kernel. Some distributions will enable Perfctr in their default kernels, but some do not, so you might have to rebuild the kernel to enable it.

A number of tools use PAPI, such as:

Another tool that uses Linux-perfctr is simply called perf. It hasn’t been updated in a while, and I’m not sure where to get the code, but it seems to be part of some Linux distributions. A nice tutorial on the wiki also explains how to use it.

System Profiling

At this point, you can glean some timing numbers from applications and some tracing information from the hardware performance counters that are common in modern processors. However, often you want to learn how your application is running from the system itself. This is similar to profiling the application, but it is extended to profile the system as a whole and can create quite a bit more data; however, it also gives you more information about how your application is performing.

Probably the most common system profiling tool is called OProfile. OProfile gathers system statistics using sampling techniques (similar to gprof and other profiling tools) over a period of time. It gathers processor information gathered using Linux-perfctr along with other data; then, you can run reports against the gathered data to analyze whatever you want. Using OProfile is fairly easy. Begin by starting up OProfile (opcontrol) then run your application (preferably on a quiet system, so you don’t get the effects of other applications running). After the application is done, you run a report to summarize the information in which you are interested. The steps are summarized in a cheat sheet online.

For help on using OProfile, look at the GitHub tutorial, the Red Hat tutorial, or the IBM tutorial. Also, a nice GUI for interacting with OProfile, called Visual OProfile, comes with a distribution called STLinux, but I can’t tell whether it’s open source or not (little information about it can be found).

System Tracing

In addition to system profiling, you can “trace” or watch system events or values over time. This generally applies to the system as a whole, but you can gain some reasonable insight into what the system is doing to support the application.

Probably the most common set of system tracing tools are the Sysstat utilities. Sysstat comprises several tools that come with almost all Linux distributions and are very useful for tracing system activities:

Some very good tutorials on using sysstat tools, particularly sar, can be found on the Sysstat page, the The Geek Stuff website, IBM developerWorks, HowtoForge (which also talks about a graphical tool, ksar, that graphically plots the sar data), (which includes a discussion about how to use MySQL to store the sar data), and the Make Tech Easier website (a fairly simple tutorial).

A commercial tool, SarCheck, acts like sar but it goes further by creating a report of the monitored data and making recommendations on system tuning parameters to improve performance. Although I don’t have any experience with this tool or know anyone who has used it, it might be worth investigating.

Another tool that can be very useful for system tracing is called collectl, a fairly comprehensive package that is oriented a little more toward HPC because it adds the ability to trace InfiniBand and Lustre components. I have written about collectl in the past on the ADMIN HPC website, so I won’t cover it here.

MPI Profiling and Tracing

For HPC, it’s appropriate to discuss how to profile and trace MPI (Message-Passing Interface) applications. A number of MPI profiling tools are available, but you should check that they work with the MPI library you are using. Examples of tracing tools include MPE, which is associated with MPICH but works with other MPI implementations, TAU (mentioned under Processor Tracing), and Paraver. Remember, tracing really refers to collecting event histories so you can get an idea of how performance or events evolve over time. Although MPI profiling tools such as mpiP, FPMPI-2, and IPM focus more on aggregating statistics during run time to get totals rather than time histories, some of these tools add tracing as well.

Scalasca (see System Tracing) is a bit more than a profiling or tracing tool for MPI applications because it goes beyond the normal set of tools to provide guidance on how the causes of performance bottlenecks might be improved.


Knowing your application is one of the keys to being able to improve it and, perhaps most importantly, being able to judge which architecture (or architectures) you should think about using. In essence, “knowing yourself” from an application perspective. This ability is very important in the current climate, where non-x86 processors are on the rise and where accelerators are also becoming more commonplace and diverse.

Two basic approaches are available to help you understand your application: profiling, which gathers summary data when an application is run, and tracing, which presents a history of events as a function of time when the application is executed. I believe both tools can be used to gather information about your application so that you can begin to paint a picture of how your application behaves and how it interacts with the system. In my opinion, just application profiling or tracing is not enough: You also need to profile and trace the system while the application is running so you get a much more complete picture of what the application is doing and what the system is doing to support the application or in response to it.

I hope, this article has given you some starting points in finding tools and techniques to learn more about your application. My advice is to not be overwhelmed with the choices; rather, pick one or two of the tools and start using them. Once you become adept at using the tools you will learn more about your application and start using the other tools to further develop a behavior profile. It will truly be worth the effort.