© fauxware, fotolia.com

© fauxware, fotolia.com

Profiling application resource usage

Inside View

Article from ADMIN 13/2013
By
Computing hardware is constantly changing, with new CPUs and accelerators, and the integration of both. How do you know which processors are right for your code?

Know your enemy and know yourself, and you can fight a hundred battles without disaster . – Sun Tzu

There are three things extremely hard: steel, a diamond, and to know one's self. – Benjamin Franklin

Some big changes are happening in the processor world right now. For the past 15 years or so, both the HPC world and the enterprise world have settled on, for the most part, x86 as the processor technology. Although other processor technologies, such as IBM's Power architecture, are still being used, it is at a much lower level relative to x86.

In the past few years, the development of accelerators, such as GPUs (graphical processing units); dedicated processors that use lots of classic x86 cores, such as the Intel Xeon Phi; FPGAs (field-programmable gate arrays); and DSPs (digital signal processors) have risen in popularity. These accelerators have the potential to provide much better performance in applications and algorithms that can take advantage of them compared with conventional CPUs. And, the performance per watt for these accelerators makes them compelling for a range of application classes.

Within this group of processors that are not based on x86, the best known alternative is ARM [1], which has spun off solutions such as the Raspberry Pi, and some of the server-oriented products that Dell, HP, and Boston Limited have discussed. Even AMD has announced that they will create a 64-bit ARM processor based on their Opteron processors.

These ARM-based processors are focused on very power efficient computing; that is, doing reasonable amounts of computing with very low power levels. Integrators take the ARM processors and create Systems on a Chip (SoCs) that combine the processor and the ancillary chips and interface into a very small package that uses, typically, less than 15W under load.

The Chinese government is also investing in some new processors that use the MIPS64 definition. This processor family, called Loongson [2], has been under development for several years. In early 2013, the developers from the Institute of Compute Technology (ICT) at the Chinese Academy of Sciences will present the latest development called the Godson-3B [3]. This processor has eight cores running at 1.35GHz, reaching a theoretical peak of 172.8GFLOPS but using only 40W of power.

Other interesting new chips combine a classic CPU with an accelerator on a single chip. The first example of this is AMD's Fusion product line called APU (Accelerated Processing Unit) [4].

This processor combines a CPU with a GPU on a single chip. For example, the recently announced AMD A10-5800K has the following specifications:

  • 4 cores at 3.8GHz (turbo to 4.2GHz)
  • 4MB L2 cache
  • 384 Radeon cores
  • 800MHz GPU clock speed
  • DDR3 1866MHz memory
  • 100W

Putting both the CPU and the GPU on the same processor allows the GPU to have access to system memory; even though it's slower than typical GPU memory, it allows much larger GPU memory capacities. Moreover, because the memory is "unified" between the CPU and the GPU, data transfer between the two parts simply involves an exchange of pointers.

Texas Instruments has also very recently announced the combination of an ARM processor and TI's DSP processor. It combines an ARM Cortex A15 CPU (four cores), a Texas Instrument Keystone DSP, a shared memory controller, an integrated fabric, and an I/O interface. The integrated fabric connects the ARM processor, the DSP, and the memory controller.

With so many changes happening – accelerators, non-x86 processors, and coupled CPUs and accelerators – how do you decide which processor is best or which ones work well with a given application or algorithm? In my opinion, the answer is given in the quotes at the beginning of the article but adapted for HPC – "know your application." To know your application, develop a complete profile of it.

Profiling and Tracing

Two main types of tools can be used to develop a complete profile of an application. One is called profiling and the other is called tracing. It is important to differentiate between profiling and tracing [5] because they are different but complementary tools. Profiling an application typically means aggregating or summarizing statistics of the application when it runs. On the other hand, tracing gathers data, also referred to as event histories, while the application runs and presents it as a time history. Profiling sometimes produces a fairly small amount of data, whereas tracing can produce a great deal of information.

Tracing will produce data such as how much wall clock time was spent in a routine or a set of nested loops. Profiling goes beyond this to monitor the system while the application is running, which is really monitoring "events" that happen on the system. For example, you could measure the number of different cache misses or hits, translation lookaside buffer (TLB) misses, branch mispredictions, number of instructions executed, number of memory loads/stores, the number of floating point operations per second, and so on. Typically, this data is presented as a time history (i.e., a plot versus time).

You can also use tools to gather more "global" information about the system during the run. You can gather information about general CPU load, networking information and statistics, I/O information, and OS information, including process scheduling, I/O scheduling, context switches, and so on.

The overall goal is to take all of this information – the timing information for various portions of the code, event histories, and system information – and create a picture of the application. I think of this picture as a true "profile" that can be used to understand how it functions. This information is very important in today's climate, where processors and computing technologies are changing rapidly, because you need to identify parallel regions of the code that might be good for accelerators or modify the application to reduce cache misses, change the I/O pattern, and so on. All of this is focused on improving the performance of the application. (And who doesn't like performance?)

Wikipedia provides an extensive list of performance analysis tools [6]. In this article, I'm only going to cover a few toolsets for profiling and tracing applications.

Application Profiling Tools

The first class of tools I will cover in this article are application profiling tools. Most compilers comes with a profiling tool. The GNU compilers come with a basic profiler called gprof [7]. To prep the application for profiling when building code with GNU compilers, you use the -pg option. When you execute the application, it creates an output file called gmon.out or progname.gmon. Then, you can use gprof to analyze this file, producing two sets of information: (1) timing information that consists of execution time spent in every function and the equivalent percentage of total run time, and (2) a call graph showing who called each function within the program and its children.

The timing information created by gprof is probably the most immediately useful data, allowing you to see where the application spends most of its time. However, it measures execution time on the basis of subroutines or functions, which makes it more difficult to find "hotspots" in the middle of code. For this, you usually need to instrument your code to add the time spent in various portions of the code. This does mean modifying your code, but it might be worth the time.

One thing to remember about gprof is that it is a "sample"-based profile tool. It uses system interrupts to take snapshots of the application's progress, so it doesn't precisely measure timings but instead uses statistical sampling to measure time.

A very good online tutorial on using gprof [8] has lots of postprocessing examples, along with an explanation of the output. The IBM developerWorks gprof tutorial is also good [9], and a gprof quick-start guide [10] can help you interpret results.

Another well-used profiling tool is Valgrind [11], a framework for analyzing applications as well as detecting memory and threading bugs, but I'm interested in it for its ability to do application profiling. Valgrind uses "dynamic binary instrumentation," which allows it to work with precompiled binaries so you don't have to recompile your applications. The valgrind distribution comes with six tools:

  • Memory error detector
  • Two thread error detectors
  • Cache and branch-prediction profiler (cachegrind)
  • Call-graph-generating cache and branch-prediction profiler (callgrind)
  • Heap profiler

and three experimental tools:

  • Heap/stack/global array overrun detector
  • Second heap profiler that examines how heap blocks are used
  • SimPoint basic block vector generator

The profiling tools of interest for profiling applications within Valgrind are cachegrind [12] and callgrind [13]. The cachegrind tool primarily simulates how the application interacts with a system's cache hierarchy. It interacts with both the instruction (I) and data (D) of the L1 cache of typical processors.

If the processor has three levels of caches, as most processors have, then Cachegrind will simulate that level of cache because it has the most influence on run time. So, in Valgrind-speak, it looks at I1, D1, and LL (last-level) caches.

The cache statistics that Cachegrind gathers are:

  • I cache reads
  • I cache read misses
  • LL cache instruction read misses
  • D cache reads
  • D cache read misses
  • LL cache read misses
  • D cache writes
  • D cache write misses
  • LL cache write misses
  • Conditional branches executed
  • Conditional branches mispredicted
  • Indirect branches executed
  • Indirect branches mispredicted

Some good tutorials on Valgrind primarily focus on using its memory-checking features, which is not so useful for profiling, but one Valgrind tutorial exists [14].

Processor Tracing

A second useful tool set for analyzing applications performs "tracing." These tools capture "processor events" while the application is running to create time histories. Most modern processors have the ability to expose certain processor events, such as the number of different cache misses, TLB misses, branch mispredictions, instructions executed, memory loads/stores, floating point operations per second, and so on.

Linux kernel modules can capture these events and make them generally available to userspace. Then, userspace tools can capture and manipulate the data and use it for analysis and profiling of the application. However, these events or counters vary from processor to processor, so what is really needed is a standard set of tools to capture a common set of counters that will help application tracing. Fortunately, such a thing exists: PAPI [15].

PAPI (Performance Application Programming Interface) is a cross-platform interface to hardware counters. PAPI has defined a standard set of events or counters across a number of platforms that have relevance to application tracing. It has a set of routines for accessing the counters both from a low-level perspective to control where certain events are recorded, and a high-level perspective that starts, stops, and accesses the counters.

PAPI uses something called Linux-perfctr [16] or Linux "performance counters," which all kernels after about 2.6.32 should have [17]; however, if your kernel is older, you can always download PAPI and add it to your kernel. Some distributions will enable Perfctr in their default kernels, but others do not, so you might have to rebuild the kernel to enable it.

A number of tools use PAPI, such as:

  • HPCView
  • TAU
  • Scalasca
  • HPCToolkit
  • IPM
  • Open|SpeedShop
  • PerfSuite
  • SCALEA
  • Titanium
  • Vampir

Another tool that uses Linux-perfctr is simply called perf. It hasn't been updated in a while, and I'm not sure where to get the code, but it seems to be part of some Linux distributions. A nice tutorial on the wiki [18] also explains how to use it.

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Profiling Is the Key to Survival

    Computing hardware is constantly changing, with new CPUs and accelerators, and the integration of both. How do you know which processors are right for your code?

comments powered by Disqus