 
        		HPC Storage is arguably one of the most pressing issues in HPC. Selecting various HPC Storage solutions is a problem that requires some research, study, and planning to be effective – particularly cost-effective. Getting this process started usually means understanding how your applications perform I/O. This article presents some techniques for examining I/O patterns in HPC applications.
HPC Storage – Getting Started with I/O Profiling
It's fairly obvious to say that storage is one of the largest issues, if not the largest issue, in HPC. All aspects of HPC storage are becoming critical to the overall success or productivity of systems: high performance, high reliability, access protocols, scalability, ease of management, price, power, and so on. These aspects, and perhaps more importantly, combinations of these aspects, are key drivers in HPC systems and HPC performance.
With so many options and so many key aspects to HPC storage, a logical question one can ask is: Where should one start? Will a NAS (Network Attached Storage) solution work for my system? Do I need a high-performance parallel file system? Do I use a SAN (Storage Area Network) as a back end for my storage, or can I use less expensive DAS (Direct Attached Storage)? Should I use InfiniBand for the compute node storage traffic or will GigE or 10GigE be sufficient? Do I use 15,000 rpm drives or should I use 7,200 rpm drives? Do I need SSDs (Solid State Disks)? Which file system should I use? Which I/O scheduler should I use within Linux? Should I tune my network interfaces? How can I take snapshots of the storage and do I really need snapshots? How can I tell if my storage is performing well enough? How do I manage my storage? How do I monitor my storage? Do I need a backup or do I really only need a copy of the data? How can I monitor the state of my storage? Do I need quotas and how do I enforce them? How can I scale my storage in terms of performance and capacity? Do I need a single namespace? How can I do a filesystem check, how long will it take, and do I need one? Do I need cold spare drives or storage chassis? What RAID level do I need (file or object)? How many hot spares is appropriate? SATA versus SAS? And on, and on. When designing or selecting HPC storage, you have many questions to consider, but you might have noticed one item that I left out of the laundry list of questions. If you noticed I didn't discuss applications, you are correct.
Designing HPC storage, just as with designing the compute nodes and network layout, starts with the applications. Although designing hardware and software for storage is very important, without considering the I/O needs of your application mix, the entire process is just an exercise with no real value. You are designing a solution without understanding the problem. So the first thing you really should consider is the I/O pattern and I/O needs of your applications.
Some people might think that understanding application I/O patterns is a futile effort because, with thousands of applications, they think it's impossible to understand the I/O pattern of them all. However, it is possible to stack-rank the applications beginning with the few that use the most compute time or seemingly use a great deal of I/O. Then, you can begin to examine their I/O patterns.
Tools to Help Determine I/O Usage
You have several options for determining the I/O patterns of your applications. One obvious way is to run the application and monitor storage while it is running. Measuring I/O usage on your entire cluster might not always be the easiest thing to do, but some basic tools can help you. For example, tools such as sar, iotop, iostat, nfsiostat, and collectl can be used to help you measure your I/O usage. (Note: This is by all means not an exhaustive list of possible tools.)
iotop
Using iotop you can get quite a few I/O stats for a particular system. It runs on a single system, such as a storage server or a compute node, and measures the I/O on that system for all processes, but on a per-process basis, allowing you to watch particular processes (such as an HPC application). One way to use this tool is to run it on all of the compute nodes that are running a particular application, perhaps as part of a job script. When the MPI job runs, you will get an output file for each node, but before collecting them together, be sure you use a unique name for each node. Once you have gathered all of the output files, you should sort the data to get only the information for the MPI application. Finally, you can take the data for each MPI process and create a time history of the I/O usage of the application and perhaps find which process is creating the most I/O.
Depending on the storage solution you are using, you might be able to use iotop to measure I/O usage on the data servers. For example, if you are using NFS on a single server, you could run iotop on that server and measure I/O usage for the nfsd processes.
However, using iotop to measure I/O patterns really only gives you an overall picture without a great deal of detail. For example, it is probably impossible to determine whether the application is doing I/O to a central storage server or to the local node. It is impossible to use iotop to determine how the application is doing the I/O. Moreover, you really only get an idea of the throughput and not the IOPS that the application is generating.
iostat
Iostat allows you to collect quite a few I/O statistics, as well, and even allows you to specify a particular device on the node, but it does not separate out I/O usage on a per-process basis (i.e., you get an aggregate view of all I/O usage on a particular compute node). However, iostat gives you a much larger array of statistics than iotop. In general, you get two types of reports with iostat. (You can use options to get one report or the other, but the default is both types of reports.) The first report has CPU usage, and the second report shows device utilization. The CPU report contains the following information:
- %user: Percentage of CPU utilization that occurred while executing at the user level (this is application usage).
- %nice: Percentage of CPU utilization that occurred while executing at the user level with nice priority.
- %system: Percentage of CPU utilization that occurred while executing at the system level (kernel).
- %iowait: Percentage of time the CPU or CPUs were idle, during which the system had an outstanding disk I/O request.
- %steal: Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
- %idle: Percentage of time the CPU or CPUs were idle and the systems did not have an outstanding disk I/O request.
The values are computed as system-wide averages for all processors when your system has more than one core (which is pretty much everything today).
The second report prints out all kinds of details about device utilization (can be a physical device or a partition). If you don't use a device on the command line, then iostat will print out values for all devices (alternatively, you can use ALL as the device). Typically the report output includes the following:
- Device: Device name.
- rrqm/s: Number of read requests merged per second that were queued to the device.
- wrqm/s: Number of write requests merged per second that were queued to the device.
- r/s: Number of read requests issued to the device per second.
- w/s: Number of write requests issued to the device per second.
- rMB/s: Number of megabytes read from the device per second.
- wMB/s: Number of megabytes written to the device per second.
- avgrq-sz: Average size (in sectors) of the requests issued to the device.
- avgqu-sz: Average queue length of the requests issued to the device.
- await: Average time (milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in a queue and the time spent servicing them.
- svctm: Average service time (milliseconds) for I/O requests issued to the device.
- %util: Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this values is close to 100%.
These fields are specific to the set of options used, as well as to the device.
Relative to iotop, iostat gives more detail on the actual I/O, but it does it in an aggregate manner (i.e., not a per-process basis). Perhaps more importantly, iostat reports on a device basis, not a mountpoint basis. Because the HPC application is likely writing to a central storage server mounted on the compute node, it is not very likely that the application is writing to a device on each node (however, you can capture local I/O using iostat), so the most likely use of iostat is on the storage server or servers.
nfsiostat
A tool called nfsiostat, which is very similar to iostat, is part of the sysstat collection, but it allows you to monitor NFS-mounted filesystems, so this tool can be used for MPI applications that are using NFS as a central filesystem. It produces the same information as iostat (see above). In the case of HPC applications, you would run nfsiostat on each compute node and collect the information and process it in much the way you did with iotop.
sar
sar is one of the most common tools for gathering information about system performance. Many admins commonly use sar to gather information about CPU utilization and I/O and network information on systems. I won't go into detail about using sar because there are so many articles around the web about it, but it can be used to examine the I/O pattern of HPC applications at a higher level.
Like iotop and nfsiostat, you would run sar on each compute node and gather the statistical I/O information, or you could just let it gather all information and sort out the I/O statistics. Then, you could gather all of that information together and create a time history of overall application I/O usage.
collectl
collectl is sort of an uber-tool for collecting performance stats on a system. It is more HPC oriented than sar, iotop, iostat, or nfsiostat because it has hooks to monitor NFS and Lustre. But it too requires that you run it on all nodes to get an understanding of what I/O load is imposed throughput the cluster.
Like sar, you run collectl on each compute node and gather the I/O information for that node. However, it does allow you to gather information about processes and threads, so you can capture a bit more information than sar. Then, you have to gather all of the information for each MPI process or compute node and create a time history of the I/O pattern of the application.
A similar tool called collectd is run as a daemon to collect performance statistics much like collectl.
The above tools can help you understand what is happening on individual systems, but you have to gather the information on each compute node or for each MPI process and create your time history or statistical analysis of the I/O usage pattern. Moreover, they don't do a good job, if at all, of watching IOPS usage on systems, and IOPS can be a very important requirement for HPC systems. Other tools can help you understand more detailed I/O usage, but at a block level, allowing you to capture more information, such as IOPS.
blktrace
For example, blktrace can watch what is happening on specific block devices, so you can use this tool on storage servers to watch I/O usage. For example, if you are using a Linux NFS server, you could watch the block devices underlying the NFS file system, or if you are using Lustre, you could use blktrace to monitor block devices on the OSS nodes.
Blktrace can be a very useful tool because it also allows you to compute IOPS (I think it's only "Total IOPS"). Also, a tool called seekwatcher can be used to plot results from blktrace. An example on the website illustrates this.
Obviously, no one tool can give you all the information you need across a range of nodes. Perhaps a combination of iotop, iostat, nfsiostat, and collectl coupled with blktrace can give you a better picture of what your HPC storage is doing as a whole. Coordinating this data to generate a good picture of I/O patterns is not easy and will likely involve some coding. But if you assume that you can create this picture of I/O usage, you have to correlate it with the job history from the job scheduler to help determine which applications are using more I/O than others.
However, these tools only tell you what is happening in aggregate, focusing primarily on throughput, although blktrace can give you some IOPS information. They can't really tell you what the application is doing in more detailed, such as the order of reads and writes, the amount of data in each read or write function, and information on lseeks or other I/O functions. In essence, what is missing is the ability to look at I/O patterns from the application level. In the next section, I'll present a technique and application that can be used to help you understand the I/O pattern of your application.
Determining I/O Patterns
One of the keys to selecting HPC storage is to understand the I/O pattern of your application. This isn't an easy task to accomplish overall and several attempts have been made over the years to help you understand I/O patterns. One method I have been using is to use strace (system trace) to profile the I/O pattern of an application.
Because virtually all I/O from an application will use system libraries, you can use strace to capture the I/O profile of an application. For example, using the command
strace -T -ttt -o strace.out [application]
on an [application] doing I/O, might output a line like this:
1269925389.847294 write(17, " 37989 2136595 2136589 213"..., 3850) = 3850 <0.000004>
This single line has several interesting pieces. First, the amount of data written appears after the = sign (in this case, 3,850 bytes). Second, the amount of time used to complete the operation is on the far right in the angle brackets (< >) (in this case, 0.000004 seconds). Third, the data sent to the function is listed inside the quotes and is abbreviated, so you don’t see all of the data. This can be useful if you are examining an strace of an application that has sensitive data. The fourth piece of useful information is the first argument to the write() function, which contains the file descriptor to the specific file (in this case, it is fd 17). If you track the file associated with open() functions (and close() functions), you can tell which file the I/O function is operating upon.
From this information, you can start to gather all kinds of statistics from the application. For example, you can count the number of read() or write() operations and how much data is in each IO operation. This data can then be converted into throughput data in MB/s (or more). You can also count the number of I/O functions in a period of time to estimate the IOPS needed by the application. Additionally, you can do all of this on a per-file basis to find which files have more I/O operations than others. Then, you can perform a statistical analysis of this information, including histograms of the I/O functions. This information can be used as a good starting point for understanding the I/O pattern of your application.
However, if you run strace against your application, you are liable to end up with thousands, if not hundreds of thousands, of lines of output. To make things a bit easier, I have developed a simple program in Python that goes through the strace output and does this statistical analysis for you. The application, with the ingenious name "strace_analyzer," scans the entire strace output and creates an HTML report of the I/O pattern of the application, including plots (using matplotlib).
To give you an idea of what the HTML output from strace_analyzer looks like, you can look at this snippet of the first part of the major output (without plots).
This is only the top portion of the report; the rest of the report contains plots – sometimes quite a few of them. The analysis is done for a single process (without threading at this time), and it gives you all sorts of useful statistical information about the application.
One subtle thing to notice about using strace to analyze I/O patterns is that you are getting the strace output from a single application. Because I’m interested in HPC Storage here, many of the applications will be MPI applications, for which you get one strace output per MPI process. Getting strace output for each MPI process isn't difficult and doesn't require that the application be changed. You can get a unique strace output file for each MPI process, but again, you will get a great deal of output from each strace output file. To help with this, strace_analyzer creates an output file (a “pickle” in Python-speak), and you can take the collection of these files for an entire MPI application and perform the same statistical analysis across the MPI processes. The tool, called "MPI Strace Analyzer", also produces an HTML report across the MPI processes. If you look in the Appendix to this article, you will see a full report from an example application (eight-core LS-Dyna run that uses a simple central NFS storage system).
What Do I Do with This Information?
Now that you have all of this statistical information about your applications, what do you do with it? I'm glad you asked. The answer is that you can do quite a bit. The first thing I always look for is how many processes in the MPI application actually do I/O, primarily write(). You will be very surprised that many applications have only a single MPI process, typically the rank-0 process, doing all of the I/O for the entire application. However, there is a second class of applications in which a fixed number of MPI processes perform I/O, and this number is less than the total number of MPI processes. Finally, a third class of applications have all, or virtually all, processes writing to the same file at the same time (MPI-IO). If you don't know whether your applications fall into one of these three classes, you can use mpi_strace_analyzer to help determine that.
Just knowing whether your application has a single process doing I/O or whether it uses MPI-IO is a huge step toward making an informed decision about HPC storage. The simple reason is that running MPI-IO codes on NFS storage is really not recommended, although it is possible. Instead, the general rule of thumb is to run MPI-IO codes on parallel distributed storage that have tuned MPI-IO implementations. (Please note that these are general rules of thumb, and it is possible to run MPI-IO codes on NFS storage and non-MPI-IO codes on parallel distributed storage).
Other useful information obtained from examining the application I/O pattern, such as
- throughput requirements (read and write),
- IOPS requirements (write IOPS, read IOPS, total IOPS),
- sizes of read() and write() function calls (i.e., the distribution of data sizes),
- time spent doing I/O versus total run time, and
- lseek information,
can be used to determine not only what kind of HPC storage you need, but also how much performance you need from the storage. Is your application dominated by throughput performance or IOPS? What is the peak IOPS obtained from the strace output? How much time is spent doing I/O versus total run time?
If you look at the specific example in the Appendix, you could make the following observations from the HTML report.
- The MPI process associated with file_18597 spends the largest amount of time doing I/O. However, it is only 1.15% of the total run time. At best, I could only ever improve the wall clock time by 1.15% by adding more I/O capability to the system. This can also be seen in Appendix Figure 1.
- If you examine Appendix Table 2, which counts the number of times a specific I/O function is called, you can see that the MPI processes associated with file_18597 have the largest number of lseek(), write(), open(), fstat(), and stat() function calls of all of the processes. However, the MPI process associated with file_18590 also does quite a bit of I/O, as can be seen in Appendix Figure 2, which plots the major I/O functions ( read(), write(), lseek(), open(), and close() ).
- To help resolve whether file_18590 or file_18597 is a more dominant I/O process, examine Appendix Figure 8. This figure plots the total amount of data written by each function with the average (± standard deviation) plotted as well. From this figure, it is easy to see that file_18597 did the vast majority of data writing for this application (1.75GB out of about 3GB, or almost 60 percent).
- In Appendix Table 4, examine the write() functions and how much data use per function call is tabulated. You'll see that most of the data is passed in 1 to 8KB chunks, with the vast majority in 1KB or smaller chunks. This indicates that the application is doing a very large number of small writes (which could influence storage design).
- If you look at the write() summary right after Appendix Figure 7, you can also see that the whole application only wrote a little more than 3GB of data, with an average data size of a little more than 10KB per write.
- The same thing can be done for read() functions. Appendix Table 7 tabulates the data size passed to the read() function. You can see that the majority of data is read in the 1 to 10MB range. It turns out that a number of these read() function calls are the result of loading shared objects (.so) files, which are read and loaded into memory.
- If you look at the last section that covers IOPS, you can see that the peak Write IOPS is about 2,444, which is fairly large considering that a typical 7,200 rpm SAS or SATA drive can only handle about 100 IOPS. The Read IOPS number is low and the Total IOPS number is high as well (2,444). You would think this application needs a fair amount of IOPS performance because of the very small writes being performed, and you might presume that you need to have 25, 7,200 rpm SAS or SATA drives to meet the IOPS requirement of the application. However, don't forget that the best wall clock improvement possible for this application is 1.15 percent, so spending so much on drives will only improve the overall performance of the application by a small amount. It might be better to use that money to buy another compute node, if the application scales well, to improve performance.
Summary
HPC storage is definitely a difficult problem for the industry right now. Designing systems to meet our storage needs has become a headache that is difficult to cure. However, you can make this headache easier to manage if you understand the I/O patterns of your applications.
In this article, I talk about different ways to measure the performance of your current HPC storage system, your applications, or both, but this requires a great deal of coordination to capture all of the relevant information and then piece it together. Some of the tools, such as iotop, iostat, nfsiostat, collectl, collectd, and blktrace, can be used to understand what is happening with your storage and your applications. However, they don't really give you the details of what is going from the perspective of the application. These tools are all focused on what is happening on a particular server (compute node). For HPC, you would have to gather this information for all nodes and then coordinate it to understand what is happening at an application (MPI) level.
Using strace can give you more information from the perspective of the application, although it also requires you to gather all of this information on each node in the job run and coordinate it. To help with this process, two applications – strace_analyzer and mpi_strace_analyzer – have been written to help sort through the mounds of strace data and produce some useful statistical information.
The tools were applied to a LS-Dyna run over eight cores that used an NFS filesystem (NFS over GigE). Portions of the strace analysis of a single process was presented, and the entire MPI strace analysis was presented in an Appendix, showing the sort of information produced by the analysis tools to help you better understand the I/O pattern of your application.
I hope this article has presented some ideas about how to analyze your I/O needs from the perspective of an application. After all, making your applications run more efficiently, and hopefully faster, is the whole point of HPC.
