 
        		The IO500 is a less well known but very useful HPC benchmark with innovative applications.
IO500
After the advent of the TOP500, several other TOP benchmarks followed, such as the Green500, High-Performance Conjugate Gradients (HPCG), and the defunct High-Performance Computing (HPC) Challenge (HPCC). All benchmarks are in search of a single number that can be used to rate or compare systems by focusing on different aspects of systems, such as absolute performance or energy efficiency. One TOP500 that people don't always consider is the IO500, perhaps it's because it’s storage focused – not one of the more exciting topics in HPC according to many people – or perhaps because it doesn’t get much press (it doesn’t). However, this benchmark is important, and the IO500 allows you to contrast storage solutions and filesystems. (The IO500 website uses the word “compare,” which to me implies a conclusion. I like “contrast” because it points out differences and does not imply a conclusion.)
Goals and Workloads
At the start of the IO500, the founding members created a set of goals that drove its creation:
- representative
- understandable
- scalable
- portable
- inclusive
- lightweight
- trustworthy
From these goals, they created several workloads from which a single IO500 number is created:
- IOEasy: applications with well optimized I/O patterns
- IOHard: applications that require a random workload
- MDEasy: metadata/small objects
- MDHard: small files (3901 bytes) in a shared directory
- Find: relevant objects found by patterns
As you can probably tell, two tests are used for the workload. The first is ior, which tests for read and write operations, and the second is mdtest, which is a metadata benchmark. Both of these tools use MPI to run tests and have been around for quite a while, so they are robust, well-seasoned, and reasonably well understood.
IOR has been used for many years in the HPC space as an I/O test. I don’t know the history of the test and can’t really find much on it, but it has become the HPC I/O benchmark. One reason is its flexibility, which allows you to test various interfaces and data access patterns. You can test sequential or random data access with some control over the exact random pattern. You can look through the list of options and see the breadth of control available to you, but don’t be repelled by the number of options: You can pick just a few and start running quickly.
MDTest is also a popular benchmark for metadata tests. It performs several tests and computes a “rate” – that is, how many operations per second (OPS) can be performed. For example, it computes the creation, stat, rename, and removal operations and computes a rate (OPS) for each. It does this for directories and files and then tests the creation rate and removal rate of a tree. The tree tests are focused not on a particular file, but on part of a directory tree or the entire directory tree.
They have four lists to which you can submit:
- Production
- 10 Node Production
- Research
- 10 Node Research
Obviously, the two categories with “10 Node” in their title only allow the use of 10 clients, but with a variable number of client processes. In the other two categories, you can use whatever number of clients you want and whatever total client processes you want.
Scoring
IO500 takes the various scores and combines them into a single number, which is then used to rank the solutions. I don’t know how they combine them, but they provide raw scores that can be use to rank the various systems. If you look at the main IO500 page, you will see a table. Look for the heading that says Score, which is the overall score for the systems in one of the four lists previously mentioned and is the overall score for the system.
To the right of the overall score are the overall results for BW (bandwidth; GBps) and MD (metadata rate, thousands of I/O OPS; KIOPS). These two columns are aggregated scores for the throughput and metadata rate and are some combinations of the Easy and Hard workloads for the respective lists.
Just above the table are buttons for the four lists. You can click on these to get the table you want. The systems are ranked from the highest to lowest score, but you can reverse that order if you like. I would recommend looking at the Production systems and then the Research systems. You will see quite a difference in the two.
Applying IO500
Benchmarks can be very useful tools, but to be useful, you must be able to apply them to your situation and applications, especially for the IO500 benchmark. You can run four different benchmarks covering sequential and random throughput and metadata performance. If you are not submitting the results to the IO500, you can run as many or as few clients as you want and as many processes per client as you want.
The key consideration is the I/O pattern(s) of your application(s). Is the I/O primarily sequential or random? How much I/O is done by the application? Does the data have quite a bit of metadata? Do all of the clients participate in performing I/O? Knowing some of these things will allow you to use the IO500 results to contrast the various submissions to the IO500 for your applications. It’s not simple, but it’s worth the effort if for nothing else but to understand the /IO patterns of your applications.
IO500 at SC24
I attended the IO500 BoF (birds of a feather) at SC24, and I really enjoyed this meeting. Each year at the Supercomputing Conference, the organizing members present the new list, discuss the significant changes to the list, and perhaps present a statistical analysis of the list, including a little history that includes the last one or two lists. I always find this presentation interesting, particularly the results of the Production list, because this is what systems are using today. The group then has an invited speaker that presents a topic relevant to the IO500.
This year's speaker was Mark Nelson from Clyso, Ceph specialists, who gave a fantastic talk about using the IO500 benchmarks to help develop CephFS. Mark is a contributor to the CephFS filesystem, and he discussed how they run the IO500 benchmarks after a group of patch commits (theirs and others) to Ceph. They run the IO500 benchmarks against a version of Ceph with certain commits. If they see performance degradation, they can then start bisecting the commits according to the IO500 benchmarks to locate the specific patch that caused the performance problem. This process helps them understand what the patch did to cause the problem or perhaps allows them to look for ways to record the patch to avoid the performance regression. He did say that in some cases they can’t avoid the performance regression, but then they can look for other ways to gain back the performance by creating other patches to Ceph.
I found the talk really interesting because Ceph is using a TOP500 benchmark as a standard for analyzing patches to a filesystem. I have no idea whether other contributors or development teams do this, but I’m willing to bet they use their own internal benchmarks. Now, an open source filesystem has contributors that use standard TOP500 benchmarks for writing patches that affect performance.
Summary
In the world of HPC you have the TOP500, Green500, HPCG, and IO500. Others have come and gone, but this seems to be the current collection. All four produce a single number that is used to stack rank the systems that are submitted to the respective lists. I’m not a big fan of single numbers, but I understand their usefulness with the great unwashed that fund these systems. Much is written about the first two projects, something is written about HPCG, but not much is written about IO500.
IO500 produces a single number, but the project also publishes the individual benchmark results that go into that single number. Currently, the project uses two different tests, each with two different workloads, producing four results. These results of course depend on the number of clients (number of nodes) used, and each client can run several processes during the tests. Two different classes of results, Production, and Research, can be used (something like MLPerf in the artificial intelligence world). I find the varied tests and workloads useful in understanding how storage systems behave.
These results are just benchmarks, and you need to understand the I/O pattern(s) of your applications to apply the IO500 results to your case, essentially correlating between the applications and the benchmarks. This process is non-trivial but is nonetheless a very useful exercise, even if a good correlation is not found, because you will better understand the I/O patterns of your applications. If you can find a good correlation, you can then examine the effect of the storage system on your application.
IO500 lends itself to innovative uses that I haven’t seen from TOP500, Green500, or HPCG. The invited talk during the IO500 BoF at SC24 presented a very innovative use. Mark Nelson showed how the IO500 can be used to test various groups of commits (patches) to Ceph for performance regressions, with a way to bisect the patches with the IO500 benchmarks to find the one that caused the performance regression. Although you might not be able to change the patch to improve performance, you now know where sensitivities lie within the code, and you can perhaps look for other avenues to gain back from performance.
