Photo by Barnabas Davoti on Unsplash

File access optimization discovery and visualization

Fast and Lean

Article from ADMIN 87/2025

By Andreas Scheumann , By Philipp Koester , By and Rainer Keller

File I/O is a limiting factor for many applications and can become a bottleneck for data-intensive computing. We analyze different file access functions and present a script that visualizes and optimizes inefficient file I/O.

Analyzing program file I/O is difficult and time consuming, especially if you have to change an existing program with a long history of different developers. Therefore, assistance from tools is important.

Most existing tools (e.g., Darshan [1], Vampir [2], libiotrace [3]) are profilers that collect data during runtime and provide the collected data for later analysis. With the libiotrace profiling data, we developed an automatic analysis tool to find crucial patterns and to offer instructions to improve this pattern for POSIX [4] and MPI [5] file I/O (see the "Basics of File Access" box).

Basics of File Access

Before getting into the details, a good starting point is to describe how file access works internally and how several simultaneous file accesses on one file can interfere with each other.

Opening a file with low-level file access (e.g., in C) creates a file handle with a unique ID representing the opened file. In POSIX, the handles are called file descriptors or streams, whereas in MPI they are called MPI file handles. Because the underlying functionality is the same, the tool presented in this article addresses the different types of handles alike, and we generalize them by calling them active file accesses.

Both MPI and POSIX processes can open multiple files simultaneously, and an active file may have multiple concurrent accesses from a single process (Figure 1). The functions called on these active file accesses become meaningful, but do they interfere with each other, or can they, for example, be combined without changing the program behavior?

Figure 1: Illustration of POSIX and MPI application file handles.

The underlying problem is inefficient file access that can cause a program to use more resources and time than necessary. One type of suboptimal file I/O is inefficient file access patterns. Our script [6] can help with the problem of unnecessary frequent and small reads and writes of data inside a file. In this article, we describe the methodology used for devising a script that serves as an example of the kinds of optimizations you can apply to your own homegrown applications to improve performance.

Optimization Possibilities

To achieve the aim of reducing the number of function calls per active file access, you must determine whether merging function calls is possible and consider parallel active file accesses on the same file. Therefore, three factors are important: the type of function used to access the file area, the time-based sequence of function calls onto a file, and the affected byte areas inside the file.

With this information, you can recognize whether the same function is called more than once successively and then check whether those calls can be merged into a single call. In the following section, we explain the theory of this check.

Sequential Condition

When the script determines two function calls of the same type by a file access, the first step is to check for another function call by a different file access on the same file sequentially between the repeated function calls (hence called intermediate function). If no intermediate function call by a different file access occurs, you can merge calls into one, because the resulting file content will remain the same (Figure 2). If other file access function calls occur between the potentially optimizable calls, you need to investigate those cases further.

Figure 2: Merging two calls of the same type.

For simple cases (i.e., when the intermediate function between is only an open or close function), you can quickly rule out function interference. The intermediate function in between neither changes the file nor depends on the repeated function calls. In this case, you can merge the repeated function calls and perform them in a single call before or after the intermediate function call.

Conditions Dependent on the Offset

If the intermediate function instead is a seek, read, or write, you often need to consider additional factors. You can resolve some cases easily, but often the sequence of the function calls is not sufficient to make a statement about your ability to optimize. Often, the function calls and the corresponding byte areas within the file require a detailed look.

To get a better understanding of what is going on inside the file, it is important to know that each active file access has a so-called offset, which describes the current position of the cursor of the file access. Functions called by a process change the offset of the file handle according to the function type.

1 2 3 4 5 6 Next »