Lead Image © Kheng Ho Toh, 123RF.com

Lead Image © Kheng Ho Toh, 123RF.com

Fundamentals of I/O benchmarking

Measure for Measure

Article from ADMIN 32/2016
By , By
Admins often want to know how to measure the performance of a specific solution. Care is needed, however: Where there are benchmarks, there are pitfalls.

Administrators wanting to examine a specific storage solution will have many questions. To begin: What is storage performance? Most admins will think of several I/O key performance indicators (KPIs), for which the focus is on one or the other. However, these metrics describe different things: Sometimes they relate to the filesystem, sometimes to raw storage, sometimes to read performance, and sometimes to write performance. Sometimes a cache is involved and sometimes not. Moreover, the various indicators are measured by different tools.

Once you have battled through to this point and clarified what you are measuring and with which tool, the next questions are just around the corner: Which component is the bottleneck that is impairing the performance of the system? What storage performance does your application actually need? In this article, we will help you answer all these questions.

Fundamentals

An I/O request passes through several layers (Figure 1) in the operating system. These layers each build on one another. For example, the application and filesystem layer (with technologies such as LVM, DRBD, mdadm, multipathing, devmapper, etc.) is based on the block virtualization layer. Closer to the hardware, you will find the block layer, the SCSI layers, and finally the devices themselves (RAID controllers, HBAs, etc.).

Figure 1: Layers relevant for an I/O benchmark.

Some of the layers in Figure 1 have their own cache. Each can perform two tasks: Buffering data to offer better performance than would be possible with the physical device alone, and collating requests to create a few large requests from a large number of smaller ones. This is true both for reading and writing. For read operations there is another cache function, that is, proactive reading (read ahead), which retrieves neighboring blocks of data without a specific request and stores them in the cache because they will probably be needed later on.

The processor and filesystem cache is volatile; it only guarantees that all data is stored safely after a sync operation. The block virtualization layer's cache depends on the technique used (DRBD, LVM, or similar). The cache on RAID controllers and external storage devices is typically battery buffered. What has arrived here can therefore be considered safe and will survive a reboot.

In this article, the term "storage benchmark" refers to an I/O benchmark that bypasses the filesystem and accesses the underlying layer (e.g., device files such as /dev/sda or /dev/dm-0) directly.

In comparison, a filesystem benchmark addresses the filesystem but does not necessarily use the filesystem cache. The use of this cache is known as buffered I/O, and bypassing it is known as direct I/O. A pure filesystem benchmark is therefore without prejudice to the layers in the filesystem, so it is possible to compare filesystems.

Indicators

Throughput is certainly the most prominent indicator. The terms "bandwidth" or "transfer rate" are considered synonyms. They describe the number of bytes an actor can read or write per unit of time. Therefore, the throughput for copying large files, for example, determines the duration of the operation.

IOPS (I/O operations per second) is a measure of the number of read/write operations per second that a storage system can accommodate. Such an operation for SCSI storage (e.g., "read block 0815 from LUN 4711") takes time, which limits the possible number of operations per unit time, even if the theoretically possible maximum throughput has not been reached. IOPS are particularly interesting for cases in which there are many relatively small blocks to process, as is often the case with databases.

Latency is the delay between triggering an I/O operation and the following commit, which confirms that the data has actually reached the storage medium.

CPU cycles per I/O operation is a rarely used counter but an important one all the same, because it indicates the extent to which the CPU is stressed by I/O operations. This is best illustrated by the example of software RAID: For software RAID 5, checksums are calculated for all blocks, which consumes CPU cycles. Also, faulty drivers are notorious for burning computational performance.

Two mnemonics will help you remember this. The first is: "Throughput is equivalent to the amount of water flowing into a river every second. Latency is the time needed for a stick to cover a defined distance on the water." The second is: "Block size is equivalent to the cargo capacity of a vehicle." In this case, a sports car is excellent for taking individual items from one place to another as quickly as possible (low latency), but in practice, you will choose a slower truck for moving a house: It offers you more throughput because of its larger cargo volume.

Influencing Factors

The indicators will depend on several factors (Figure 2):

Figure 2: Mindmap with influencing factors in the benchmark and their associations.
  • Hardware
  • Block size
  • Access patterns (read/write portions and the portion of sequential and random access)
  • Use of the various caches
  • Benchmark tool used
  • Parallelization of I/O operations

Storage hardware naturally determines the test results, but the performance of processors and RAM can also cause a bottleneck in a storage benchmark. This effect can never be entirely excluded when programming the benchmark tools, which is why caution is advisable in comparing I/O benchmark results run with different CPUs or RAM configurations.

Block size determines the volume of data read or written by an I/O operation. Halving the block size with the same data flow leads to a doubling of the number of required I/O operation. Neighboring I/O operations are merged. Access within a block is always sequential; the random nature of access requests in case of very large block sizes is thus secondary. The measured values are then similar to those for sequential access.

Each operation will cost a certain amount of overhead; therefore, the throughput results will be better for larger blocks. Figure 3 shows the throughput depending on block size for a hard disk. The statistics were created with the iops program, carrying out random reads. Figure 4 shows an SSD, which eliminates the seek time of the read head. The block size can be arbitrarily large – Linux limits it by the value in /sys/block/<device>/queue/max_sectors_kb.

Figure 3: Operations and throughput with increasing block sizes on a magnetic disk.
Figure 4: Operations and throughput with increasing block sizes on an SSD.

Reading is generally faster than writing; however, you need to note the influence of caching. During reading, the storage and filesystem cache fill up, and repeated access is then faster. In operations, this is desirable, but when benchmarking, the history leads to measuring errors. During writes, the caches behave exactly the other way around: First, a write to the cache achieves high throughput, but later the content needs to be written out to disk. This results in an inevitable slump in performance and can also be observed in the benchmark.

Read-ahead access is another form of optimization for read operations. The speculative operations read more than requested, because the following block will probably be used next anyway. Elevator scheduling during writing will result in sectors reaching the medium in the optimized order.

If a filesystem cache exists, one speaks of buffered I/O; direct I/O bypasses the cache. With buffered I/O, data from the filesystem cache can be lost with a computer failure. With direct I/O, this cannot happen, but the effect on write performance is drastic and can be demonstrated with dd, as shown in Listing 1.

Listing 1

Buffered and Direct I/O

# dd of=file if=/dev/zero bs=512 count=1000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB) copied, 1.58155 s, 324 MB/s
# dd of=file if=/dev/zero bs=512 count=1000000 oflag=direct
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB) copied, 49.1424 s, 10.4 MB/s

If you want to empty the read and write cache for benchmark purposes, you can do so using:

sync; echo 3 > /proc/sys/vm/drop_caches

Sequential access is faster than random access, because access is always adjacent, allowing multiple operations of small block size to combine into a few operations of large block size in the filesystem cache. Also, the time needed to reposition the read head is eliminated. The admin can monitor merging of I/O blocks with iostat -x. This assumes the use of the filesystem cache.

In Listing 2, wrqm/s stands for write requests merged per second. The example shown here merges 259 write requests per second (w/s ) to 28 write operations (wrqm/s ).

Listing 2

Merging I/O

Device:    rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s  avgrq-sz  ...
sdb          0.00    28.00    1.00  259.00     0.00   119.29    939.69  ...

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Review: Accelerator card by OCZ for ESX server
    I/O throughput is a classic bottleneck, especially in virtualized environments. A flash cache card with matching software from OCZ promises to open up wide. We tested it.
  • The Benefit of Hybrid Drives
    People still use hard disks even when SSDs are much faster and more robust. One reason is the price; another is the lower capacity of flash storage. Hybrid drives promise to be as fast as SSDs while offering as much capacity as hard drives. But can they keep that promise?
  • SDS configuration and performance
    Software-defined storage promises centrally managed, heterogeneous storage with built-in redundancy. We examine how complicated it is to set up the necessary distributed filesystems. A benchmark shows which operations each system is best at tackling.
  • TKperf – Customized performance testing for SSDs and HDDs
    SSD manufacturers try to impress customers with performance data. If you want to know more, why not try your own performance measurements with a standardized test suite that the free TKperf tool implements.
  • Tuning SSD RAID for optimal performance
    Hardware RAID controllers are optimized for the I/O characteristics of hard disks; however, the different characteristics of SSDs require optimized RAID controllers and RAID settings.
comments powered by Disqus