ZFS Tuning for HPC

The ZFS filesystem and volume manager simplifies data storage management and offers advanced features that allow it to perform in mission-critical or high-performance environments.

If you manage storage servers, chances are you are already aware of ZFS and some of the features and functions it boasts. In short, ZFS is a combined all-purpose filesystem and volume manager that simplifies data storage management while offering some advanced features, including drive pooling with software RAID support, file snapshots, in-line data compression, data deduplication, built-in data integrity, advanced caching (to DRAM and SSD), and more.

ZFS is licensed under the Common Development and Distribution License (CDDL), a weak copyleft license based on the Mozilla Public License (MPL). Although open source, ZFS and anything else under the CDDL was, and supposedly still is, incompatible with the GNU General Public License (GPL). This hasn’t stopped ZFS enthusiasts from porting it over to the Linux kernel, where it remains a side project under the dominion of the ZFS on Linux (ZoL) project.

The ZoL project not only helped introduce the advanced filesystem to Linux users, it garnered its fair share of users, some developers, and an entire community to support it. That aside, with a significant user base and the filesystem’s use for a wide variety of applications (HPC included), it often becomes necessary to know how to tune the filesystem and understand which knobs to turn.

You should understand that when you decide to apply the methods exercised in this article, you must do so with caution or after dry runs before rolling it out into production.

Creating the Test Environment

To begin, you need a server (or virtual machine) with one or more spare drives. I advise more than one because when it comes to performance, spreading I/O load across more disk drives instead of bottlenecking a single drive helps significantly. Therefore, I use four local drives – sdc, sdd, sde, and sdf – in this article:

$ cat /proc/partitions|grep sda
   8        0  488386584 sda
   8        1       1024 sda1
   8        2  488383488 sda2
   8       16   39078144 sdb
   8       32 6836191232 sdc
   8       64 6836191232 sde
   8       48 6836191232 sdd
   8       80 6836191232 sdf

Make sure to load the ZFS modules,

$ sudo modprobe zfs

and verify that they are loaded:

$ lsmod|grep zfs
zfs                  3039232  3
zunicode              331776  1 zfs
zavl                   16384  1 zfs
icp                   253952  1 zfs
zcommon                65536  1 zfs
znvpair                77824  2 zfs,zcommon
spl                   102400  4 zfs,icp,znvpair,zcommon

With the four drives identified above, I create a ZFS RAIDZ pool, which is equivalent to RAID5,

$ sudo zpool create -f myvol raidz sdc sdd sde sdf

and verify the status of the pool (Listing 1) and that it has been mounted (Listing 2).

Listing 1: Pool Status

$ zpool status
  pool: myvol
 state: ONLINE
  scan: none requested
  myvol       ONLINE       0     0     0
    raidz1-0  ONLINE       0     0     0
      sdc     ONLINE       0     0     0
      sdd     ONLINE       0     0     0
      sde     ONLINE       0     0     0
      sdf     ONLINE       0     0     0
errors: No known data errors

Listing 2: Pool Mounted

$ df -ht zfs
Filesystem      Size  Used Avail Use% Mounted on
myvol            18T  128K   18T   1% /myvol

Some Basic Tuning

A few general procedures can tune a ZFS filesystem for performance, such as disabling file access time updates in the file metadata. Historically, filesystems have always tracked when a user or application accesses a file and logs the most recent time of access, even if that file was only read and not modified. This activity can affect metadata performance when updating this field. To avoid this unnecessary I/O, simply turn off the atime parameters:

$ sudo zfs set atime=off myvol

To verify that it has been turned off, use the zfs get atime command:

$ sudo zfs get atime myvol
myvol  atime     off    local

Another parameter that can affect performance is compression, and although some algorithms (e.g., LZ4) are known to perform extremely well, it still sucks up a bit of CPU time compared with its counterparts. Therefore, disable filesystem compression,

$ sudo zfs set compression=off myvol

and verify that compression has been turned off:

$ sudo zfs get compression myvol
myvol  compression  off       default

To view all available parameters, use zfs get all (Listing 3).

Listing 3: View Parameters

$ zfs get all myvol
NAME   PROPERTY              VALUE                  SOURCE
myvol  type                  filesystem             -
myvol  creation              Sat Feb 22 22:09 2020  -
myvol  used                  471K                   -
[ ... ]


Operating systems commonly rely on local (and volatile) memory (e.g., DRAM) to cache file data and has done so for decades, with the ultimate goal of not having to touch the back-end storage device. Waiting for a disk drive to read the requested data can be painfully slow, so operating systems – and, in turn, filesystems – attempt to cache data content in the hopes of not accessing the underlying device. ZFS implements its own non-least-recently-used (non-LRU)-based cache, referred to as the adaptive replacement cache (ARC). In a standard (LRU) cache, the least recently used page cache data is replaced with new cache data. ZFS implements algorithms to be a bit more intelligent than this by maintaining lists for:

  1. recently cached entries,
  2. recently cached entries that have been accessed more than once,
  3. entries evicted from the list of (1) recently cached entries, and
  4. entries evicted from the list of (2) recently cached entries that have been accessed more than once.

Caching reads is an extremely difficult task to accomplish. Predicting which data will need to continue to remain in cache is not possible, and the likelihood of data being evicted before it is needed again, and then reread back into cache, is very high because of the nature of randomized read I/O profiles and operations.

The amount of memory the ARC can use on your local system can be managed in multiple ways. For instance, if you want to cap it at 4GB, you can insert that into the ZFS module with the zfs_arc_max parameter:

$ sudo modprobe zfs zfs_arc_max=4294967296

Or, you can create a configuration file for modprobe called /etc/modprobe.d/zfs.conf and save the following content in it:

options zfs zfs_arc_max=4294967296

You can verify the current setting of this parameter by viewing it under sysfs:

$ cat /sys/module/zfs/parameters/zfs_arc_max

Also, you can modify that same parameter over the same sysfs interface:

$ echo 4294967296 |sudo tee -a /sys/module/zfs/parameters/zfs_arc_max
$ cat /sys/module/zfs/parameters/zfs_arc_max 

If you are ever interested in viewing the statistics of the ARC, it is all available in procfs (Listing 4).

Listing 4: ARC Statistics

$ cat /proc/spl/kstat/zfs/arcstats
13 1 0x01 96 26112 26975127196 517243166877
name                            type data
hits                            4    691
misses                          4    254
demand_data_hits                4    0
demand_data_misses              4    0
demand_metadata_hits            4    691
demand_metadata_misses          4    254
prefetch_data_hits              4    0
prefetch_data_misses            4    0
prefetch_metadata_hits          4    0
prefetch_metadata_misses        4    0
mru_hits                        4    88
mru_ghost_hits                  4    0
mfu_hits                        4    603
mfu_ghost_hits                  4    0
deleted                         4    0
mutex_miss                      4    0
access_skip                     4    0
evict_skip                      4    0
[ ... ]


ZFS provides another, larger secondary layer for read caching. By having a larger volume to cache, you are increasing your chances of rereading valuable data content without hitting the slower device underneath. In ZFS, this is accomplished by adding an SSD to your pool. The Level 2 ARC (L2ARC) will host entries that are scanned from the “primary” ARC cache and are next to be evicted.

In my configuration, I have created two partitions on a local NVMe device:

$ cat /proc/partitions|grep nvme
 259        0  244198584 nvme0n1
 259        3   97654784 nvme0n1p1
 259        4   96679936 nvme0n1p2

I will be using partition 1 for the L2ARC read cache, so to enable, I enter:

$ sudo zpool add myvol cache nvme0n1p1

Then, I verify that the cache volume has been added to the pool configuration (Listing 5).

Listing 5: Verify Pool Config 1

$ sudo zpool status
  pool: myvol
 state: ONLINE
  scan: none requested
  myvol        ONLINE       0     0     0
    raidz1-0   ONLINE       0     0     0
      sdc      ONLINE       0     0     0
      sdd      ONLINE       0     0     0
      sde      ONLINE       0     0     0
      sdf      ONLINE       0     0     0
    nvme0n1p1  ONLINE       0     0     0
errors: No known data errors

Updates that enable a persistent L2ARC cache that can tolerate system reboots are soon to make the mainline ZFS code.


The purpose of the ZFS Intent Log (ZIL) is to persistently log synchronous I/O operations to disk before it is written to the pool managed array. That synchronous part is how you can ensure that all operations complete and are persisted to disk before returning an I/O completion status back to the application. You can think of it as a sort of “write cache.” The separate intent log (SLOG), however, is intended to give this write log a bit of a boost by plugging in an SSD.

Remember how I had two separate partitions on the local NVMe device? The one partition was used for the L2ARC read cache; now, I will use the second partition for the SLOG write cache.

To add the NVMe device partition as the SLOG to the pool, enter:

$ sudo zpool add myvol log nvme0n1p2

Then, verify that the cache volume has been added to the pool configuration (Listing 6).

Listing 6: Verify Pool Config 2

$ sudo zpool status
  pool: myvol
 state: ONLINE
  scan: none requested
  myvol        ONLINE       0     0     0
    raidz1-0   ONLINE       0     0     0
      sdc      ONLINE       0     0     0
      sdd      ONLINE       0     0     0
      sde      ONLINE       0     0     0
      sdf      ONLINE       0     0     0
    nvme0n1p2  ONLINE       0     0     0
    nvme0n1p1  ONLINE       0     0     0
errors: No known data errors

Now that you have added the NVMe devices as the caches for both reads and writes, you can view general and basic metrics of those devices with the same zpool iostat interface (Listing 7).

Listing 7: View Metrics

$ zpool iostat -v myvol
               capacity     operations     bandwidth 
pool         alloc   free   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
myvol        1.62M  25.2T      0      1     84  15.2K
  raidz1     1.62M  25.2T      0      1     67  13.5K
    sdc          -      -      0      0     16  3.40K
    sdd          -      -      0      0     16  3.39K
    sde          -      -      0      0     16  3.38K
    sdf          -      -      0      0     16  3.37K
logs             -      -      -      -      -      -
  nvme0n1p2      0    92G      0      0    586  56.6K
cache            -      -      -      -      -      -
  nvme0n1p1  16.5K  93.1G      0      0  1.97K    636
-----------  -----  -----  -----  -----  -----  -----


As you can see, ZFS is equipped with an entire arsenal of features that allow it to perform better in more mission critical or demanding high-performance environments. With an active community supporting ZFS, the filesystem is also very likely to continue to see additional features and improvements in the near future.

The Author

Petros Koutoupis is currently a senior performance software engineer at Cray for its Lustre High Performance File System division. He is also the creator and maintainer of the RapidDisk Project. Petros has worked in the data storage industry for well over a decade and has helped to pioneer the many technologies unleashed in the wild today.

Tags: HPC HPC , Linux Linux , Storage Storage