ZFS Tuning for HPC


Operating systems commonly rely on local (and volatile) memory (e.g., DRAM) to cache file data and has done so for decades, with the ultimate goal of not having to touch the back-end storage device. Waiting for a disk drive to read the requested data can be painfully slow, so operating systems – and, in turn, filesystems – attempt to cache data content in the hopes of not accessing the underlying device. ZFS implements its own non-least-recently-used (non-LRU)-based cache, referred to as the adaptive replacement cache (ARC). In a standard (LRU) cache, the least recently used page cache data is replaced with new cache data. ZFS implements algorithms to be a bit more intelligent than this by maintaining lists for:

  1. recently cached entries,
  2. recently cached entries that have been accessed more than once,
  3. entries evicted from the list of (1) recently cached entries, and
  4. entries evicted from the list of (2) recently cached entries that have been accessed more than once.

Caching reads is an extremely difficult task to accomplish. Predicting which data will need to continue to remain in cache is not possible, and the likelihood of data being evicted before it is needed again, and then reread back into cache, is very high because of the nature of randomized read I/O profiles and operations.

The amount of memory the ARC can use on your local system can be managed in multiple ways. For instance, if you want to cap it at 4GB, you can insert that into the ZFS module with the zfs_arc_max parameter:

$ sudo modprobe zfs zfs_arc_max=4294967296

Or, you can create a configuration file for modprobe called /etc/modprobe.d/zfs.conf and save the following content in it:

options zfs zfs_arc_max=4294967296

You can verify the current setting of this parameter by viewing it under sysfs:

$ cat /sys/module/zfs/parameters/zfs_arc_max

Also, you can modify that same parameter over the same sysfs interface:

$ echo 4294967296 |sudo tee -a /sys/module/zfs/parameters/zfs_arc_max
$ cat /sys/module/zfs/parameters/zfs_arc_max 

If you are ever interested in viewing the statistics of the ARC, it is all available in procfs (Listing 4).

Listing 4: ARC Statistics

$ cat /proc/spl/kstat/zfs/arcstats
13 1 0x01 96 26112 26975127196 517243166877
name                            type data
hits                            4    691
misses                          4    254
demand_data_hits                4    0
demand_data_misses              4    0
demand_metadata_hits            4    691
demand_metadata_misses          4    254
prefetch_data_hits              4    0
prefetch_data_misses            4    0
prefetch_metadata_hits          4    0
prefetch_metadata_misses        4    0
mru_hits                        4    88
mru_ghost_hits                  4    0
mfu_hits                        4    603
mfu_ghost_hits                  4    0
deleted                         4    0
mutex_miss                      4    0
access_skip                     4    0
evict_skip                      4    0
[ ... ]


ZFS provides another, larger secondary layer for read caching. By having a larger volume to cache, you are increasing your chances of rereading valuable data content without hitting the slower device underneath. In ZFS, this is accomplished by adding an SSD to your pool. The Level 2 ARC (L2ARC) will host entries that are scanned from the “primary” ARC cache and are next to be evicted.

In my configuration, I have created two partitions on a local NVMe device:

$ cat /proc/partitions|grep nvme
 259        0  244198584 nvme0n1
 259        3   97654784 nvme0n1p1
 259        4   96679936 nvme0n1p2

I will be using partition 1 for the L2ARC read cache, so to enable, I enter:

$ sudo zpool add myvol cache nvme0n1p1

Then, I verify that the cache volume has been added to the pool configuration (Listing 5).

Listing 5: Verify Pool Config 1

$ sudo zpool status
  pool: myvol
 state: ONLINE
  scan: none requested
  myvol        ONLINE       0     0     0
    raidz1-0   ONLINE       0     0     0
      sdc      ONLINE       0     0     0
      sdd      ONLINE       0     0     0
      sde      ONLINE       0     0     0
      sdf      ONLINE       0     0     0
    nvme0n1p1  ONLINE       0     0     0
errors: No known data errors

Updates that enable a persistent L2ARC cache that can tolerate system reboots are soon to make the mainline ZFS code.