Isolating and diagnosing the root causes of your performance troubles.

When I/O Workloads Don’t Perform

Every now and then, you find yourself in a situation where you expect better performance from your data storage drives. Either they once performed very well and one day just stopped, or they came straight out of the box underperforming. I explore a few of the reasons why this might happen.

Sometimes, the easiest and quickest way to determine the root cause of a slow drive is to check its local logging data. The method by which this log data is stored will differ by the drive type, but in the end, the results are generally the same. For instance, a SCSI-based drive such as a Serial Attached SCSI (SAS) drive collects drive log data and general metrics in something called the SCSI log pages (plural because each page separates the collected data into its respective category). The easiest way to access this data is by using the sg3_utils package available for Linux. To find out what categories the drive supports, execute the sg_logs binary with the SAS drive or SCSI generic identifier in which you are interested (Listing 1).

Listing 1: sg_logs

$ sudo sg_logs /dev/sdc
    SEAGATE   ST14000NM0001     K001
Supported log pages  [0x0]:
    0x00        Supported log pages [sp]
    0x02        Write error [we]
    0x03        Read error [re]
    0x05        Verify error [ve]
    0x06        Non medium [nm]
    0x08        Format status [fs]
    0x0d        Temperature [temp]
    0x0e        Start-stop cycle counter [sscc]
    0x0f        Application client [ac]
    0x10        Self test results [str]
    0x15        Background scan results [bsr]
    0x18        Protocol specific port [psp]
    0x1a        Power condition transitions [pct]
    0x2f        Informational exceptions [ie]
    0x37        Cache (seagate) [c_se]
    0x38        
    0x3e        Factory (seagate) [f_se]

As you can see, you can observe data for write, read, and drive temperature errors, and more. To specify a specific page, you will need to use the -p parameter followed by the page number. For instance, look at the log page for write errors (i.e., 0x2; Listing 2).

Listing 2: Log Page for Write Errors

$ sudo sg_logs -p 0x2 /dev/sdc
    SEAGATE   ST14000NM0001     K001
Write error counter page  [0x2]
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 0
  Total times correction algorithm processed = 0
  Total bytes processed = 3951500537856
  Total uncorrected errors = 0

Seemingly, this drive does not have any write errors (corrected and uncorrected by the drive firmware), so it looks to be in good shape. Typically, if you see errors, especially of the uncorrected type, the printout will include failing logical block addresses (LBAs). If the failed LBA regions (i.e., sectors) were listed under the read error category, it would likely be in a pending reallocation state (waiting for a future write to the same address). A sector pending reallocation is a sector that is unable to be read from or written to and must be reallocated elsewhere on the disk drive. This reallocation will only happen on the next write operation to that failed sector, if the drive has spare sectors it can use to relocate the data. A failing sector or a sector pending reallocation by the drive’s firmware will affect overall drive performance, and if enough of it occurs, it would be highly recommended to replace the drive as soon as possible.

Another thing that needs to be understood is that if a log page starts to list a significant count of corrected read or write errors, chances are that the disk drive’s surrounding environment may be at fault. For instance, vibration will often cause a disk drive’s head to misread or miswrite a length of sectors on a drive track, which results in the firmware taking action to correct it. This process alone will introduce unwanted I/O latencies (reducing performance to the drive).

If you’d like to list all of the log pages at once, use the -a parameter (Listing 3). (Warning: You will get a lot of information.)

Listing 3: List All Log Pages

$ sudo sg_logs -a /dev/sdc
    SEAGATE   ST14000NM0001     K001
 
Supported log pages  [0x0]:
    0x00        Supported log pages [sp]
    0x02        Write error [we]
    0x03        Read error [re]
    0x05        Verify error [ve]
    0x06        Non medium [nm]
    0x08        Format status [fs]
    0x0d        Temperature [temp]
    0x0e        Start-stop cycle counter [sscc]
    0x0f        Application client [ac]
    0x10        Self test results [str]
    0x15        Background scan results [bsr]
    0x18        Protocol specific port [psp]
    0x1a        Power condition transitions [pct]
    0x2f        Informational exceptions [ie]
    0x37        Cache (seagate) [c_se]
    0x38
    0x3e        Factory (seagate) [f_se]
 
Write error counter page  [0x2]
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 0
  Total times correction algorithm processed = 0
  Total bytes processed = 3951500537856
  Total uncorrected errors = 0
  Reserved or vendor specific [0xf800] = 0
  Reserved or vendor specific [0xf801] = 0
  Reserved or vendor specific [0xf802] = 0
  Reserved or vendor specific [0xf803] = 0
  Reserved or vendor specific [0xf804] = 0
  Reserved or vendor specific [0xf805] = 0
  Reserved or vendor specific [0xf806] = 0
  Reserved or vendor specific [0xf807] = 0
  Reserved or vendor specific [0xf810] = 0
  Reserved or vendor specific [0xf811] = 0
  Reserved or vendor specific [0xf812] = 0
  Reserved or vendor specific [0xf813] = 0
  Reserved or vendor specific [0xf814] = 0
  Reserved or vendor specific [0xf815] = 0
  Reserved or vendor specific [0xf816] = 0
  Reserved or vendor specific [0xf817] = 0
  Reserved or vendor specific [0xf820] = 0
 
Read error counter page  [0x3]
  Errors corrected without substantial delay = 0
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 0
  Total times correction algorithm processed = 0
  Total bytes processed = 35801804845056
  Total uncorrected errors = 0
 
Verify error counter page  [0x5]
  Errors corrected without substantial delay = 0
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 0
  Total times correction algorithm processed = 0
  Total bytes processed = 0
  Total uncorrected errors = 0
 
Non-medium error page  [0x6]
  Non-medium error count = 0
 
Format status page  [0x8]
  Format data out: <not available>
  Grown defects during certification <not available>
  Total blocks reassigned during format <not available>
  Total new blocks reassigned <not available>
  Power on minutes since format <not available>
 
Temperature page  [0xd]
  Current temperature = 28 C
  Reference temperature = 60 C
 
Start-stop cycle counter page  [0xe]
  Date of manufacture, year: 2019, week: 26
  Accounting date, year:     , week:   
  Specified cycle count over device lifetime = 50000
  Accumulated start-stop cycles = 498
  Specified load-unload count over device lifetime = 600000
  Accumulated load-unload cycles = 553
 
[ … ]

Other tools exist to extract similar and sometimes the same amount of data from a SAS drive (e.g., smartctl). If a drive supports the industry standard Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T.), you can use the smartmontools package and, again, more specifically, the smartctl binary (Listing 4).

Listing 4: smartctl on SAS Drive

$ sudo smartctl -a /dev/sdc
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-66-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST14000NM0001
Revision:             K001
Compliance:           SPC-5
User Capacity:        7,000,259,821,568 bytes [7.00 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x6000c500a7b3ceeb0000000000000000
Serial number:        ZKL00CYG0000G925020A
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Sun Mar 21 15:00:00 2021 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported
 
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
 
Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     28 C
Drive Trip Temperature:        60 C
 
Manufactured in week 26 of year 2019
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  498
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  553
Elements in grown defect list: 0
 
Vendor (Seagate Cache) information
  Blocks sent to initiator = 150743545
  Blocks received from initiator = 964465354
  Blocks read from cache and sent to initiator = 1014080851
  Number of read and write commands whose size <= segment size = 8318611
  Number of read and write commands whose size > segment size = 12
 
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 264.67
  number of minutes until next internal SMART test = 48
 
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      35801.805           0
write:         0        0         0         0          0       3951.501           0
 
Non-medium error count:        0
 
No Self-tests have been logged

The smartmontools package is most beneficial for Serial ATA (SATA) drives, because most SATA drives tend to support the feature out of the box. Note that the S.M.A.R.T. output, such as the type of data and the way it is formatted, will differ on SATA drives from its SAS counterpart (Listing 5).

Listing 5: smartctl on SATA Drive

$ sudo smartctl -a /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-66-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
 
=== START OF INFORMATION SECTION ===
Model Family:     Seagate Constellation ES (SATA 6Gb/s)
Device Model:     ST500NM0011
Serial Number:    Z1M11WAJ
LU WWN Device Id: 5 000c50 04edcb79a
Add. Product Id:  DELL(tm)
Firmware Version: PA08
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Mar 21 15:00:38 2021 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
          was completed without error.
          Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
          without error or no self-test has ever 
          been run.
Total time to complete Offline 
data collection:    (  609) seconds.
Offline data collection
capabilities:        (0x7b) SMART execute Offline immediate.
          Auto Offline data collection on/off support.
          Suspend Offline collection upon new
          command.
          Offline surface scan supported.
          Self-test supported.
          Conveyance Self-test supported.
          Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
          power-saving mode.
          Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
          General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (  75) minutes.
Conveyance self-test routine
recommended polling time:    (   3) minutes.
SCT capabilities:          (0x10bd) SCT Status supported.
          SCT Error Recovery Control supported.
          SCT Feature Control supported.
          SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   077   063   ---    Pre-fail  Always       -       56770409
  3 Spin_Up_Time            0x0003   096   092   ---    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   ---    Old_age   Always       -       137
  5 Reallocated_Sector_Ct   0x0033   100   100   ---    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   067   060   ---    Pre-fail  Always       -       5578572
  9 Power_On_Hours          0x0032   100   100   ---    Old_age   Always       -       400
 10 Spin_Retry_Count        0x0013   100   099   ---    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       135
184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   ---    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   072   058   ---    Old_age   Always       -       28 (Min/Max 22/28)
191 G-Sense_Error_Rate      0x0032   100   100   ---    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   ---    Old_age   Always       -       35
193 Load_Cycle_Count        0x0032   100   100   ---    Old_age   Always       -       487
194 Temperature_Celsius     0x0022   028   042   ---    Old_age   Always       -       28 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   113   099   ---    Old_age   Always       -       56770409
197 Current_Pending_Sector  0x0012   100   100   ---    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   ---    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   ---    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   ---    Old_age   Offline      -       344 (218 109 0)
241 Total_LBAs_Written      0x0000   100   253   ---    Old_age   Offline      -       1389095282
242 Total_LBAs_Read         0x0000   100   253   ---    Old_age   Offline      -       619165492
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         2         -
# 2  Extended offline    Completed without error       00%         2         -
 
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

For the most part, the information is generally the same. For instance, when you look at the drive attributes category, attribute 197 or Current_Pending_Sector is the same sector pending reallocation discussed earlier. Again, you can gather drive temperature information, lifetime hours, and more.

How About CPU and Drive Utilization?

Now you have checked all your drives, but for some reason, they are still not performing as expected. The next step should be to determine whether drive utilization is too high or the CPU is too busy and is having a difficult time keeping up with I/O requests. The sysstat package provides a nice little utility called iostat that gathers both sets of data. In the example in Listing 6, iostat is showing an extended set of metrics and the CPU utilization at two-second intervals.

Listing 6: iostat Output

$ iostat -x -d 2 -c
Linux 5.4.12-050412-generic (dev-machine)     03/14/2021     _x86_64_    (4 CPU)
 
avg-cpu:  %user   %nice %system %iowait  %steal  %idle
           0.79    0.07    1.19    2.89    0.00  95.06
 
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda             10.91    6.97    768.20    584.64     4.87    18.20  30.85  72.31   13.16   20.40   0.26    70.44    83.89   1.97   3.52
nvme0n1         58.80   12.22  17720.47     48.71   230.91     0.01  79.70   0.08    0.42    0.03   0.00   301.34     3.98   1.02   7.24
sdb              0.31   55.97      4.13  17676.32     0.00   231.64   0.00  80.54    2.50    8.47   0.32    13.45   315.84   1.30   7.32
sdc              0.24    0.00      3.76      0.00     0.00     0.00   0.00   0.00    2.47    0.00   0.00    15.64     0.00   1.03   0.02
sde              2.47    0.00     62.57      0.00     0.00     0.00   0.00   0.00    0.63    0.00   0.00    25.34     0.00   0.29   0.07
sdf              1.51    0.00     32.42      0.00     0.00     0.00   0.00   0.00    0.69    0.00   0.00    21.40     0.00   0.31   0.05
sdd              1.42    0.00     50.96      0.00     0.00     0.00   0.00   0.00    0.44    0.00   0.00    35.83     0.00   0.38   0.05
md0             12.43   12.17     54.39     48.68     0.00     0.00   0.00   0.00    0.00    0.00   0.00     4.37     4.00   0.00   0.00
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.76    0.00    3.03    1.26    0.00   94.95
 
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda              0.00    9.00      0.00     88.00     0.00     8.00   0.00  47.06    0.00   30.83   0.26     0.00     9.78   0.67   0.60
nvme0n1       2769.50 2682.00  29592.00  10723.25   241.00     0.00   8.01   0.00    0.11    0.02   0.01    10.68     4.00   0.14  77.60
sdb              0.00 2731.00      0.00  27814.00     0.00   241.00   0.00   8.11    0.00   12.20  30.13     0.00    10.18   0.29  79.40
sdc              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sde              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdf              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdd              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md0           2717.50 2679.00  10870.00  10716.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     4.00     4.00   0.00   0.00
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.51    0.00    2.42    0.00    0.00   97.07
 
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
nvme0n1       2739.00 2747.50  27336.00  10988.50   210.00     0.00   7.12   0.00    0.12    0.02   0.00     9.98     4.00   0.14  77.20
sdb              0.00 2797.50      0.00  28270.00     0.00   210.00   0.00   6.98    0.00   11.75  29.38     0.00    10.11   0.28  78.80
sdc              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sde              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdf              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdd              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md0           2688.00 2746.50  10752.00  10986.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     4.00     4.00   0.00   0.00

The first interval should probably be ignored because iostat has no real data with which to compare (i.e., no initial state), so the numbers look a bit off. Once the utility stabilizes by the second interval, you will see a more accurate picture of your disk drive and what it reports with reads per second (r/s), writes per second (w/s), average I/O waiting to complete in reads (r_await) and writes (w_await), a calculation of how much of the drive is in use (%util), and more. The higher the %util number, the busier the drive is likely to be completing I/O requests. If that number is high, you might need to consider methods either to throttle the amount of I/O sent to the drive or find ways to balance the same I/O across multiple drives (e.g., in a RAID0, 5, or 6 configuration).

Also, notice the average CPU metrics at the top of each interval. Here, you will find a breakdown of how much of the collected CPU is busy performing tasks, waiting on completion of tasks (%iowait), idling, and so on. The less idle in the system, the more affected your drive performance.

You can view a real-time breakdown of these CPU cores with the top utility. After opening the top application at the command line, press the 1 key (Listing 7).

Listing 7: top Output

top - 19:44:01 up 15 min,  3 users,  load average: 1.08, 0.68, 0.42
Tasks: 145 total,   1 running, 144 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.7 us,  1.4 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu1  :  0.3 us,  3.1 sy,  0.0 ni, 95.9 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu2  :  0.3 us,  1.7 sy,  0.0 ni, 97.6 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu3  :  0.0 us,  1.3 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7951.2 total,   6269.5 free,    210.9 used,   1470.8 buff/cache
MiB Swap:   3934.0 total,   3934.0 free,      0.0 used.   7079.6 avail Mem 
 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 3294 root      20   0  748016   4760    988 S  13.3   0.1   0:01.19 fio
 3155 root      20   0       0      0      0 D   2.3   0.0   0:14.55 md0_resync
   18 root      20   0       0      0      0 S   0.3   0.0   0:00.05 ksoftirqd/1
 3152 root      20   0       0      0      0 S   0.3   0.0   0:05.38 md0_raid1
 3284 petros    20   0    9496   4080   3356 R   0.3   0.1   0:00.04 top
 3286 root      20   0  813548 428684 424916 S   0.3   5.3   0:00.37 fio

Enough Free Memory?

If the CPU is not the problem and the drives are being underutilized, do you have a constraint on memory resources? The easiest and quickest way to check is with free:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7951         201        7037           1         712        7493
Swap:          4095           0        4095

The free utility dumps the amount of total, used, and free memory on the system, but it will also show how much of it is used as a buffer or temporary cache (buff/cache) and how much of it is available and reclaimable (available). The output is in megabytes – the -g argument reports output in gigabytes and -k in kilobytes – and will match the data found in /proc/meminfo. Note that the output only looks different from the free output because it is calculated in kilobytes instead of megabytes:

$ cat /proc/meminfo | grep -e "^Mem"
MemTotal:        8142012 kB
MemFree:         7204048 kB
MemAvailable:    7672148 kB

When available memory starts to increase while free memory drastically decreases, a lot of memory is being consumed by the system and its applications, a percentage of which can be reclaimed from temporary caches when the operating system is under memory pressure. If the system begins to reclaim memory, it will affect overall system performance. You will also observe a kswapd or swapper message in the dmesg or syslog output, indicating that the kernel is hard at work freeing up reclaimable memory. If the system is in a situation in which both free and available memory decreases, it means the system has less memory to reclaim, so when an application asks for more memory, it will fault on the page allocation, likely forcing the application to exit early. This condition is typically accompanied by an entry in dmesg or syslog notifying the system administrator or user that a page allocation has occurred.

If you find yourself in a situation in which the system is struggling to find memory resources to serve applications and I/O requests, you might have to figure out the greatest offender(s) and attempt to find a resolution to the problem. In a worst case scenario, an application may contain a memory leak and not properly free unused memory, consuming more system memory that cannot be reclaimed. This scenario will need to be addressed in the application’s code. A simple way to observe the greatest consumers of your memory resources is with the ps utility (Listing 8).

Listing 8: ps Output

$ ps aux | head -1; ps aux | sort -rnk 4 | head -9
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        5088  5.0  5.2 815852 429928 pts/2   Sl+  15:35   0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test
root        5097  2.2  0.4 783092 38248 ?        Ds   15:35   0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test
root        5096  2.1  0.4 783088 38168 ?        Ds   15:35   0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test
root        5095  2.0  0.4 783084 38204 ?        Ds   15:35   0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test
root        5094  2.1  0.4 783080 38236 ?        Ds   15:35   0:00 fio --bs=1M --ioengine=libaio --iodepth=32 --size=10g --direct=0 --runtime=60 --filename=/dev/sdf --rw=randread --numjobs=4 --name=test
root        1421  0.2  0.3 1295556 29472 ?       Ssl  14:51   0:05 /usr/lib/snapd/snapd
root         990  0.0  0.2 107888 16912 ?        Ssl  14:50   0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
root         844  0.0  0.2 280452 18256 ?        SLsl 14:50   0:00 /sbin/multipathd -d -s
systemd+     912  0.0  0.1  24092 10612 ?        Ss   14:50   0:00 /lib/systemd/systemd-resolved

The fourth column (%MEM) is the column on which you should focus. In this example, the fio utility is consuming 5.2% of the system memory, and as soon as it exits, it will free that memory back into the larger pool of available memory for future use.

Other things worth considering are tuning the kernel’s virtual memory subsystem with the sysctl utility. A guide of what parameters can be tuned are found in the Documentation/admin-guide/sysctl/vm.txt file of the Linux kernel source. Tunables include filesystem buffering and caching thresholds, memory swapping, and others.