Most storage devices have SMART capability, but can it help you predict failure? We look at ways to take advantage of this built-in monitoring technology with the smartctl utility from the Linux smartmontools package.

SMART Devices

S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) is a monitoring system for storage devices that provides information about the status of a device and allows for the running of self-tests. Administrators can use it to check on the status of their storage devices and periodically run self-tests to determine the state of the device.

IBM was the first company to add some monitoring and information capability to their drives in 1992. Other vendors followed suit, and Compaq led an effort to standardize the approach to monitoring drive health and reporting it. This push for standardization led to S.M.A.R.T. (Although S.M.A.R.T. is the correct abbreviation, it’s not nearly as easy to type, so I will be using SMART throughout the remainder of the article.)

Over time, SMART capability has been added to many drives, including PATA, SATA, and the many varieties of SCSI, SAS, and solid-state drives, as well as NVM Express (commonly referred to as NVMe) and even eMMC drives. The standard provides that the drives measure the appropriate health parameters and then make the results available for the operating system or other monitoring tools. However, each drive vendor is free to decide which parameters are to be monitored and their thresholds (i.e., the points at which the drive has “failed”). Note that I use “drive” as a generic term for a storage device in this article.

For a drive to be considered “SMART,” all it has to have is the ability to signal between the internal drive sensors and the host computer. Nothing in the standard defines what sensors are in the drive or how the data is exposed to the user. However, at the lowest level, SMART provides a simple binary bit of information – the drive is OK or the drive has failed. This bit of information is called the SMART status. Many times the output DISK FAILING doesn’t indicate that the drive has actually failed but that the drive might not meet its specifications.

It is fairly safe to assume that all modern drives have, in addition to the SMART status, SMART attributes. These attributes are completely up to the drive manufacturers and consequently are not standard. Therefore, each type of drive has to be scanned for various SMART attributes and possible values. In addition to SMART attributes, the drives can also contain some self-tests that store the results in the self-test log. These logs can be scanned or read to track the state of the drive, particularly over time. Moreover, you can also tell the drives to run self-tests that indicate whether the drive PASSED or FAILED the tests (more on this later).

SMART attributes might have lower values for better performance or higher values. You have to examine the attribute and decide which is true (or consult the drive manufacturer’s specifications). The difficulty in reading SMART attributes is that the threshold values beyond which the drive will not pass under ordinary conditions might not be published by the manufacturer. Moreover, each attribute returns a raw measurement value that is determined by the drive manufacturer and a normalized value that has a value from 1 to 253. The “normal” attribute value is completely up to the manufacturer, as well, so you can see that it’s not always easy getting SMART attributes from various drives or interpreting the values. Examples of some SMART attributes are listed in the article about SMART on Wikipedia, along with the typical meaning for their raw values.

S.M.A.R.T. Attribute Drive Failure

One would think that you could predict failure with many of the SMART attributes. For example, if a drive was running too hot or if bad sectors were developing quickly, you might think the drive would be more susceptible to failure. Perhaps, then, you can use the attributes with some general models of drive failure to predict when drives might fail and then work to minimize the damage.

However the use of SMART attributes for predicting drive failure has been a difficult proposition. In 2007, a Google study examined more than 100,000 drives of various types for correlations between failure and SMART values. The disks were a combination of consumer-grade drives (SATA and PATA) with speeds from 5,400 to 7,200rpm and capacities ranging from 80 to 400GB. Several drive manufacturers were represented in the population of drives, with at least nine different models in total. The data in the study was collected over a nine-month window.

In the study, the authors monitored the SMART attributes of the population of drives and noted which drives failed. Google chose the word “fail” to mean that the drive was not suitable for use in production, even if the drive tests were good (sometimes the drive would test fine but immediately fail in production). From their study, the authors concluded:

Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.

However, despite the overall message that they had difficulty developing correlations, they did find some interesting trends:

  • In discussing the correlation between SMART attributes and failure rates, one of the best summaries in the paper stated, “Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives.”
  • Temperature effects are interesting, in that high temperatures start affecting older drives (3–4 years old or older), but lower temperatures can also increase the failure rate of drives, regardless of age.
  • A section of the final paragraph of the paper bears repeating here: “We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors."
  • Summing all observed factors that contributed to drive failure, such as errors or temperature, they still missed about 36% of drive failures.

The paper provides some good insight into the drive failure rate of a large population of drives. As mentioned previously, they did observe some correlation of drive failure with scan errors, but that didn’t account for all failures, a large fraction of which did not show any SMART error signals. It’s also important to mention that the comment in the last paragraph mentions that, “... SMART models are more useful in predicting trends for large aggregate populations than for individual components.” However, this should not deter you from watching the SMART error signals and attributes to track the history of drives in your systems. Again, there appears to be some correlation between scan errors and drive failure, which might be useful in your environment.

In a more recent 2016 study, Microsoft and Pennsylvania State University, examined SSD failures in data centers. Over nearly three years they examined about 500,000 SSDs from five very large data centers and several edge data centers. The drives were used in a variety of workloads, including big data analytics, content distribution caches, data center management software, and web search functions (indexing, multimedia, object store, advertisement, etc.). The big data analytics workload was more write than read heavy, and the other three workloads were more read than write heavy.

For all the drives, failure data was gathered, as well as other possible influencing factors, including design, provisioning and workload evolution data (read/write volumes, write amplification, etc.), fine spatial information (data center, rack, and server location), and temporal resolution. SMART attributes for the drives also were captured.

Some of the primary conclusions included:

  • The annualized failure rate (AFR) for some drive models is much higher than quoted in SSD specifications – as much as 70% higher.
  • Four SMART attributes are most important in determining drive failure:
    • Data errors (uncorrectable and cyclic redundancy check [CRC])
    • Sector reallocations
    • Program/erase failures
    • SATA downshift (a downgrade to a lower signaling rate with an increase in errors)
  • Uncorrectable bit errors are at least an order of magnitude higher than the target rates.
  • Symptoms captured by SMART are more likely to precede SSD failure, “with an intense manifestation preventing their survivability beyond a few months. However, our analysis shows that these symptoms are not a sufficient indicator for diagnosing failures.”
  • Drive symptoms (i.e., data errors and reallocated sectors) have a direct effect on failures.
  • Design/provisioning factors (e.g., device model) can affect failure rates.
  • Devices are more likely to fail in less than a month after their symptoms match failure signatures.
  • The AFR increases two to four times with an increase in average writes per day for some drive types.

With the use of machine learning techniques, the researchers were able to rank the importance of SMART parameters (Table 1).

Table 1: SMART Parameter Ranking

Category Feature Importance
Symptom DataErrors 1
Symptom ReallocSectors 0.943
Device workload TotalNANDWrite 0.526
Device workload HostWrite 0.517
Device workload TotalReads+Write 0.516
Device workload AvgMemory 0.504
Device workload AvgSSDSpace 0.493
Device workload UsagePerDay 0.491
Device workload TotalReads 0.475
Device workload ReadsPerDay 0.469

Getting to SMART Data and Self-Tests

Fortunately, Linux has a great tool, smartmontools, that takes advantage of the features and capabilities of SMART drives by allowing you to interact with storage devices that use the SMART protocol. Smartmontools lets you collect SMART attribute information, control self-tests on the drive, and obtain logs. Derived from and expanding on an earlier project, smartsuite, from the University of California at Santa Cruz, smartmontools incorporated later standards and additional features. The tool is compatible with all SMART features and supports ATA, ATAPI, and SATA-3 to -8 disks, as well as SCSI disks and tape, NVMe, solid-state, and eMMC devices. It also supports the major Linux RAID cards, which sometimes present difficulties because they require vendor-specific I/O control commands. Check the smartmontools web page for more details on your specific RAID card.

Smartmontools is easy to build and easy to use. In the interest of brevity, I just downloaded the latest 64-bit binary from the website. In the smartmontools package, the key binary is smartctl, which allows interaction with the SMART attributes of drives. For this article, I tested a Samsung 840 SSD on an office desktop running Ubuntu 18.04.

The first thing to do once smartmontools is installed is to scan each drive with the -i (info) option (Listing 1), which checks to see whether the drive is SMART capable.