Finding and recording memory errors


Analysis Tool

The second tool can be as simple or as complex as desired. A basic function would plot the error rate values for each DIMM for each host and memory controller as a function of time. Additionally, the memory errors could be summed for each host and the memory error rate plotted versus time for each host.

Another use of the tool would be to conduct a statistical analysis of the error rates to uncover trends in the historical data. It could be as simple as computing the average and standard deviation of the error rate over time (looking to see if the error rates are increasing or decreasing) or as complex as examining the error rates as functions of time or location in the data center.

The code in Listing 5 is a very simple Python script that reads the CSV file and creates a list of lists (like a 2D array). Although the code is short, it illustrates how easy it is to read the CSV data. From this point, error rates can be computed along with all sorts of statistical analyses and graphing.

Listing 5

Reading the Scanned Data

import csv;
# ===================
# Main Python section
# ===================
if __name__ == '__main__':
    with open('file.txt', 'rb') as f:
        reader = csv.reader(f);
        data_list = list(reader);
    # end with
    print data_list;
# end if

Parting Words

As mentioned in the article about how to kill a supercomputer, memory errors, either correctable or uncorrectable, can lead to problems. Keeping track of error rates over time is an important system aspect to be monitored.

A huge "thank you" is owed to Dr. Tommy Minyard at the University of Texas Advanced Computing Center (TACC) and to Dr. James Cuff and Dr. Scott Yockel at Harvard University, Faculty of Arts and Sciences Research Computing (FAS RC), for their help with access to systems used for testing.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Memory Errors

    Memory errors are a silent killer of high-performance computers, but you can find and track these stealthy assassins.

  • Error-correcting code memory keeps single-bit errors at bay
    System memory is extremely important to your applications, which is why many systems use error-correcting code (ECC) memory. ECC memory can typically detect and correct single-bit memory errors, and Linux has a reporting capability that collects this information.
  • Monitoring Memory Errors

    One resource extremely important to your applications is system memory, which is why many systems use error-correcting code (ECC) memory. ECC memory can typically detect and correct single-bit memory errors, and Linux has a reporting capability that collects this information.

comments powered by Disqus