Finding and Recording Memory Errors

Analysis Tool

The second tool can be as simple or as complex as desired. A basic function would plot the error rate values for each DIMM for each host and memory controller as a function of time. Additionally, the memory errors can be summed for each host and the memory error rate plotted versus time for each host.

Another useful function of the analysis tool would be to conduct a statistical analysis of the error rates to uncover trends in the historical data. It could be as simple as computing the average and standard deviation of the error rate over time (looking to see if the error rates are increasing or decreasing) or as complex as examining the error rates as functions of time or location in the data center.

The code in Listing 2 is a very simple Python script that reads the CSV file and creates a list of lists (like a 2D array).

Listing 2: Reading the Scanned Data

import csv;
# ===================
# Main Python section
# ===================
if __name__ == '__main__':
    with open('file.txt', 'rb') as f:
        reader = csv.reader(f);
        data_list = list(reader);
    # end with
    print data_list;
# end if

Although the code is short, it illustrates how easy it is to read the CSV data. From this point, error rates can be computed along with all sorts of statistical analyses and graphing.

Parting Words

As mentioned in the article about how to kill a supercomputer, memory errors, either correctable or uncorrectable, can lead to problems. Keeping track of error rates over time is an important system aspect to be monitored.

A huge “thank you” is owed to Dr. Tommy Minyard at the University of Texas Advanced Computing Center (TACC) and to Dr. James Cuff and Dr. Scott Yockel at Harvard University, Faculty of Arts and Sciences Research Computing (FAS RC), for their help with access to systems used for testing.