Hierarchical Data Storage for HPC

Listing 1: Defining a Dataset with YAML

HDF5 "dset.h5" {
GROUP "/" {
   DATASET "dset" {
      DATATYPE  H5T_STD_I32BE
      DATASPACE  SIMPLE { ( 4, 6 ) / ( 4, 6 ) }
      DATA {
         1, 2, 3, 4, 5, 6,
         7, 8, 9, 10, 11, 12,
         13, 14, 15, 16, 17, 18,
         19, 20, 21, 22, 23, 24
      }
   }
}
}

HDF5 has many ways to represent the same datatype. For example, H5T_NATIVE_INT corresponds to a C integer type. On Linux (Intel architecture) this could also be written H5T_STD_I32LE (standard integer 32-bit little-endian), and on a MIPS systems it would be H5T_STD_I32BE (standardinteger32-bitbig-endian).

Compound datatypes refer to collections of several datatypes that are presented as a single unit. In C, this is similar to a struct. The various parts of a compound datatype are called members and may be of any datatype, including another compound datatype. One of the fancy features of HDF5 is that it is possible to read members from a compound datatype without reading the whole type.

In the YAML example, you can see the class of information listed as DATASPACE, which describes the layout of a dataset’s data elements. It can consist of non-elements (NULL), a single element (a scalar), or a simple array. The dataspace can be fixed or it can be unlimited, which allows it to be extensible (i.e., can grow larger).

Dataspace properties include rank (number of dimensions), actual size (dimensions), maximum size (size to which an array may grow). In the case of the YAML example, the dataspace is defined as SIMPLE, a multidimensional array of elements. The dimensionality (rank) of the dataspace is fixed when the array is created. In this case, it is defined as a 4x6 integer array (the first set of dimensions ( 4, 6 ) ). The second set of dimensions, which is also ( 4, 6 ), defines the maximum size that each dimension can grow during the lifetime of the dataspace. Because the array is fixed in size, the current dimensions and the maximum dimensions are the same.

If you are not sure about the dimensions, you can always use the HDF5 predefined variable H5P_UNLIMITED.

Attributes

One of the fundamental objects in HDF5 is an attribute, which is how you store metadata inside an HDF5 file. Attributes can be optionally associated with other HDF5 objects, such as groups, datasets, or named datatypes, if they are not independent objects. As such, attributes are accessed by opening the object to which they are attached. As the user, you define the attributes (make it meaningful), and you can delete them and overwrite them as you see fit.

Attributes have two parts. The first is a name and the second is a value. Classically the value is a string that describes the data to which it is attached. They can be extremely useful in a data file. Using attributes, you can describe the data, including information such as when the data was collected, who collected it, what applications or sensors were used in its creation, a description with as much information as you can include, and on and on. Useful metadata is one of the biggest problems in HPC data today, and attributes can be used to help alleviate the problem. You just have to use them.

Summary and Next Steps

In an effort to improve I/O performance and data organization, an external library such as HDF5 is one option. HDF5 uses a hierarchical approach to storing data that is similar to directories and files. You can store almost any data you want in an HDF5 file, including user-defined data types; integer, floating point, and string data; and binary data such as images, PDFs, and Excel spreadsheets.

HDF5 also allows metadata (attributes) to be associated with virtually any object in the data file. Taking advantage of these is the key to usable data files in the future. The use of attributes makes self-describing HDF5 files. As with a database, you an access data within the file randomly.

HDF5 is supported out of the box by C, C++, Fortran, and Java, as well as a large number of tools that support reading and writing HDF5 files. For example, Python, Matlab, Octave, Scilab, Mathematica, R, Julia, Perl, Lua, Node.js, Erlang, and Haskell can read and write HDF5 files.

This article is just a quick introduction to the concepts in HDF5. To get started, you only have to know a few concepts and remember a few key words, such as group (a directory), dataset (a file), attribute, datatype, and dataspace (the actual data). In an upcoming article, I will present some simple code for creating HDF5 files.