Articles

News

Vendors

    Whitepapers

    Write for Us

    About Us

    Using the popular HDF5 I/O library with Python and Fortran.

    Simple HDF5 in Python and Fortran

    HDF5 is one of the most popular I/O libraries in HPC. It uses a familiar filesystem hierarchy; it is flexible, self-describing, and portable across operating systems and hardware; can store text and binary data, can be used by parallel applications (MPI), has a large number of language plugins; and is fairly easy to use.

    In a previous article, I introduced HDF5, focusing on the concepts and strengths. In this article. I want to give a quick introduction to HDF5 through some simple code examples. The goal is not to dive deep into HDF5 but to illustrate the basics of using it. I'll start with Python because it is a widely used language and the HDF5 Python library h5py is very easy to use and very easy to understand. I also want to illustrate how to use HDF5 with a compiled language. In particular, I want to use Fortran for illustrating how a compiled program works with HDF5.

    h5py Python Library

    H5py is the dominant Python interface to HDF5. It is included with many Python distributions and with most Linux distributions. For the examples here, I use the Anaconda Python distribution for Python 2.7.

    The examples I use in this article are fairly simple and are derived from the Quick Start page on the h5py website. The first example simply illustrates a few concepts, such as:

    • Opening an HDF5 file for writing
    • Creating data sets
    • Creating groups

    The simple Python script in Listing 1 incorporates these concepts.

    Listing 1: Starting Out with h5py

    01   #!/home/laytonjb/anaconda2/bin/python
    02 
    03   import h5py
    04   import numpy as np
    05 
    06   # ===================
    07   # Main Python section
    08   # ===================
    09   #
    10  if __name__ == '__main__':
    11 
    12      f = h5py.File("mytestfile.hdf5", "w")
    13  
    14      dset = f.create_dataset("mydataset", (100,), dtype='i')
    15    
    16      dset[...] = np.arange(100)
    17 
    18      print "dset.shape = ",dset.shape
    19 
    20      print "dset.dtype = ",dset.dtype
    21 
    22      print "dset.name = ",dset.name
    23   
    24      print "f.name = ",f.name
    25   
    26      grp = f.create_group("subgroup");
    27    
    28      dset2 = grp.create_dataset("another_dataset", (50,), dtype='f');
    29      print "dset2.name = ",dset2.name
    30    
    31      dset3 = f.create_dataset('subgroup2/dataset_three', (10,), dtype='i')
    32      print "dset3.name = ",dset3.name
    33   
    34  # end if

    The first h5py command is line 12, which opens a file for writing. If the file exists, it will overwrite; if it doesn't exist, it will create the file. Remember that HDF5 is really a container for data objects. When you create a file, the library creates a number of defaults, such as the root group (/). Therefore the file will be non-zero in size, even if no data or attributes are written into it.

    After the file is opened and created, a data set with 100 integers is created (mydataset in line 14). At this point, only the object for the dataset is created in the file (dataspace). Line 16 puts data into the data object using numpy. Notice that you can put data into the object, and the h5py library will take care of updating the HDF5 file. We could also modify the data in the file.

    Recall that in Python, almost everything is an object, so it has properties. Lines 18, 20, 22, and 24 print out some of the properties of the HDF5 file (line 24) as well as the first data set (lines 18, 20, and 22). Because HDF5 is object based, it fits well with the object nature of Python.

    On line 26, a subgroup to the root group (subgroup) is created; then, on line 28, a new data set that resides in this subgroup is created using a float data type that starts with 50 elements. Notice that a method of the group object is used for this.

    On line 31, a new dataset is created. What is unique is that the dataset is created in a new subgroup named subgroup2. H5py will automatically create the subgroup if it doesn't exist.

    The output from this example Python script is show below:

    [laytonjb@laytonjb-Lenovo-G50-45 PYTHON]$ ./test.py
    dset.hsape =  (100,)
    dset.dtype =  int32
    dset.name =  /mydataset
    f.name =  /
    dset2.name =  /subgroup/another_dataset
    dset3.name =  /subgroup2/dataset_three

    Notice the size of the integers. The NumPy integer type represents integers with 32 bits (int32).

    Another short Python script reads the HDF5 file and outputs some of the attributes. This can be done fairly easily using the h5py function visit. This function recursively walks the HDF5 file so you can discover the objects in the file, including groups and data sets. With this function, you can print the "names" of the objects. Listing 2 is a simple script for walking the HDF5 file and printing the names of the objects.

    Listing 2: Walking the HDF5 File

    01   #!/home/laytonjb/anaconda2/bin/python
    02 
    03   import h5py
    04   import numpy as np
    05 
    06   def printname(name):
    07       print name
    08 
    09   # ===================
    10  # Main Python section
    11  # ===================
    12  #
    13  if __name__ == '__main__':
    14 
    15      f = h5py.File("mytestfile.hdf5", "r")
    16 
    17      for name in f:
    18          print name
    19      # end for
    20 
    21      f.visit(printname);
    22 
    23  # end if

    The output from the script is below:

    [laytonjb@laytonjb-Lenovo-G50-45 PYTHON]$ ./test2.py
    mydataset
    subgroup
    subgroup2
    mydataset
    subgroup
    subgroup/another_dataset
    subgroup2
    subgroup2/dataset_three

    You can find more information in the HDF5 documentation. The Quick Start guide also has more examples of acessing HDF5 files from Python.

    Fortran and HDF5

    H5py is a very Python-centric library allowing HDF5 to be used in a very flexible manner. Compiled languages are a little different. Using HDF5 with compiled languages is not quite as easy as with Python, but it is not difficult. The developers of HDF5 have created a number of functions and subroutines to be used for manipulating data and objects in an HDF5 file that make programming straightforward.

    For this article, a CentOS 7.3 OS was used with the default Fortran compiler (gfortran) and the HDF5 library that is part of the distribution. It's not difficult to build a Fortran executable with gfortran and the HDF5 library that comes with the distribution. The generic command line below illustrates how to accomplish this,

    $ gfortran code.f90 -fintrinsic-modules-path /usr/lib64/gfortran/modules \
       -lhdf5_fortran -o exe

    where code.f90 is the source file and exe is the resultant binary.

    The HDF Group has provided some sample Fortran 90 code to get started, as well as more complex examples. With the use of these examples, LIsting 3 shows a Fortran 90 version of the first sample Python code.

    Listing 3: Sample Fortran 90 Code

    001   PROGRAM TEST
    002  
    003       USE HDF5 ! This module contains all necessary HDF5 modules
    004  
    005       IMPLICIT NONE
    006
    007       ! Names (file and HDF5 objects)
    008       CHARACTER(LEN=15), PARAMETER :: filename = "mytestfile.hdf5" ! File name
    009       CHARACTER(LEN=9), PARAMETER :: dsetname1 = "mydataset" ! Dataset name
    010      CHARACTER(LEN=8), PARAMETER :: groupname = "subgroup" ! Sub-Group 1 name
    011      CHARACTER(LEN=9), PARAMETER :: groupname3 = "subgroup2" ! Sub-Group 3 name
    012      ! Dataset 2 name
    013      CHARACTER(LEN=24), PARAMETER :: dsetname2 = "subgroup/another_dataset"
    014      ! Dataset 3 name
    015      CHARACTER(LEN=23), PARAMETER :: dsetname3 = "subgroup2/dataset_three"
    016      
    017      ! Identifiers
    018      INTEGER(HID_T) :: file_id       ! File identifier
    019      INTEGER(HID_T) :: group_id      ! Group identifier
    020      INTEGER(HID_T) :: group3_id     ! Group 3 identifier
    021      INTEGER(HID_T) :: dset1_id      ! Dataset 1 identifier
    022      INTEGER(HID_T) :: dset2_id      ! Dataset 2 identifier
    023      INTEGER(HID_T) :: dset3_id      ! Dataset 3 identifier
    024      INTEGER(HID_T) :: dspace1_id    ! Dataspace 1 identifier
    025      INTEGER(HID_T) :: dspace2_id    ! Dataspace 2 identifier
    026      INTEGER(HID_T) :: dspace3_id    ! Dataspace 3 identifier
    027    
    028      ! Integer array
    029      INTEGER :: rank                 ! Dataset rank
    030      INTEGER(HSIZE_T), DIMENSION(1) :: dims1 = (/100/) ! Dataset dimensions
    031      INTEGER(HSIZE_T), DIMENSION(1) :: data_dims1
    032      INTEGER, DIMENSION(100) :: dset_data1   ! Data buffers
    033 
    034      ! FP array
    035      INTEGER(HSIZE_T), DIMENSION(1) :: dims2 = (/50/)
    036      INTEGER(HSIZE_T), DIMENSION(1) :: data_dims2
    037      REAL, DIMENSION(50) :: dset_data2
    038    
    039      ! Integer array for dataset_three
    040      INTEGER(HSIZE_T), DIMENSION(1) :: dims3 = (/10/) ! Dataset dimensions
    041      INTEGER(HSIZE_T), DIMENSION(1) :: data_dims3     ! Dataset rank
    042      INTEGER, DIMENSION(10) :: dset_data3
    043    
    044      ! Misc variables (e.g. loop counters)
    045      INTEGER :: error ! Error flag
    046      INTEGER :: i,j
    047  ! =====================================================================
    048 
    049      ! Initialize the dset_data array 
    050      data_dims1(1) = 100
    051      rank = 1
    052      DO i = 1, 100
    053          dset_data1(i) = i
    054      END DO
    055    
    056      ! Initialize Fortran interface
    057      CALL h5open_f(error)   
    058      ! Create a new file
    059      CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id, error)
    060 
    061      ! Create dataspace 1 (the dataset is next) "dspace_id" is returned
    062      CALL h5screate_simple_f(rank, dims1, dspace1_id, error)
    063      ! Create dataset 1 with default properties "dset_id" is returned
    064      CALL h5dcreate_f(file_id, dsetname1, H5T_NATIVE_INTEGER, dspace1_id, &
    065                       dset1_id, error)
    066      ! Write dataset 1
    067      CALL h5dwrite_f(dset1_id, H5T_NATIVE_INTEGER, dset_data1, data_dims1, &
    068                      error)
    069      ! Close access to dataset 1
    070      CALL h5dclose_f(dset1_id, error)
    071      ! Close access to data space 1
    072      CALL h5sclose_f(dspace1_id, error)
    073    
    074      ! Create a group in the HDF5 file
    075      CALL h5gcreate_f(file_id, groupname, group_id, error)
    076      ! Close the group
    077      CALL h5gclose_f(group_id, error)
    078 
    079      ! Create dataspace 2 (the dataset is next)
    080      data_dims2(1) = 50
    081      DO i = 1, 50
    082          dset_data2(i) = 1.0
    083      END DO
    084      ! Create dataspace 2
    085      CALL h5screate_simple_f(rank, dims2, dspace2_id, error)
    086      ! Create dataset 2 with default properties
    087      CALL h5dcreate_f(file_id, dsetname2, H5T_NATIVE_REAL, dspace2_id, &
    088                       dset2_id, error)
    089      ! Write dataset 2
    090      CALL h5dwrite_f(dset2_id, H5T_NATIVE_REAL, dset_data2, data_dims2, &
    091                      error)
    092      ! Close access to dataset 2
    093      CALL h5dclose_f(dset2_id, error)
    094      ! Close access to data space 2
    095      CALL h5sclose_f(dspace2_id, error)
    096    
    097      ! Create a group in the HDF5 file
    098      CALL h5gcreate_f(file_id, groupname3, group3_id, error)
    099      ! Close the group
    100     CALL h5gclose_f(group3_id, error)
    101    
    102     ! Create dataspace 3
    103     data_dims3(1) = 10
    104     DO i = 1, 10
    105         dset_data3(i) = i + 3
    106     END DO
    107     ! Create dataspace 3
    108     CALL h5screate_simple_f(rank, dims3, dspace3_id, error)
    109     ! Create dataset 3 with default properties
    110     CALL h5dcreate_f(file_id, dsetname3, H5T_NATIVE_INTEGER,  &
    111                      dspace3_id, dset3_id, error)
    112     ! Write dataset 3
    113     CALL h5dwrite_f(dset3_id, H5T_NATIVE_INTEGER, dset_data3, data_dims3, &
    114                     error)
    115     ! Close access to dataset 3
    116     CALL h5dclose_f(dset3_id, error)
    117     ! Close access to data space 3
    118     CALL h5sclose_f(dspace3_id, error)
    119    
    120     ! Close the file
    121     CALL h5fclose_f(file_id, error)
    122     ! Close FORTRAN interface
    123     CALL h5close_f(error)
    124  END PROGRAM TEST

    Notice that the code uses some predefined HDF5 variables that are necessary to use the library. Also note that this isn't “good” coding, in that the error variable is not checked when returning from a subroutine call. This code is just an example, and I wanted to keep it short in the interest of space.

    The basic process of using HDF5 in Fortran is pretty logical. To begin, you initialize or enable the Fortran interface (line 57); then, you open a file (line 59) and start creating objects.

    The first object to be created is a dataset in the root (/) group (lines 61–72), but first, you have to create the dataspace (line 62) then the dataset (lines 64-65). Lines 67–68 write the data to the dataset. To reverse the process the process, first close the dataset (line 70) and then the dataspace (line 72).

    The general approach for writing a dataset to an HDF5 file using Fortran is the following:

    • Open a dataspace
    • Open a dataset within the dataspace
    • Write the data to the dataset
    • Close the dataset
    • Close the dataspace

    You could easily write a function in Fortran 90 for all these steps if you desired.

    In the rest of the code, the other datasets are written to the file. One interesting thing to note is that when using these subroutines, you have to use the full path to the group where you are going to write the dataset. With the h5py Python module, you can write to a group by using the method associated with the specific group.

    After running the Fortran code, which has no output, a quick experiment is to run the test2.py script from the Python section against the Fortran output:

    [laytonjb@laytonjb-Lenovo-G50-45 FORTRAN]$ ./test2.py
    mydataset
    subgroup
    subgroup2
    mydataset
    subgroup
    subgroup/another_dataset
    subgroup2
    subgroup2/dataset_three

    If you compare this to the output from the Python code, you will see that they are the same.