User's Guide

If you are new to HDF5, begin by reading about the HDF5 data model and file structure in the HDF5 User’s Guide.

If you are new to LuaJIT, read about C data structures to pass data between a program and the HDF5 library.

1 Datasets

In this example we store particle positions to a dataset.

We begin by filling an array of length N of vectors with 3 components:

Next we create an empty HDF5 file:

An HDF5 file has a hierarchical structure analogous to a filesystem. Where a filesystem consists of directories that contain subdirectories and files, an HDF5 file consists of groups that contain subgroups and datasets. In an HDF5 file, data is stored in datasets and attributes. Datasets store large data and may be read and written partially. Attributes store small metadata that is attached to groups or datasets and may be read and written only atomically.

A dataset is described by a location in the file, a datatype and a dataspace:

The location of the dataset is specified as a path relative to the file root; its absolute path is /particles/solvent/position. The group /particles and the subgroup /particles/solvent are implicitly created since they do not exist yet. We pass the datatype hdf5.double to store data in double-precision floating point format. The dataspace describes the dimensions of the dataset. A dataset may be created with fixed or variable-size dimensions. We choose a dataspace of fixed dimensions N×3 equal to the dimensions of the array.

Next we write the contents of the pos array to the dataset:

Note that datatype and dataspace are also passed to the write function. These tell the HDF5 library about the memory layout of the pos array, which happens to match the file layout of the dataset. For advanced use cases file and memory layouts may differ. For example, the data could be stored in memory with double precision and stored in the file with single precision to halve disk I/O, in which case the HDF5 library would automatically convert between double and single precision when writing the data.

We close the HDF5 file, which flushes the written data to disk:

Any open objects in the file, which includes groups, datasets, attributes and committed datatypes, must be closed before closing the file itself; otherwise, the HDF5 library will raise an error when attempting to close the file.

To inspect the HDF5 file, we use the h5dump tool:

h5dump dataset.h5

To show only the metadata of the file:

h5dump -A dataset.h5
HDF5 "dataset.h5" {
GROUP "/" {
   GROUP "particles" {
      GROUP "solvent" {
         DATASET "position" {
            DATATYPE  H5T_IEEE_F64LE
            DATASPACE  SIMPLE { ( 1000, 3 ) / ( 1000, 3 ) }

The dimensions of the dataset are shown twice, as the current dimensions and the maximum dimensions. A dataset may be resized at any point in time, as long as the new dimensions fit within the maximum dimensions specified upon creation. The maximum dimensions may further be specified as variable, in which case the dataset may be resized to arbitrary dimensions.

Next we open the HDF5 file and the dataset for reading:

We retrieve and verify the dimensions of the dataset:

We read the data from the dataset into a newly allocated array:

The array stores elements with single precision, while we created the dataset with double precision. In passing a different memory datatype to the read function, we instruct the HDF5 library to convert from double precision to single precision when reading.

Finally we verify the data read from the dataset:

The values are compared with a tolerance corresponding to single precision.

2 Attributes

In this example we store a sequence of strings to an attribute.

We create an empty HDF5 file with a group /molecules:

To the group we attach an attribute for a given set of strings:

Like a dataset, an attribute is described by a datatype and a dataspace. The HDF5 library defines a special datatype hdf5.c_s1 for C strings, which needs to be copied and adjusted for the length of the strings to be written. The size of a string datatype must be equal to or greater than the number of characters of the longest string plus a terminating null character. We choose 100 characters including a null terminator, to avoid having to determine an exact size at the expense of padding with null characters.

Next we write the strings to the attribute:

For writing, the Lua strings in the species table are converted to an array of C strings that match the string datatype.

We inspect the contents of the HDF5 file:

h5dump -A attribute.h5
HDF5 "attribute.h5" {
GROUP "/" {
   GROUP "molecules" {
      ATTRIBUTE "species" {
            STRSIZE 100;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
         DATA {
         (0): "H2O", "C8H10N4O2", "C6H12O6"

The strings are shown according to their actual lengths and the padding null characters discarded.

Next we open the file, the group, and the attached attribute:

We retrieve and verify the dimensions of the attribute:

We allocate an array for the strings and read the attribute:

The size of the memory datatype (50 characters) need not match the size of the file datatype (100 characters). If a string exceeds the size of the memory datatype, the HDF5 library truncates and terminates the string safely with a null character.

Finally we verify the data read from the attribute:

Using ffi.string, the C strings are converted to Lua strings for comparison.

Array indices begin at 0, and table indices begin at 1.

3 Dataspaces

In this example we use hyperslab selections to store time-dependent data to a dataset.

We simulate a system of particles over a number of steps in which the particles move randomly in 3‑dimensional space.

We begin by creating a dataset for the particle positions:

Here we specify the dataspace of the dataset using two sets of dimensions. The dimensions {0, N, 3} specify the sizes of the dataset after creation. The dimensions {nil, N, 3} specify the maximum sizes to which the dataset may be resized, where nil stands for a variable size in that dimension. Here the dataset is empty after creation and resizable to an arbitrary number of steps. When a dataset is resizable, i.e. when the current and maximum dimensions of a dataset differ upon creation, the data is stored using a chunked layout, rather than a contiguous layout as for a fixed-size dataset. The size of the chunks needs to be specified upon creation of the dataset using a dataset creation property list.

The choice of chunk size affects write and read performance, since each chunk is written or read as a whole. Here we choose a chunk size of {1, 1000, 3}. To write or read the positions of all 4200 particles at a given step, the HDF5 library needs to write or read 5 chunks altogether, of which the fifth chunk is partially filled. Tracing the position of a single particle through all 100 steps, however, would require the HDF5 library to fetch 100 chunks. As a general guideline, the chunk size in bytes should be well above 1 kB and below 1 MB, and the chunk dimensions should match the write and read access patterns as closely as possible.

Chunks may be individually compressed by specifying a compression algorithm and level in the dataset creation property list. Here we use the deflate algorithm with a compression level of 6, which is the default level of the gzip program. Note for this example compression is pointless since we store random data. Compression is sensible when the data within each chunk is correlated.

After creating the dataset, we prepare a file dataspace for writing:

The file dataspace specifies a selection within the dataset that data is written to or read from. We select a 3‑dimensional slab from the file dataspace that has its origin at {0, 0, 0} and a size of {1, N, 3}. As shown in the figure below, at each step we extend the dimensions of the dataset by one slab, and move the hyperslab selection by one in the dimension of steps.

Dataset dimensions with hyperslab selection at steps 1, 2, 3.
Dataset dimensions with hyperslab selection at steps 1, 2, 3.

Similarly we prepare a memory dataspace for writing:

The memory dataspace specifies the dimensions of and a selection within the array of particle positions. Each 3‑dimensional vector array element is padded with an unused fourth component to ensure properly aligned memory access. By selecting a hyperslab of dimensions {N, 3} from the memory dataspace of dimensions {N, 4}, we tell the HDF5 library to discard the unused fourth component of each vector.

We conclude with a loop that writes the particle positions at each step:

Before writing the positions, we append an additional slab of dimensions {1, N, 3} to the dataset by extending its dimensions to {step, N, 3}. We pass the memory dataspace to select the region for reading from the array and the file dataspace to select the region for writing to the dataset. Recall the file dataspace has rank 3 and the memory dataspace has rank 2; the dimensionality of file and memory dataspaces may differ as long as the number of selected elements is equal. After writing the positions, we move the origin of the hyperslab selection within the file dataspace to offset {step, 0, 0} to prepare the next step.

At the end of the simulation the HDF5 file contains:

h5dump -A dataspace.h5
HDF5 "dataspace.h5" {
GROUP "/" {
   GROUP "particles" {
      GROUP "solvent" {
         GROUP "position" {
            DATASET "value" {
               DATATYPE  H5T_IEEE_F64LE
               DATASPACE  SIMPLE { ( 100, 4200, 3 ) / ( H5S_UNLIMITED, 4200, 3 ) }

To check the stored particle positions, we open the dataset for reading:

We retrieve and verify the dimensions of the dataset:

For reading individual slabs, we can use the dataspace obtained from the dataset:

As above we prepare a memory dataspace that describes the array:

We read the slabs for each step and compare the positions for equality: