User's Guide

If you are new to HDF5, begin by reading about the HDF5 data model and file structure in the HDF5 User’s Guide.

If you are new to LuaJIT, read about C data structures to pass data between a program and the HDF5 library.

1 Datasets

In this example we store particle positions to a dataset.

We begin by filling an array of length N of vectors with 3 components:

local N = 1000
local pos = ffi.new("struct { double x, y, z; }[?]", N)
math.randomseed(42)
for i = 0, N - 1 do
  pos[i].x = math.random()
  pos[i].y = math.random()
  pos[i].z = math.random()
end

Next we create an empty HDF5 file:

local file = hdf5.create_file("dataset.h5")

An HDF5 file has a hierarchical structure analogous to a filesystem. Where a filesystem consists of directories that contain subdirectories and files, an HDF5 file consists of groups that contain subgroups and datasets. In an HDF5 file, data is stored in datasets and attributes. Datasets store large data and may be read and written partially. Attributes store small metadata that is attached to groups or datasets and may be read and written only atomically.

A dataset is described by a location in the file, a datatype and a dataspace:

local datatype = hdf5.double
local dataspace = hdf5.create_simple_space({N, 3})
local dataset = file:create_dataset("particles/solvent/position", datatype, dataspace)

The location of the dataset is specified as a path relative to the file root; its absolute path is /particles/solvent/position. The group /particles and the subgroup /particles/solvent are implicitly created since they do not exist yet. We pass the datatype hdf5.double to store data in double-precision floating point format. The dataspace describes the dimensions of the dataset. A dataset may be created with fixed or variable-size dimensions. We choose a dataspace of fixed dimensions N×3 equal to the dimensions of the array.

Next we write the contents of the pos array to the dataset:

dataset:write(pos, datatype, dataspace)

Note that datatype and dataspace are also passed to the write function. These tell the HDF5 library about the memory layout of the pos array, which happens to match the file layout of the dataset. For advanced use cases file and memory layouts may differ. For example, the data could be stored in memory with double precision and stored in the file with single precision to halve disk I/O, in which case the HDF5 library would automatically convert between double and single precision when writing the data.

We close the HDF5 file, which flushes the written data to disk:

dataset:close()
file:close()

Any open objects in the file, which includes groups, datasets, attributes and committed datatypes, must be closed before closing the file itself; otherwise, the HDF5 library will raise an error when attempting to close the file.

To inspect the HDF5 file, we use the h5dump tool:

h5dump dataset.h5

To show only the metadata of the file:

h5dump -A dataset.h5
HDF5 "dataset.h5" {
GROUP "/" {
   GROUP "particles" {
      GROUP "solvent" {
         DATASET "position" {
            DATATYPE  H5T_IEEE_F64LE
            DATASPACE  SIMPLE { ( 1000, 3 ) / ( 1000, 3 ) }
         }
      }
   }
}
}

The dimensions of the dataset are shown twice, as the current dimensions and the maximum dimensions. A dataset may be resized at any point in time, as long as the new dimensions fit within the maximum dimensions specified upon creation. The maximum dimensions may further be specified as variable, in which case the dataset may be resized to arbitrary dimensions.

Next we open the HDF5 file and the dataset for reading:

local file = hdf5.open_file("dataset.h5")
local dataset = file:open_dataset("particles/solvent/position")

We retrieve and verify the dimensions of the dataset:

local filespace = dataset:get_space()
local dims = filespace:get_simple_extent_dims()
local N, D = dims[1], dims[2]
assert(N == 1000)
assert(D == 3)

We read the data from the dataset into a newly allocated array:

local pos = ffi.new("struct { float x, y, z; }[?]", N)
local memtype = hdf5.float
local memspace = hdf5.create_simple_space({N, 3})
dataset:read(pos, memtype, memspace)
dataset:close()
file:close()

The array stores elements with single precision, while we created the dataset with double precision. In passing a different memory datatype to the read function, we instruct the HDF5 library to convert from double precision to single precision when reading.

Finally we verify the data read from the dataset:

math.randomseed(42)
for i = 0, N - 1 do
  assert(math.abs(pos[i].x - math.random()) < 1e-7)
  assert(math.abs(pos[i].y - math.random()) < 1e-7)
  assert(math.abs(pos[i].z - math.random()) < 1e-7)
end

The values are compared with a tolerance corresponding to single precision.

2 Attributes

In this example we store a sequence of strings to an attribute.

We create an empty HDF5 file with a group /molecules:

local file = hdf5.create_file("attribute.h5")
local group = file:create_group("molecules")

To the group we attach an attribute for a given set of strings:

local species = {"H2O", "C8H10N4O2", "C6H12O6"}
local dataspace = hdf5.create_simple_space({#species})
local datatype = hdf5.c_s1:copy()
datatype:set_size(100)
local attr = group:create_attribute("species", datatype, dataspace)

Like a dataset, an attribute is described by a datatype and a dataspace. The HDF5 library defines a special datatype hdf5.c_s1 for C strings, which needs to be copied and adjusted for the length of the strings to be written. The size of a string datatype must be equal to or greater than the number of characters of the longest string plus a terminating null character. We choose 100 characters including a null terminator, to avoid having to determine an exact size at the expense of padding with null characters.

Next we write the strings to the attribute:

attr:write(ffi.new("char[?][100]", #species, species), datatype)
attr:close()
group:close()
file:close()

For writing, the Lua strings in the species table are converted to an array of C strings that match the string datatype.

We inspect the contents of the HDF5 file:

h5dump -A attribute.h5
HDF5 "attribute.h5" {
GROUP "/" {
   GROUP "molecules" {
      ATTRIBUTE "species" {
         DATATYPE  H5T_STRING {
            STRSIZE 100;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
         DATA {
         (0): "H2O", "C8H10N4O2", "C6H12O6"
         }
      }
   }
}
}

The strings are shown according to their actual lengths and the padding null characters discarded.

Next we open the file, the group, and the attached attribute:

local file = hdf5.open_file("attribute.h5")
local group = file:open_group("molecules")
local attr = group:open_attribute("species")

We retrieve and verify the dimensions of the attribute:

local filespace = attr:get_space()
local dims = filespace:get_simple_extent_dims()
local N = dims[1]
assert(N == #species)

We allocate an array for the strings and read the attribute:

local buf = ffi.new("char[?][50]", N)
local memtype = hdf5.c_s1:copy()
memtype:set_size(50)
attr:read(buf, memtype)
attr:close()
group:close()
file:close()

The size of the memory datatype (50 characters) need not match the size of the file datatype (100 characters). If a string exceeds the size of the memory datatype, the HDF5 library truncates and terminates the string safely with a null character.

Finally we verify the data read from the attribute:

for i = 0, N - 1 do
  assert(ffi.string(buf[i]) == species[i + 1])
end

Using ffi.string, the C strings are converted to Lua strings for comparison.

Array indices begin at 0, and table indices begin at 1.

3 Dataspaces

In this example we use hyperslab selections to store time-dependent data to a dataset.

We simulate a system of particles over a number of steps in which the particles move randomly in 3‑dimensional space.

local N = 4200
local nstep = 100

We begin by creating a dataset for the particle positions:

local file = hdf5.create_file("dataspace.h5")
local dcpl = hdf5.create_plist("dataset_create")
dcpl:set_chunk({1, 1000, 3})
dcpl:set_deflate(6)
local dataspace = hdf5.create_simple_space({0, N, 3}, {nil, N, 3})
local dataset = file:create_dataset("particles/solvent/position/value", hdf5.double, dataspace, nil, dcpl)

Here we specify the dataspace of the dataset using two sets of dimensions. The dimensions {0, N, 3} specify the sizes of the dataset after creation. The dimensions {nil, N, 3} specify the maximum sizes to which the dataset may be resized, where nil stands for a variable size in that dimension. Here the dataset is empty after creation and resizable to an arbitrary number of steps. When a dataset is resizable, i.e. when the current and maximum dimensions of a dataset differ upon creation, the data is stored using a chunked layout, rather than a contiguous layout as for a fixed-size dataset. The size of the chunks needs to be specified upon creation of the dataset using a dataset creation property list.

The choice of chunk size affects write and read performance, since each chunk is written or read as a whole. Here we choose a chunk size of {1, 1000, 3}. To write or read the positions of all 4200 particles at a given step, the HDF5 library needs to write or read 5 chunks altogether, of which the fifth chunk is partially filled. Tracing the position of a single particle through all 100 steps, however, would require the HDF5 library to fetch 100 chunks. As a general guideline, the chunk size in bytes should be well above 1 kB and below 1 MB, and the chunk dimensions should match the write and read access patterns as closely as possible.

Chunks may be individually compressed by specifying a compression algorithm and level in the dataset creation property list. Here we use the deflate algorithm with a compression level of 6, which is the default level of the gzip program. Note for this example compression is pointless since we store random data. Compression is sensible when the data within each chunk is correlated.

After creating the dataset, we prepare a file dataspace for writing:

local filespace = hdf5.create_simple_space({nstep, N, 3})
filespace:select_hyperslab("set", {0, 0, 0}, nil, {1, N, 3})

The file dataspace specifies a selection within the dataset that data is written to or read from. We select a 3‑dimensional slab from the file dataspace that has its origin at {0, 0, 0} and a size of {1, N, 3}. As shown in the figure below, at each step we extend the dimensions of the dataset by one slab, and move the hyperslab selection by one in the dimension of steps.

Dataset dimensions with hyperslab selection at steps 1, 2, 3.

Similarly we prepare a memory dataspace for writing:

local pos = ffi.new("struct { double x, y, z, w; }[?]", N)
local memspace = hdf5.create_simple_space({N, 4})
memspace:select_hyperslab("set", {0, 0}, nil, {N, 3})

The memory dataspace specifies the dimensions of and a selection within the array of particle positions. Each 3‑dimensional vector array element is padded with an unused fourth component to ensure properly aligned memory access. By selecting a hyperslab of dimensions {N, 3} from the memory dataspace of dimensions {N, 4}, we tell the HDF5 library to discard the unused fourth component of each vector.

We conclude with a loop that writes the particle positions at each step:

math.randomseed(42)
for step = 1, nstep do
  dataset:set_extent({step, N, 3})
  for i = 0, N - 1 do
    pos[i].x = math.random()
    pos[i].y = math.random()
    pos[i].z = math.random()
  end
  dataset:write(pos, hdf5.double, memspace, filespace)
  filespace:offset_simple({step, 0, 0})
end
dataset:close()
file:close()

Before writing the positions, we append an additional slab of dimensions {1, N, 3} to the dataset by extending its dimensions to {step, N, 3}. We pass the memory dataspace to select the region for reading from the array and the file dataspace to select the region for writing to the dataset. Recall the file dataspace has rank 3 and the memory dataspace has rank 2; the dimensionality of file and memory dataspaces may differ as long as the number of selected elements is equal. After writing the positions, we move the origin of the hyperslab selection within the file dataspace to offset {step, 0, 0} to prepare the next step.

At the end of the simulation the HDF5 file contains:

h5dump -A dataspace.h5
HDF5 "dataspace.h5" {
GROUP "/" {
   GROUP "particles" {
      GROUP "solvent" {
         GROUP "position" {
            DATASET "value" {
               DATATYPE  H5T_IEEE_F64LE
               DATASPACE  SIMPLE { ( 100, 4200, 3 ) / ( H5S_UNLIMITED, 4200, 3 ) }
            }
         }
      }
   }
}
}

To check the stored particle positions, we open the dataset for reading:

local file = hdf5.open_file("dataspace.h5")
local dataset = file:open_dataset("particles/solvent/position/value")

We retrieve and verify the dimensions of the dataset:

local filespace = dataset:get_space()
local dims = filespace:get_simple_extent_dims()
local nstep, N, D = dims[1], dims[2], dims[3]
assert(nstep == 100)
assert(N == 4200)
assert(D == 3)

For reading individual slabs, we can use the dataspace obtained from the dataset:

filespace:select_hyperslab("set", {0, 0, 0}, nil, {1, N, 3})

As above we prepare a memory dataspace that describes the array:

local pos = ffi.new("struct { double x, y, z, w; }[?]", N)
local memspace = hdf5.create_simple_space({N, 4})
memspace:select_hyperslab("set", {0, 0}, nil, {N, 3})

We read the slabs for each step and compare the positions for equality:

math.randomseed(42)
for step = 1, nstep do
  dataset:read(pos, hdf5.double, memspace, filespace)
  for i = 0, N - 1 do
    assert(pos[i].x == math.random())
    assert(pos[i].y == math.random())
    assert(pos[i].z == math.random())
  end
  filespace:offset_simple({step, 0, 0})
end
dataset:close()
file:close()