User's Guide
If you are new to HDF5, begin by reading about the HDF5 data model and file structure in the HDF5 User’s Guide.
If you are new to LuaJIT, read about C data structures to pass data between a program and the HDF5 library.
1 Datasets
In this example we store particle positions to a dataset.
We begin by filling an array of length N of vectors with 3 components:
local N = 1000
local pos = ffi.new("struct { double x, y, z; }[?]", N)
math.randomseed(42)
for i = 0, N - 1 do
[i].x = math.random()
pos[i].y = math.random()
pos[i].z = math.random()
posend
Next we create an empty HDF5 file:
local file = hdf5.create_file("dataset.h5")
An HDF5 file has a hierarchical structure analogous to a filesystem. Where a filesystem consists of directories that contain subdirectories and files, an HDF5 file consists of groups that contain subgroups and datasets. In an HDF5 file, data is stored in datasets and attributes. Datasets store large data and may be read and written partially. Attributes store small metadata that is attached to groups or datasets and may be read and written only atomically.
A dataset is described by a location in the file, a datatype and a dataspace:
local datatype = hdf5.double
local dataspace = hdf5.create_simple_space({N, 3})
local dataset = file:create_dataset("particles/solvent/position", datatype, dataspace)
The location of the dataset is specified as a path relative to the
file root; its absolute path is
/particles/solvent/position
. The group
/particles
and the subgroup /particles/solvent
are implicitly created since they do not exist yet. We pass the datatype
hdf5.double
to store data in double-precision floating
point format. The dataspace describes the dimensions of the dataset. A
dataset may be created with fixed or variable-size dimensions. We choose
a dataspace of fixed dimensions N×3 equal to the dimensions of the
array.
Next we write the contents of the pos
array to the
dataset:
:write(pos, datatype, dataspace) dataset
Note that datatype and dataspace are also passed to the
write
function. These tell the HDF5 library about the
memory layout of the pos
array, which happens to match the
file layout of the dataset. For advanced use cases file and memory
layouts may differ. For example, the data could be stored in memory with
double precision and stored in the file with single precision to halve
disk I/O, in which case the HDF5 library would automatically convert
between double and single precision when writing the data.
We close the HDF5 file, which flushes the written data to disk:
:close()
dataset:close() file
Any open objects in the file, which includes groups, datasets, attributes and committed datatypes, must be closed before closing the file itself; otherwise, the HDF5 library will raise an error when attempting to close the file.
To inspect the HDF5 file, we use the h5dump
tool:
h5dump dataset.h5
To show only the metadata of the file:
h5dump -A dataset.h5
HDF5 "dataset.h5" {
GROUP "/" {
GROUP "particles" {
GROUP "solvent" {
DATASET "position" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( 1000, 3 ) / ( 1000, 3 ) }
}
}
}
}
}
The dimensions of the dataset are shown twice, as the current dimensions and the maximum dimensions. A dataset may be resized at any point in time, as long as the new dimensions fit within the maximum dimensions specified upon creation. The maximum dimensions may further be specified as variable, in which case the dataset may be resized to arbitrary dimensions.
Next we open the HDF5 file and the dataset for reading:
local file = hdf5.open_file("dataset.h5")
local dataset = file:open_dataset("particles/solvent/position")
We retrieve and verify the dimensions of the dataset:
local filespace = dataset:get_space()
local dims = filespace:get_simple_extent_dims()
local N, D = dims[1], dims[2]
assert(N == 1000)
assert(D == 3)
We read the data from the dataset into a newly allocated array:
local pos = ffi.new("struct { float x, y, z; }[?]", N)
local memtype = hdf5.float
local memspace = hdf5.create_simple_space({N, 3})
:read(pos, memtype, memspace)
dataset:close()
dataset:close() file
The array stores elements with single precision, while we created the
dataset with double precision. In passing a different memory datatype to
the read
function, we instruct the HDF5 library to convert
from double precision to single precision when reading.
Finally we verify the data read from the dataset:
math.randomseed(42)
for i = 0, N - 1 do
assert(math.abs(pos[i].x - math.random()) < 1e-7)
assert(math.abs(pos[i].y - math.random()) < 1e-7)
assert(math.abs(pos[i].z - math.random()) < 1e-7)
end
The values are compared with a tolerance corresponding to single precision.
2 Attributes
In this example we store a sequence of strings to an attribute.
We create an empty HDF5 file with a group
/molecules
:
local file = hdf5.create_file("attribute.h5")
local group = file:create_group("molecules")
To the group we attach an attribute for a given set of strings:
local species = {"H2O", "C8H10N4O2", "C6H12O6"}
local dataspace = hdf5.create_simple_space({#species})
local datatype = hdf5.c_s1:copy()
:set_size(100)
datatypelocal attr = group:create_attribute("species", datatype, dataspace)
Like a dataset, an attribute is described by a datatype and a
dataspace. The HDF5 library defines a special datatype
hdf5.c_s1
for C strings, which needs to be copied and
adjusted for the length of the strings to be written. The size of a
string datatype must be equal to or greater than the number of
characters of the longest string plus a terminating null character.
We choose 100 characters including a null terminator, to avoid having to
determine an exact size at the expense of padding with null
characters.
Next we write the strings to the attribute:
:write(ffi.new("char[?][100]", #species, species), datatype)
attr:close()
attr:close()
group:close() file
For writing, the Lua strings in the species
table are
converted to an array of C strings that match the string datatype.
We inspect the contents of the HDF5 file:
h5dump -A attribute.h5
HDF5 "attribute.h5" {
GROUP "/" {
GROUP "molecules" {
ATTRIBUTE "species" {
DATATYPE H5T_STRING {
STRSIZE 100;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): "H2O", "C8H10N4O2", "C6H12O6"
}
}
}
}
}
The strings are shown according to their actual lengths and the padding null characters discarded.
Next we open the file, the group, and the attached attribute:
local file = hdf5.open_file("attribute.h5")
local group = file:open_group("molecules")
local attr = group:open_attribute("species")
We retrieve and verify the dimensions of the attribute:
local filespace = attr:get_space()
local dims = filespace:get_simple_extent_dims()
local N = dims[1]
assert(N == #species)
We allocate an array for the strings and read the attribute:
local buf = ffi.new("char[?][50]", N)
local memtype = hdf5.c_s1:copy()
:set_size(50)
memtype:read(buf, memtype)
attr:close()
attr:close()
group:close() file
The size of the memory datatype (50 characters) need not match the size of the file datatype (100 characters). If a string exceeds the size of the memory datatype, the HDF5 library truncates and terminates the string safely with a null character.
Finally we verify the data read from the attribute:
for i = 0, N - 1 do
assert(ffi.string(buf[i]) == species[i + 1])
end
Using ffi.string, the C strings are converted to Lua strings for comparison.
Array indices begin at 0, and table indices begin at 1.
3 Dataspaces
In this example we use hyperslab selections to store time-dependent data to a dataset.
We simulate a system of particles over a number of steps in which the particles move randomly in 3‑dimensional space.
local N = 4200
local nstep = 100
We begin by creating a dataset for the particle positions:
local file = hdf5.create_file("dataspace.h5")
local dcpl = hdf5.create_plist("dataset_create")
:set_chunk({1, 1000, 3})
dcpl:set_deflate(6)
dcpllocal dataspace = hdf5.create_simple_space({0, N, 3}, {nil, N, 3})
local dataset = file:create_dataset("particles/solvent/position/value", hdf5.double, dataspace, nil, dcpl)
Here we specify the dataspace of the dataset using two sets of
dimensions. The dimensions {0, N, 3}
specify the sizes of
the dataset after creation. The dimensions {nil, N, 3}
specify the maximum sizes to which the dataset may be resized, where
nil stands for a variable size in that dimension. Here
the dataset is empty after creation and resizable to an arbitrary number
of steps. When a dataset is resizable, i.e. when the current and maximum
dimensions of a dataset differ upon creation, the data is stored using a
chunked layout, rather than a contiguous layout as for a fixed-size
dataset. The size of the chunks needs to be specified upon creation of
the dataset using a dataset creation
property list.
The choice of chunk size affects write and read performance, since
each chunk is written or read as a whole. Here we choose a chunk size of
{1, 1000, 3}
. To write or read the positions of all 4200
particles at a given step, the HDF5 library needs to write or read 5
chunks altogether, of which the fifth chunk is partially filled. Tracing
the position of a single particle through all 100 steps, however, would
require the HDF5 library to fetch 100 chunks. As a general guideline,
the chunk size in bytes should be well above 1 kB and below 1 MB, and
the chunk dimensions should match the write and read access patterns as
closely as possible.
Chunks may be individually compressed by specifying a compression algorithm and level in the dataset creation property list. Here we use the deflate algorithm with a compression level of 6, which is the default level of the gzip program. Note for this example compression is pointless since we store random data. Compression is sensible when the data within each chunk is correlated.
After creating the dataset, we prepare a file dataspace for writing:
local filespace = hdf5.create_simple_space({nstep, N, 3})
:select_hyperslab("set", {0, 0, 0}, nil, {1, N, 3}) filespace
The file dataspace specifies a selection within the dataset that data
is written to or read from. We select a 3‑dimensional slab from the file
dataspace that has its origin at {0, 0, 0}
and a size of
{1, N, 3}
. As shown in the figure below, at each step we
extend the dimensions of the dataset by one slab, and move the hyperslab
selection by one in the dimension of steps.
Similarly we prepare a memory dataspace for writing:
local pos = ffi.new("struct { double x, y, z, w; }[?]", N)
local memspace = hdf5.create_simple_space({N, 4})
:select_hyperslab("set", {0, 0}, nil, {N, 3}) memspace
The memory dataspace specifies the dimensions of and a selection
within the array of particle positions. Each 3‑dimensional vector array
element is padded with an unused fourth component to ensure properly
aligned memory access. By selecting a hyperslab of dimensions
{N, 3}
from the memory dataspace of dimensions
{N, 4}
, we tell the HDF5 library to discard the unused
fourth component of each vector.
We conclude with a loop that writes the particle positions at each step:
math.randomseed(42)
for step = 1, nstep do
:set_extent({step, N, 3})
datasetfor i = 0, N - 1 do
[i].x = math.random()
pos[i].y = math.random()
pos[i].z = math.random()
posend
:write(pos, hdf5.double, memspace, filespace)
dataset:offset_simple({step, 0, 0})
filespaceend
:close()
dataset:close() file
Before writing the positions, we append an additional slab of
dimensions {1, N, 3}
to the dataset by extending its
dimensions to {step, N, 3}
. We pass the memory dataspace to
select the region for reading from the array and the file dataspace to
select the region for writing to the dataset. Recall the file dataspace
has rank 3 and the memory dataspace has rank 2; the dimensionality of
file and memory dataspaces may differ as long as the number of selected
elements is equal. After writing the positions, we move the origin of
the hyperslab selection within the file dataspace to offset
{step, 0, 0}
to prepare the next step.
At the end of the simulation the HDF5 file contains:
h5dump -A dataspace.h5
HDF5 "dataspace.h5" {
GROUP "/" {
GROUP "particles" {
GROUP "solvent" {
GROUP "position" {
DATASET "value" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( 100, 4200, 3 ) / ( H5S_UNLIMITED, 4200, 3 ) }
}
}
}
}
}
}
To check the stored particle positions, we open the dataset for reading:
local file = hdf5.open_file("dataspace.h5")
local dataset = file:open_dataset("particles/solvent/position/value")
We retrieve and verify the dimensions of the dataset:
local filespace = dataset:get_space()
local dims = filespace:get_simple_extent_dims()
local nstep, N, D = dims[1], dims[2], dims[3]
assert(nstep == 100)
assert(N == 4200)
assert(D == 3)
For reading individual slabs, we can use the dataspace obtained from the dataset:
:select_hyperslab("set", {0, 0, 0}, nil, {1, N, 3}) filespace
As above we prepare a memory dataspace that describes the array:
local pos = ffi.new("struct { double x, y, z, w; }[?]", N)
local memspace = hdf5.create_simple_space({N, 4})
:select_hyperslab("set", {0, 0}, nil, {N, 3}) memspace
We read the slabs for each step and compare the positions for equality:
math.randomseed(42)
for step = 1, nstep do
:read(pos, hdf5.double, memspace, filespace)
datasetfor i = 0, N - 1 do
assert(pos[i].x == math.random())
assert(pos[i].y == math.random())
assert(pos[i].z == math.random())
end
:offset_simple({step, 0, 0})
filespaceend
:close()
dataset:close() file