Page tree

Direct Chunk Read and Write Questions

The Direct Chunk Read and Write APIs are for users with specialized data processing pipelines who do things like compress their data in hardware or something else highly unusual. They were moved to the main library so that they would work with VOL and also for performance reasons. Anyone who is not an obvious power user should almost certainly be steered away from these functions.

They map the user's buffer directly to a single entry in the chunk index with no interpretation. There is no filter pipeline. No type conversion. No dataspace manipulations. No spanning chunk boundaries. Nothing. Just 'this buffer I'm providing = what should be stored for this entire chunk' (and vice-versa)."

Can this functionality be used on datasets that are not chunked (compact or contiguous) ?

This will not work, since non-chunked datasets do not have chunks. HDF5 explicitly tests for the dataset being chunked and returns an error if it is not.

Is it possible to use this functionality with parallel HDF5 at the same time ?

Reading will definitely work, but writing may have issues. It is not tested, so it should be considered unsupported.

Is it possible to use this functionality with  hyperslab/point selections at the same time ?

Neither direct chunk call takes a dataspace, so it's unclear how this would even work. Once a chunk is written via H5Dwrite_chunk(), dataspaces, filter pipelines, etc. will all work normally, as if the data were written using H5Dwrite().

Does direct chunk write/read not work with variable length datatypes ?
No, this functionality cannot work with variable length types in their current form.

Is there a function that one can call to retrieve the filter mask used to write a chunk directly in a dataset?

Yes, see H5Dget_chunk_info_by_coord.

Why the need to provide the size of "buf" through the "data_size" parameter? From the documentation, "data_size" specifies the size of "buf"
but isn't implicit that the size of "buf" is/should be the size of the chunk (and therefore should be redundant to provide it)? Otherwise, what are the
implications of defining a value for "data_size" lower and greater than the size of the chunk?

If buf is filtered, the buffer size will not be computable from the data_size parameter.

Can we write more than one chunk with just one call of function "H5DOwrite_chunk" ?


Assuming you have a dataset of type int32 of one rank (size 10) and that is chunked with one rank (size 2)):

             a. if you call function "H5Dwrite_chunk" with "data_size" equal to a value (much) greater than the size of the dataset, e.g., 2048, the
HDF5 file size bloats (around 2048 bytes more). Why not limit this size to the size of the chunk (8 bytes - 2 x int32)? Asking this because the dataset
(disk) space increases but you will never be able to reach more than the first 40 bytes of this dataset anyway - i.e. 10 x int32 bytes.

We just believe the user that their chunk is that size. Filters can increase the size of the stored data (consider a checksum filter), so there's no
obvious size limits that we could enforce.

             b. if you call function "H5Dwrite_chunk" with "data_size" lower than the size of the chunk, all the bytes of the chunks are written (even
after the size of "data_size"). Why this behavior? I would expect that if "data_size" is, e.g., 4 and you write a direct chunk only the first 4 bytes will
be written leaving the other 4 bytes (i.e. the other int32) already written on the dataset intact.

 This API call is for writing a chunk in its entirety, it's not for doing partial I/O.