This page describes the new HDF5 Partial Edge Chunk option and provides links to available reference manual entries. The page includes the following sections:
Prior to HDF5 Release 1.10.0, data chunks on the edge of a dataset were stored with the same size as other chunks, even if the logical size of the chunk was smaller. With this feature, HDF5 adds an option controlling whether filters are applied to partial edge chunks.
Background and the New Option
Data chunking can, in many cases, greatly improve dataset I/O performance. In some cases, however, chunking can result in performance degradation.
Consider an extensible dataset that will be opened, extended, and closed many times. If the dataset size is large, compression and chunked storage can yield substantial file size and I/O performance benefits. However, since the number of elements per extension may vary, it is unlikely that the dataset size will always be a multiple of the chunk size, and partial edge chunks will be present. Compression of the partial edge chunks in this usage model may introduce a substantial, and sometimes unacceptable, performance penalty each time the dataset is extended. The penalty occurs not only because the filter must be applied twice to all edge chunks after each extension (the original compressed partial chunk is first uncompressed, new data is then added to the chunk, and the extended chunk is recompressed), but also because the compressed size of the edge chunks changes as the dataset grows, requiring new placement of the chunks in the file. The movement of chunks in the file degrades write performance and can also cause fragmentation, which adds wasted space in the file. In order to extend datasets as quickly as possible without this option, it has been necessary to store the entire dataset uncompressed.
A new option to control the filtering of partial edge chunks overcomes the performance degradation described above. With this option, partial edge chunks are stored without compression. If the dataset is subsequently extended, any partial edge chunk that becomes a complete chunk will then be compressed for storage and new partial edge chunks will remain uncompressed in storage. The double filtering of partial edge chunks is eliminated.
In disabling filters on partial edge chunks, this option not only reduces filtering overhead, it also reduces fragmentation when datasets are extended and chunks must be moved while still allowing completed chunks in the dataset to be compressed. When a dataset expands or shrinks, it is possible that one or more chunks will go from partial edge to complete, or complete to partial edge. When filters are disabled for partial edge chunks in a dataset and a chunk in that dataset undergoes a change of classification, the HDF5 Library will reallocate storage for the chunk and apply or disable filters depending on the final classification of the chunk.
This option is controlled by a bit flag in a function parameter. The parameter is manipulated by two API functions, H5P_SET_CHUNK_OPTS and H5P_GET_CHUNK_OPTS, which act on a dataset creation property list.
Disabling filters for partial edge chunks was not available in HDF5 releases prior to 1.10.0 and its implementation requires a modification to the HDF5 file format specification. Therefore, datasets created with this option will not be accessible using earlier HDF5 releases.