What happens if a process crashes when writing data in parallel?
In general, HDF5 does not handle system crashes gracefully. If metadata has not been written to the file when a crash occurs, you can corrupt the file or lose data. The problem can be mitigated by flushing the data to the file regularly, but it does not solve the issue and frequent flushing can also slow down performance.
With Parallel HDF5 objects are created collectively, and once created you can write to a dataset collectively or independently. As with the serial version of HDF5, if the metadata has not been written to the file at the time of the crash, then that metadata can be lost. Typically, for best performance, one process creates all of the objects in the file (must be done collectively) and then closes the file. Then all of the processes open the file for writing. If one process crashes then the APIs that need to be called collectively (H5Fclose, H5Fflush, etc…) will not be able to complete because the crashed process also needs to call those APIs. So, in this case, the needed metadata may not be written, resulting in a corrupted file.