We are seeing a problem VERY similar to the problem described in the HDF forum entry below
Right now, running our application for any length of time, writing data to an h5 file, inevitably results in the system running out of memory.
I have already tried all of the things suggested in the forum entry, with the same results as the other HDF user (no success).
We have thousands of sensors whose measurements they track are sent in data packets, similar to Ethernet packets. Part of this packet is the data.The length of the data portion is variable on a sensor-by-sensor basis. Moreover, the content/structure of the data is also variable on a sensor-by-sensor basis. When the user runs our software, the user decides which sensor(s)/measurement(s) are important for the test session. It is no until the test session starts that we finally know what the potential structure of the data might be if we were to choose compound types in a table or packet table in an h5 file.
Our chosen option was to treat each individual measurement as a dataset with a single column. The number of rows in the column grows over time as the measurements come in from the sensor. We don't know ahead of time how many rows are going to be in the dataset because we don't know how long the test session is going to run. When a new measurement comes in, we extend the dataset by one, add the new measurement, and close the dataset until the next measurement comes in.
We have a feature called checkpointing that closes the current h5 file and starts writing to a new one. The first file had the property H5F_CLOSE_STRONG set, so that close should have closed everything in the file. However, nothing changed memory-usage (or processor-usage, for that matter) for that thread.
It is hard to answer this because there is a lot going on here. Here are some thoughts:
- Performance is probably going to be terrible. Variable-length data are bad for I/O, not compressed, etc. We would have to see the code or output from h5dump to be certain, but the scenario described sounds terrible.
- In terms of memory, if a process on a POSIX-y system consumes a bunch of memory, we are almost certain that it won't get released back to the system until the process ends. That seems to be just how sbrk and malloc interact in libc and there is not a way around this. When you call malloc to get some memory, the C library calls sbrk to get a block of memory from the kernel (if needed). Once the memory manager has handed out a bunch of memory from that allocation, it is usually fragmented, with holes all over the place, so the memory manager cannot really "give back" the memory to the OS, even if you free it. Therefore, if your program has ever grown to a very large size in memory, it is going to stay that way until your process ends.
- Do not open and close the dataset unless the data arrived very infrequently. That operation has overhead.
- We would need to see how large variable-length types affect memory usage via the global heap and metadata cache. They may be having trouble there.
- An alternative structure is to use two datasets: A 1-D dataset of the data and a 2-D index dataset of the start/end points in the 1-D dataset of where each data element lies. It is hard to guess a good chunk size for either without more information, but that might give better I/O, compression, etc. than using VL datatypes.
So, to sum up some obvious potential performance improvements:
- Use the latest file format.
- Try not opening and closing datasets.
- Try out the 1D "data" dataset + 2D "index" dataset scheme to see how that works (may have to be careful about the chunk sizes to get good performance, good values will probably be different in each dataset)
There is no content with the specified labels