The HDF Group has a strong interest in the use of HDF5 in bioinformatics. The links to the left will direct you to several projects that demonstrate the use of HDF5 to store biological data.
The title of each list item is a link to the company or project website.
- Applied Biosystems The primary image data from Applied Biosystem's next generation DNA sequencers are stored in HDF5.
- Pacific Biosciences The primary image data from Pacific Biosciences' next generation DNA sequencers are stored in HDF5.
- SDCubes Semantically Typed Data Cubes (SDCubes) combine XML and HDF5 to leverage the strengths of each for storing numerical data and metadata. Millard BL, Niepel M, Menden MP, Muhlich JL, Sorger PK. Adaptive informatics for multifactorial and high-content biological data. Nature Meth. 2011 8:487-492. LINK
- Unifying Biological Image Formats with HDF5. This article describes the advantages of HDF5 as an image format and how it could be used to unify the many disparate biological image formats. Dougherty MT, Folk MJ, Zadok E, Bernstein HJ, Bernstein FC, Eliceiri KW, Benger W, Best C. Unifying Biological Image Formats with HDF5. Commun ACM. 2009 1;52(10):42-47. LINK
- Quantitative data: learning to share. This article describes the use of HDF5 and several other technologies when dealing with large, heterogeneous datasets which need to be shared among researchers. Baker M. Quantitative data: learning to share. Nature Meth. 2012 9:39-41. LINK
If you use HDF5 in a bioinformatics context and would like your project listed here, please email derobins at hdfgroup dot org.
Linkage Disequilibrium (LD) deals with finding non-random allele associations at different chromosome loci in population genetics. The complexity of an LD calculation is O(mn^2), where m the size of the population being studied and n is the number of loci being considered. Due to the large number of SNPs, genome-level LD analysis is extrememly time consuming. On a single-processor workstation, this computation could take months, but the same computation could be done on a large supercomputer in a matter of days. Our experiments showed that for chromosome 22, a complete LD analysis could be done within 4 minutes on a 32 node cluster.
The storage requirements for the linkage array are also very high - O(n^2). In order to store, exchange and visualize this data, we need a data storage technology using which was designed to handle large quantities of data such as HDF5. With HDF5 and compression enabled, we can fit the main LD dataset into 4.5 MB and the HDF5 data structures allow us to easily store and retrieve lower-resolution data for visualization with a minimal impact on the overall file size (5.5 MB total).
The main contributions of this work are:
- Proposing the parallelizing of the LD algorithm so that the results can be generated quickly on large supercomputers.
- Proposing compression and chunking for storing the entire array.
- Proposing a hierarchy of images of reducing resolution to allow for efficient interactive visualization.
The LD_22.h5 file contains the LD values, calculated using the r^2 metric, for chromosome 22. The file has 3 different datasets. The Chromosome22 dataset contains the entire LD matrix and the other two contain the matrix at lower resolutions. In the future HDFView could be extended so that scientists could make selections in the lower resolution datasets and directly zoom into the higher resolution datasets. The LD_19.h5 file uses the same structure for chromosome 19.
You can inspect the files with HDFView after downloading them to your computer.
CAUTION: Attempting to open any dataset except the lowest resolution might result in the viewer crashing. This is a known problem where HDFView has trouble opening extremely large datasets and this will be addressed in a future version of HDFView.
As of now the only way to access the higher resolution images is by using the open as functionality and manually subsetting the data. While the size of the dataset is not very high, (achieved through compression) the actual number of elements is still very high. DO NOT CLICK on Chromosome22 dataset to see the data after opening the file with HDFView. Use "Right click" on the dataset, then choose "Open As" from the menu; dialog window with a small image will appear; you may which to choose "Image" to display the dataset; use the left mouse button to select a small region.
The hapmap*.h5 files contain all the SNP data for Chromosome 22. In the file hapmap_num_indexes.h5, we have created projection indexes on rs# and position, so that efficient subsetting based on these attributes can be supported.
Use HDFView to inspect the files after downloading them to your computer.
This is an example of storing genotype data in HDF5. The file contains data for the adrb2 beta-2 adrenergic receptor gene.
Use HDFView to inspect the file after you download it to your computer.
Early Pearl Work
Sample Perl wrappers to HDF5 were created to illustrate how one might store genomic sequence data in HDF5, and to engage the bioinformatics community in these investigations.
The software distribution contains two modules: HDFPerl (wrappers) and BioHDF_Perl (high level APIs).
HDFPerl. Wrappers for a subset of the HDF5 functions have been developed to provide a simple Perl interface to HDF5.
BioHDF_Perl. A second Perl API has been implemented to illustrate how one might import genomic sequence data from FASTA format files into the HDF5 format. This API also creates indexes in HDF5 that allow limited search operations on data.
Performance Study: HDF5 vs. FASTA
A performance study was conducted in which we compare HDF5 with the FASTA format in terms of (a) storage use and (b) time to access genomic sequence data using traditional text-management tools for FASTA and BioHDF_Perl for HDF5. Results show that HDF5 can provide storage efficiency through its use of compression and still allow fast random access through its ability to store indexes along with compressed, chunked data.