Motivation: The past decade has witnessed a rapid development of data acquisition technologies that enable integrative genomic and proteomic analysis. One such technology is chromatin immunoprecipitation sequencing (ChIP-seq), developed for analyzing interactions between proteins and DNA via next-generation sequencing technologies. As ChIP-seq experiments are inexpensive and time-efficient, massive datasets from this domain have been acquired, introducing significant storage and maintenance challenges. To address the resulting Big Data problems, we propose a state-of-the-art lossless and lossy compression framework specifically designed for ChIP-seq Wig data, termed ChIPWig. Wig is a standard file format, which in this setting contains relevant read density information crucial for visualization and downstream processing. ChIPWig may be executed in two different modes: lossless and lossy. Lossless ChIPWig compression allows for random access and fast queries in the file through careful variable-length block-wise encoding. ChIPWig also stores the summary statistics of each block needed for guided access. Lossy ChIPWig, in contrast, performs quantization of the read density values before feeding them into the lossless ChIPWig compressor. Nonuniform lossy quantization leads to further reductions in the file size, while maintaining the same accuracy of the ChIP-seq peak calling and motif discovery pipeline based on the NarrowPeaks method tailor-made for Wig files. The compressors are designed using new statistical modeling approaches coupled with delta and arithmetic encoding. Results: We tested the ChIPWig compressor on a number of ChIP-seq datasets generated by the ENCODE project. Lossless ChIPWig reduces the file sizes to merely 6% of the original, and offers an average 6-fold compression rate improvement compared to bigWig. The running times for compression and decompression are comparable to those of bigWig. The compression and decompression speed rates are of the order of 0.2 MB/sec using general purpose computers. ChIPWig with random access only slightly degrades the performance and running time when compared to the standard mode. In the lossy mode, the average file sizes reduce by 2-fold compared to the lossless mode. Most importantly, near-optimal nonuniform quantization with respect to mean-square distortion does not affect peak calling and motif discovery results on the data tested.
- Downloaded 542 times
- Download rankings, all-time:
- Site-wide: 60,360
- In bioinformatics: 5,935
- Year to date:
- Site-wide: 143,846
- Since beginning of last month:
- Site-wide: 123,254
Downloads over time
Distribution of downloads per paper, site-wide
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!