Rxivist logo

The GCTx format and cmap{Py, R, M} packages: resources for the optimized storage and integrated traversal of dense matrices of data and annotations

By Oana M. Enache, David L. Lahr, Ted E. Natoli, Lev Litichevskiy, David Wadden, Corey Flynn, Joshua Gould, Jacob K. Asiedu, Rajiv Narayan, Aravind Subramanian

Posted 30 Nov 2017
bioRxiv DOI: 10.1101/227041

Motivation: Computational analysis of datasets generated by treating cells with pharmacological and genetic perturbagens has proven useful for the discovery of functional relationships. Facilitated by technological improvements, perturbational datasets have grown in recent years to include millions of experiments. While initial studies, such as our work on Connectivity Map, used gene expression readouts, recent studies from the NIH LINCS consortium have expanded to a more diverse set of molecular readouts, including proteomic and cell morphological signatures. Sharing these diverse data creates many opportunities for research and discovery, but the unprecedented size of data generated and the complex metadata associated with experiments have also created fundamental technical challenges regarding data storage and cross-assay integration. Results: We present the GCTx file format and a suite of open-source packages for the efficient storage, serialization, and analysis of dense two-dimensional matrices. The utility of this format is not just theoretical; we have extensively used the format in the Connectivity Map to assemble and share massive data sets comprising 1.7 million experiments. We anticipate that the generalizability of the GCTx format, paired with code libraries that we provide, will stimulate wider adoption and lower barriers for integrated cross-assay analysis and algorithm development. Availability: Software packages (available in Matlab, Python, and R) are freely available at https://github.com/cmap

Download data

  • Downloaded 2,393 times
  • Download rankings, all-time:
    • Site-wide: 7,241
    • In bioinformatics: 780
  • Year to date:
    • Site-wide: 24,975
  • Since beginning of last month:
    • Site-wide: 37,156

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide