Rxivist logo

Reproducible big data science: A case study in continuous FAIRness

By Ravi Madduri, Kyle Chard, Mike D’ Arcy, Segun C Jung, Alexis Rodriguez, Dinanath Sulakhe, Eric W. Deutsch, Cory Funk, Ben Heavner, Matthew Richards, Paul Shannon, Gustavo Glusman, Nathan Price, Carl Kesselman, Ian Foster

Posted 27 Feb 2018
bioRxiv DOI: 10.1101/268755 (published DOI: 10.1371/journal.pone.0213013)

Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility--thus ensuring that big data are not hard-to-(re)use data. We compare and contrast our approach with other approaches to big data analysis and reproducibility.

Download data

  • Downloaded 1,968 times
  • Download rankings, all-time:
    • Site-wide: 10,668
    • In bioinformatics: 1,186
  • Year to date:
    • Site-wide: 32,901
  • Since beginning of last month:
    • Site-wide: 96,083

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide