Rxivist logo

Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects

By Allison A Regier, Yossi Farjoun, David Larson, Olga Krasheninina, Hyun Min Kang, Daniel P Howrigan, Bo-Juen Chen, Manisha Kher, Eric Banks, Darren C Ames, Adam C English, Heng Li, Jinchuan Xing, Yeting Zhang, Tara Matise, the NHLBI Trans-Omics for Precision Medicine (TOPMed) Program, Goncalo R. Abecasis, Will Salerno, Michael C. Zody, Benjamin M Neale, Ira M Hall

Posted 22 Feb 2018
bioRxiv DOI: 10.1101/269316 (published DOI: 10.1038/s41467-018-06159-4)

Hundreds of thousands of human whole genome sequencing (WGS) datasets will be generated over the next few years to interrogate a broad range of traits, across diverse populations. These data are more valuable in aggregate: joint analysis of genomes from many sources increases sample size and statistical power for trait mapping, and will enable studies of genome biology, population genetics and genome function at unprecedented scale. A central challenge for joint analysis is that different WGS data processing and analysis pipelines cause substantial batch effects in combined datasets, necessitating computationally expensive reprocessing and harmonization prior to variant calling. This approach is no longer tenable given the scale of current studies and data volumes. Here, in a collaboration across multiple genome centers and NIH programs, we define WGS data processing standards that allow different groups to produce "functionally equivalent" (FE) results suitable for joint variant calling with minimal batch effects. Our approach promotes broad harmonization of upstream data processing steps, while allowing for diverse variant callers. Importantly, it allows each group to continue innovating on data processing pipelines, as long as results remain compatible. We present initial FE pipelines developed at five genome centers and show that they yield similar variant calling results — including single nucleotide (SNV), insertion/deletion (indel) and structural variation (SV) — and produce significantly less variability than sequencing replicates. Residual inter-pipeline variability is concentrated at low quality sites and repetitive genomic regions prone to stochastic effects. This work alleviates a key technical bottleneck for genome aggregation and helps lay the foundation for broad data sharing and community-wide "big-data" human genetics studies.

Download data

  • Downloaded 2,490 times
  • Download rankings, all-time:
    • Site-wide: 7,492
    • In bioinformatics: 777
  • Year to date:
    • Site-wide: None
  • Since beginning of last month:
    • Site-wide: 67,266

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide