Rxivist logo

The variant call format provides efficient and robust storage of GWAS summary statistics

By Matthew Lyon, Shea J Andrews, Benjamin Elsworth, Tom R Gaunt, Gib Hemani, Edoardo Marcora

Posted 30 May 2020
bioRxiv DOI: 10.1101/2020.05.29.115824

Genome-wide association study (GWAS) summary statistics are a fundamental resource for a variety of research applications 1-6. Yet despite their widespread utility, no common storage format has been widely adopted, hindering tool development and data sharing, analysis and integration. Existing tabular formats 7,8 often ambiguously or incompletely store information about genetic variants and their associations, and also lack essential metadata increasing the possibility of errors in data interpretation and post-GWAS analyses. Additionally, data in these formats are typically not indexed, requiring the whole file to be read which is computationally inefficient. To address these issues, we propose an adaptation of the variant call format 9 (GWAS-VCF) and have produced a suite of open-source tools for using this format in downstream analyses. Simulation studies determine GWAS-VCF is 9-46x faster than tabular alternatives when extracting variant(s) by genomic position. Our results demonstrate the GWAS-VCF provides a robust and performant solution for sharing, analysis and integration of GWAS data. We provide open access to over 10,000 complete GWAS summary datasets converted to this format (available from: https://gwas.mrcieu.ac.uk). ### Competing Interest Statement TRG receives funding from GlaxoSmithKline and Biogen for unrelated research.

Download data

  • Downloaded 819 times
  • Download rankings, all-time:
    • Site-wide: 34,980
    • In genetics: 1,573
  • Year to date:
    • Site-wide: 26,781
  • Since beginning of last month:
    • Site-wide: 33,384

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide