Rxivist logo

Sparse Project VCF: efficient encoding of population genotype matrices

By Michael F Lin, Xiaodong Bai, William J Salerno, Jeffrey G. Reid

Posted 17 Apr 2019
bioRxiv DOI: 10.1101/611954

Summary Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10X size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. Availability and Implementation Freely available at github.com/mlin/spVCF Contact dna{at}mlin.net ### Competing Interest Statement The authors have declared no competing interest.

Download data

  • Downloaded 1,603 times
  • Download rankings, all-time:
    • Site-wide: 18,291
    • In bioinformatics: 1,980
  • Year to date:
    • Site-wide: 70,950
  • Since beginning of last month:
    • Site-wide: 34,573

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide