Hardy Weinberg Exact Test In Large Scale Variant Calling Quality Control

By Zhuoyi Huang, Navin Rustagi, Degui Zhi, L. Adrienne Cupples, Richard Gibbs, Eric Boerwinkle, Fuli Yu

Posted 19 Dec 2016
bioRxiv DOI: 10.1101/095521

Hardy Weinberg Equilibrium (HWE) test is widely used as a quality control measure to detect sequencing artifacts like mismapping, allelic dropout and biases. However, in the high throughput sequencing era, where the sample size is beyond a thousand scale, the utility of HWE test in reducing the false positive rate remains unclear. In this paper, we demonstrate that HWE test has limited power in identifying sequencing artifacts when the variant allele frequency is lower than 1% in a variant call set produced from more than five thousand whole genome sequenced samples from two homogeneous populations. We develop a novel strategy of implementing HWE filtering in which we incorporate site frequency spectrum information and determine the p-value cutoff which optimizes the tradeoff between sensitivity and specificity. The novel strategy is shown to outperform the exact test of HWE with an empirical constant p-value cutoff regardless of the sequencing sample size. We also present best practice recommendations for identifying possible sources of false positives from large sequencing datasets based on an analysis of intrinsic biases in the variant calling process. Our novel strategy of determining the HWE test p-value cutoff and applying the test to the common variants provides a practical approach for the variant level quality controls in the upcoming sequencing projects with tens to hundreds of thousand of samples.

