Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 62,198 bioRxiv papers from 276,130 authors.

Correcting subtle stratification in summary association statistics

By Gaurav Bhatia, Nicholas A. Furlotte, Po-Ru Loh, Xuanyao Liu, Hilary K Finucane, Alexander Gusev, Alkes L. Price

Posted 19 Sep 2016
bioRxiv DOI: 10.1101/076133

Population stratification is a well-documented confounder in GWASes, and is often addressed by including principal component (PC) covariates computed from common SNPs (SNP-PCs). In our analyses of summary statistics from 36 GWASes (mean n=88k), including 20 GWASes using 23andMe data that included SNP-PC covariates, we observed a significantly inflated LD score regression (LDSC) intercept for several traits−suggesting that residual stratification remains a concern, even when SNP-PC covariates are included. Here we propose a new method, PC loading regression, to correct for stratification in summary statistics by leveraging SNP loadings for PCs computed in a large reference panel. In addition to SNP-PCs, the method can be applied to haploSNP-PCs, i.e. PCs computed from a larger number of rare haplotype variants that better capture subtle structure. Using simulations based on real genotypes from 54,000 individuals of diverse European ancestry from the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort, we show that PC loading regression effectively corrects for stratification along top PCs. We applied PC loading regression to several traits with inflated LDSC intercepts. Correcting for the top four SNP-PCs in GERA data, we observe a significant reduction in LDSC intercept height summary statistics from the Genetic Investigation of ANthropometric Traits (GIANT) consortium, but not for 23andMe summary statistics, which already included SNP-PC covariates. However, when correcting for additional haploSNP-PCs in 23andMe GWASes, inflation in the LDSC intercept was eliminated for eye color, hair color, and skin color and substantially reduced for height (1.41 to 1.16; n=430k). Correcting for haploSNP-PCs in GIANT height summary statistics eliminated inflation in the LDSC intercept (from 1.35 to 1.00; n=250k), eliminating 27 significant association signals including one at the LCT locus, which is highly differentiated among European populations and widely known to produce spurious signals. Overall, our results suggest that uncorrected population stratification is a concern in GWASes of large sample size and that PC loading regression can correct for this stratification.

Download data

  • Downloaded 765 times
  • Download rankings, all-time:
    • Site-wide: 10,545 out of 62,198
    • In genetics: 762 out of 3,537
  • Year to date:
    • Site-wide: 30,614 out of 62,198
  • Since beginning of last month:
    • Site-wide: 25,727 out of 62,198

Altmetric data


Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)


News