Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 73,772 bioRxiv papers from 321,037 authors.
Population stratification is a well-documented confounder in GWASes, and is often addressed by including principal component (PC) covariates computed from common SNPs (SNP-PCs). In our analyses of summary statistics from 36 GWASes (mean n=88k), including 20 GWASes using 23andMe data that included SNP-PC covariates, we observed a significantly inflated LD score regression (LDSC) intercept for several traits−suggesting that residual stratification remains a concern, even when SNP-PC covariates are included. Here we propose a new method, PC loading regression, to correct for stratification in summary statistics by leveraging SNP loadings for PCs computed in a large reference panel. In addition to SNP-PCs, the method can be applied to haploSNP-PCs, i.e. PCs computed from a larger number of rare haplotype variants that better capture subtle structure. Using simulations based on real genotypes from 54,000 individuals of diverse European ancestry from the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort, we show that PC loading regression effectively corrects for stratification along top PCs. We applied PC loading regression to several traits with inflated LDSC intercepts. Correcting for the top four SNP-PCs in GERA data, we observe a significant reduction in LDSC intercept height summary statistics from the Genetic Investigation of ANthropometric Traits (GIANT) consortium, but not for 23andMe summary statistics, which already included SNP-PC covariates. However, when correcting for additional haploSNP-PCs in 23andMe GWASes, inflation in the LDSC intercept was eliminated for eye color, hair color, and skin color and substantially reduced for height (1.41 to 1.16; n=430k). Correcting for haploSNP-PCs in GIANT height summary statistics eliminated inflation in the LDSC intercept (from 1.35 to 1.00; n=250k), eliminating 27 significant association signals including one at the LCT locus, which is highly differentiated among European populations and widely known to produce spurious signals. Overall, our results suggest that uncorrected population stratification is a concern in GWASes of large sample size and that PC loading regression can correct for this stratification.
- Downloaded 827 times
- Download rankings, all-time:
- Site-wide: 11,860 out of 73,782
- In genetics: 836 out of 4,046
- Year to date:
- Site-wide: 33,590 out of 73,782
- Since beginning of last month:
- Site-wide: 33,590 out of 73,782
Downloads over time
Distribution of downloads per paper, site-wide
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!