Machine learning identifies SNPs predictive of advanced coronary artery calcium in ClinSeq® and Framingham Heart Study cohorts
One goal of personalized medicine is leveraging the emerging tools of data science to guide medical decision-making. Achieving this using disparate data sources is most daunting for polygenic traits and requires systems level approaches. To this end, we employed random forests (RF) and neural networks (NN) for predictive modeling of coronary artery calcification (CAC), which is an intermediate end-phenotype of coronary artery disease (CAD). Model inputs were derived from advanced cases in the ClinSeq® discovery cohort (n=16) and the FHS replication cohort (n=36) from 89th-99th CAC score percentile range, and age-matching controls (ClinSeq® n=16, FHS n=36) with no detectable CAC (all subjects were Caucasian males). These inputs included clinical variables (CLIN), genotypes of 57 SNPs associated with CAC in past GWAS (SNP Set-1), and an alternative set of 56 SNPs (SNP Set-2) ranked highest in terms of their nominal correlation with advanced CAC state in the discovery cohort. Predictive performance was assessed by computing the areas under receiver operating characteristics curves (AUC). Within the discovery cohort, RF models generated AUC values of 0.69 with CLIN, 0.72 with SNP Set-1, and 0.77 with their combination. In the replication cohort, SNP Set-1 was again more predictive (AUC=0.78) than CLIN (AUC=0.61), but also more predictive than the combination (AUC=0.75). In contrast, in both cohorts, SNP Set-2 generated enhanced predictive performance with or without CLIN (AUC>0.8). Using the 21 SNPs of SNP Set-2 that produced optimal predictive performance in both cohorts, we developed NN models trained with ClinSeq® data and tested with FHS data and replicated the high predictive accuracy (AUC>0.8) with several topologies, thereby identifying several potential susceptibility loci for advanced CAD. Several CAD-related biological processes were found to be enriched in the network of genes constructed from these loci. In both cohorts, SNP Set-1 derived from past CAC GWAS yielded lower performance than SNP Set-2 derived from "extreme" CAC cases within the discovery cohort. Machine learning tools hold promise for surpassing the capacity of conventional GWAS-based approaches for creating predictive models utilizing the complex interactions between disease predictors intrinsic to the pathogenesis of polygenic disorders.
- Downloaded 712 times
- Download rankings, all-time:
- Site-wide: 31,828
- In genomics: 2,861
- Year to date:
- Site-wide: 69,178
- Since beginning of last month:
- Site-wide: None
Downloads over time
Distribution of downloads per paper, site-wide
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!