Rxivist logo

Polygenic risk scores (PRS) are commonly used to quantify the inherited susceptibility for a given trait. However, the standard PRS fail to account for non-linear and interaction effects between single nucleotide polymorphisms (SNPs). Machine learning algorithms can be used to account for such non-linearities and interactions. We trained and validated polygenic prediction models for five complex phenotypes in a multi-ancestry population: total cholesterol, triglycerides, systolic blood pressure, sleep duration, and height. We used an ensemble method of LASSO for feature selection and gradient boosted trees (XGBoost) for non-linearities and interaction effects. In an independent test set, we found that combining a standard PRS as a feature in the XGBoost model increases the percentage variance explained (PVE) of the prediction model compared to the standard PRS by 25% for sleep duration, 26% for height, 44% for systolic blood pressure, 64% for triglycerides, and 85% for total cholesterol. Machine learning models trained in specific racial/ethnic groups performed similarly in multi-ancestry trained models, despite smaller sample sizes. The predictions of the machine learning models were superior to the standard PRS in each of the racial/ethnic groups in our study. However, among Blacks the PVE was substantially lower than for other groups. For example, the PVE for total cholesterol was 8.1%, 12.9%, and 17.4% for Blacks, Whites, and Hispanics/Latinos, respectively. This work demonstrates an effective method to account for non-linearities and interaction effects in genetics-based prediction models.

Download data

  • Downloaded 881 times
  • Download rankings, all-time:
    • Site-wide: 46,570
    • In genetic and genomic medicine: 285
  • Year to date:
    • Site-wide: 26,102
  • Since beginning of last month:
    • Site-wide: 94,781

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide