Rxivist logo

Learning from Longitudinal Data in Electronic Health Record and Genetic Data to Improve Cardiovascular Event Prediction

By Juan Zhao, Qiping Feng, Patrick Wu, Roxana Lupu, Russell A. Wilke, Quinn S Wells, Joshua C Denny, Wei-Qi Wei

Posted 11 Jul 2018
bioRxiv DOI: 10.1101/366682 (published DOI: 10.1038/s41598-018-36745-x)

Background: Current approaches to predicting Cardiovascular disease rely on conventional risk factors and cross-sectional data. In this study, we asked whether: i) machine learning and deep learning models with longitudinal EHR information can improve the prediction of 10-year CVD risk, and ii) incorporating genetic data can add values to predictability. Methods: We conducted two experiments. In the first experiment, we modeled longitudinal EHR data with aggregated features and temporal features. We applied logistic regression (LR), random forests (RF) and gradient boosting trees (GBT) and Convolutional Neural Networks (CNN) and Recurrent Neural Networks, using Long Short-Term Memory (LSTM) units. In the second experiment, we proposed a late-fusion framework to incorporate genetic features. Results: Our study cohort included 109, 490 individuals (9,824 were cases and 99, 666 were controls) from Vanderbilt University Medical Center (VUMC) de-identified EHRs. American College of Cardiology and the American Heart Association (ACC/AHA) Pooled Cohort Risk Equations had areas under receiver operating characteristic curves (AUROC) of 0.732 and areas under receiver under precision and recall curves (AUPRC) of 0.187. LSTM, CNN and GBT with temporal features achieved best results, which had AUROC of 0.789, 0.790, and 0.791, and AUPRC of 0.282, 0.280 and 0.285, respectively. The late fusion approach achieved a significant improvement for the prediction performance. Conclusions: Machine learning and deep learning with longitudinal features improved the 10-year CVD risk prediction. Incorporating genetic features further enhanced 10-year CVD prediction performance, underscoring the importance of integrating relevant genetic data whenever available in the context of routine care.

Download data

  • Downloaded 2,725 times
  • Download rankings, all-time:
    • Site-wide: 5,667
    • In epidemiology: 507
  • Year to date:
    • Site-wide: 6,556
  • Since beginning of last month:
    • Site-wide: 4,113

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide