Rxivist logo

Using Topic Modeling via Non-negative Matrix Factorization to Identify Relationships between Genetic Variants and Disease Phenotypes: A Case Study of Lipoprotein(a) (LPA)

By Juan Zhao, Qiping Feng, Patrick Wu, Jeremy L Warner, Joshua C Denny, Wei-Qi Wei

Posted 31 May 2018
bioRxiv DOI: 10.1101/335745 (published DOI: 10.1371/journal.pone.0212112)

Genome-wide and phenome-wide association studies are commonly used to identify important relationships between genetic variants and phenotypes. Most of these studies have treated diseases as independent variables and suffered from heavy multiple adjustment burdens due to the large number of genetic variants and disease phenotypes. In this study, we propose using topic modeling via non-negative matrix factorization (NMF) for identifying associations between disease phenotypes and genetic variants. Topic modeling is an unsupervised machine learning approach that can be used to learn the semantic patterns from electronic health record data. We chose rs10455872 in LPA as the predictor since it has been shown to be associated with increased risk of hyperlipidemia and cardiovascular diseases (CVD). Using data of 12,759 individuals from the biobank at Vanderbilt University Medical Center, we trained a topic model using NMF from 1,853 distinct phecodes extracted from the cohort's electronic health records and generated six topics. We quantified their associations with rs10455872 in LPA. Topics indicating CVD had positive correlations with rs10455872 (P < 0.001), replicating a previous finding. We also identified a negative correlation between LPA and a topic representing lung cancer (P < 0.001). Our results demonstrate the applicability of topic modeling in exploring the relationship between the genome and clinical diseases.

Download data

  • Downloaded 482 times
  • Download rankings, all-time:
    • Site-wide: 65,100
    • In genetics: 2,863
  • Year to date:
    • Site-wide: 83,716
  • Since beginning of last month:
    • Site-wide: 118,480

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide