Rxivist logo

Objective: Large scale next-generation sequencing of population cohorts paired with patients' electronic health records (EHR) provides an excellent resource for the study of gene-disease associations. To validate those associations, researchers often consult databases that identify relationships between genes of interest and relevant disease phenotypes, which we refer to as simply "phenotypes". However, most of these databases contain phenotypes that are not suited for automated analysis of EHR data, which often captured these phenotypes in the form of International Classification of Diseases (ICD) codes. There is a need for a resource that comprehensively provides gene-phenotype mappings in a format that can be used to evaluate phenotypes from EHR. Methods: We built a directed graph database of genes, medical concepts and ICD codes based on a subset of the National Library of Medicine's Unified Medical Language System (UMLS) and other resources. To obtain associations between genes and ICD codes, we traversed the defined relationships from gene, variant and disease concepts to ICD codes, resulting in a set of mappings that link specific genes and variants to these ICD codes. Results: Our method created 249,764 mappings between genes and ICD codes, including 27,226 "disease" phenotypes and 222,538 "symptom" phenotypes, and provided mappings for 4,456 unique genes. Paths were validated by manual review of a diverse sample of paths. In a cohort of 92,455 samples, we used these mappings to validate gene-phenotype associations in 32,786 samples where a person had a potentially disease-causing genetic mutation and at least one corresponding diagnosis in their EHR. Conclusion: The concepts and relationships in the UMLS can be used to generate gene-ICD phenotype mappings that are not explicit in the source vocabularies. We were able use these mappings to validate gene-disease associations in a large cohort of sequenced exomes paired with EHR. ### Competing Interest Statement All individual authors are full-time employees of the Regeneron Genetics Center and receive stock in Regeneron Pharmaceuticals, Inc. as part of compensation. This research was funded by the Regeneron Genetics Center. No other conflicts are reported.

Download data

  • Downloaded 212 times
  • Download rankings, all-time:
    • Site-wide: 116,971
    • In bioinformatics: 9,581
  • Year to date:
    • Site-wide: 75,756
  • Since beginning of last month:
    • Site-wide: 119,144

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide