Rxivist logo

Gene pathogenicity prediction of Mendelian diseases via the Random Forest algorithm

By Sijie He, Weiwei Chen, Hankui Liu, Shengting Li, Dongzhu Lei, Xiao Dang, Yulan Chen, Xiuqing Zhang, Jianguo Zhang

Posted 18 Feb 2019
bioRxiv DOI: 10.1101/553362 (published DOI: 10.1007/s00439-019-02021-9)

The study of Mendelian diseases and the identification of their causative genes are of great significance in the field of genetics. The evaluation of the pathogenicity of genes and the total number of Mendelian disease genes are both important questions worth studying. However, very few studies have addressed these issues to date, so we attempt to answer them in this study. We calculated gene pathogenicity prediction (GPP) score by a machine learning approach (random forest algorithm) to evaluate the pathogenicity of genes. When we applied the GPP score to the testing gene set, we obtained accuracy of 80%, recall of 93% and area under the curve (AUC) of 0.87. Our results estimated that a total of 10,399 protein-coding genes were Mendelian disease genes. Furthermore, we found the GPP score was positively correlated with the severity of disease. Our results indicate that GPP score may provide a robust and reliable guideline to predict the pathogenicity of protein-coding genes. To our knowledge, this is the first trial to estimate the total number of Mendelian disease genes.

Download data

  • Downloaded 387 times
  • Download rankings, all-time:
    • Site-wide: 95,184
    • In genetics: 3,951
  • Year to date:
    • Site-wide: 90,823
  • Since beginning of last month:
    • Site-wide: 35,021

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide