Rxivist logo

MACE2K: A Text-Mining Tool to Extract Literature-based Evidence for Variant Interpretation using Machine Learning

By Samir Gupta, Shruti Rao, Trisha Miglani, Yasaswini Iyer, Junxia Lin, Ahson M. Saiyed, Ifeoma Ikwuemesi, Shannon McNulty, Courtney Thaxton, Subha Madhavan

Posted 04 Dec 2020
bioRxiv DOI: 10.1101/2020.12.03.409094

Interpretation of a given variant's pathogenicity is one of the most profound challenges to realizing the promise of genomic medicine. A large amount of information about associations between variants and diseases used by curators and researchers for interpreting variant pathogenicity is buried in biomedical literature. The development of text-mining tools that can extract relevant information from the literature will speed up and assist the variant interpretation curation process. In this work, we present a text-mining tool, MACE2k that extracts evidence sentences containing associations between variants and diseases from full-length PMC Open Access articles. We use different machine learning models (classical and deep learning) to identify evidence sentences with variant-disease associations. Evaluation shows promising results with the best F1-score of 82.9% and AUC-ROC of 73.9%. Classical ML models had a better recall ( 96.6% for Random Forest) compared to deep learning models. The deep learning model, Convolutional Neural Network had the best precision (75.6%), which is essential for any curation task.

Download data

  • Downloaded 232 times
  • Download rankings, all-time:
    • Site-wide: 138,710
    • In bioinformatics: 10,905
  • Year to date:
    • Site-wide: 91,651
  • Since beginning of last month:
    • Site-wide: 91,767

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide