Rxivist logo

Use of unstructured text in prognostic clinical prediction models: a systematic review

By Tom M Seinen, Egill Fridgeirsson, Solomon Ioannou, Daniel Jeannetot, Luis H. John, Jan A. Kors, Aniek F. Markus, Victor Pera, Alexandros Rekkas, Ross D. Williams, Erik van Mulligen, Peter R. Rijnbeek

Posted 18 Jan 2022
medRxiv DOI: 10.1101/2022.01.17.22269400

Objective: This systematic review aims to assess how information from unstructured clinical text is used to develop and validate prognostic risk prediction models. We summarize the prediction problems and methodological landscape and assess whether using unstructured clinical text data in addition to more commonly used structured data improves the prediction performance. Materials and Methods: We searched Embase, MEDLINE, Web of Science, and Google Scholar to identify studies that developed prognostic risk prediction models using unstructured clinical text data published in the period from January 2005 to March 2021. Data items were extracted, analyzed, and a meta-analysis of the model performance was carried out to assess the added value of text to structured-data models. Results: We identified 126 studies that described 145 clinical prediction problems. Combining text and structured data improved model performance, compared to using only text or only structured data. In these studies, a wide variety of dense and sparse numeric text representations were combined with both deep learning and more traditional machine learning methods. External validation, public availability, and explainability of the developed models was limited. Conclusion: Overall, the use of unstructured clinical text data in the development of prognostic prediction models has been found beneficial in addition to structured data in most studies. The EHR text data is a source of valuable information for prediction model development and should not be neglected. We suggest a future focus on explainability and external validation of the developed models, promoting robust and trustworthy prediction models in clinical practice.

Download data

  • Downloaded 458 times
  • Download rankings, all-time:
    • Site-wide: 110,096
    • In health informatics: 501
  • Year to date:
    • Site-wide: 12,383
  • Since beginning of last month:
    • Site-wide: 52,122

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide