A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data
Rae Woong Park,
Anthony G. Sena,
Marc A Suchard,
Seng Chan You,
Jenna M. Reps
Posted 26 Mar 2021
medRxiv DOI: 10.1101/2021.03.23.21254098
Posted 26 Mar 2021
Background and Objective: As a response to the ongoing COVID-19 pandemic, several prediction models have been rapidly developed, with the aim of providing evidence-based guidance. However, no COVID-19 prediction model in the existing literature has been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation and publicly providing all analytical source code). Methods: We show step-by-step how to implement the pipeline for the question: In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization. We develop models using six different machine learning methods in a US claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the US. Results: Our open-source tools enabled us to efficiently go end-to-end from problem design to reliable model development and evaluation. When predicting death in patients hospitalized for COVID-19 adaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated. Conclusion: Our results show that following the OHDSI analytics pipeline for patient-level prediction can enable the rapid development towards reliable prediction models. The OHDSI tools and pipeline are open source and available to researchers around the world.
- Downloaded 161 times
- Download rankings, all-time:
- Site-wide: 128,631
- In health informatics: 524
- Year to date:
- Site-wide: 34,876
- Since beginning of last month:
- Site-wide: 34,654
Downloads over time
Distribution of downloads per paper, site-wide
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!