Cohort Design and Natural Language Processing to Reduce Bias in Electronic Health Records Research: The Community Care Cohort Project
Lia X Harrington,
Samuel F Friedman,
Paolo Di Achille,
Jonathan W Cunningham,
Ashby C Turner,
Emily S Lau,
Julian S Haimovich,
Mostafa A Al-Alusi,
Marcus D. R. Klarqvist,
Jeffrey M Ashburner,
Hanna M Eilken,
Steven J Atlas,
Patrick T Ellinor,
Anthony A. Philippakis,
Christopher D. Anderson,
Jennifer E. Ho,
Steven A Lubitz
Posted 30 May 2021
medRxiv DOI: 10.1101/2021.05.26.21257872
Posted 30 May 2021
Background: Electronic health records (EHRs) promise to enable broad-ranging discovery with power exceeding that of conventional research cohort studies. However, research using EHR datasets may be subject to selection bias, which can be compounded by missing data, limiting the generalizability of derived insights. Methods: Mass General Brigham (MGB) is a large New England-based healthcare network comprising seven tertiary care and community hospitals with associated outpatient practices. Within an MGB-based EHR warehouse of >3.5 million individuals with at least one ambulatory care visit, we approximated a community-based cohort study by selectively sampling individuals longitudinally attending primary care practices between 2001-2018 (n=520,868), which we named the Community Care Cohort Project (C3PO). We also utilized pre-trained deep natural language processing (NLP) models to recover vital signs (i.e., height, weight, and blood pressure) from unstructured notes in the EHR. We assessed the validity of C3PO by deploying established risk models including the Pooled Cohort Equations (PCE) and the Cohorts for Aging and Genomic Epidemiology Atrial Fibrillation (CHARGE-AF) score, and compared model performance in C3PO to that observed within typical EHR Convenience Samples which included all individuals from the same parent EHR with sufficient data to calculate each score but without a requirement for longitudinal primary care. All analyses were facilitated by the JEDI Extractive Data Infrastructure pipeline which we designed to efficiently aggregate EHR data within a unified framework conducive to regular updates. Results: C3PO includes 520,868 individuals (mean age 48 years, 61% women, median follow-up 7.2 years, median primary care visits per individual 13). Estimated using reports, C3PO contains over 2.9 million electrocardiograms, 450,000 echocardiograms, 12,000 cardiac magnetic resonance images, and 75 million narrative notes. Using tabular data alone, 286,009 individuals (54.9%) had all vital signs available at baseline, which increased to 358,411 (68.8%) after NLP recovery (31% reduction in missingness). Among individuals with both NLP and tabular data available, NLP-extracted and tabular vital signs obtained on the same day were highly correlated (e.g., Pearson r range 0.95-0.99, p<0.01 for all). Both the PCE models (c-index range 0.724-0.770) and CHARGE-AF (c-index 0.782, 95% 0.777-0.787) demonstrated good discrimination. As compared to the Convenience Samples, AF and MI/stroke incidence rates in C3PO were lower and calibration error was smaller for both PCE (integrated calibration index range 0.012-0.030 vs. 0.028-0.046) and CHARGE-AF (0.028 vs. 0.036). Conclusions: Intentional sampling of individuals receiving regular ambulatory care and use of NLP to recover missing data have the potential to reduce bias in EHR research and maximize generalizability of insights.
- Downloaded 164 times
- Download rankings, all-time:
- Site-wide: 127,990
- In epidemiology: 5,332
- Year to date:
- Site-wide: 34,178
- Since beginning of last month:
- Site-wide: 6,268
Downloads over time
Distribution of downloads per paper, site-wide
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!