Rxivist logo

Fast and simple comparison of semi-structured data, with emphasis on electronic health records

By Max Robinson, Jennifer Hadlock, Jiyang Yu, Alireza Khatamian, Aleksandr Y Aravkin, Eric W. Deutsch, Nathan D Price, Sui Huang, Gustavo Glusman

Posted 02 Apr 2018
bioRxiv DOI: 10.1101/293183

We present a locality-sensitive hashing strategy for summarizing semi-structured data (e.g., in JSON or XML formats) into 'data fingerprints': highly compressed representations which cannot recreate details in the data, yet simplify and greatly accelerate the comparison and clustering of semi-structured data by preserving similarity relationships. Computation on data fingerprints is fast: in one example involving complex simulated medical records, the average time to encode one record was 0.53 seconds, and the average pairwise comparison time was 3.75 microseconds. Both processes are trivially parallelizable. Applications include detection of duplicates, clustering and classification of semi-structured data, which support larger goals including summarizing large and complex data sets, quality assessment, and data mining. We illustrate use cases with three analyses of electronic health records (EHRs): (1) pairwise comparison of patient records, (2) analysis of cohort structure, and (3) evaluation of methods for generating simulated patient data.

Download data

  • Downloaded 1,342 times
  • Download rankings, all-time:
    • Site-wide: 17,008
    • In bioinformatics: 1,944
  • Year to date:
    • Site-wide: 36,669
  • Since beginning of last month:
    • Site-wide: 39,787

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide