Rxivist logo

Maximizing the Reusability of Gene Expression Data by Predicting Missing Metadata

By Pei-Yau Lung, Xiaodong Pang, Yan Li, Jinfeng Zhang

Posted 03 Oct 2019
bioRxiv DOI: 10.1101/792382 (published DOI: 10.1371/journal.pcbi.1007450)

Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we develop a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We propose a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we show that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.

Download data

  • Downloaded 270 times
  • Download rankings, all-time:
    • Site-wide: 124,500
    • In bioinformatics: 10,008
  • Year to date:
    • Site-wide: 154,850
  • Since beginning of last month:
    • Site-wide: 158,701

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide