Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 72,924 bioRxiv papers from 317,458 authors.

Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

By Kieu Trinh Do, Simone Wahl, Johannes Raffler, Sophie Molnos, Michael Laimighofer, Jerzy Adamski, Karsten Suhre, Konstantin Strauch, Annette Peters, Christian Gieger, Claudia Langenberg, Isobel D Stewart, Fabian J Theis, Harald Grallert, Gabi Kastenm├╝ller, Jan Krumsiek

Posted 11 Feb 2018
bioRxiv DOI: 10.1101/260281 (published DOI: 10.1007/s11306-018-1420-2)

BACKGROUND: Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in epidemiological studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation. METHODS: We investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established genetically metabolic quantitative trait loci. RESULTS: Run day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations (MICE) performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable. CONCLUSION: Missing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes. NOTE: Kieu Trinh Do and Simone Wahl are co-first authors, and Gabi Kastenmueller and Jan Krumsiek are co-last authors.

Download data

  • Downloaded 877 times
  • Download rankings, all-time:
    • Site-wide: 10,633 out of 72,942
    • In systems biology: 299 out of 1,999
  • Year to date:
    • Site-wide: 21,270 out of 72,942
  • Since beginning of last month:
    • Site-wide: 21,270 out of 72,942

Altmetric data


Downloads over time

Distribution of downloads per paper, site-wide


PanLingua

Sign up for the Rxivist weekly newsletter! (Click here for more details.)


News