A proteomics sample metadata representation for multiomics integration, and big data analysis.
Deepti Jaiswal Kundu,
Melanie Christine Föll,
Tim Van Den Bossche,
Vinay Kumar Duggineni,
Eric W. Deutsch,
Juan Antonio Vizcaíno,
Lev I. Levitsky,
Posted 23 May 2021
bioRxiv DOI: 10.1101/2021.05.21.445143
Posted 23 May 2021
The amount of public proteomics data is increasing at an extraordinary rate. Hundreds of datasets are submitted each month to ProteomeXchange repositories, representing many types of proteomics studies, focusing on different aspects such as quantitative experiments, post-translational modifications, protein-protein interactions, or subcellular localization, among many others. For every proteomics dataset, two levels of data are captured: the dataset description, and the data files (encoded in different file formats). Whereas the dataset description and data file formats are supported by all ProteomeXchange partner repositories, there is no standardized format to properly describe the sample metadata and their relationship with the dataset files in a way that fully allows their understanding or re-analysis. It is left to the users choice whether to provide or not an ad hoc document containing this information. Therefore, in many cases, understanding the study design and data requires going back to the associated publication. This can be tedious and may be restricted in the case of non-open access publications. In many cases, this problem limits the generalization and reuse of public proteomics data. Here we present a standard representation for sample metadata tailored to proteomics datasets produced by the HUPO Proteomics Standards Initiative and supported by ProteomeXchange resources. We repurposed the existing data format MAGE-TAB used routinely in the transcriptomics field to represent and annotate proteomics datasets. MAGE-TAB-Proteomics defines a set of annotation rules that the datasets submitted to ProteomeXchange should follow, ranging from sample properties to data analysis protocols. We also introduce a crowdsourcing project that enabled the manual curation of over 200 public datasets using MAGE-TAB-Proteomics. In addition, we describe an ecosystem of tools and libraries that were developed to validate and submit sample metadata-related information to ProteomeXchange. We expect that these tools will improve the reproducibility of published results and facilitate the reanalysis and integration of public proteomics datasets.
- Downloaded 525 times
- Download rankings, all-time:
- Site-wide: 71,260
- In bioinformatics: 6,686
- Year to date:
- Site-wide: 34,952
- Since beginning of last month:
- Site-wide: 54,941
Downloads over time
Distribution of downloads per paper, site-wide
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!