Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 73,086 bioRxiv papers from 318,192 authors.

Quality control guidelines and machine learning predictions for next generation sequencing data

By Steffen Albrecht, Miguel A Andrade-Navarro, Jean-Fred Fontaine

Posted 14 Sep 2019
bioRxiv DOI: 10.1101/768713

Controlling the quality of next generation sequencing (NGS) data files is usually not fully automatized because of its complexity and involves strong assumptions and arbitrary choices. We have statistically characterized common NGS quality features of a large set of files and optimized the complex quality control procedure using a machine learning approach including tree-based algorithms and deep learning. Predictive models were validated using internal and external data, including applications to disease diagnosis datasets. Models are unbiased, accurate and to some extent generalizable to unseen data types and species. Given enough labelled data for training, this approach could potentially work for any type of NGS assay or species. The derived statistical guidelines and predictive models represent a valuable resource for NGS specialists to better understand quality issues and perform automatic quality control of their own files. Our guidelines and software are available at the following URL: https://github.com/salbrec/seqQscorer.

Download data

  • Downloaded 341 times
  • Download rankings, all-time:
    • Site-wide: 36,430 out of 73,086
    • In bioinformatics: 4,481 out of 7,136
  • Year to date:
    • Site-wide: 1,507 out of 73,086
  • Since beginning of last month:
    • Site-wide: 1,507 out of 73,086

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)