Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 70,441 bioRxiv papers from 307,584 authors.

Genomic loci susceptible to systematic sequencing bias in clinical whole genomes

By Timothy M. Freeman, Genomics England Research Consortium, Dennis Wang, Jason Harris

Posted 22 Jun 2019
bioRxiv DOI: 10.1101/679423

Background Highly accurate next-generation sequencing (NGS) of genetic variants is key to many areas of science and medicine, such as cataloguing population genetic variation and diagnosing patients with genetic diseases. Certain genomic loci and regions can be prone to higher rates of systematic sequencing and alignment bias that pose a challenge to achieving high accuracy, resulting in false positive variant calls. Current standard practices to differentiate between loci that can and cannot be sequenced with high confidence utilise consensus between different sequencing methods as a proxy for sequencing confidence. This assumption is not accurate in cases where all sequencing pipelines have consensus on the same errors due to similar systematic biases in sequencing. Alternative methods are therefore required to identify systematic biases. Methods We have developed a novel statistical method based on summarising sequenced reads from whole genome clinical samples and cataloguing them in “Incremental Databases” (IncDBs) that maintain individual confidentiality. Variant statistics were analysed and catalogued for each genomic position that consistently showed systematic biases with the corresponding sequencing pipeline. Results We have demonstrated that systematic errors in NGS data are widespread, with persistent low-fraction alleles present at 1.26-2.43% of the human autosomal genome across three different Illumina-based pipelines, each consisting of at least 150 patient samples. We have identified a variety of genomic regions that are more or less prone to systematic biases, such as GC-rich regions (OR = 6.47-8.19) and the NIST high-confidence genomic regions (OR = 0.154-0.191). We have verified our predictions on a gold-standard reference genome and have shown that these systematic biases can lead to suspect variant calls at clinically important loci, including within introns and exons. Conclusions Our results recommend increased caution to minimise the effect of systematic biases in whole genome sequencing and alignment. This study supports the utility of a statistical approach to enhance quality control of clinically sequenced samples in order to flag up variant calls made at known suspect loci for further analysis or exclusion, using anonymised summary databases from which individual patients cannot be re-identified, so that results can be shared more widely. * BAM : Binary Alignment Map (file format) BED : Browser Extensible Data (file format) cfDNA : Cell-free DNA ctDNA : Circulating tumour DNA GIAB : Genome in a Bottle (consortium) gnomAD : Genome Aggregation Database IGV : Integrative Genomics Viewer (software tool) IncDB : Incremental Database MC : Monte-Carlo NGS : Next-Generation Sequencing NIST : National Institute of Standards and Technology (organisation) SD : Standard Deviation SNPs : Single-Nucleotide Polymorphism SNVs : Single-Nucleotide Variant WGS : Whole-Genome Sequencing

Download data

  • Downloaded 411 times
  • Download rankings, all-time:
    • Site-wide: 28,396 out of 70,441
    • In bioinformatics: 3,765 out of 6,902
  • Year to date:
    • Site-wide: 27,792 out of 70,441
  • Since beginning of last month:
    • Site-wide: 34,291 out of 70,441

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)