Discordant genotype calls across technology platforms elucidate variants with systematic errors in next-generation sequencing
By
Elizabeth G Atkinson,
Mykyta Artomov,
Konrad J Karczewski,
Alexander A. Loboda,
Heidi L Rehm,
Daniel G. MacArthur,
Benjamin M Neale,
Mark J. Daly
Posted 27 Mar 2022
bioRxiv DOI: 10.1101/2022.03.24.485707
Large-scale next-generation sequencing (NGS) datasets have been transformative for informing clinical variant interpretation and as reference panels for statistical and population genetic efforts. While such resources are often treated as ground truth, we find that in widely used reference datasets such as the Genome Aggregation Database (gnomAD), some variants pass gold standard filters yet are systematically different in their genotype calls across sequencing technologies. The inclusion of such discordant sites in study designs involving multiple sequencing platforms (e.g. whole genome and/or different whole-exome captures) could bias results and lead to false-positive hits in association studies due to technological artifacts rather than a true relationship to the phenotype. Here, we describe this phenomenon of discordant genotype calls across sequencing technologies, characterize the error mode of wrong calls, provide a blacklist of discordant sites identified in gnomAD that should be treated with caution in analyses, and present a metric and machine learning classifier trained on gnomAD data to identify likely discordant variants in other datasets. We find that different NGS technologies have different sets of variants at which this problem occurs but that there are characteristic variant features that can be used to predict discordant behavior. Discordant sites are largely shared across ancestry groups, though different populations are powered for discovery of different variants. We find that the most common error mode is that of a variant being heterozygous for one platform and homozygous for the other, with heterozygous in the genomes and homozygous reference in the exomes making up the majority of miscalls.
Download data
- Downloaded 328 times
- Download rankings, all-time:
- Site-wide: 151,656
- In genomics: 8,051
- Year to date:
- Site-wide: 23,180
- Since beginning of last month:
- Site-wide: 47,704
Altmetric data
Downloads over time
Distribution of downloads per paper, site-wide
PanLingua
News
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!