Benchmarking challenging small variants with linked and long reads
Nathan D. Olson,
Aaron M. Wenger,
William J. Rowell,
Uday S Evani,
Alvaro Martinez Barrio,
Ian T. Fiddes,
Fritz J. Sedlazeck,
Justin M. Zook,
Genome in a Bottle Consortium
Posted 25 Jul 2020
bioRxiv DOI: 10.1101/2020.07.24.212712
Posted 25 Jul 2020
Genome in a Bottle (GIAB) benchmarks have been widely used to help validate clinical sequencing pipelines and develop new variant calling and sequencing methods. Here we use accurate long and linked reads to expand the prior benchmark to include difficult-to-map regions and segmental duplications that are not readily accessible to short reads. Our new benchmark adds more than 300,000 SNVs, 50,000 indels, and 16 % new exonic variants, many in challenging, clinically relevant genes not previously covered (e.g., PMS2). We increase coverage of the autosomal GRCh38 assembly from 85 % to 92 %, while excluding problematic regions for benchmarking small variants (e.g., copy number variants and assembly errors) that should not have been in the previous version. Our new benchmark reliably identifies both false positives and false negatives across multiple short-, linked-, and long-read based variant calling methods. As an example of its utility, this benchmark identifies eight times more false negatives in a short read variant call set relative to our previous benchmark, mostly in difficult-to-map regions. To enable robust small variant benchmarking, we still exclude 3.6% of GRCh37 and 5.0% of GRCh38 in (1) highly repetitive regions such as large, highly similar segmental duplications and the centromere not accessible to our data and (2) regions where our sample is highly divergent from the reference due to large indels, structural variation, copy number variation, and/or errors in the reference (e.g., some KIR genes that have duplications in HG002). We have demonstrated the utility of this benchmark to assess performance in more challenging regions, which enables benchmarking in more difficult genes and continued technology and bioinformatics development. The v4.2.1 benchmarks are available under ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/.
- Downloaded 2,040 times
- Download rankings, all-time:
- Site-wide: 8,231
- In genomics: 862
- Year to date:
- Site-wide: 3,070
- Since beginning of last month:
- Site-wide: 2,803
Downloads over time
Distribution of downloads per paper, site-wide
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!