Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 57,789 bioRxiv papers from 265,997 authors.

Minor allele frequency thresholds strongly affect population structure inference with genomic datasets

By Ethan B. Linck, C. J. Battey

Posted 14 Sep 2017
bioRxiv DOI: 10.1101/188623 (published DOI: 10.1111/1755-0998.12995)

Across the genome, the effects of different evolutionary processes and historical events can result in different classes of genetic variants (or alleles) characterized by their relative frequency in a given population. As a result, population genetic inference can be strongly affected by biases in laboratory and bioinformatics treatments that affect the site frequency spectrum, or SFS. Yet despite the widespread use of reduced-representation genomic datasets with nonmodel organisms, the potential consequences of these biases for downstream analyses remain poorly examined. Here, we assess the influence of minor allele frequency (MAF) thresholds implemented during variant detection on inference of population structure. We use simulated and empirical datasets to evaluate the effect of MAF thresholds on the ability to discriminate among populations and quantify admixture with both model-based and non-model-based clustering methods. We find model-based inference of population structure is highly sensitive to choice of MAF, and may be confounded by either including singletons or excluding all rare alleles. In contrast, non-model-based clustering is largely robust to MAF choice. Our results suggest that model-based inference of population structure can fail due to either natural demographic processes or assembly artifacts, with broad consequences for phylogeographic and population genetic studies using NGS data. We propose a simple hypothesis to explain this behavior and recommend a set of best practices for researchers seeking to describe population structure using reduced-representation libraries.

Download data

  • Downloaded 6,154 times
  • Download rankings, all-time:
    • Site-wide: 248 out of 57,789
    • In genomics: 72 out of 4,057
  • Year to date:
    • Site-wide: 139 out of 57,789
  • Since beginning of last month:
    • Site-wide: 282 out of 57,789

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide

Sign up for the Rxivist weekly newsletter! (Click here for more details.)