Rxivist logo

The Multispecies Coalescent Model Outperforms Concatenation across Diverse Phylogenomic Data Sets

By Xiaodong Jian, Scott V. Edwards, Liang Liu

Posted 02 Dec 2019
bioRxiv DOI: 10.1101/860809

A statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation, and multispecies coalescent (MSC) models across 47 phylogenomic data sets collected across tree of life. Tests for substitution models and the concatenation assumption of topologically concordant gene trees suggest that a poor fit of substitution models (44% of loci rejecting the substitution model) and concatenation models (38% of loci rejecting the hypothesis of topologically congruent gene trees) is widespread. Logistic regression shows that the proportions of GC content and informative sites are both negatively correlated with the fit of substitution models across loci. Moreover, a substantial violation of the concatenation assumption of congruent gene trees is consistently observed across 6 major groups (birds, mammals, fish, insects, reptiles, and others, including other invertebrates). In contrast, among those loci adequately described by a given substitution model, the proportion of loci rejecting the MSC model is 11%, significantly lower than those rejecting the substitution and concatenation models, and Bayesian model comparison strongly favors the MSC over concatenation across all data sets. Species tree inference suggests that loci rejecting the MSC have little effect on species tree estimation. Due to computational constraints, the Bayesian model validation and comparison analyses were conducted on the reduced data sets. A complete analysis of phylogenomic data requires the development of efficient algorithms for phylogenetic inference. Nevertheless, the concatenation assumption of congruent gene trees rarely holds for phylogenomic data with more than 10 loci. Thus, for large phylogenomic data sets, model comparison analyses are expected to consistently and more strongly favor the coalescent model over the concatenation model. Our analysis reveals the value of model validation and comparison in phylogenomic data analysis, as well as the need for further improvements of multilocus models and computational tools for phylogenetic inference.

Download data

  • Downloaded 296 times
  • Download rankings, all-time:
    • Site-wide: 98,569
    • In evolutionary biology: 5,052
  • Year to date:
    • Site-wide: 144,037
  • Since beginning of last month:
    • Site-wide: None

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide