Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of sampling-based inference, namely Bayesian inference via Hamiltonian Monte Carlo, under random-effects substitution models across large trees and state-spaces. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is a more adequate model than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. Simulations reveal that random-effects substitution models can accommodate both negligible and radical departures from the underlying base substitution model. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches.
翻译:系统发育和离散性状进化推断高度依赖于对底层字符替换过程的恰当描述。本文提出随机效应替换模型,将常见的连续时间马尔可夫链模型扩展为能够捕捉更广泛替换动态的丰富模型类别。由于随机效应替换模型通常需要比传统模型更多的参数,其推断在统计和计算上都颇具挑战性。因此,我们还提出一种高效方法,用于计算数据似然函数对所有未知替换模型参数的近似梯度。我们证明,该近似梯度能够在随机效应替换模型下,使基于采样的推断(即通过哈密顿蒙特卡洛方法进行的贝叶斯推断)扩展至大规模树和状态空间。在对583条SARS-CoV-2序列的应用中,具有随机效应的HKY模型显示出替换过程中强烈的不可逆性信号,且后验预测模型检验清晰表明该模型比可逆模型更为合适。在分析1441条甲型流感病毒(H3N2)序列在14个区域间的系统地理扩散模式时,随机效应系统地理替换模型推断出航空旅行量能充分预测几乎所有扩散速率。随机效应状态依赖替换模型未发现树栖性对树蛙亚科Hylinae游泳模式存在影响的证据。模拟表明,随机效应替换模型既能容纳与基础替换模型的极小偏差,也能容纳显著偏差。我们证明,基于梯度的推断方法在时间效率上比传统方法高出一个数量级以上。