Random-effects substitution models for phylogenetics via scalable gradient approximations

Andrew F. Magee,Andrew J. Holbrook,Jonathan E. Pekar,Itzue W. Caviedes-Solis,Fredrick A. Matsen IV,Guy Baele,Joel O. Wertheim,Xiang Ji,Philippe Lemey,Marc A. Suchard

Phylogenetic and discrete-trait evolutionary inference depend heavily on appropriate characterization of the underlying substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of both sampling-based (Bayesian inference via HMC) and maximization-based inference (MAP estimation) under random-effects substitution models across large trees and state-spaces. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is more adequate than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. On a dataset of 28 taxa spanning the Metazoa, a random-effects amino acid substitution model finds evidence of notable departures from the current best-fit amino acid model in seconds. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches.

翻译：系统发育学与离散性状进化推断高度依赖于对潜在替代过程的合理刻画。本文提出随机效应替代模型，将常见的连续时间马尔可夫链模型扩展为能捕捉更丰富替代动态的新型过程。由于此类随机效应模型通常需要比常规模型更多的参数，其统计推断与计算均面临挑战。为此，我们提出一种高效方法，可近似计算数据似然函数对所有未知替代模型参数的梯度。我们证明，该梯度近似方法能够将基于采样的推断（通过HMC实现的贝叶斯推断）和基于最大化的推断（MAP估计）扩展到包含大规模系统发育树和状态空间的随机效应模型中。应用包含583条SARS-CoV-2序列的数据集时，带有随机效应的HKY模型显示出替代过程中强烈的不可逆性信号，且后验预测模型检验明确表明其优于可逆模型。在分析包含1441条甲型流感病毒（H3N2）序列在14个区域间的系统地理传播模式时，随机效应系统地理替代模型推断出航空旅行量可充分预测几乎所有扩散速率。通过随机效应状态依赖替代模型，我们未发现树蟾亚科中树栖性对游泳方式产生影响的证据。在包含28个后生动物类群的数据集上，随机效应氨基酸替代模型在数秒内检测到当前最优氨基酸模型存在显著偏离。研究表明，基于梯度的推断方法比传统方法的时间效率提升超过一个数量级。