In Bayesian phylogenetics, our goal is to estimate the posterior distribution over phylogenetic trees. Markov chain Monte Carlo methods are widely used to approximate the phylogenetic posterior distributions. For large-scale sequence data, repeated evaluation of the likelihood function incurs a high computational cost. In this article, we propose a machine-learning algorithm with over 35 topological and branch-length features to predict the changes in the likelihood function caused by tree moves (\eg,~eSPR, stNNI) used in standard MCMC approaches. This algorithm is then used to design a delayed acceptance MCMC kernel, which utilized the predicted surrogate function for preliminary rejection, to accelerate tree space searches. Furthermore, we integrate our proposed MCMC kernel into the sequential Monte Carlo sampler framework. We validate the proposed delayed-acceptance sequential Monte Carlo approach (DA-SMC) on simulation and real data sets. Our delayed acceptance kernel can maintain robust estimation while reduces the number of likelihood evaluations significantly, yielding substantial computational time savings. We develop a Python package that is available at https://github.com/wentYu/DAphyloSMC.
翻译:在贝叶斯系统发育学中,我们的目标是估计系统发育树的后验分布。马尔可夫链蒙特卡洛方法被广泛用于近似系统发育后验分布。对于大规模序列数据,重复计算似然函数会带来高昂的计算成本。本文提出一种机器学习算法,利用超过35个拓扑和分支长度特征,预测标准MCMC方法中树移动(例如eSPR、stNNI)引起的似然函数变化。该算法随后用于设计一个延迟接受MCMC核,利用预测的代理函数进行初步拒绝,以加速树空间搜索。此外,我们将所提出的MCMC核集成到序贯蒙特卡洛采样器框架中。我们在模拟和真实数据集上验证了所提出的延迟接受序贯蒙特卡洛方法(DA-SMC)。我们的延迟接受核在显著减少似然计算次数的同时,能保持稳健的估计,从而大幅节省计算时间。我们开发了一个Python软件包,可通过https://github.com/wentYu/DAphyloSMC获取。