Phylogenetic inference, grounded in molecular evolution models, is essential for understanding the evolutionary relationships in biological data. Accounting for the uncertainty of phylogenetic tree variables, which include tree topologies and evolutionary distances on branches, is crucial for accurately inferring species relationships from molecular data and tasks requiring variable marginalization. Variational Bayesian methods are key to developing scalable, practical models; however, it remains challenging to conduct phylogenetic inference without restricting the combinatorially vast number of possible tree topologies. In this work, we introduce a novel, fully differentiable formulation of phylogenetic inference that leverages a unique representation of topological distributions in continuous geometric spaces. Through practical considerations on design spaces and control variates for gradient estimations, our approach, GeoPhy, enables variational inference without limiting the topological candidates. In experiments using real benchmark datasets, GeoPhy significantly outperformed other approximate Bayesian methods that considered whole topologies.
翻译:系统发育推断基于分子进化模型,对于理解生物数据中的进化关系至关重要。考虑系统发育树变量(包括树拓扑结构和分支上的进化距离)的不确定性,对于从分子数据准确推断物种关系以及需要变量边缘化的任务至关重要。变分贝叶斯方法是开发可扩展实用模型的关键,然而在不限制组合数量庞大的可能树拓扑结构的情况下进行系统发育推断仍具挑战性。在本研究中,我们提出了一种新颖的、完全可微分的系统发育推断公式,该公式利用拓扑分布在连续几何空间中的独特表示。通过设计空间和用于梯度估计的控制变量方面的实际考量,我们的方法GeoPhy实现了不限制拓扑候选的变分推断。在使用真实基准数据集的实验中,GeoPhy显著优于其他考虑完整拓扑结构的近似贝叶斯方法。