Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history and numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference. Because GFlowNets are well-suited for sampling complex combinatorial structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies and evolutionary distances. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets. PhyloGFN is competitive with prior works in marginal likelihood estimation and achieves a closer fit to the target distribution than state-of-the-art variational inference methods. Our code is available at https://github.com/zmy1116/phylogfn.
翻译:系统发育学是计算生物学的一个分支,研究生物实体间的进化关系。尽管其历史悠久且应用广泛,但从序列数据推断系统发育树仍面临挑战:树空间的高复杂度给当前组合和概率技术带来了显著障碍。本文采用生成流网络(GFlowNets)框架解决系统发育学中的两个核心问题:基于简约法的系统发育推断和贝叶斯系统发育推断。由于GFlowNets适用于采样复杂组合结构,它们是探索和采样树拓扑结构及进化距离的多模态后验分布的自然选择。我们证明,所提出的摊销后验采样器PhyloGFN能在真实基准数据集上生成多样且高质量的进化假设。PhyloGFN在边际似然估计方面与先前研究具有竞争力,并且在拟合目标分布方面优于最先进的变分推断方法。我们的代码可在https://github.com/zmy1116/phylogfn获取。