Rooted bifurcating trees are mathematical objects used to model evolutionary relationships and arise naturally in both coalescent theory and phylogenetics. Recent numerical representations of tree topologies, known as F-matrices, allow for summarizing a sample of trees via Fréchet means and provide new measures of tree balance. However, the number of ranked unlabelled trees grows super-exponentially with the number of leaves. This makes computation intensive and current methods rely on mixed integer programming and simulation-based methods. Moreover, F-matrices are difficult to interpret, and their distribution is only described in terms of first- and second-order moments under neutral branching. In this paper, we introduce a Markov chain embedding of ranked and unlabelled trees that drastically decreases the size of the state space. Leveraging this embedding, we develop an algorithm that efficiently computes all Fréchet means and use discrete phase-type theory to obtain the joint distribution of tree balance indices. We also use discrete phase-type theory to generalize previous results regarding moments of F-matrices to arbitrary order for any time homogeneous and bifurcating coalescent model. Using this framework, we construct three tests for neutrality and demonstrate their improved power compared to previous methods on simulated data.
翻译:有根分叉树是用于模拟进化关系的数学对象,自然出现在溯祖理论和系统发育学中。近期基于树拓扑结构的数值表示(称为F-矩阵)可通过Fréchet均值总结树的样本,并提供新的树平衡度量。然而,带秩无标记树的数量随叶子数呈超指数增长,导致计算密集,现有方法依赖混合整数规划和基于模拟的算法。此外,F-矩阵难以解释,且其分布仅在中性分支条件下用一阶和二阶矩描述。本文提出带秩无标记树的马尔可夫链嵌入方法,大幅缩减状态空间规模。利用该嵌入,我们开发了一种高效计算所有Fréchet均值的算法,并采用离散相型理论推导树平衡指数的联合分布。同时,利用离散相型理论将F-矩阵矩的已有结论推广至任意阶,适用于任何时间齐次且分叉的溯祖模型。基于此框架,我们构建了三种中性检验方法,模拟数据表明其检验功效优于现有方法。