Statistics for Phylogenetic Trees in the Presence of Stickiness

Samples of phylogenetic trees arise in a variety of evolutionary and biomedical applications, and the Fr\'echet mean in Billera-Holmes-Vogtmann tree space is a summary tree shown to have advantages over other mean or consensus trees. However, use of the Fr\'echet mean raises computational and statistical issues which we explore in this paper. The Fr\'echet sample mean is known often to contain fewer internal edges than the trees in the sample, and in this circumstance calculating the mean by iterative schemes can be problematic due to slow convergence. We present new methods for identifying edges which must lie in the Fr\'echet sample mean and apply these to a data set of gene trees relating organisms from the apicomplexa which cause a variety of parasitic infections. When a sample of trees contains a significant level of heterogeneity in the branching patterns, or topologies, displayed by the trees then the Fr\'echet mean is often a star tree, lacking any internal edges. Not only in this situation, the population Fr\'echet mean is affected by a non-Euclidean phenomenon called stickness which impacts upon asymptotics, and we examine two data sets for which the mean tree is a star tree. The first consists of trees representing the physical shape of artery structures in a sample of medical images of human brains in which the branching patterns are very diverse. The second consists of gene trees from a population of baboons in which there is evidence of substantial hybridization. We develop hypothesis tests which work in the presence of stickiness. The first is a test for the presence of a given edge in the Fr\'echet population mean; the second is a two-sample test for differences in two distributions which share the same sticky population mean.

翻译：系统发育树样本出现在各种进化和生物医学应用中，Billera-Holmes-Vogtmann树空间中的Fr\'echet均值作为一种汇总树，已被证明比其他均值树或共识树更具优势。然而，Fr\'echet均值的使用引发了计算和统计问题，本文对此进行了探讨。已知Fr\'echet样本均值树通常比样本中的树包含更少的内边，在这种情况下，通过迭代方案计算均值可能因收敛缓慢而存在问题。我们提出了识别必须位于Fr\'echet样本均值中的边的新方法，并将其应用于一组关于顶复门生物（引起多种寄生虫感染）的基因树数据集。当树样本在分支模式（即拓扑结构）上表现出显著的异质性时，Fr\'echet均值通常是一棵星形树，缺乏任何内边。不仅在此情况下，总体Fr\'echet均值还受到一种称为“粘性”的非欧几里得现象的影响，该现象会影响渐近性质。我们研究了两个均值树为星形树的数据集：第一个数据集包含代表人类大脑医学图像样本中动脉结构物理形状的树，其分支模式非常多样；第二个数据集包含来自狒狒种群的基因树，其中有证据表明存在大量杂交现象。我们开发了在粘性存在下适用的假设检验方法：第一个检验用于检测给定边是否存在于Fr\'echet总体均值中；第二个检验是针对共享同一粘性总体均值的两个分布差异的双样本检验。