There are multiple factors which can cause the phylogenetic inference process to produce two or more conflicting hypotheses of the evolutionary history of a set X of biological entities. That is: phylogenetic trees with the same set of leaf labels X but with distinct topologies. This leads naturally to the goal of quantifying the difference between two such trees T_1 and T_2. Here we introduce the problem of computing a 'maximum relaxed agreement forest' (MRAF) and use this as a proxy for the dissimilarity of T_1 and T_2, which in this article we assume to be unrooted binary phylogenetic trees. MRAF asks for a partition of the leaf labels X into a minimum number of blocks S_1, S_2, ... S_k such that for each i, the subtrees induced in T_1 and T_2 by S_i are isomorphic up to suppression of degree-2 nodes and taking the labels X into account. Unlike the earlier introduced maximum agreement forest (MAF) model, the subtrees induced by the S_i are allowed to overlap. We prove that it is NP-hard to compute MRAF, by reducing from the problem of partitioning a permutation into a minimum number of monotonic subsequences (PIMS). Furthermore, we show that MRAF has a polynomial time O(log n)-approximation algorithm where n=|X| and permits exact algorithms with single-exponential running time. When at least one of the two input trees has a caterpillar topology, we prove that testing whether a MRAF has size at most k can be answered in polynomial time when k is fixed. We also note that on two caterpillars the approximability of MRAF is related to that of PIMS. Finally, we establish a number of bounds on MRAF, compare its behaviour to MAF both in theory and in an experimental setting and discuss a number of open problems.
翻译:有多种因素可导致系统发育推断过程对生物实体集合X产生两个或更多关于其进化历史的冲突假设。即:具有相同叶标签集合X但拓扑结构不同的系统发育树。这自然引出了量化两棵此类树T_1和T_2之间差异的目标。本文提出了计算"最大松弛一致森林"(MRAF)的问题,并将其用作T_1和T_2不相似性的代理指标——本文假设这两棵树均为无根二叉系统发育树。MRAF要求将叶标签集合X划分为最少数量的块S_1, S_2, ... S_k,使得对于每个i,由S_i在T_1和T_2中诱导的子树在抑制度数为2的节点并考虑标签X的意义下同构。与早期提出的最大一致森林(MAF)模型不同,由S_i诱导的子树允许重叠。我们通过将排列划分为最少单调子序列问题(PIMS)归约,证明了计算MRAF是NP困难的。进一步,我们证明MRAF具有多项式时间的O(log n)-近似算法(其中n=|X|),并允许单指数运行时间的精确算法。当至少一棵输入树具有毛毛虫拓扑结构时,我们证明当k固定时,判定MRAF规模是否不超过k的问题可在多项式时间内解决。我们还指出,在两棵毛毛虫树上,MRAF的近似性与PIMS的近似性存在关联。最后,我们建立了MRAF的多个边界,在理论和实验环境下比较了其与MAF的行为差异,并讨论了若干开放问题。