There are several tools available to infer phylogenetic trees, which depict the evolutionary relationships among biological entities such as viral and bacterial strains in infectious outbreaks, or cancerous cells in tumor progression trees. These tools rely on several inference methods available to produce phylogenetic trees, with resulting trees not being unique. Thus, methods for comparing phylogenies that are capable of revealing where two phylogenetic trees agree or differ are required. An approach is then to compute a similarity or dissimilarity measure between trees, with the Robinson- Foulds distance being one of the most used, and which can be computed in linear time and space. Nevertheless, given the large and increasing volume of phylogenetic data, phylogenetic trees are becoming very large with hundreds of thousands of leafs. In this context, space requirements become an issue both while computing tree distances and while storing trees. We propose then an efficient implementation of the Robinson-Foulds distance over trees succinct representations. Our implementation generalizes also the Robinson-Foulds distances to labelled phylogenetic trees, i.e., trees containing labels on all nodes, instead of only on leaves. Experimental results show that we are able to still achieve linear time while requiring less space. Our implementation is available as an open-source tool at https://github.com/pedroparedesbranco/TreeDiff.
翻译:有多种工具可用于推断系统发育树,这些树描绘了生物实体之间的进化关系,例如传染病暴发中的病毒和细菌菌株,或肿瘤进展树中的癌细胞。这些工具依赖于多种可用的推断方法来生成系统发育树,但生成的树并非唯一。因此,需要能够揭示两棵系统发育树在何处一致或不同的比较方法。一种方法是计算树之间的相似性或相异性度量,其中罗宾逊-福尔兹距离是最常用的方法之一,且可以在线性时间和空间内计算。然而,鉴于系统发育数据量庞大且持续增长,系统发育树变得非常庞大,拥有数十万个叶子节点。在此背景下,空间需求在计算树距离和存储树时都成为问题。因此,我们提出了一种基于树简洁表示的罗宾逊-福尔兹距离的高效实现方法。我们的实现还将罗宾逊-福尔兹距离推广到带标签的系统发育树,即所有节点(而非仅叶节点)都带有标签的树。实验结果表明,我们仍能实现线性时间,同时占用更少空间。我们的实现作为开源工具发布在 https://github.com/pedroparedesbranco/TreeDiff。