The mode of a collection of values (i.e., the most frequent value in the collection) is a key summary statistic. Finding the mode in a given range of an array of values is thus of great importance, and constructing a data structure to solve this problem is in fact the well-known Range Mode problem. In this work, we introduce the Subtree Mode (SM) problem, the analogous problem in a leaf-colored tree, where the task is to compute the most frequent color in the leaves of the subtree of a given node. SM is motivated by several applications in domains such as text analytics and biology, where the data are hierarchical and can thus be represented as a (leaf-colored) tree. Our central contribution is a time-optimal algorithm for SM that computes the answer for every node of an input $N$-node tree in $O(N)$ time. We further show how our solution can be adapted for node-colored trees, or for computing the $k$ most frequent colors, for any given $k=O(1)$, in the optimal $O(N)$ time. Moreover, we prove that a similarly fast solution for when the input is a sink-colored directed acyclic graph instead of a leaf-colored tree is highly unlikely. Our experiments on real datasets with trees of up to $7.3$ billion nodes demonstrate that our algorithm is faster than baselines by at least one order of magnitude and much more space efficient. They also show that it is effective in pattern mining, sequence-to-database search, and biology applications.
翻译:众数(即集合中出现频率最高的值)是一种关键汇总统计量。因此,在给定值数组的某个区间内寻找众数具有重要意义,而构建解决此问题的数据结构实际上就是著名的区间众数问题。在本工作中,我们引入了子树众数问题,即叶着色树中的类似问题,其任务是计算给定节点子树中叶节点中出现频率最高的颜色。子树众数问题受到文本分析和生物学等多个领域应用的推动,这些领域的数据具有层次结构,因此可以表示为(叶着色的)树。我们的核心贡献是一种时间最优的子树众数算法,该算法以$O(N)$时间计算输入$N$节点树中每个节点的答案。我们进一步展示了如何将我们的解决方案适配于节点着色树,或用于计算任意给定$k=O(1)$时的前$k$个最频繁颜色,且均在最优的$O(N)$时间内完成。此外,我们证明,当输入是汇着色的有向无环图而非叶着色树时,类似快速的解决方案极不可能存在。我们在包含多达$73$亿个节点的真实数据集树上的实验表明,我们的算法比基线方法至少快一个数量级,且空间效率更高。实验还证明,该算法在模式挖掘、序列到数据库搜索以及生物学应用中具有良好效果。