The Complexity of Maximal/Closed Frequent Tree Mining for Bounded Height Trees

Frequent tree mining asks us to enumerate tree patterns that occur frequently in a database of rooted trees. This problem is motivated by tree-structured data in bioinformatics, such as glycans and pseudoknot-free RNA secondary structures. A direct enumeration of all frequent trees is often highly redundant, because every subtree of a frequent tree is again frequent. Closed and maximal frequent trees are standard ways to reduce this redundancy, but their enumeration can still be computationally hard. In this paper, we study the effect of bounding the height of the input trees. This is a natural restriction for rooted trees, since the height is the depth of the hierarchy. We ask whether closed/maximal frequent tree mining remains hard when every input tree has a small height. Our results show that the answer depends sharply on the model. For rooted unordered trees of height at most 2, we give a polynomial-delay algorithm for enumerating closed frequent trees. On the other hand, for rooted ordered trees of height at most 2, we show that an output-polynomial time algorithm for enumerating closed frequent trees would imply an output-polynomial time algorithm for Dualization. For maximal frequent tree enumeration, we prove that no output-polynomial time algorithm exists unless P = NP already for rooted ordered trees of height at most 2 and for rooted unordered trees of height at most 3. Thus, even very small height bounds do not make the enumeration problems easy in general. At the same time, the unordered closed case of height at most 2 admits polynomial-delay enumeration. These results give a height-based classification of the complexity of closed and maximal frequent tree mining on shallow rooted trees.

翻译：频繁子树挖掘要求枚举在根树数据库中出现频繁的树模式。该问题受到生物信息学中树结构数据（如聚糖和无伪结RNA二级结构）的驱动。直接枚举所有频繁树通常存在高度冗余，因为频繁树的每个子树仍是频繁的。闭合频繁树和最大频繁树是减少这种冗余的标准方法，但它们的枚举仍可能具有计算难度。本文研究输入树高度有界的影响。对根树而言，高度即层次深度，这是一种自然约束。我们提出：当所有输入树的高度较小时，闭合/最大频繁子树挖掘是否仍然困难？结果表明答案严格依赖于模型。对于高度不超过2的根无序树，我们给出了枚举闭合频繁树的多项式延迟算法；另一方面，对于高度不超过2的根有序树，我们证明：若存在枚举闭合频繁树的输出多项式时间算法，则意味着对偶化问题存在输出多项式时间算法。对于最大频繁树枚举，我们证明：除非P=NP，否则对高度不超过2的根有序树和高度不超过3的根无序树，不存在输出多项式时间算法。因此，即使非常小的高度边界通常也无法简化枚举问题。同时，高度不超过2的无序树闭合情况允许多项式延迟枚举。这些结果为浅层根树上的闭合和最大频繁子树挖掘的复杂度提供了基于高度的分类。