Frequent pattern mining is a relevant method to analyse structured data, like sequences, trees or graphs. It consists in identifying characteristic substructures of a dataset. This paper deals with a new type of patterns for tree data: common subtrees with identical label distribution. Their detection is far from obvious since the underlying isomorphism problem is graph isomorphism complete. An elaborated search algorithm is developed and analysed from both theoretical and numerical perspectives. Based on this, the enumeration of patterns is performed through a new lossless compression scheme for trees, called DAG-RW, whose complexity is investigated as well. The method shows very good properties, both in terms of computation times and analysis of real datasets from the literature. Compared to other substructures like topological subtrees and labelled subtrees for which the isomorphism problem is linear, the patterns found provide a more parsimonious representation of the data.
翻译:频繁模式挖掘是分析序列、树或图等结构化数据的一种重要方法,其核心在于识别数据集中的特征性子结构。本文针对树形数据提出了一种新型模式——具有相同标签分布的公共子树。由于底层同构问题是图同构完备问题,此类子树的检测远非易事。我们开发了一种精细搜索算法,并从理论与数值两个维度对其进行了分析。在此基础上,通过一种名为DAG-RW的新型无损树压缩方案实现模式枚举,并深入研究了该方案的复杂性。该方法在计算效率与文献中真实数据集的分析方面均展现出优异性能。相较于拓扑子树和带标签子树等同构问题呈线性复杂度的子结构,本文发现的模式能够更简洁地表示数据。