Sequential pattern mining (SPM) has excellent prospects and application spaces and has been widely used in different fields. The non-overlapping SPM, as one of the data mining techniques, has been used to discover patterns that have requirements for gap constraints in some specific mining tasks, such as bio-data mining. And for the non-overlapping sequential patterns with gap constraints, the Nettree structure has been proposed to efficiently compute the support of the patterns. For pattern mining, users usually need to consider the threshold of minimum support (\textit{minsup}). This is especially difficult in the case of large databases. Although some existing algorithms can mine the top-$k$ patterns, they are approximate algorithms with fixed lengths. In this paper, a precise algorithm for mining \underline{T}op-$k$ \underline{N}on-\underline{O}verlapping \underline{S}equential \underline{P}atterns (TNOSP) is proposed. The top-$k$ solution of SPM is an effective way to discover the most frequent non-overlapping sequential patterns without having to set the \textit{minsup}. As a novel pattern mining algorithm, TNOSP can precisely search the top-$k$ patterns of non-overlapping sequences with different gap constraints. We further propose a pruning strategy named \underline{Q}ueue \underline{M}eta \underline{S}et \underline{P}runing (QMSP) to improve TNOSP's performance. TNOSP can reduce redundancy in non-overlapping sequential mining and has better performance in mining precise non-overlapping sequential patterns. The experimental results and comparisons on several datasets have shown that TNOSP outperformed the existing algorithms in terms of precision, efficiency, and scalability.
翻译:序列模式挖掘(SPM)具有广阔的应用前景和空间,已被广泛应用于不同领域。非重叠SPM作为数据挖掘技术之一,已被用于发现在特定挖掘任务(如生物数据挖掘)中具有间隙约束要求的模式。针对带间隙约束的非重叠序列模式,提出了Nettree结构以高效计算模式的支持度。对于模式挖掘,用户通常需要考虑最小支持度阈值(\textit{minsup}),这在大型数据库中尤为困难。尽管现有一些算法可以挖掘top-$k$模式,但它们是固定长度的近似算法。本文提出了一种精确挖掘 \underline{T}op-$k$ \underline{N}on-\underline{O}verlapping \underline{S}equential \underline{P}atterns(TNOSP)的算法。SPM的top-$k$解是一种无需设置\textit{minsup}即可发现最频繁非重叠序列模式的有效方法。作为一种新型模式挖掘算法,TNOSP能够精确搜索具有不同间隙约束的非重叠序列的top-$k$模式。我们进一步提出了一种名为 \underline{Q}ueue \underline{M}eta \underline{S}et \underline{P}runing(QMSP)的剪枝策略来提升TNOSP的性能。TNOSP能够减少非重叠序列挖掘中的冗余,并在精确挖掘非重叠序列模式方面具有更优性能。在多个数据集上的实验结果和比较表明,TNOSP在精度、效率和可扩展性方面均优于现有算法。