With the widespread application of efficient pattern mining algorithms, sequential patterns that allow gap constraints have become a valuable tool to discover knowledge from biological data such as DNA and protein sequences. Among all kinds of gap-constrained mining, non-overlapping sequence mining can mine interesting patterns and satisfy the anti-monotonic property (the Apriori property). However, existing algorithms do not search for targeted sequential patterns, resulting in unnecessary and redundant pattern generation. Targeted pattern mining can not only mine patterns that are more interesting to users but also reduce the unnecessary redundant sequence generated, which can greatly avoid irrelevant computation. In this paper, we define and formalize the problem of targeted non-overlapping sequential pattern mining and propose an algorithm named TALENT (TArgeted mining of sequentiaL pattErN with consTraints). Two search methods including breadth-first and depth-first searching are designed to troubleshoot the generation of patterns. Furthermore, several pruning strategies to reduce the reading of sequences and items in the data and terminate redundant pattern extensions are presented. Finally, we select a series of datasets with different characteristics and conduct extensive experiments to compare the TALENT algorithm with the existing algorithms for mining non-overlapping sequential patterns. The experimental results demonstrate that the proposed targeted mining algorithm, TALENT, has excellent mining efficiency and can deal efficiently with many different query settings.
翻译:随着高效模式挖掘算法的广泛应用,允许间隙约束的序列模式已成为从DNA和蛋白质序列等生物数据中提取知识的有效工具。在各种间隙约束挖掘方法中,非重叠序列模式挖掘既能发现有趣模式,又满足反单调性(Apriori性质)。然而现有算法未针对性地挖掘目标序列模式,导致产生不必要且冗余的模式。目标模式挖掘不仅能挖掘更符合用户兴趣的模式,还能减少生成的冗余序列,从而大幅避免无关计算。本文定义并形式化了面向目标的非重叠序列模式挖掘问题,提出名为TALENT(带约束的目标序列模式挖掘)算法。设计了广度优先和深度优先两种搜索方法以排查模式生成过程。此外,提出了多种剪枝策略以减少数据中序列与项的读取次数,并终止冗余模式扩展。最后选取一系列具有不同特征的数据集进行广泛实验,将TALENT算法与现有非重叠序列模式挖掘算法进行对比。实验结果表明,本文提出的目标挖掘算法TALENT具有卓越的挖掘效率,能有效处理多种不同的查询设置。