低效用序列模式的高效挖掘 (Efficient Mining of Low-Utility Sequential Patterns)

Discovering valuable insights from rich data is a crucial task for exploratory data analysis. Sequential pattern mining (SPM) has found widespread applications across various domains. In recent years, low-utility sequential pattern mining (LUSPM) has shown strong potential in applications such as intrusion detection and genomic sequence analysis. However, existing research in utility-based SPM focuses on high-utility sequential patterns, and the definitions and strategies used in high-utility SPM cannot be directly applied to LUSPM. Moreover, no algorithms have yet been developed specifically for mining low-utility sequential patterns. To address these problems, we formalize the LUSPM problem, redefine sequence utility, and introduce a compact data structure called the sequence-utility chain to efficiently record utility information. Furthermore, we propose three novel algorithm--LUSPM_b, LUSPM_s, and LUSPM_e--to discover the complete set of low-utility sequential patterns. LUSPM_b serves as an exhaustive baseline, while LUSPM_s and LUSPM_e build upon it, generating subsequences through shrinkage and extension operations, respectively. In addition, we introduce the maximal non-mutually contained sequence set and incorporate multiple pruning strategies, which significantly reduce redundant operations in both LUSPM_s and LUSPM_e. Finally, extensive experimental results demonstrate that both LUSPM_s and LUSPM_e substantially outperform LUSPM_b and exhibit excellent scalability. Notably, LUSPM_e achieves superior efficiency, requiring less runtime and memory consumption than LUSPM_s. Our code is available at https://github.com/Zhidong-Lin/LUSPM.

翻译：从丰富数据中发现有价值的洞见是探索性数据分析的关键任务。序列模式挖掘（SPM）已在多个领域得到广泛应用。近年来，低效用序列模式挖掘（LUSPM）在入侵检测和基因组序列分析等应用中展现出巨大潜力。然而，现有基于效用的SPM研究主要关注高效用序列模式，且高效用SPM中使用的定义和策略无法直接应用于LUSPM。此外，目前尚未开发出专门用于挖掘低效用序列模式的算法。为解决这些问题，我们形式化了LUSPM问题，重新定义了序列效用，并引入了一种称为序列效用链的紧凑数据结构以高效记录效用信息。进一步地，我们提出了三种新算法——LUSPM_b、LUSPM_s和LUSPM_e——以发现完整的低效用序列模式集合。LUSPM_b作为穷举基线算法，而LUSPM_s和LUSPM_e在其基础上分别通过收缩和扩展操作生成子序列。此外，我们引入了最大非互包含序列集，并结合了多种剪枝策略，显著减少了LUSPM_s和LUSPM_e中的冗余操作。最后，大量实验结果表明，LUSPM_s和LUSPM_e均显著优于LUSPM_b，并展现出优异的可扩展性。值得注意的是，LUSPM_e实现了更高的效率，其运行时间和内存消耗均低于LUSPM_s。我们的代码发布于https://github.com/Zhidong-Lin/LUSPM。