Sequential pattern mining (SPM) is an important branch of knowledge discovery that aims to mine frequent sub-sequences (patterns) in a sequential database. Various SPM methods have been investigated, and most of them are classical SPM methods, since these methods only consider whether or not a given pattern occurs within a sequence. Classical SPM can only find the common features of sequences, but it ignores the number of occurrences of the pattern in each sequence, i.e., the degree of interest of specific users. To solve this problem, this paper addresses the issue of repetitive nonoverlapping sequential pattern (RNP) mining and proposes the RNP-Miner algorithm. To reduce the number of candidate patterns, RNP-Miner adopts an itemset pattern join strategy. To improve the efficiency of support calculation, RNP-Miner utilizes the candidate support calculation algorithm based on the position dictionary. To validate the performance of RNP-Miner, 10 competitive algorithms and 20 sequence databases were selected. The experimental results verify that RNP-Miner outperforms the other algorithms, and using RNPs can achieve a better clustering performance than raw data and classical frequent patterns. All the algorithms were developed using the PyCharm environment and can be downloaded from https://github.com/wuc567/Pattern-Mining/tree/master/RNP-Miner.
翻译:序贯模式挖掘(SPM)是知识发现的重要分支,旨在挖掘序列数据库中的频繁子序列(模式)。现有研究已提出多种SPM方法,其中大多数为经典SPM方法,因其仅考虑给定模式在序列中是否出现。经典SPM仅能发现序列的共性特征,却忽略了模式在每条序列中的出现次数,即特定用户的兴趣程度。为解决此问题,本文聚焦重复非重叠序贯模式(RNP)挖掘问题,并提出RNP-Miner算法。为减少候选模式数量,RNP-Miner采用项集模式连接策略;为提高支持度计算效率,该算法基于位置字典设计了候选支持度计算方法。为验证RNP-Miner性能,选取了10种竞争性算法和20个序列数据库。实验结果表明,RNP-Miner优于其他算法,且基于RNP的聚类性能优于原始数据和经典频繁模式。所有算法均在PyCharm环境下开发,可从https://github.com/wuc567/Pattern-Mining/tree/master/RNP-Miner 下载。