Sequential pattern mining (SPM) with gap constraints (or repetitive SPM or tandem repeat discovery in bioinformatics) can find frequent repetitive subsequences satisfying gap constraints, which are called positive sequential patterns with gap constraints (PSPGs). However, classical SPM with gap constraints cannot find the frequent missing items in the PSPGs. To tackle this issue, this paper explores negative sequential patterns with gap constraints (NSPGs). We propose an efficient NSPG-Miner algorithm that can mine both frequent PSPGs and NSPGs simultaneously. To effectively reduce candidate patterns, we propose a pattern join strategy with negative patterns which can generate both positive and negative candidate patterns at the same time. To calculate the support (frequency of occurrence) of a pattern in each sequence, we explore a NegPair algorithm that employs a key-value pair array structure to deal with the gap constraints and the negative items simultaneously and can avoid redundant rescanning of the original sequence, thus improving the efficiency of the algorithm. To report the performance of NSPG-Miner, 11 competitive algorithms and 11 datasets are employed. The experimental results not only validate the effectiveness of the strategies adopted by NSPG-Miner, but also verify that NSPG-Miner can discover more valuable information than the state-of-the-art algorithms. Algorithms and datasets can be downloaded from https://github.com/wuc567/Pattern-Mining/tree/master/NSPG-Miner.
翻译:带间隔约束的序列模式挖掘(或称重复序列模式挖掘,在生物信息学中称为串联重复发现)能够发现满足间隔约束的频繁重复子序列,这些序列被称为带间隔约束的正序列模式。然而,经典的带间隔约束的序列模式挖掘无法发现PSPG中频繁缺失的项。为解决此问题,本文探索了带间隔约束的负序列模式。我们提出了一种高效的NSPG-Miner算法,能够同时挖掘频繁的PSPG与NSPG。为有效减少候选模式,我们提出了一种包含负模式的模式连接策略,该策略能够同时生成正候选模式与负候选模式。为计算每个序列中模式的支持度(出现频率),我们探索了一种NegPair算法,该算法采用键值对数组结构同时处理间隔约束与负项,并能避免对原始序列的冗余重复扫描,从而提高了算法效率。为评估NSPG-Miner的性能,实验采用了11个竞争性算法与11个数据集。实验结果不仅验证了NSPG-Miner所采用策略的有效性,也证实了NSPG-Miner相比现有先进算法能够发现更有价值的信息。算法与数据集可从 https://github.com/wuc567/Pattern-Mining/tree/master/NSPG-Miner 下载。