As modern data sets continue to grow exponentially in size, the demand for efficient mining algorithms capable of handling such large data sets becomes increasingly imperative. This paper develops a memory-efficient approach for Sequential Pattern Mining (SPM), a fundamental topic in knowledge discovery that faces a well-known memory bottleneck for large data sets. Our methodology involves a novel hybrid trie data structure that exploits recurring patterns to compactly store the data set in memory; and a corresponding mining algorithm designed to effectively extract patterns from this compact representation. Numerical results on real-life test instances show an average improvement of 88% in memory consumption and 41% in computation time for small to medium-sized data sets compared to the state of the art. Furthermore, our algorithm stands out as the only capable SPM approach for large data sets within 256GB of system memory.
翻译:随着现代数据集规模呈指数级增长,对能够处理此类大规模数据的高效挖掘算法的需求日益迫切。本文针对序列模式挖掘(SPM)这一知识发现中的基础课题,提出了一种内存高效的方法,以解决大规模数据中众所周知的存储瓶颈问题。我们的方法涉及一种新颖的混合字典树数据结构,该结构利用重复模式紧凑地将数据集存储在内存中,并设计了一种相应的挖掘算法,能够有效地从这种紧凑表示中提取模式。在实际测试实例上的数值结果表明,与现有技术相比,对于中小型数据集,内存消耗平均降低88%,计算时间平均减少41%。此外,我们的算法是唯一能够在256GB系统内存内处理大规模数据集的SPM方法。