This paper develops a memory-efficient approach for Sequential Pattern Mining (SPM), a fundamental topic in knowledge discovery that faces a well-known memory bottleneck for large data sets. Our methodology involves a novel hybrid trie data structure that exploits recurring patterns to compactly store the data set in memory; and a corresponding mining algorithm designed to effectively extract patterns from this compact representation. Numerical results on small to medium-sized real-life test instances show an average improvement of 85% in memory consumption and 49% in computation time compared to the state of the art. For large data sets, our algorithm stands out as the only capable SPM approach within 256GB of system memory, potentially saving 1.7TB in memory consumption.
翻译:本文提出了一种内存高效的序列模式挖掘方法。序列模式挖掘作为知识发现领域的基础课题,在处理大规模数据集时面临众所周知的内存瓶颈问题。我们的方法采用了一种新颖的混合字典树数据结构,该结构通过利用重复出现的模式来紧凑地在内存中存储数据集;同时设计了一种相应的挖掘算法,能够有效地从这种紧凑表示中提取模式。在中小规模实际测试实例上的数值结果表明,与现有技术相比,我们的方法平均减少了85%的内存消耗,并缩短了49%的计算时间。对于大规模数据集,我们的算法在256GB系统内存限制下成为唯一可行的序列模式挖掘方法,潜在节省内存消耗达1.7TB。