In various real-world applications, ranging from the Internet of Things (IoT) to social media and financial systems, data stream clustering is a critical operation. This paper introduces Benne, a modular and highly configurable data stream clustering algorithm designed to offer a nuanced balance between clustering accuracy and computational efficiency. Benne distinguishes itself by clearly demarcating four pivotal design dimensions: the summarizing data structure, the window model for handling data temporality, the outlier detection mechanism, and the refinement strategy for improving cluster quality. This clear separation not only facilitates a granular understanding of the impact of each design choice on the algorithm's performance but also enhances the algorithm's adaptability to a wide array of application contexts. We provide a comprehensive analysis of these design dimensions, elucidating the challenges and opportunities inherent to each. Furthermore, we conduct a rigorous performance evaluation of Benne, employing diverse configurations and benchmarking it against existing state-of-the-art data stream clustering algorithms. Our empirical results substantiate that Benne either matches or surpasses competing algorithms in terms of clustering accuracy, processing throughput, and adaptability to varying data stream characteristics. This establishes Benne as a valuable asset for both practitioners and researchers in the field of data stream mining.
翻译:从物联网到社交媒体和金融系统等多种实际应用中,数据流聚类是一项关键操作。本文提出Benne——一种模块化且高度可配置的数据流聚类算法,旨在实现聚类精度与计算效率之间的精细平衡。Benne的独特之处在于清晰界定了四个关键设计维度:摘要数据结构、处理数据时间性的窗口模型、异常检测机制以及提升聚类质量的优化策略。这种明确分离不仅能促进对每个设计选择如何影响算法性能的深入理解,还增强了算法对不同应用场景的适应性。我们对这些设计维度进行了全面分析,阐明了各维度固有的挑战与机遇。此外,通过采用多样化配置并对比现有最先进数据流聚类算法,我们对Benne进行了严格的性能评估。实验结果表明,在聚类精度、处理吞吐量及对不同数据流特征的适应性方面,Benne均可比肩甚至超越竞争算法。这使Benne成为数据流挖掘领域实践者与研究者的一项重要工具。