Data stream clustering is a critical operation in various real-world applications, ranging from the Internet of Things (IoT) to social media and financial systems. Existing data stream clustering algorithms, while effective to varying extents, often lack the flexibility and self-optimization capabilities needed to adapt to diverse workload characteristics such as outlier, cluster evolution and changing dimensions in data points. These limitations manifest in suboptimal clustering accuracy and computational inefficiency. In this paper, we introduce MOStream, a modular and self-optimizing data stream clustering algorithm designed to dynamically balance clustering accuracy and computational efficiency at runtime. MOStream distinguishes itself by its adaptivity, clearly demarcating four pivotal design dimensions: the summarizing data structure, the window model for handling data temporality, the outlier detection mechanism, and the refinement strategy for improving cluster quality. This clear separation facilitates flexible adaptation to varying design choices and enhances its adaptability to a wide array of application contexts. We conduct a rigorous performance evaluation of MOStream, employing diverse configurations and benchmarking it against 9 representative data stream clustering algorithms on 4 real-world datasets and 3 synthetic datasets. Our empirical results demonstrate that MOStream consistently surpasses competing algorithms in terms of clustering accuracy, processing throughput, and adaptability to varying data stream characteristics.
翻译:数据流聚类是从物联网到社交媒体及金融系统等多种现实应用中的关键操作。现有的数据流聚类算法虽然在不同程度上有效,但通常缺乏适应多样化工作负载特征所需的灵活性和自优化能力,例如数据点中的离群值、聚类演化和维度变化等。这些限制表现为次优的聚类精度和计算效率低下。本文提出MOStream,一种模块化自优化的数据流聚类算法,旨在运行时动态平衡聚类精度与计算效率。MOStream以其自适应性为显著特征,清晰划分了四个关键设计维度:数据摘要结构、处理数据时序性的窗口模型、离群值检测机制以及提升聚类质量的优化策略。这种清晰的分离便于灵活适应不同的设计选择,并增强其对广泛应用场景的适应性。我们对MOStream进行了严格的性能评估,采用多样化配置,并在4个真实数据集和3个合成数据集上,与9种代表性数据流聚类算法进行基准测试。实验结果表明,MOStream在聚类精度、处理吞吐量以及对不同数据流特征的适应性方面均持续优于竞争算法。