In data stream clustering, systematic theory of stream clustering algorithms remains relatively scarce. Recently, density-based methods have gained attention. However, existing algorithms struggle to simultaneously handle arbitrarily shaped, multi-density, high-dimensional data while maintaining strong outlier resistance. Clustering quality significantly deteriorates when data density varies complexly. This paper proposes a clustering algorithm based on the novel concept of Tightest Neighbors and introduces a data stream clustering theory based on the Skeleton Set. Based on these theories, this paper develops a new method, TNStream, a fully online algorithm. The algorithm adaptively determines the clustering radius based on local similarity, summarizing the evolution of multi-density data streams in micro-clusters. It then applies a Tightest Neighbors-based clustering algorithm to form final clusters. To improve efficiency in high-dimensional cases, Locality-Sensitive Hashing (LSH) is employed to structure micro-clusters, addressing the challenge of storing k-nearest neighbors. TNStream is evaluated on various synthetic and real-world datasets using different clustering metrics. Experimental results demonstrate its effectiveness in improving clustering quality for multi-density data and validate the proposed data stream clustering theory.
翻译:在数据流聚类中,关于流聚类算法的系统性理论仍相对匮乏。近年来,基于密度的方法受到关注,但现有算法难以同时处理任意形状、多密度、高维数据,并保持强抗离群点能力。当数据密度复杂变化时,聚类质量显著下降。本文提出一种基于最紧邻域新概念的聚类算法,并引入基于骨架集的数据流聚类理论。基于这些理论,本文开发了一种全在线算法TNStream。该算法根据局部相似性自适应确定聚类半径,以微簇形式总结多密度数据流的演化过程,然后采用基于最紧邻域的聚类算法形成最终聚类。为提高高维场景下的效率,采用局部敏感哈希(LSH)结构化微簇,解决k近邻存储难题。使用不同聚类指标在多种合成与真实数据集上评估TNStream。实验结果表明,该方法能有效提升多密度数据的聚类质量,并验证了所提出的数据流聚类理论。