Streaming data clustering is a popular research topic in the fields of data mining and machine learning. Compared to static data, streaming data, which is usually analyzed in data chunks, is more susceptible to encountering the dynamic cluster imbalanced issue. That is, the imbalanced degree of clusters varies in different streaming data chunks, leading to corruption in either the accuracy or the efficiency of streaming data analysis based on existing clustering methods. Therefore, we propose an efficient approach called Learning Self-Refined Organizing Map (LSROM) to handle the imbalanced streaming data clustering problem, where we propose an advanced SOM for representing the global data distribution. The constructed SOM is first refined for guiding the partition of the dataset to form many micro-clusters to avoid the missing small clusters in imbalanced data. Then an efficient merging of the micro-clusters is conducted through quick retrieval based on the SOM, which can automatically yield a true number of imbalanced clusters. In comparison to existing imbalanced data clustering approaches, LSROM is with a lower time complexity $O(n\log n)$, while achieving very competitive clustering accuracy. Moreover, LSROM is interpretable and insensitive to hyper-parameters. Extensive experiments have verified its efficacy.
翻译:流式数据聚类是数据挖掘与机器学习领域的热门研究课题。相较于静态数据,通常以数据块形式分析的流式数据更易遭遇动态簇不平衡问题,即不同流式数据块中簇的不平衡程度存在差异,这会导致现有聚类方法在流式数据分析中的准确性或效率受损。为此,我们提出一种名为学习自优化组织映射(LSROM)的高效方法以应对不平衡流式数据聚类问题。该方法通过构建改进的自组织映射(SOM)来表征全局数据分布。首先对构建的SOM进行优化以指导数据集划分形成多个微簇,避免不平衡数据中小簇被遗漏;随后基于SOM快速检索实现微簇的高效合并,从而自动生成真实数量的不平衡簇。与现有不平衡数据聚类方法相比,LSROM具有更低的时间复杂度$O(n\log n)$,同时能获得极具竞争力的聚类精度。此外,LSROM具有可解释性且对超参数不敏感。大量实验验证了其有效性。