Clustering large high-dimensional datasets with diverse variable is essential for extracting high-level latent information from these datasets. Here, we developed an unsupervised clustering algorithm, we call "Village-Net". Village-Net is specifically designed to effectively cluster high-dimension data without priori knowledge on the number of existing clusters. The algorithm operates in two phases: first, utilizing K-Means clustering, it divides the dataset into distinct subsets we refer to as "villages". Next, a weighted network is created, with each node representing a village, capturing their proximity relationships. To achieve optimal clustering, we process this network using a community detection algorithm called Walk-likelihood Community Finder (WLCF), a community detection algorithm developed by one of our team members. A salient feature of Village-Net Clustering is its ability to autonomously determine an optimal number of clusters for further analysis based on inherent characteristics of the data. We present extensive benchmarking on extant real-world datasets with known ground-truth labels to showcase its competitive performance, particularly in terms of the normalized mutual information (NMI) score, when compared to other state-of-the-art methods. The algorithm is computationally efficient, boasting a time complexity of O(N*k*d), where N signifies the number of instances, k represents the number of villages and d represents the dimension of the dataset, which makes it well suited for effectively handling large-scale datasets.
翻译:对包含多样变量的大型高维数据集进行聚类,对于从这些数据集中提取高层次潜在信息至关重要。本文提出了一种我们称之为“Village-Net”的无监督聚类算法。Village-Net专门设计用于在没有关于现有簇数量先验知识的情况下,有效对高维数据进行聚类。该算法分两个阶段运行:首先,利用K-Means聚类将数据集划分为我们称之为“村庄”的若干不同子集。接着,构建一个加权网络,其中每个节点代表一个村庄,以捕捉它们之间的邻近关系。为实现最优聚类,我们使用一种名为Walk-likelihood Community Finder (WLCF)的社区检测算法(由我们团队成员开发)来处理该网络。Village-Net聚类的一个显著特点是能够根据数据的内在特性,自主确定用于进一步分析的最佳聚类数量。我们在具有已知真实标签的现有真实世界数据集上进行了广泛的基准测试,以展示其具有竞争力的性能,特别是在归一化互信息(NMI)分数方面,与其他先进方法相比。该算法计算效率高,时间复杂度为O(N*k*d),其中N表示实例数量,k表示村庄数量,d表示数据集的维度,这使其非常适合有效处理大规模数据集。