Selecting a clustering algorithm and its hyperparameters without labels is a common difficulty in engineering machine learning pipelines that work with unsupervised analysis of sensor, image, or process data. Clustering validation indices (CVIs) provide internal scores for ranking candidate clusterings, but most popular CVIs are built from Euclidean compactness and separation terms and so tend to favour compact, convex partitions. Their performance is known to degrade on non convex, irregular, or variable density data, where kernel transformations or alternative distance measures are typically used at the cost of additional tuning and computation. This paper introduces the Central Description Length (CDL) clustering validation index. CDL uses the observed within cluster compactness, the estimated cluster centers, and the estimated cluster covariances to compute a probabilistic upper bound on the description length associated with the unobservable true cluster centers. The bound condenses intra cluster compactness and centroid displacement into a single computable quantity and is evaluated on the partition produced by any clustering algorithm. The implementation uses only observable quantities (the data, the partition, the estimated centers, and the estimated covariances) and does not use ground truth labels. On synthetic benchmarks with non convex and arbitrary shape clusters, CDL-CVI selected the reference number of clusters more often and reached higher Adjusted Rand Index (ARI) values than the conventional CVIs we tested, without an additional kernel preprocessing stage. On image benchmarks (MNIST, CIFAR-10, STL-10) clustered from frozen unsupervised embeddings, CDL-CVI returned cluster numbers close to the reference class counts across K-means, DBSCAN, and spectral clustering in the reported trials.
翻译:在无标签条件下选择聚类算法及其超参数是工程化机器学习流水线中的常见难题,这类流水线需对传感器、图像或过程数据进行无监督分析。聚类验证指标(CVI)为候选聚类结果提供内部评分排序,但大多数主流CVI基于欧氏紧致性和分离度项构建,因此倾向于偏好紧致凸形划分。已知这些指标在处理非凸、不规则或密度变化的数据时性能会下降,通常需借助核变换或替代距离度量,但这会带来额外的调参与计算成本。本文提出中心描述长度(CDL)聚类验证指标。CDL利用观测到的类内紧致性、估计的聚类中心及估计的聚类协方差,计算不可观测真实聚类中心描述长度的概率上界。该上界将类内紧致性与质心偏移量融合为单一可计算量,并在任意聚类算法生成的划分上进行评估。实现仅使用可观测量(数据、划分、估计中心及估计协方差),无需真实标签。在具有非凸及任意形状簇的合成基准测试中,CDL-CVI比我们测试的传统CVI更频繁地选出参考聚类数,并达到更高的调整兰德指数(ARI)值,且无需额外的核预处理步骤。在基于冻结无监督嵌入聚类的图像基准测试(MNIST、CIFAR-10、STL-10)中,CDL-CVI在报告试验中通过K-means、DBSCAN及谱聚类返回的聚类数均接近参考类别数。