Clustering is a well-established technique in machine learning and data analysis, widely used across various domains. Cluster validity indices, such as the Average Silhouette Width, Calinski-Harabasz, and Davies-Bouldin indices, play a crucial role in assessing clustering quality when external ground truth labels are unavailable. However, these measures can be affected by different degrees of feature relevance, potentially leading to unreliable evaluations in high-dimensional or noisy data sets. We introduce a theoretically grounded Feature Importance Rescaling (FIR) method that enhances the quality of clustering validation by adjusting feature contributions based on their dispersion. It attenuates noise features, clarifies clustering compactness and separation, and thereby aligns clustering validation more closely with the ground truth. Through extensive experiments on synthetic data sets under different configurations and a case study on real-world data, we demonstrate that FIR consistently improves the correlation between the values of cluster validity indices and the ground truth, particularly in settings with noisy or irrelevant features. The results show that FIR increases the robustness of clustering evaluation, reduces variability in performance across different data sets, and remains effective even when clusters exhibit significant overlap. These findings highlight the potential of FIR as a valuable enhancement of clustering validation, making it a practical tool for unsupervised learning tasks where labelled data is unavailable.
翻译:聚类是机器学习与数据分析中一项成熟的技术,已广泛应用于各个领域。当外部真实标签不可用时,聚类有效性指标(如平均轮廓宽度、Calinski-Harabasz指数和Davies-Bouldin指数)在评估聚类质量方面发挥着关键作用。然而,这些度量指标可能受到不同特征相关度的影响,在高维或含噪声数据集中可能导致不可靠的评估。本文提出一种基于理论的特征重要性重标定方法,该方法通过依据特征离散度调整其特征贡献来提升聚类验证的质量。FIR方法能够抑制噪声特征、明晰聚类的紧致性与分离性,从而使聚类验证结果更贴近真实情况。通过对不同配置下的合成数据集进行大量实验,并结合真实世界数据的案例研究,我们证明FIR能够持续提升聚类有效性指标值与真实情况之间的相关性,尤其在存在噪声或无关特征的场景中效果显著。结果表明,FIR增强了聚类评估的鲁棒性,降低了不同数据集间的性能波动,即使在聚类存在显著重叠时仍保持有效。这些发现凸显了FIR作为聚类验证重要增强手段的潜力,使其成为无标签数据可用情况下无监督学习任务的实用工具。