Cluster Metric Sensitivity to Irrelevant Features

Clustering algorithms are used extensively in data analysis for data exploration and discovery. Technological advancements lead to continually growth of data in terms of volume, dimensionality and complexity. This provides great opportunities in data analytics as the data can be interrogated for many different purposes. This however leads challenges, such as identification of relevant features for a given task. In supervised tasks, one can utilise a number of methods to optimise the input features for the task objective (e.g. classification accuracy). In unsupervised problems, such tools are not readily available, in part due to an inability to quantify feature relevance in unlabeled tasks. In this paper, we investigate the sensitivity of clustering performance noisy uncorrelated variables iteratively added to baseline datasets with well defined clusters. We show how different types of irrelevant variables can impact the outcome of a clustering result from $k$-means in different ways. We observe a resilience to very high proportions of irrelevant features for adjusted rand index (ARI) and normalised mutual information (NMI) when the irrelevant features are Gaussian distributed. For Uniformly distributed irrelevant features, we notice the resilience of ARI and NMI is dependent on the dimensionality of the data and exhibits tipping points between high scores and near zero. Our results show that the Silhouette Coefficient and the Davies-Bouldin score are the most sensitive to irrelevant added features exhibiting large changes in score for comparably low proportions of irrelevant features regardless of underlying distribution or data scaling. As such the Silhouette Coefficient and the Davies-Bouldin score are good candidates for optimising feature selection in unsupervised clustering tasks.

翻译：聚类算法广泛应用于数据探索与发现的数据分析中。技术进步使得数据在规模、维度和复杂度上持续增长。这为数据分析带来了巨大机遇，因为数据可被用于多种目的的查询。然而，这也带来了挑战，例如为特定任务识别相关特征。在有监督任务中，可利用多种方法针对任务目标（如分类准确率）优化输入特征。而在无监督问题中，此类工具并不易得，部分原因在于无法量化无标签任务中的特征相关性。本文研究了在具有明确聚类的基准数据集中迭代添加无噪声不相关变量时，聚类性能的敏感性。我们展示了不同类型的无关变量如何以不同方式影响基于 $k$-means 的聚类结果。我们发现，当无关变量服从高斯分布时，调整兰德指数（ARI）和归一化互信息（NMI）对极高比例的无关特征具有鲁棒性。对于均匀分布的无关变量，我们注意到 ARI 和 NMI 的鲁棒性依赖于数据维度，并在高评分与接近零值之间呈现临界点。我们的结果表明，无论底层分布或数据缩放如何，轮廓系数和戴维斯-布尔丁指数对添加的无关特征最为敏感，在无关特征比例相对较低时即表现出评分的大幅变化。因此，轮廓系数和戴维斯-布尔丁指数是无监督聚类任务中优化特征选择的良好候选指标。