A major limitation of clustering approaches is their lack of explainability: methods rarely provide insight into which features drive the grouping of similar observations. To address this limitation, we propose an ensemble-based clustering framework that integrates bagging and feature dropout to generate feature importance scores, in analogy with feature importance mechanisms in supervised random forests. By leveraging multiple bootstrap resampling schemes and aggregating the resulting partitions, the method improves stability and robustness of the cluster definition, particularly in small-sample or noisy settings. Feature importance is assessed through an information-theoretic approach: at each step, the mutual information between each feature and the estimated cluster labels is computed and weighted by a measure of clustering validity to emphasize well-formed partitions, before being aggregated into a final score. The method outputs both a consensus partition and a corresponding measure of feature importance, enabling a unified interpretation of clustering structure and variable relevance. Its effectiveness is demonstrated on multiple simulated and real-world datasets.
翻译:聚类方法的一个主要局限性在于其缺乏可解释性:现有方法很少能揭示哪些特征驱动了相似观测值的分组。为解决这一局限,我们提出了一种基于集成的聚类框架,该框架整合了集成方法(bagging)与特征丢弃(feature dropout)来生成特征重要性分数,其原理类似于监督随机森林中的特征重要性机制。通过利用多重自助重采样方案并聚合所得分区,该方法提高了聚类定义的稳定性和鲁棒性,尤其适用于小样本或噪声环境。特征重要性通过信息论方法评估:在每一步中,计算每个特征与估计聚类标签之间的互信息,并通过聚类有效性度量对其进行加权以强调结构良好的分区,最终聚合为最终分数。该方法输出共识分区及对应的特征重要性度量,从而实现对聚类结构和变量相关性的统一解释。其有效性已在多个模拟数据集和真实数据集上得到验证。