Clustering is widely used for exploratory analysis and scientific discovery, driving insights from market segmentation to biological data analysis, but its outputs can be difficult to interpret, audit, and reproduce as modern datasets become increasingly large and complex. Reliable use of clustering requires understanding which features drive the discovered structure, yet feature-level explanations for clustering remain scarce compared with methods in supervised learning. Furthermore, existing clustering feature importance scores are often tied to specific algorithms and data assumptions. To address these challenges, we propose Cluster LOCO (Leave-One-Covariate-Out), a family of model-agnostic feature importance scores for clustering. Cluster LOCO is built on feature occlusion and clustering generalizability, defined as whether cluster labels learned on one subset of the data can be accurately predicted on held-out samples. For any chosen clustering algorithm, Cluster LOCO quantifies a feature's importance by measuring how much its removal degrades generalizability. We first introduce Cluster LOCO-Split, which relies on data splitting, and then extend it to Cluster LOCO-MP, a minipatch ensemble-based version designed for large-scale data. Across synthetic simulations and an application to cell-type discovery in single-cell transcriptomics, we show that Cluster LOCO more reliably recovers informative features than existing clustering feature importance methods.
翻译:聚类分析广泛应用于探索性分析和科学发现,从市场细分到生物数据分析均能提供重要见解。然而,随着现代数据集规模日益庞大和复杂,其输出结果往往难以解释、审计和复现。要可靠使用聚类分析,必须理解哪些特征驱动了发现的数据结构,但与监督学习方法相比,针对聚类的特征级解释方法仍较为匮乏。此外,现有聚类特征重要性评分往往与特定算法和数据假设紧密绑定。针对这些挑战,我们提出Cluster LOCO(Leave-One-Covariate-Out)——一个基于模型的聚类特征重要性评分方法族。Cluster LOCO建立在特征遮蔽和聚类泛化性概念之上,后者定义为在一部分数据上学到的簇标签能否准确预测剩余样本。对于任意选定的聚类算法,Cluster LOCO通过衡量移除某特征后泛化性能的下降程度来量化该特征的重要性。我们首先提出基于数据切分的Cluster LOCO-Split,继而扩展为面向大规模数据的微型集成版本Cluster LOCO-MP。在合成数据实验及单细胞转录组学细胞类型发现的实际应用中,我们证明Cluster LOCO比现有聚类特征重要性方法更可靠地恢复信息特征。