Clustering is widely used across the sciences as the foundation for downstream data-driven scientific discoveries. However, clustering results are highly sensitive to the choice of algorithm, preprocessing, and the number of clusters $k$, producing scientific claims that are often not reproducible. The current state of the art for validating clustering solutions consists of clustering validation indices (CVIs) such as Silhouette, Davies-Bouldin, and Calinski-Harabasz, which rely on geometric assumptions that break down on the heavy-tailed, high-dimensional, and nonlinearly structured data encountered in biomedical research. Resampling-based alternatives - grounded in the ideas of clustering stability and generalizability - have been proposed but remain scattered across specialized tools with no unified, accessible software. We fill this gap with CARVE (Cluster Analysis with Resampling for Validation and Exploration), an open-source Python and R package that jointly evaluates multiple clustering algorithms and hyperparameters, returning stability and generalizability diagnostics at the global, cluster, and sample level together with principled selection rules and consensus-based cluster labels. Across six synthetic benchmarks CARVE consistently recovers near-optimal clusterings where classical indices degrade substantially. On experimental genomics and proteomics data sets, CARVE recovers finer biological structure when classical CVIs collapse entirely. CARVE is available with a scikit-learn-compatible Python API and an analogous R interface compatible with Seurat workflows.
翻译:聚类分析被广泛应用于科学领域,作为下游数据驱动科学发现的基础。然而,聚类结果对算法选择、预处理步骤及聚类数k高度敏感,导致科学发现往往难以复现。当前验证聚类解决方案的技术现状主要依赖于轮廓系数、戴维森-堡丁指数和卡林斯基-哈拉巴斯指数等聚类有效性指标,这些指标基于几何假设,但在生物医学研究中遇到的重尾、高维及非线性结构数据上会失效。基于重抽样的替代方法——以聚类稳定性和泛化性为核心思想——虽已被提出,但分散于专业工具中,缺乏统一且易用的软件实现。我们通过CARVE(基于重抽样的聚类验证与探索分析)填补了这一空白,该开源Python和R包可联合评估多种聚类算法与超参数,输出全局、聚类及样本层面的稳定性与泛化性诊断,并附有规范的选择规则与基于共识的聚类标签。在六个合成基准测试中,CARVE持续恢复近似最优聚类,而经典指标显著退化。在实验基因组学和蛋白质组学数据集上,当经典CVI完全失效时,CARVE恢复了更精细的生物结构。CARVE提供了兼容scikit-learn的Python API以及适配Seurat工作流的相应R接口。