In cancer research, clustering techniques are widely used for exploratory analyses and dimensionality reduction, playing a critical role in the identification of novel cancer subtypes, often with direct implications for patient management. As data collected by multiple research groups grows, it is increasingly feasible to investigate the replicability of clustering procedures, that is, their ability to consistently recover biologically meaningful clusters across several datasets. In this paper, we review existing methods to assess replicability of clustering analyses, and discuss a framework for evaluating cross-study clustering replicability, useful when two or more studies are available. These approaches can be applied to any clustering algorithm and can employ different measures of similarity between partitions to quantify replicability, globally (i.e. for the whole sample) as well as locally (i.e. for individual clusters). Using experiments on synthetic and real gene expression data, we illustrate the utility of replicability metrics to evaluate if the same clusters are identified consistently across a collection of datasets.
翻译:在癌症研究中,聚类技术被广泛用于探索性分析和降维,在识别新型癌症亚型中发挥着关键作用,通常对患者管理具有直接影响。随着多个研究团队收集的数据不断增长,研究聚类流程的可复现性——即其在多个数据集中一致地恢复具有生物学意义的聚类的能力——变得越来越可行。本文回顾了评估聚类分析可复现性的现有方法,并讨论了一个用于评估跨研究聚类可复现性的框架,该框架在两个或多个研究可用时尤为实用。这些方法可适用于任何聚类算法,并能采用不同的划分间相似性度量来全局(即针对整个样本)和局部(即针对单个聚类)量化可复现性。通过在合成和真实基因表达数据上的实验,我们展示了可复现性指标在评估同一聚类是否能在多个数据集中被一致性识别时的实用性。