In consensus clustering, a clustering algorithm is used in combination with a subsampling procedure to detect stable clusters. Previous studies on both simulated and real data suggest that consensus clustering outperforms native algorithms. We extend here consensus clustering to allow for attribute weighting in the calculation of pairwise distances using existing regularised approaches. We propose a procedure for the calibration of the number of clusters (and regularisation parameter) by maximising a novel consensus score calculated directly from consensus clustering outputs, making it extremely computationally competitive. Our simulation study shows better clustering performances of (i) models calibrated by maximising our consensus score compared to existing calibration scores, and (ii) weighted compared to unweighted approaches in the presence of features that do not contribute to cluster definition. Application on real gene expression data measured in lung tissue reveals clear clusters corresponding to different lung cancer subtypes. The R package sharp (version 1.4.0) is available on CRAN.
翻译:在共识聚类中,聚类算法与子采样程序结合使用,以检测稳定的聚类。先前对模拟数据和真实数据的研究表明,共识聚类优于原始算法。我们在此扩展共识聚类,允许使用现有的正则化方法在计算成对距离时进行属性加权。我们提出了一种通过最大化直接从共识聚类输出计算的新颖共识分数来校准聚类数量(及正则化参数)的程序,使其在计算上极具竞争力。我们的模拟研究显示:(i)与现有校准分数相比,通过最大化我们的共识分数进行校准的模型聚类性能更优;(ii)在存在不贡献于聚类定义的特征时,加权方法优于非加权方法。对肺组织测量的真实基因表达数据的应用揭示了对应于不同肺癌亚型的清晰聚类。R包sharp(版本1.4.0)可在CRAN上获取。