Clustering is a central tool for discovering latent structure in unlabeled data; yet modern clustering pipelines often end with a hard assignment of each observation to a cluster without rigorous measures of assignment uncertainty. We propose a novel weighted conformal approach for constructing valid confidence sets for cluster labels. The key difficulty is that the labels available for calibration are not observed ground-truth labels, but synthetic labels produced by a data-dependent clustering algorithm. Our method develops a conformal inference algorithm that corrects the resulting mismatch with the latent target labels through weights by formulating conformal clustering as a conditional label-distribution shift problem. We first derive an oracle procedure that attains finite-sample marginal coverage and then develop a computationally tractable and implementable version using estimated conditional label probabilities and novel augmented calibration. We show that the coverage of the estimated-weight procedure depends on the estimator, giving an explicit bound on the loss relative to the nominal level. Empirical studies demonstrate that the proposed weighted approach offers improvements over the recently proposed split conformal clustering procedure in terms of informative confidence set size, especially in nonlinear and high-dimensional clustering applications.
翻译:聚类是发现无标注数据中潜在结构的核心工具;然而,现代聚类流程通常以将每个观测值硬性分配到某个聚类而告终,缺乏对分配不确定性的严格度量。我们提出了一种新颖的加权共形方法,用于构建聚类标签的有效置信集。关键难点在于,用于校准的标签并非观测到的真实标签,而是由数据依赖的聚类算法生成的合成标签。我们的方法开发了一种共形推断算法,通过将共形聚类表述为条件标签分布偏移问题,利用权重校正由此产生的与潜在目标标签之间的失配。我们首先推导出一种能够实现有限样本边际覆盖的基准过程,然后使用估计的条件标签概率和创新的增广校准,开发了一种计算可行且可实现的版本。我们证明,估计权重过程的覆盖率取决于估计量,并给出了相对于名义水平的损失的显式界限。实证研究表明,所提出的加权方法在信息性置信集大小方面,特别是针对非线性和高维聚类应用,相比近期提出的分裂共形聚类过程有所改进。