The clustering task consists in partitioning elements of a sample into homogeneous groups. Most datasets contain individuals that are ambiguous and intrinsically difficult to attribute to one or another cluster. However, in practical applications, misclassifying individuals is potentially disastrous and should be avoided. To keep the misclassification rate small, one can decide to classify only a part of the sample. In the supervised setting, this approach is well known and referred to as classification with an abstention option. In this paper the approach is revisited in an unsupervised mixture model framework and the purpose is to develop a method that comes with the guarantee that the false clustering rate (FCR) does not exceed a pre-defined nominal level $\alpha$. A new procedure is proposed and shown to be optimal up to a remainder term in the sense that the FCR is controlled and at the same time the number of classified items is maximized. Bootstrap versions of the procedure are shown to improve the performance in numerical experiments. An application to breast cancer data illustrates the benefits of the new approach from a practical viewpoint.
翻译:聚类任务包括将样本中的元素划分为同质组。大多数数据集包含具有模糊性且本质上难以归入某类的个体。然而,在实际应用中,错误分类个体可能造成严重后果,应加以避免。为保持低误分类率,可仅对样本部分数据进行分类。在有监督设定下,该思路被称为带拒绝选项的分类方法。本文在无监督混合模型框架下重新审视该方法,旨在开发一种保证假聚类率不超过预设名义水平$\alpha$的技术。本文提出一种新流程,证明其在控制假聚类率的同时最大化分类项数方面达到最优(仅含余项偏差)。数值实验表明,该流程的bootstrap版本可提升性能。乳腺癌数据应用从实践角度展示了新方法的优势。