Constrained clustering allows the training of classification models using pairwise constraints only, which are weak and relatively easy to mine, while still yielding full-supervision-level model performance. While they perform well even in the absence of the true underlying class labels, constrained clustering models still require large amounts of binary constraint annotations for training. In this paper, we propose a semi-supervised context whereby a large amount of \textit{unconstrained} data is available alongside a smaller set of constraints, and propose \textit{ConstraintMatch} to leverage such unconstrained data. While a great deal of progress has been made in semi-supervised learning using full labels, there are a number of challenges that prevent a naive application of the resulting methods in the constraint-based label setting. Therefore, we reason about and analyze these challenges, specifically 1) proposing a \textit{pseudo-constraining} mechanism to overcome the confirmation bias, a major weakness of pseudo-labeling, 2) developing new methods for pseudo-labeling towards the selection of \textit{informative} unconstrained samples, 3) showing that this also allows the use of pairwise loss functions for the initial and auxiliary losses which facilitates semi-constrained model training. In extensive experiments, we demonstrate the effectiveness of ConstraintMatch over relevant baselines in both the regular clustering and overclustering scenarios on five challenging benchmarks and provide analyses of its several components.
翻译:约束聚类允许仅使用成对约束训练分类模型,这类约束虽然较弱且易于获取,仍能达到全监督水平的模型性能。尽管在缺乏真实底层类别标签时表现良好,约束聚类模型仍需大量二元约束标注用于训练。本文提出一个半监督场景:大量"无约束"数据与少量约束并存,并设计"约束匹配"(ConstraintMatch)方法以利用这些无约束数据。虽然基于完整标签的半监督学习已取得显著进展,但在基于约束的标签设置中直接应用现有方法仍面临诸多挑战。因此,我们剖析并分析这些挑战,具体包括:1)提出"伪约束"机制以克服伪标签法的主要缺陷——确认偏差;2)开发面向"信息性"无约束样本筛选的新伪标签方法;3)证明该方法同时支持将成对损失函数用于初始损失与辅助损失,从而促进半约束模型训练。通过涵盖五个挑战性基准的广泛实验,我们在常规聚类与过聚类场景中验证了约束匹配方法相较于相关基线的有效性,并对其多个组件进行了分析。