In this paper, we address a complex but practical scenario in semi-supervised learning (SSL) named open-set SSL, where unlabeled data contain both in-distribution (ID) and out-of-distribution (OOD) samples. Unlike previous methods that only consider ID samples to be useful and aim to filter out OOD ones completely during training, we argue that the exploration and exploitation of both ID and OOD samples can benefit SSL. To support our claim, i) we propose a prototype-based clustering and identification algorithm that explores the inherent similarity and difference among samples at feature level and effectively cluster them around several predefined ID and OOD prototypes, thereby enhancing feature learning and facilitating ID/OOD identification; ii) we propose an importance-based sampling method that exploits the difference in importance of each ID and OOD sample to SSL, thereby reducing the sampling bias and improving the training. Our proposed method achieves state-of-the-art in several challenging benchmarks, and improves upon existing SSL methods even when ID samples are totally absent in unlabeled data.
翻译:本文针对半监督学习(SSL)中一个复杂但实际场景——开放集半监督学习(open-set SSL),其中未标注数据同时包含分布内(ID)与分布外(OOD)样本。不同于先前方法仅认为ID样本有用并试图在训练中完全滤除OOD样本,我们提出对ID和OOD样本的探索与利用均可有益于SSL。为支撑这一论点,我们:i)提出一种基于原型的聚类与识别算法,该算法在特征层面探索样本间的内在相似性与差异性,并有效将其聚类至若干预定义的ID与OOD原型周围,从而增强特征学习并促进ID/OOD识别;ii)提出一种基于重要性的采样方法,通过挖掘各ID和OOD样本对SSL重要性的差异,减少采样偏差并优化训练。本方法在多个具有挑战性的基准测试中达到最优性能,且即便未标注数据中完全不含ID样本,相较于现有SSL方法仍能实现性能提升。