Semi-supervised learning (SSL) methods effectively leverage unlabeled data to improve model generalization. However, SSL models often underperform in open-set scenarios, where unlabeled data contain outliers from novel categories that do not appear in the labeled set. In this paper, we study the challenging and realistic open-set SSL setting, where the goal is to both correctly classify inliers and to detect outliers. Intuitively, the inlier classifier should be trained on inlier data only. However, we find that inlier classification performance can be largely improved by incorporating high-confidence pseudo-labeled data, regardless of whether they are inliers or outliers. Also, we propose to utilize non-linear transformations to separate the features used for inlier classification and outlier detection in the multi-task learning framework, preventing adverse effects between them. Additionally, we introduce pseudo-negative mining, which further boosts outlier detection performance. The three ingredients lead to what we call Simple but Strong Baseline (SSB) for open-set SSL. In experiments, SSB greatly improves both inlier classification and outlier detection performance, outperforming existing methods by a large margin. Our code will be released at https://github.com/YUE-FAN/SSB.
翻译:半监督学习方法可以有效利用无标签数据提升模型泛化能力。然而,当无标签数据包含标注集中未出现的新类别异常样本时,现有方法在开放集场景中表现欠佳。本文研究具有挑战性且更贴近实际的开放集半监督学习:需同时正确分类内点并检测异常点。直观而言,内点分类器应仅使用内点数据进行训练。但研究发现,融合高置信度伪标签数据(无论其属于内点还是异常点)可显著提升内点分类性能。此外,我们提出利用非线性变换在多任务学习框架中分离内点分类与异常检测的特征表示,从而抑制两者间的负面影响。进一步地,我们引入伪负例挖掘方法以提升异常检测性能。上述三项技术构成了开放集半监督学习中的"简单而强大的基线"(SSB)。实验表明,SSB在极大提升内点分类与异常检测性能的同时,以显著优势超越现有方法。我们的代码将发布于https://github.com/YUE-FAN/SSB。