Self-training and contrastive learning have emerged as leading techniques for incorporating unlabeled data, both under distribution shift (unsupervised domain adaptation) and when it is absent (semi-supervised learning). However, despite the popularity and compatibility of these techniques, their efficacy in combination remains unexplored. In this paper, we undertake a systematic empirical investigation of this combination, finding that (i) in domain adaptation settings, self-training and contrastive learning offer significant complementary gains; and (ii) in semi-supervised learning settings, surprisingly, the benefits are not synergistic. Across eight distribution shift datasets (e.g., BREEDs, WILDS), we demonstrate that the combined method obtains 3--8% higher accuracy than either approach independently. We then theoretically analyze these techniques in a simplified model of distribution shift, demonstrating scenarios under which the features produced by contrastive learning can yield a good initialization for self-training to further amplify gains and achieve optimal performance, even when either method alone would fail.
翻译:自训练与对比学习已成为利用无标注数据的主流技术,既能应对分布偏移场景(无监督域自适应),也可用于无分布偏移情形(半监督学习)。尽管这两种技术广受欢迎且具有互补性,但尚未有研究探讨其联合使用时的实际效果。本文通过系统性实证研究该组合方法,发现:(i)在域自适应场景中,自训练与对比学习能显著互补提升性能;(ii)在半监督学习场景中,令人意外的是,二者并无协同增益效果。基于八组分布偏移数据集(如BREEDs、WILDS)的实验表明,组合方法相比独立使用任一方法可提升3%-8%的准确率。我们进一步在简化分布偏移模型中对这些技术进行理论分析,揭示对比学习生成的特征如何为自训练提供良好初始化条件,从而在任一方法单独失效时仍能放大增益并达到最优性能。