We consider a setting that a model needs to adapt to a new domain under distribution shifts, given that only unlabeled test samples from the new domain are accessible at test time. A common idea in most of the related works is constructing pseudo-labels for the unlabeled test samples and applying gradient descent (GD) to a loss function with the pseudo-labels. Recently, \cite{GSRK22} propose conjugate labels, which is a new kind of pseudo-labels for self-training at test time. They empirically show that the conjugate label outperforms other ways of pseudo-labeling on many domain adaptation benchmarks. However, provably showing that GD with conjugate labels learns a good classifier for test-time adaptation remains open. In this work, we aim at theoretically understanding GD with hard and conjugate labels for a binary classification problem. We show that for square loss, GD with conjugate labels converges to an $\epsilon$-optimal predictor under a Gaussian model for any arbitrarily small $\epsilon$, while GD with hard pseudo-labels fails in this task. We also analyze them under different loss functions for the update. Our results shed lights on understanding when and why GD with hard labels or conjugate labels works in test-time adaptation.
翻译:我们考虑一个场景:在分布偏移下,模型需要适应新领域,且测试时仅能获取来自该领域的无标签测试样本。大多数相关研究的共同思路是为无标签测试样本构建伪标签,并对带有伪标签的损失函数应用梯度下降(GD)。近期,\cite{GSRK22}提出的共轭标签(Conjugate Labels)是一种用于测试时自训练的新型伪标签。实验表明,在诸多领域自适应基准测试中,共轭标签的性能优于其他伪标签方法。然而,如何严格证明共轭标签的梯度下降方法在测试时自适应中能学习到有效分类器仍是一个待解问题。本文旨在从理论上理解二元分类任务中硬标签与共轭标签的梯度下降机制。我们证明:对于平方损失函数,在高斯模型下,共轭标签的梯度下降可收敛到任意小$\epsilon$的$\epsilon$-最优预测器,而硬伪标签的梯度下降无法完成此任务。此外,我们还分析了不同损失函数更新下的算法表现。本研究为理解硬标签与共轭标签的梯度下降在测试时自适应中何时有效及其内在机理提供了理论启示。