Out-Of-Domain Unlabeled Data Improves Generalization

We propose a novel framework for incorporating unlabeled data into semi-supervised classification problems, where scenarios involving the minimization of either i) adversarially robust or ii) non-robust loss functions have been considered. Notably, we allow the unlabeled samples to deviate slightly (in total variation sense) from the in-domain distribution. The core idea behind our framework is to combine Distributionally Robust Optimization (DRO) with self-supervised training. As a result, we also leverage efficient polynomial-time algorithms for the training stage. From a theoretical standpoint, we apply our framework on the classification problem of a mixture of two Gaussians in $\mathbb{R}^d$, where in addition to the $m$ independent and labeled samples from the true distribution, a set of $n$ (usually with $n\gg m$) out of domain and unlabeled samples are given as well. Using only the labeled data, it is known that the generalization error can be bounded by $\propto\left(d/m\right)^{1/2}$. However, using our method on both isotropic and non-isotropic Gaussian mixture models, one can derive a new set of analytically explicit and non-asymptotic bounds which show substantial improvement on the generalization error compared to ERM. Our results underscore two significant insights: 1) out-of-domain samples, even when unlabeled, can be harnessed to narrow the generalization gap, provided that the true data distribution adheres to a form of the ``cluster assumption", and 2) the semi-supervised learning paradigm can be regarded as a special case of our framework when there are no distributional shifts. We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.

翻译：我们提出了一种将无标签数据纳入半监督分类问题的新型框架，其中考虑了最小化 i) 对抗鲁棒或 ii) 非鲁棒损失函数的场景。值得注意的是，我们允许无标签样本（在总变差意义上）与域内分布存在轻微偏差。该框架的核心思想是将分布鲁棒优化（DRO）与自监督训练相结合。因此，我们还能在训练阶段利用高效的多项式时间算法。从理论角度出发，我们将该框架应用于 $\mathbb{R}^d$ 中两个高斯混合模型的分类问题：除了来自真实分布的 $m$ 个独立有标签样本外，还提供了 $n$ 个（通常 $n\gg m$）域外无标签样本。已知仅使用有标签数据时，泛化误差可界为 $\propto\left(d/m\right)^{1/2}$。然而，通过将我们的方法应用于各向同性和非各向同性高斯混合模型，可推导出一组解析显式且非渐近的新边界，表明与经验风险最小化（ERM）相比，泛化误差获得了显著改进。我们的结果揭示了两项重要洞察：1）域外样本（即使无标签）可被用于缩小泛化差距，前提是真实数据分布符合某种形式的"聚类假设"；2）当不存在分布偏移时，半监督学习范式可视为本框架的特例。我们通过在多种合成数据集和真实数据集上进行的实验验证了上述论断。