Many machine learning models appear to deploy effortlessly under distribution shift, and perform well on a target distribution that is considerably different from the training distribution. Yet, learning theory of distribution shift bounds performance on the target distribution as a function of the discrepancy between the source and target, rarely guaranteeing high target accuracy. Motivated by this gap, this work takes a closer look at the theory of distribution shift for a classifier from a source to a target distribution. Instead of relying on the discrepancy, we adopt an Invariant-Risk-Minimization (IRM)-like assumption connecting the distributions, and characterize conditions under which data from a source distribution is sufficient for accurate classification of the target. When these conditions are not met, we show when only unlabeled data from the target is sufficient, and when labeled target data is needed. In all cases, we provide rigorous theoretical guarantees in the large sample regime.
翻译:许多机器学习模型在分布偏移下似乎能轻松部署,并在与训练分布显著不同的目标分布上表现良好。然而,分布偏移的学习理论通常将目标分布上的性能界定为源分布与目标分布之间差异的函数,很少能保证较高的目标准确率。受此差距启发,本研究对分类器从源分布到目标分布的分布偏移理论进行了深入探讨。我们不再依赖分布差异,而是采用一种类似不变风险最小化(IRM)的假设来连接分布,并刻画了源分布数据足以对目标分布进行准确分类的条件。当这些条件不满足时,我们阐明了何时仅需目标分布的无标注数据即可,以及何时需要目标分布的有标注数据。在所有情况下,我们都在大样本条件下提供了严格的理论保证。