This paper establishes the generalization error of pooled min-$\ell_2$-norm interpolation in transfer learning where data from diverse distributions are available. Min-norm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. Previous work characterized their out-of-distribution risk when samples from the test distribution are unavailable during training. However, in many applications, a limited amount of test data may be available during training, yet properties of min-norm interpolation in this setting are not well-understood. We address this gap by characterizing the bias and variance of pooled min-$\ell_2$-norm interpolation under covariate and model shifts. The pooled interpolator captures both early fusion and a form of intermediate fusion. Our results have several implications: under model shift, for low signal-to-noise ratio (SNR), adding data always hurts. For higher SNR, transfer learning helps as long as the shift-to-signal (SSR) ratio lies below a threshold that we characterize explicitly. By consistently estimating these ratios, we provide a data-driven method to determine: (i) when the pooled interpolator outperforms the target-based interpolator, and (ii) the optimal number of target samples that minimizes the generalization error. Under covariate shift, if the source sample size is small relative to the dimension, heterogeneity between between domains improves the risk, and vice versa. We establish a novel anisotropic local law to achieve these characterizations, which may be of independent interest in random matrix theory. We supplement our theoretical characterizations with comprehensive simulations that demonstrate the finite-sample efficacy of our results.
翻译:本文研究了在可获得多种分布数据的情况下,迁移学习中池化最小$\ell_2$范数插值的泛化误差。最小范数插值器作为现代机器学习算法隐式正则化极限自然出现。先前的研究刻画了在训练期间无法获得测试分布样本时,其分布外风险。然而,在许多应用中,训练期间可能获得有限数量的测试数据,但此设置下最小范数插值的性质尚未得到充分理解。我们通过刻画协变量偏移和模型偏移下池化最小$\ell_2$范数插值的偏差与方差来填补这一空白。该池化插值器同时捕捉了早期融合和一种中间融合形式。我们的研究结果具有若干意义:在模型偏移下,对于低信噪比(SNR),增加数据总会损害性能;对于较高SNR,只要偏移-信号比(SSR)低于我们明确刻画的阈值,迁移学习即有益。通过一致估计这些比率,我们提供了一种数据驱动方法来确定:(i)池化插值器何时优于基于目标的插值器,以及(ii)最小化泛化误差的最优目标样本数量。在协变量偏移下,若源样本量相对于维度较小,域间异质性可改善风险,反之亦然。我们建立了一种新颖的各向异性局部定律以实现这些刻画,该定律在随机矩阵理论中可能具有独立价值。我们通过全面的仿真实验补充理论刻画,证明了研究结果在有限样本下的有效性。