Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as `surrogate data.' We introduce a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein's paradox. $(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. $(iii)$ The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.
翻译:收集大量高质量数据的成本可能过高或不可行,这已成为机器学习的瓶颈。作为替代方案,我们可以用来自更易获取来源的数据来增强目标分布中的少量$n$个数据点,例如在不同条件下收集的数据或由生成模型合成的数据。我们将此类数据称为“替代数据”。本文提出了一种加权经验风险最小化(ERM)方法,用于将替代数据整合到训练中。我们在多个经典统计模型下对该方法进行了数学分析,并通过不同领域的实证数据集验证了研究结果。主要发现如下:$(i)$ 整合替代数据能显著降低原始分布上的测试误差。令人惊讶的是,即使替代数据与原始数据无关时也可能出现这种现象。我们将此现象溯源至经典的斯坦悖论。$(ii)$ 要获得替代数据的优势,必须使用最优加权的ERM方法。$(iii)$ 在真实数据与替代数据混合训练下,模型测试误差近似遵循尺度律。该尺度律可用于预测最优加权方案,并确定应添加的替代数据量。