Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as `surrogate data'. We study a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein's paradox. $(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. $(iii)$ The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.
翻译:收集大量高质量数据的成本可能过高或不切实际,这已成为机器学习的瓶颈。一种替代方案是使用来自更易获取来源的数据来增强目标分布中的少量$n$个数据点,例如在不同条件下收集的数据或由生成模型合成的数据。我们将此类数据称为“替代数据”。我们研究了一种加权经验风险最小化(ERM)方法,用于将替代数据整合到训练中。我们在多个经典统计模型下对这一方法进行了数学分析,并在不同领域的数据集上实证验证了我们的发现。我们的主要结论包括:$(i)$ 整合替代数据可以显著降低原始分布上的测试误差。令人惊讶的是,即使替代数据与原始数据无关时,这种现象也可能发生。我们将此现象溯源至经典的斯坦因悖论。$(ii)$ 为充分利用替代数据的优势,采用最优加权的ERM至关重要。$(iii)$ 在真实数据与替代数据混合集上训练的模型,其测试误差近似遵循一种标度律。该标度律可用于预测最优加权方案,并确定应添加的替代数据量。