Collecting large quantities of high-quality data is often prohibitively expensive or impractical, and a crucial bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources like public datasets, data collected under different circumstances, or synthesized by generative models. Blurring distinctions, we refer to such data as `surrogate data'. We define a simple scheme for integrating surrogate data into training and use both theoretical models and empirical studies to explore its behavior. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution; $(ii)$ In order to reap this benefit, it is crucial to use optimally weighted empirical risk minimization; $(iii)$ The test error of models trained on mixtures of real and surrogate data is well described by a scaling law. This can be used to predict the optimal weighting and the gain from surrogate data.
翻译:收集大量高质量数据通常成本高昂或不可行,成为机器学习的关键瓶颈。一种替代方案是,用来自更易获取的数据源(如公开数据集、不同条件下收集的数据或生成模型合成的数据)增强目标分布中的$n$个数据点。模糊其区别,我们将此类数据称为"替代数据"。我们定义了将替代数据整合到训练中的简单方案,并通过理论模型与实证研究探索其行为。主要发现如下:$(i)$ 整合替代数据可显著降低原始分布上的测试误差;$(ii)$ 要获得这一优势,必须采用最优加权经验风险最小化;$(iii)$ 混合真实数据与替代数据训练的模型,其测试误差可由缩放定律准确描述。该定律可用于预测最优权重及替代数据的增益效果。