Recent breakthroughs in synthetic data generation approaches made it possible to produce highly photorealistic images which are hardly distinguishable from real ones. Furthermore, synthetic generation pipelines have the potential to generate an unlimited number of images. The combination of high photorealism and scale turn synthetic data into a promising candidate for improving various machine learning (ML) pipelines. Thus far, a large body of research in this field has focused on using synthetic images for training, by augmenting and enlarging training data. In contrast to using synthetic data for training, in this work we explore whether synthetic data can be beneficial for model selection. Considering the task of image classification, we demonstrate that when data is scarce, synthetic data can be used to replace the held out validation set, thus allowing to train on a larger dataset. We also introduce a novel method to calibrate the synthetic error estimation to fit that of the real domain. We show that such calibration significantly improves the usefulness of synthetic data for model selection.
翻译:近期合成数据生成方法取得突破,使得生成与真实图像几乎无法区分的高保真图像成为可能。此外,合成生成流程具备生成无限数量图像的潜力。高保真度与规模化的结合,使合成数据成为改进各类机器学习流程的有力候选。迄今为止,该领域的大量研究集中于通过扩充训练数据规模来使用合成图像进行训练。与利用合成数据训练不同,本研究探索合成数据是否有助于模型选择。针对图像分类任务,我们证明当数据稀缺时,合成数据可替代预留验证集,从而实现在更大数据集上进行训练。同时,我们提出一种新颖方法,用于校准合成误差估计以适配真实域特性。实验表明,此类校准显著提升了合成数据在模型选择中的有效性。