Deep learning in computer vision has achieved great success with the price of large-scale labeled training data. However, exhaustive data annotation is impracticable for each task of all domains of interest, due to high labor costs and unguaranteed labeling accuracy. Besides, the uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist. All these nuisances may hinder the verification of typical theories and exposure to new findings. To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization. We in this work push forward along this line by doing profound and extensive research on bare supervised learning and downstream domain adaptation. Specifically, under the well-controlled, IID data setting enabled by 3D rendering, we systematically verify the typical, important learning insights, e.g., shortcut learning, and discover the new laws of various data regimes and network architectures in generalization. We further investigate the effect of image formation factors on generalization, e.g., object scale, material texture, illumination, camera viewpoint, and background in a 3D scene. Moreover, we use the simulation-to-reality adaptation as a downstream task for comparing the transferability between synthetic and real data when used for pre-training, which demonstrates that synthetic data pre-training is also promising to improve real test results. Lastly, to promote future research, we develop a new large-scale synthetic-to-real benchmark for image classification, termed S2RDA, which provides more significant challenges for transfer from simulation to reality. The code and datasets are available at https://github.com/huitangtang/On_the_Utility_of_Synthetic_Data.
翻译:计算机视觉中的深度学习依赖于大规模标注训练数据取得了巨大成功。然而,由于高劳动成本和无法保证的标注精度,对所有感兴趣领域中的每项任务进行详尽的数据标注并不可行。此外,不可控的数据收集过程会产生非独立同分布的训练和测试数据,其中可能存在不必要的重复。这些问题可能阻碍典型理论的验证和新发现的涌现。为解决这些问题,一种替代方案是通过三维渲染结合域随机化生成合成数据。本研究沿着这一方向深入拓展,对纯监督学习和下游域适应进行了广泛而深刻的研究。具体而言,在三维渲染实现的严格受控、独立同分布数据设置下,我们系统验证了典型且重要的学习见解(例如捷径学习),并发现了不同数据规模与网络架构在泛化中的新规律。我们进一步探究了三维场景中图像生成因素(如物体尺度、材质纹理、光照、相机视角和背景)对泛化的影响。此外,我们将仿真到现实域适应作为下游任务,比较预训练时合成数据与真实数据的迁移性,表明合成数据预训练在提升真实测试结果方面同样具有潜力。最后,为促进未来研究,我们开发了一个大规模合成到现实的图像分类新基准S2RDA,该基准为从仿真到现实的迁移提供了更具挑战性的任务。代码和数据集可在https://github.com/huitangtang/On_the_Utility_of_Synthetic_Data获取。