Deep learning in computer vision has achieved great success with the price of large-scale labeled training data. However, exhaustive data annotation is impracticable for each task of all domains of interest, due to high labor costs and unguaranteed labeling accuracy. Besides, the uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist. All these nuisances may hinder the verification of typical theories and exposure to new findings. To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization. We in this work push forward along this line by doing profound and extensive research on bare supervised learning and downstream domain adaptation. Specifically, under the well-controlled, IID data setting enabled by 3D rendering, we systematically verify the typical, important learning insights, e.g., shortcut learning, and discover the new laws of various data regimes and network architectures in generalization. We further investigate the effect of image formation factors on generalization, e.g., object scale, material texture, illumination, camera viewpoint, and background in a 3D scene. Moreover, we use the simulation-to-reality adaptation as a downstream task for comparing the transferability between synthetic and real data when used for pre-training, which demonstrates that synthetic data pre-training is also promising to improve real test results. Lastly, to promote future research, we develop a new large-scale synthetic-to-real benchmark for image classification, termed S2RDA, which provides more significant challenges for transfer from simulation to reality. The code and datasets are available at https://github.com/huitangtang/On_the_Utility_of_Synthetic_Data.
翻译:摘要:计算机视觉领域的深度学习凭借大规模标注训练数据取得了巨大成功。然而,针对所有感兴趣领域的每个任务进行详尽的数据标注并不可行,原因在于高昂的人工成本和无法保证的标注精度。此外,不可控的数据收集过程会产生非独立同分布的训练与测试数据,其中可能存在不必要的重复。这些问题可能阻碍典型理论的验证以及新发现的获取。为此,一种替代方案是通过3D渲染结合域随机化来生成合成数据。本文沿着这一方向深入探索,对纯监督学习及下游域适应任务进行了广泛而深刻的研究。具体而言,在3D渲染所实现的严格可控、独立同分布数据设置下,我们系统验证了典型且重要的学习见解(如捷径学习),并发现了不同数据规模和网络架构在泛化中的新规律。我们进一步研究了图像形成因素(如3D场景中的物体尺度、材质纹理、光照、相机视角和背景)对泛化的影响。此外,我们将仿真到现实适应作为下游任务,比较了合成数据与真实数据在预训练中的可迁移性,结果表明合成数据预训练同样有望提升真实测试结果。最后,为促进未来研究,我们构建了一个面向图像分类的大规模仿真到现实基准S2RDA,为仿真到现实的迁移提供了更具挑战性的任务。代码与数据集见 https://github.com/huitangtang/On_the_Utility_of_Synthetic_Data。