The use of synthetic data for training computer vision algorithms has become increasingly popular due to its cost-effectiveness, scalability, and ability to provide accurate multi-modality labels. Although recent studies have demonstrated impressive results when training networks solely on synthetic data, there remains a performance gap between synthetic and real data that is commonly attributed to lack of photorealism. The aim of this study is to investigate the gap in greater detail for the face parsing task. We differentiate between three types of gaps: distribution gap, label gap, and photorealism gap. Our findings show that the distribution gap is the largest contributor to the performance gap, accounting for over 50% of the gap. By addressing this gap and accounting for the labels gap, we demonstrate that a model trained on synthetic data achieves comparable results to one trained on a similar amount of real data. This suggests that synthetic data is a viable alternative to real data, especially when real data is limited or difficult to obtain. Our study highlights the importance of content diversity in synthetic datasets and challenges the notion that the photorealism gap is the most critical factor affecting the performance of computer vision models trained on synthetic data.
翻译:使用合成数据训练计算机视觉算法因其成本效益、可扩展性及提供精确多模态标签的能力而日益流行。尽管近期研究表明仅依靠合成数据训练网络取得了令人瞩目的成果,但合成数据与真实数据之间仍存在性能差距,这通常归因于缺乏逼真度。本研究旨在更深入地探究人脸解析任务中的这一差距。我们将差距分为三种类型:分布差距、标签差距和逼真度差距。研究结果表明,分布差距是性能差距的最大贡献因素,占比超过50%。通过解决这一差距并考虑标签差距,我们证明在合成数据上训练的模型能达到与在相似数量真实数据上训练的模型相当的性能。这表明合成数据是真实数据的可行替代方案,尤其在真实数据有限或难以获取时。本研究强调了合成数据集中内容多样性的重要性,并挑战了“逼真度差距是影响合成数据训练的计算机视觉模型性能最关键因素”这一观点。