Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following. But do they perform well as synthetic vision data generators? In this work, we revisit the promise of synthetic data as a scalable substitute for real training sets and uncover a surprising performance regression. We generate large-scale synthetic datasets using state-of-the-art T2I models released between 2022 and 2025, train standard classifiers solely on this synthetic data, and evaluate them on real test data. Despite observable advances in visual fidelity and prompt adherence, classification accuracy on real test data consistently declines with newer T2I models as training data generators. Our analysis reveals a hidden trend: These models collapse to a narrow, aesthetic-centric distribution that undermines diversity and label-image alignment. Overall, our findings challenge a growing assumption in vision research, namely that progress in generative realism implies progress in data realism. We thus highlight an urgent need to rethink the capabilities of modern T2I models as reliable training data generators.
翻译:近期,文生图扩散模型能够生成视觉上令人惊叹的图像,并展现出优异的提示跟随能力。但它们能否作为合成视觉数据生成器而表现良好?在本研究中,我们重新审视了合成数据作为真实训练集可扩展替代品的承诺,并揭示了一个令人惊讶的性能衰退现象。我们使用2022年至2025年间发布的最先进文生图模型生成了大规模合成数据集,仅在此合成数据上训练标准分类器,并在真实测试数据上对其评估。尽管在视觉保真度和提示遵循方面存在可观察到的进步,但作为训练数据生成器,使用较新的文生图模型时,在真实测试数据上的分类准确率却持续下降。我们的分析揭示了一个隐藏趋势:这些模型坍缩到一个狭窄的、以美学为中心的分布,这损害了多样性和标签-图像对齐性。总体而言,我们的研究结果挑战了视觉研究中一个日益增长的假设,即生成真实感的进步意味着数据真实感的进步。因此,我们强调迫切需要重新思考现代文生图模型作为可靠训练数据生成器的能力。