The emergence of generative AI models has dramatically expanded the availability and use of synthetic data across scientific, industrial, and policy domains. While these developments open new possibilities for data analysis, they also raise fundamental statistical questions about when synthetic data can be used in a valid, reliable, and principled manner. This paper reviews the current landscape of synthetic data generation and use from a statistical perspective, with the goal of clarifying the assumptions under which synthetic data can meaningfully support downstream discovery, inference, and prediction. We survey major classes of modern generative models, their intended use cases, and the benefits they offer, while also highlighting their limitations and characteristic failure modes. We additionally examine common pitfalls that arise when synthetic data are treated as surrogates for real observations, including biases from model misspecification, attenuated uncertainty, and difficulties in generalization. Building on these insights, we discuss emerging frameworks for the principled use of synthetic data. We conclude with practical recommendations, open problems, and cautions intended to guide both method developers and applied researchers.
翻译:生成式人工智能模型的出现极大地扩展了合成数据在科学、工业和政策领域的可用性与应用范围。尽管这些发展为数据分析开辟了新的可能性,但也引发了关于合成数据何时能够以有效、可靠且符合原则的方式使用的基本统计学问题。本文从统计学视角回顾了当前合成数据的生成与使用现状,旨在阐明在何种假设下合成数据能够有意义地支持下游的发现、推断与预测。我们综述了现代生成模型的主要类别、其预期用例及其带来的益处,同时也强调了它们的局限性和典型失效模式。此外,我们探讨了将合成数据视为真实观测替代品时出现的常见陷阱,包括模型误设导致的偏差、不确定性衰减以及泛化困难等问题。基于这些见解,我们讨论了合成数据原则性使用的新兴框架。最后,我们提出了旨在指导方法开发者和应用研究者的实用建议、开放性问题以及注意事项。