Generative artificial intelligence (AI) technologies and large models are producing realistic outputs across various domains, such as images, text, speech, and music. Creating these advanced generative models requires significant resources, particularly large and high-quality datasets. To minimize training expenses, many algorithm developers use data created by the models themselves as a cost-effective training solution. However, not all synthetic data effectively improve model performance, necessitating a strategic balance in the use of real versus synthetic data to optimize outcomes. Currently, the previously well-controlled integration of real and synthetic data is becoming uncontrollable. The widespread and unregulated dissemination of synthetic data online leads to the contamination of datasets traditionally compiled through web scraping, now mixed with unlabeled synthetic data. This trend portends a future where generative AI systems may increasingly rely blindly on consuming self-generated data, raising concerns about model performance and ethical issues. What will happen if generative AI continuously consumes itself without discernment? What measures can we take to mitigate the potential adverse effects? There is a significant gap in the scientific literature regarding the impact of synthetic data use in generative AI, particularly in terms of the fusion of multimodal information. To address this research gap, this review investigates the consequences of integrating synthetic data blindly on training generative AI on both image and text modalities and explores strategies to mitigate these effects. The goal is to offer a comprehensive view of synthetic data's role, advocating for a balanced approach to its use and exploring practices that promote the sustainable development of generative AI technologies in the era of large models.
翻译:生成式人工智能技术与大型模型正在图像、文本、语音和音乐等多个领域生成逼真的输出内容。构建这些先进的生成模型需要大量资源,尤其是大规模高质量数据集。为降低训练成本,许多算法开发者采用模型自身生成的数据作为经济高效的训练解决方案。然而,并非所有合成数据都能有效提升模型性能,这需要在真实数据与合成数据的使用策略上取得平衡以优化结果。当前,原本可控的真实数据与合成数据融合过程正逐渐失控。合成数据在网络上未经监管的广泛传播,导致传统通过网络爬取构建的数据集受到污染——这些数据集如今混杂着未标注的合成数据。这一趋势预示着生成式AI系统可能日益盲目依赖自身生成数据的未来,引发了关于模型性能与伦理问题的担忧。若生成式AI持续不加甄别地自我消耗,将会产生何种后果?我们可以采取哪些措施来缓解潜在负面影响?现有科学文献在生成式AI中合成数据使用的影响方面存在显著空白,特别是在多模态信息融合领域。为填补这一研究空白,本文综述了在图像与文本模态上盲目整合合成数据对生成式AI训练的影响,并探索缓解这些影响的策略。旨在全面审视合成数据的作用,倡导其使用的平衡之道,并探索促进大模型时代生成式AI技术可持续发展的实践路径。