Generative Artificial Intelligence (AI) technologies and large models are producing realistic outputs across various domains, such as images, text, speech, and music. Creating these advanced generative models requires significant resources, particularly large and high-quality datasets. To minimise training expenses, many algorithm developers use data created by the models themselves as a cost-effective training solution. However, not all synthetic data effectively improve model performance, necessitating a strategic balance in the use of real versus synthetic data to optimise outcomes. Currently, the previously well-controlled integration of real and synthetic data is becoming uncontrollable. The widespread and unregulated dissemination of synthetic data online leads to the contamination of datasets traditionally compiled through web scraping, now mixed with unlabeled synthetic data. This trend, known as the AI autophagy phenomenon, suggests a future where generative AI systems may increasingly consume their own outputs without discernment, raising concerns about model performance, reliability, and ethical implications. What will happen if generative AI continuously consumes itself without discernment? What measures can we take to mitigate the potential adverse effects? To address these research questions, this study examines the existing literature, delving into the consequences of AI autophagy, analyzing the associated risks, and exploring strategies to mitigate its impact. Our aim is to provide a comprehensive perspective on this phenomenon advocating for a balanced approach that promotes the sustainable development of generative AI technologies in the era of large models.
翻译:生成式人工智能技术与大型模型正在图像、文本、语音及音乐等多个领域生成逼真的输出内容。构建这些先进的生成模型需要大量资源,特别是大规模高质量数据集。为降低训练成本,许多算法开发者采用模型自身生成的数据作为经济高效的训练解决方案。然而,并非所有合成数据都能有效提升模型性能,这需要在真实数据与合成数据的使用策略上寻求平衡以优化结果。当前,原本可控的真实数据与合成数据融合过程正逐渐失控。合成数据在网络上无监管的广泛传播,导致传统通过网络爬取构建的数据集受到污染,其中混杂着未标注的合成数据。这种被称为"AI自噬"的现象预示着生成式AI系统可能在未来不加甄别地持续消耗自身输出,从而引发关于模型性能、可靠性及伦理影响的担忧。若生成式AI持续无差别地自我消耗,将会产生何种后果?我们应采取哪些措施来缓解潜在负面影响?为回应这些研究问题,本研究梳理现有文献,深入探讨AI自噬的后果,分析相关风险,并探索缓解其影响的策略。我们的目标是为这一现象提供全面视角,倡导在大模型时代采取平衡策略以促进生成式AI技术的可持续发展。