When AI Eats Itself: On the Caveats of AI Autophagy

Generative Artificial Intelligence (AI) technologies and large models are producing realistic outputs across various domains, such as images, text, speech, and music. Creating these advanced generative models requires significant resources, particularly large and high-quality datasets. To minimise training expenses, many algorithm developers use data created by the models themselves as a cost-effective training solution. However, not all synthetic data effectively improve model performance, necessitating a strategic balance in the use of real versus synthetic data to optimise outcomes. Currently, the previously well-controlled integration of real and synthetic data is becoming uncontrollable. The widespread and unregulated dissemination of synthetic data online leads to the contamination of datasets traditionally compiled through web scraping, now mixed with unlabeled synthetic data. This trend, known as the AI autophagy phenomenon, suggests a future where generative AI systems may increasingly consume their own outputs without discernment, raising concerns about model performance, reliability, and ethical implications. What will happen if generative AI continuously consumes itself without discernment? What measures can we take to mitigate the potential adverse effects? To address these research questions, this study examines the existing literature, delving into the consequences of AI autophagy, analyzing the associated risks, and exploring strategies to mitigate its impact. Our aim is to provide a comprehensive perspective on this phenomenon advocating for a balanced approach that promotes the sustainable development of generative AI technologies in the era of large models.

翻译：生成式人工智能技术与大型模型正在图像、文本、语音及音乐等多个领域生成逼真的输出内容。构建这些先进的生成模型需要大量资源，特别是大规模高质量数据集。为降低训练成本，许多算法开发者采用模型自身生成的数据作为经济高效的训练解决方案。然而，并非所有合成数据都能有效提升模型性能，这需要在真实数据与合成数据的使用策略上寻求平衡以优化结果。当前，原本可控的真实数据与合成数据融合过程正逐渐失控。合成数据在网络上无监管的广泛传播，导致传统通过网络爬取构建的数据集受到污染，其中混杂着未标注的合成数据。这种被称为"AI自噬"的现象预示着生成式AI系统可能在未来不加甄别地持续消耗自身输出，从而引发关于模型性能、可靠性及伦理影响的担忧。若生成式AI持续无差别地自我消耗，将会产生何种后果？我们应采取哪些措施来缓解潜在负面影响？为回应这些研究问题，本研究梳理现有文献，深入探讨AI自噬的后果，分析相关风险，并探索缓解其影响的策略。我们的目标是为这一现象提供全面视角，倡导在大模型时代采取平衡策略以促进生成式AI技术的可持续发展。

相关内容

关注 7104

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日