When AI Eats Itself: On the Caveats of Data Pollution in the Era of Generative AI

Generative artificial intelligence (AI) technologies and large models are producing realistic outputs across various domains, such as images, text, speech, and music. Creating these advanced generative models requires significant resources, particularly large and high-quality datasets. To minimize training expenses, many algorithm developers use data created by the models themselves as a cost-effective training solution. However, not all synthetic data effectively improve model performance, necessitating a strategic balance in the use of real versus synthetic data to optimize outcomes. Currently, the previously well-controlled integration of real and synthetic data is becoming uncontrollable. The widespread and unregulated dissemination of synthetic data online leads to the contamination of datasets traditionally compiled through web scraping, now mixed with unlabeled synthetic data. This trend portends a future where generative AI systems may increasingly rely blindly on consuming self-generated data, raising concerns about model performance and ethical issues. What will happen if generative AI continuously consumes itself without discernment? What measures can we take to mitigate the potential adverse effects? There is a significant gap in the scientific literature regarding the impact of synthetic data use in generative AI, particularly in terms of the fusion of multimodal information. To address this research gap, this review investigates the consequences of integrating synthetic data blindly on training generative AI on both image and text modalities and explores strategies to mitigate these effects. The goal is to offer a comprehensive view of synthetic data's role, advocating for a balanced approach to its use and exploring practices that promote the sustainable development of generative AI technologies in the era of large models.

翻译：生成式人工智能技术与大型模型正在图像、文本、语音和音乐等多个领域生成逼真的输出内容。构建这些先进的生成模型需要大量资源，尤其是大规模高质量数据集。为降低训练成本，许多算法开发者采用模型自身生成的数据作为经济高效的训练解决方案。然而，并非所有合成数据都能有效提升模型性能，这需要在真实数据与合成数据的使用策略上取得平衡以优化结果。当前，原本可控的真实数据与合成数据融合过程正逐渐失控。合成数据在网络上未经监管的广泛传播，导致传统通过网络爬取构建的数据集受到污染——这些数据集如今混杂着未标注的合成数据。这一趋势预示着生成式AI系统可能日益盲目依赖自身生成数据的未来，引发了关于模型性能与伦理问题的担忧。若生成式AI持续不加甄别地自我消耗，将会产生何种后果？我们可以采取哪些措施来缓解潜在负面影响？现有科学文献在生成式AI中合成数据使用的影响方面存在显著空白，特别是在多模态信息融合领域。为填补这一研究空白，本文综述了在图像与文本模态上盲目整合合成数据对生成式AI训练的影响，并探索缓解这些影响的策略。旨在全面审视合成数据的作用，倡导其使用的平衡之道，并探索促进大模型时代生成式AI技术可持续发展的实践路径。

相关内容

生成式人工智能

关注 38

生成式人工智能是利用复杂的算法、模型和规则，从大规模数据集中学习，以创造新的原创内容的人工智能技术。这项技术能够创造文本、图片、声音、视频和代码等多种类型的内容，全面超越了传统软件的数据处理和分析能力。2022年末，OpenAI推出的ChatGPT标志着这一技术在文本生成领域取得了显著进展，2023年被称为生成式人工智能的突破之年。这项技术从单一的语言生成逐步向多模态、具身化快速发展。在图像生成方面，生成系统在解释提示和生成逼真输出方面取得了显著的进步。同时，视频和音频的生成技术也在迅速发展，这为虚拟现实和元宇宙的实现提供了新的途径。生成式人工智能技术在各行业、各领域都具有广泛的应用前景。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日