In the rapidly evolving field of large language models (LLMs), data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of LLMs on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond. From both data and learning perspectives, we examine various strategies that utilize LLMs for data augmentation, including a novel exploration of learning paradigms where LLM-generated data is used for diverse forms of further training. Additionally, this paper highlights the primary open challenges faced in this domain, ranging from controllable data augmentation to multi-modal data augmentation. This survey highlights a paradigm shift introduced by LLMs in DA, and aims to serve as a comprehensive guide for researchers and practitioners.
翻译:在快速演进的大语言模型(LLM)领域中,数据增强(DA)已成为一种关键技术,它通过多样化训练样本提升模型性能,而无需额外数据收集。本综述探讨了LLM对数据增强带来的变革性影响,特别关注其在自然语言处理(NLP)及其他领域所呈现的独特挑战与机遇。我们从数据与学习的双重视角出发,系统考察了利用LLM进行数据增强的多种策略,包括对LLM生成数据用于多种形式进一步训练的学习范式的新颖探索。此外,本文重点阐述了该领域面临的主要开放挑战,涵盖可控数据增强到多模态数据增强等多个方面。本综述强调了LLM为数据增强带来的范式转变,旨在为研究人员和实践者提供全面的指南。