This work delves into the expanding role of large language models (LLMs) in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". We then stress test the quality and implications of LLM-generated artificial data, comparing it with human data across various existing benchmarks. Despite artificial data's capability to match human performance, this paper reveals significant hidden disparities, especially in complex tasks where LLMs often miss the nuanced understanding of intrinsic human-generated content. This study critically examines diverse LLM-generated data and emphasizes the need for ethical practices in data creation and when using LLMs. It highlights the LLMs' shortcomings in replicating human traits and behaviors, underscoring the importance of addressing biases and artifacts produced in LLM-generated content for future research and development. All data and code are available on our project page.
翻译:本文深入探讨了大语言模型(LLM)在生成人工数据中日益扩大的作用。LLM被越来越多地用于创建各类输出,包括标注、偏好、指令提示、模拟对话和自由文本。由于这些形式的LLM生成数据在应用中经常相互交织,它们彼此产生相互影响,并引发了对训练循环中融入的人工数据质量与多样性的重大担忧,从而形成了一个人工数据生态系统。据我们所知,这是首项综合性研究,汇聚了多种类型的LLM生成文本数据,涵盖从约束较强的“任务标签”到约束较弱的“自由形式文本”。随后,我们对LLM生成的人工数据的质量与影响进行了压力测试,并在多个现有基准上将其与人类数据进行比较。尽管人工数据能够匹配人类性能,但本文揭示了显著的隐藏差异,尤其在复杂任务中,LLM往往缺失对人类生成内容内在细微理解的把握。本研究对多样化的LLM生成数据进行了批判性审视,强调了在数据创建及使用LLM时遵循伦理实践的必要性。它凸显了LLM在复制人类特质与行为方面的不足,强调了在未来的研究与发展中,解决LLM生成内容中的偏差与人工痕迹的重要性。所有数据和代码均可在我们的项目页面上获取。