This work delves into the expanding role of large language models (LLMs) in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". We then stress test the quality and implications of LLM-generated artificial data, comparing it with human data across various existing benchmarks. Despite artificial data's capability to match human performance, this paper reveals significant hidden disparities, especially in complex tasks where LLMs often miss the nuanced understanding of intrinsic human-generated content. This study critically examines diverse LLM-generated data and emphasizes the need for ethical practices in data creation and when using LLMs. It highlights the LLMs' shortcomings in replicating human traits and behaviors, underscoring the importance of addressing biases and artifacts produced in LLM-generated content for future research and development. All data and code are available on our project page.
翻译:本研究深入探讨了大语言模型(LLMs)在生成人工数据方面日益扩大的作用。LLMs被越来越多地用于创建多种产出,包括标注、偏好、指令提示、模拟对话和自由文本。由于这些由LLMs生成的数据形式在应用中常常相互交织,它们彼此之间产生相互影响,并对纳入训练循环的人工数据的质量和多样性引发了重大关切,从而形成了一个人工数据生态系统。据我们所知,这是首个汇集多种类型LLMs生成文本数据的研究,涵盖从约束较紧的数据(如“任务标签”)到约束较松的“自由形式文本”。随后,我们对LLMs生成的人工数据的质量和影响进行了压力测试,并通过多个现有基准将其与人类数据进行对比。尽管人工数据能够达到与人类相当的性能,但本文揭示了显著的隐藏差异,尤其是在复杂任务中,LLMs常常缺失对人类生成内容内在细微理解的把握。本研究批判性地审视了多样化的LLMs生成数据,并强调了在数据创建和使用LLMs时遵循伦理实践的必要性。它突显了LLMs在复现人类特质和行为方面的不足,强调了在未来的研究和开发中解决LLMs生成内容中存在的偏见和人造性的重要性。所有数据和代码均可通过我们的项目页面获取。