Under the Surface: Tracking the Artifactuality of LLM-Generated Data

Debarati Das,Karin De Langis,Anna Martin-Boyle,Jaehyung Kim,Minhwa Lee,Zae Myung Kim,Shirley Anugrah Hayati,Risako Owan,Bin Hu,Ritik Parkar,Ryan Koo,Jonginn Park,Aahan Tyagi,Libby Ferland,Sanjali Roy,Vincent Liu,Dongyeop Kang

from arxiv, Core Authors: Debarati Das, Karin De Langis, Anna Martin-Boyle, Jaehyung Kim, Minhwa Lee and Zae Myung Kim | Project lead : Debarati Das | PI : Dongyeop Kang

This work delves into the expanding role of large language models (LLMs) in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". We then stress test the quality and implications of LLM-generated artificial data, comparing it with human data across various existing benchmarks. Despite artificial data's capability to match human performance, this paper reveals significant hidden disparities, especially in complex tasks where LLMs often miss the nuanced understanding of intrinsic human-generated content. This study critically examines diverse LLM-generated data and emphasizes the need for ethical practices in data creation and when using LLMs. It highlights the LLMs' shortcomings in replicating human traits and behaviors, underscoring the importance of addressing biases and artifacts produced in LLM-generated content for future research and development. All data and code are available on our project page.

翻译：本文深入探讨了大语言模型（LLM）在生成人工数据中日益扩大的作用。LLM被越来越多地用于创建各类输出，包括标注、偏好、指令提示、模拟对话和自由文本。由于这些形式的LLM生成数据在应用中经常相互交织，它们彼此产生相互影响，并引发了对训练循环中融入的人工数据质量与多样性的重大担忧，从而形成了一个人工数据生态系统。据我们所知，这是首项综合性研究，汇聚了多种类型的LLM生成文本数据，涵盖从约束较强的“任务标签”到约束较弱的“自由形式文本”。随后，我们对LLM生成的人工数据的质量与影响进行了压力测试，并在多个现有基准上将其与人类数据进行比较。尽管人工数据能够匹配人类性能，但本文揭示了显著的隐藏差异，尤其在复杂任务中，LLM往往缺失对人类生成内容内在细微理解的把握。本研究对多样化的LLM生成数据进行了批判性审视，强调了在数据创建及使用LLM时遵循伦理实践的必要性。它凸显了LLM在复制人类特质与行为方面的不足，强调了在未来的研究与发展中，解决LLM生成内容中的偏差与人工痕迹的重要性。所有数据和代码均可在我们的项目页面上获取。

相关内容

Microsoft Surface

关注 5

Surface 是微软公司（ Microsoft）旗下一系列使用 Windows 10（早期为 Windows 8.X）操作系统的电脑产品，目前有 Surface、Surface Pro 和 Surface Book 三个系列。 2012 年 6 月 18 日，初代 Surface Pro/RT 由时任微软 CEO 史蒂夫·鲍尔默发布于在洛杉矶举行的记者会，2012 年 10 月 26 日上市销售。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日