Under the Surface: Tracking the Artifactuality of LLM-Generated Data

Debarati Das,Karin De Langis,Anna Martin,Jaehyung Kim,Minhwa Lee,Zae Myung Kim,Shirley Hayati,Risako Owan,Bin Hu,Ritik Parkar,Ryan Koo,Jonginn Park,Aahan Tyagi,Libby Ferland,Sanjali Roy,Vincent Liu,Dongyeop Kang

from arxiv, Core Authors: Debarati Das, Karin De Langis, Anna Martin, Jaehyung Kim, Minhwa Lee and Zae Myung Kim | Project lead : Debarati Das | PI : Dongyeop Kang

This work delves into the expanding role of large language models (LLMs) in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". We then stress test the quality and implications of LLM-generated artificial data, comparing it with human data across various existing benchmarks. Despite artificial data's capability to match human performance, this paper reveals significant hidden disparities, especially in complex tasks where LLMs often miss the nuanced understanding of intrinsic human-generated content. This study critically examines diverse LLM-generated data and emphasizes the need for ethical practices in data creation and when using LLMs. It highlights the LLMs' shortcomings in replicating human traits and behaviors, underscoring the importance of addressing biases and artifacts produced in LLM-generated content for future research and development. All data and code are available on our project page.

翻译：本研究深入探讨了大语言模型（LLMs）在生成人工数据方面日益扩大的作用。LLMs被越来越多地用于创建多种产出，包括标注、偏好、指令提示、模拟对话和自由文本。由于这些由LLMs生成的数据形式在应用中常常相互交织，它们彼此之间产生相互影响，并对纳入训练循环的人工数据的质量和多样性引发了重大关切，从而形成了一个人工数据生态系统。据我们所知，这是首个汇集多种类型LLMs生成文本数据的研究，涵盖从约束较紧的数据（如“任务标签”）到约束较松的“自由形式文本”。随后，我们对LLMs生成的人工数据的质量和影响进行了压力测试，并通过多个现有基准将其与人类数据进行对比。尽管人工数据能够达到与人类相当的性能，但本文揭示了显著的隐藏差异，尤其是在复杂任务中，LLMs常常缺失对人类生成内容内在细微理解的把握。本研究批判性地审视了多样化的LLMs生成数据，并强调了在数据创建和使用LLMs时遵循伦理实践的必要性。它突显了LLMs在复现人类特质和行为方面的不足，强调了在未来的研究和开发中解决LLMs生成内容中存在的偏见和人造性的重要性。所有数据和代码均可通过我们的项目页面获取。

相关内容

Microsoft Surface

关注 5

Surface 是微软公司（ Microsoft）旗下一系列使用 Windows 10（早期为 Windows 8.X）操作系统的电脑产品，目前有 Surface、Surface Pro 和 Surface Book 三个系列。 2012 年 6 月 18 日，初代 Surface Pro/RT 由时任微软 CEO 史蒂夫·鲍尔默发布于在洛杉矶举行的记者会，2012 年 10 月 26 日上市销售。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日