Next Reply Prediction X Dataset: Linguistic Discrepancies in Naively Generated Content

The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While LLMs offer scalability and cost-efficiency, their "naive" application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data, to create a dataset designed to evaluate the linguistic output of LLMs against human-generated content. We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data. Our findings highlight the need for more sophisticated prompting techniques and specialized datasets to ensure that LLM-generated content accurately reflects the complex linguistic patterns of human communication, thereby improving the validity of computational social science studies.

翻译：大型语言模型（LLMs）作为人类参与者在社会科学研究中的代理日益普及，这呈现出一个前景广阔但存在方法论风险的范式转变。尽管LLMs提供了可扩展性和成本效益，但其“朴素”应用——即在没有明确行为约束的情况下提示生成内容——引入了显著的语言差异，这对研究结果的有效性构成了挑战。本文通过引入一项基于真实X（原Twitter）数据的新型历史条件回复预测任务，构建了一个旨在评估LLMs语言输出与人类生成内容对比的数据集，以应对这些局限性。我们运用风格和内容指标分析这些差异，为研究人员评估合成数据的质量和真实性提供了量化框架。研究结果强调，需要更复杂的提示技术和专用数据集，以确保LLM生成的内容能准确反映人类交流的复杂语言模式，从而提高计算社会科学研究的有效性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大语言模型中的检索与结构化增强生成综述

专知会员服务

33+阅读 · 2025年9月17日

赋能大型语言模型多领域资源挑战

专知会员服务

10+阅读 · 2025年6月10日

面向统计学家的大型语言模型概述

专知会员服务

32+阅读 · 2025年3月16日

什么是后训练？大语言模型训练后优化方法综述，87页pdf

专知会员服务

54+阅读 · 2025年3月11日