The increasing use of Large Language Models (LLMs) as proxies for human participants in social science research presents a promising, yet methodologically risky, paradigm shift. While LLMs offer scalability and cost-efficiency, their "naive" application, where they are prompted to generate content without explicit behavioral constraints, introduces significant linguistic discrepancies that challenge the validity of research findings. This paper addresses these limitations by introducing a novel, history-conditioned reply prediction task on authentic X (formerly Twitter) data, to create a dataset designed to evaluate the linguistic output of LLMs against human-generated content. We analyze these discrepancies using stylistic and content-based metrics, providing a quantitative framework for researchers to assess the quality and authenticity of synthetic data. Our findings highlight the need for more sophisticated prompting techniques and specialized datasets to ensure that LLM-generated content accurately reflects the complex linguistic patterns of human communication, thereby improving the validity of computational social science studies.
翻译:大型语言模型(LLMs)作为人类参与者在社会科学研究中的代理日益普及,这呈现出一个前景广阔但存在方法论风险的范式转变。尽管LLMs提供了可扩展性和成本效益,但其“朴素”应用——即在没有明确行为约束的情况下提示生成内容——引入了显著的语言差异,这对研究结果的有效性构成了挑战。本文通过引入一项基于真实X(原Twitter)数据的新型历史条件回复预测任务,构建了一个旨在评估LLMs语言输出与人类生成内容对比的数据集,以应对这些局限性。我们运用风格和内容指标分析这些差异,为研究人员评估合成数据的质量和真实性提供了量化框架。研究结果强调,需要更复杂的提示技术和专用数据集,以确保LLM生成的内容能准确反映人类交流的复杂语言模式,从而提高计算社会科学研究的有效性。