Because large language models (LLMs) can produce natural language that is sometimes indistinguishable from texts produced by people, some researchers are starting to consider replacing human participants with LLM simulations. In this study, we test the extent to which the findings of a simulation with an LLM prompted to act as a synthetic participant match those obtained from 30 human participants. In our experiments, we evaluated how well writing style preference inference algorithms adapted to a participant over repeated interactions, compared to a baseline. We discover hints of bias and a lack of depth in GPT-4o's text generation and judgement that prevent it from accurately simulating people's behavior. Our results also hint at human biases that highlight the importance of considering human factors in the evaluation of systems that depend on human-automation interaction. Rather than treating these discrepancies as evidence for or against the validity of LLM-simulated participants, we present this study as a case analysis of methodological and design challenges.
翻译:由于大型语言模型(LLM)能生成有时与人类写作无法区分的自然语言,部分研究者开始考虑用LLM模拟替代人类参与者。本研究通过对比LLM模拟(将其提示为合成参与者)与30名真实参与者的实验结果,检验两者在写作风格偏好推断算法中的匹配程度。在实验中,我们评估了算法在重复交互中适应参与者的能力(与基线相比),发现GPT-4o的文本生成与判断存在偏差和深度不足的迹象,导致其无法准确模拟人类行为。实验结果还揭示了人类偏见,强调了在人机交互依赖系统的评估中考虑人类因素的重要性。我们未将偏差作为支持或否定LLM模拟参与者有效性的证据,而是将其作为方法论与设计挑战的案例分析。