We examine whether large language models (LLMs) can predict biased decision-making in conversational settings, and whether their predictions capture not only human cognitive biases but also how those effects change under cognitive load. In a pre-registered study (N = 1,648), participants completed six classic decision-making tasks via a chatbot with dialogues of varying complexity. Participants exhibited two well-documented cognitive biases: the Framing Effect and the Status Quo Bias. Increased dialogue complexity resulted in participants reporting higher mental demand. This increase in cognitive load selectively, but significantly, increased the effect of the biases, demonstrating the load-bias interaction. We then evaluated whether LLMs (GPT-4, GPT-5, and open-source models) could predict individual decisions given demographic information and prior dialogue. While results were mixed across choice problems, LLM predictions that incorporated dialogue context were significantly more accurate in several key scenarios. Importantly, their predictions reproduced the same bias patterns and load-bias interactions observed in humans. Across all models tested, the GPT-4 family consistently aligned with human behavior, outperforming GPT-5 and open-source models in both predictive accuracy and fidelity to human-like bias patterns. These findings advance our understanding of LLMs as tools for simulating human decision-making and inform the design of conversational agents that adapt to user biases.
翻译:本研究探讨大型语言模型(LLMs)能否预测对话场景中的偏见决策行为,以及其预测是否能捕捉人类认知偏见及其在认知负荷下的动态变化。通过一项预先注册的研究(N = 1,648),参与者经由具有不同复杂度对话的聊天机器人完成六项经典决策任务。实验结果揭示参与者表现出两种经典认知偏见:框架效应与现状偏见。对话复杂度的提升导致参与者报告更高的心理需求,这种认知负荷的增加选择性地显著强化了偏见效应,证实了负荷-偏见的交互作用。随后我们评估了LLMs(GPT-4、GPT-5及开源模型)在给定人口统计信息和历史对话记录时预测个体决策的能力。尽管在不同决策问题上结果存在差异,但融入对话上下文的LLM预测在多个关键场景中显著提升准确率。值得注意的是,这些预测重现了人类表现出的相同偏见模式及负荷-偏见交互现象。在所有测试模型中,GPT-4系列始终与人类行为保持最高一致性,在预测准确度和人类偏见模式还原度方面均优于GPT-5及开源模型。这些发现深化了我们对LLMs作为人类决策模拟工具的理解,并为设计能适应用户偏见的对话系统提供了理论依据。