Large language models (LLMs) are increasingly used to simulate human behavior in social settings such as legal mediation, negotiation, and dispute resolution. However, it remains unclear whether these simulations reproduce the personality-behavior patterns observed in humans. Human personality, for instance, shapes how individuals navigate social interactions, including strategic choices and behaviors in emotionally charged interactions. This raises the question: Can LLMs, when prompted with personality traits, reproduce personality-driven differences in human conflict behavior? To explore this, we introduce an evaluation framework that enables direct comparison of human-human and LLM-LLM behaviors in dispute resolution dialogues with respect to Big Five Inventory (BFI) personality traits. This framework provides a set of interpretable metrics related to strategic behavior and conflict outcomes. We additionally contribute a novel dataset creation methodology for LLM dispute resolution dialogues with matched scenarios and personality traits with respect to human conversations. Finally, we demonstrate the use of our evaluation framework with three contemporary closed-source LLMs and show significant divergences in how personality manifests in conflict across different LLMs compared to human data, challenging the assumption that personality-prompted agents can serve as reliable behavioral proxies in socially impactful applications. Our work highlights the need for psychological grounding and validation in AI simulations before real-world use.
翻译:大型语言模型(LLMs)在调解、谈判、争议解决等社会场景中越来越多地被用于模拟人类行为。然而,这些模拟是否能够复现人类的人格-行为模式仍不明确。例如,人类人格塑造了个体在社会互动中的表现方式,包括在情绪化互动中的策略选择和行为。这引出了一个核心问题:当被赋予人格特质提示时,LLMs能否复现人类冲突行为中由人格驱动的差异?为探究此问题,我们提出了一个评估框架,能够直接比较争议解决对话中人类-人类与LLM-LLM行为在大五人格特质(BFI)维度上的表现。该框架提供了一套与策略行为和冲突结果相关的可解释度量指标。我们还贡献了一种新颖的数据集构建方法,用于生成与人类对话场景和人格特质相匹配的LLM争议解决对话数据。最后,我们使用三种当代闭源LLM演示了该评估框架的应用,结果显示不同LLM在冲突情境中的人格表现与人类数据存在显著差异,这对“人格提示智能体能在具有社会影响的应用中作为可靠行为代理”的假设提出了挑战。我们的工作强调了AI模拟在现实应用前需要进行心理学基础验证的必要性。