As large language models (LLMs) are increasingly embedded in collaborative human activities such as business negotiations and group coordination, it becomes critical to evaluate both the performance gains they can achieve and how they interact in dynamic, multi-agent environments. Unlike traditional statistical agents such as Bayesian models, which may excel under well-specified conditions, large language models (LLMs) can generalize across diverse, real-world scenarios, raising new questions about how their strategies and behaviors compare to those of humans and other agent types. In this work, we compare outcomes and behavioral dynamics across humans (N = 216), LLMs (GPT-4o, Gemini 1.5 Pro), and Bayesian agents in a dynamic negotiation setting under identical conditions. Bayesian agents extract the highest surplus through aggressive optimization, at the cost of frequent trade rejections. Humans and LLMs achieve similar overall surplus, but through distinct behaviors: LLMs favor conservative, concessionary trades with few rejections, while humans employ more strategic, risk-taking, and fairness-oriented behaviors. Thus, we find that performance parity -- a common benchmark in agent evaluation -- can conceal fundamental differences in process and alignment, which are critical for practical deployment in real-world coordination tasks. By establishing foundational behavioral baselines under matched conditions, this work provides a baseline for future studies in more applied, variable-rich environments.
翻译:随着大型语言模型(LLMs)日益融入商业谈判、团队协作等人类协同活动,评估其所能实现的性能提升及其在动态多智能体环境中的交互方式变得至关重要。不同于贝叶斯模型等传统统计智能体(在条件明确时可能表现优异),大型语言模型(LLMs)能够泛化至多样化的现实场景,这引发了关于其策略和行为与人类及其他智能体类型相比如何的新问题。在本研究中,我们在完全相同的条件下,比较了人类(N = 216)、大型语言模型(GPT-4o、Gemini 1.5 Pro)与贝叶斯智能体在动态谈判环境中的结果与行为动态。贝叶斯智能体通过激进的优化提取了最高的剩余收益,但其代价是频繁的交易拒绝。人类与大型语言模型获得了相近的总剩余收益,但行为模式截然不同:大型语言模型倾向于保守、让步式的交易,拒绝次数极少;而人类则采用更具战略性、风险承担性及公平导向的行为。因此,我们发现,性能对等——这一智能体评估的常用基准——可能掩盖过程与对齐方面的根本差异,而这些差异对于现实世界协调任务的实际部署至关重要。通过在匹配条件下建立基础行为基线,本研究为未来在更具应用性、变量更丰富的环境中的研究提供了基准。