Markets increasingly accommodate large language models (LLMs) as autonomous decision-making agents. As this transition occurs, it becomes critical to evaluate how these agents behave relative to their human and task-specific statistical predecessors. In this work, we present results from an empirical study comparing humans (N=216), multiple frontier LLMs, and customized Bayesian agents in dynamic multi-player bargaining games under identical conditions. Bayesian agents extract the highest surplus with aggressive trade proposals that are frequently rejected. Humans and LLMs achieve comparable aggregate surplus within their groups, but exhibit different trading strategies. LLMs favor conservative, concessionary proposals that are usually accepted by other LLMs, while humans propose trades that are consistent with fairness norms but are more likely to be rejected. These findings highlight that performance parity -- a common benchmark in agent evaluation -- can mask substantive procedural differences in how LLMs behave in complex multi-agent interactions.
翻译:随着市场日益接纳大型语言模型(LLMs)作为自主决策智能体,评估这些智能体相对于人类及特定任务统计基准的行为模式变得至关重要。本研究通过实证分析,在相同实验条件下比较了人类参与者(N=216)、多种前沿LLMs以及定制化贝叶斯智能体在动态多方议价博弈中的表现。贝叶斯智能体通过频繁被拒绝的激进交易提案获取最高剩余收益。人类与LLMs在群体内实现相近的总剩余收益,但展现出不同的交易策略:LLMs倾向于提出保守让步型提案,且通常能被其他LLMs接受;而人类提出的交易方案更符合公平准则,却更易遭到拒绝。这些发现表明,智能体评估中常用的性能对等基准可能掩盖LLMs在复杂多智能体交互中行为模式的实质性程序差异。