Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction under hidden preferences, strategic communication, and binding constraints. These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier. Existing LLM negotiation evaluations rely on LLM-vs.-LLM interaction or aggregate outcomes such as deal rate, leaving failures opaque. We introduce Terms-Bench, short for Testbed for Economic Reasoning in Multi-turn Strategy, a Bayesian-game framework that makes the environment itself the verifier by specifying the counterpart's latent type, policy, and payoff structure. We instantiate it in bilateral price negotiation, where the counterpart's private state and simulator policy are hidden from the agent but observable to the evaluator. This turns the counterpart from a black-box opponent into a diagnostic instrument, enabling agent-attributable failure analysis and oracle-reference optimality gaps. Evaluating 13 LLM agents spanning frontier systems from major providers, Terms-Bench turns negotiation evaluation from aggregate ranking into actionable diagnosis: where agents fail, why they fail, and what to strengthen. Empirically, frontier models saturate deal rate yet diverge in surplus extraction, cue use, belief calibration, and compliance, revealing agent-specific bargaining bottlenecks masked by prior benchmarks.
翻译:谈判是经济交换的核心机制,塑造着市场、采购、劳动协议与资源分配。它也是语言智能体模型的经典测试平台,需要智能体在隐藏偏好、策略沟通和约束条件限制下进行多轮交互。这些特性使得谈判评估变得困难:与数学或代码不同,谈判缺乏内在验证器。现有的大语言模型(LLM)谈判评估依赖于LLM间的互动或聚合指标(如成交率),导致失败原因难以解释。我们提出TERMS-Bench(多轮策略经济推理测试平台),这是一个贝叶斯博弈框架,通过指定对手的潜在类型、策略和收益结构,将环境本身转化为验证器。我们在双边价格谈判场景中实例化该框架,其中对手的私有状态和仿真器策略对智能体隐藏,但对评估者可见。这将对手从黑箱对手转变为诊断工具,实现了可归因于智能体的故障分析和最优性差距的基准参考。通过评估来自主要供应商标杆系统的13个LLM智能体,TERMS-Bench将谈判评估从聚合排名转化为可操作诊断:智能体在何处失败、为何失败、以及需要加强哪些能力。实验表明,前沿模型虽已饱和成交率,但在盈余提取、提示利用、信念校准和合规性方面存在差异,揭示了先前基准掩盖的智能体特定谈判瓶颈。