Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction under hidden preferences, strategic communication, and binding constraints. These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier. Existing LLM negotiation evaluations rely on LLM-vs.-LLM interaction or aggregate outcomes such as deal rate, leaving failures opaque. We introduce Terms-Bench, short for Testbed for Economic Reasoning in Multi-turn Strategy, a Bayesian-game framework that makes the environment itself the verifier by specifying the counterpart's latent type, policy, and payoff structure. We instantiate it in bilateral price negotiation, where the counterpart's private state and simulator policy are hidden from the agent but observable to the evaluator. This turns the counterpart from a black-box opponent into a diagnostic instrument, enabling agent-attributable failure analysis and oracle-reference optimality gaps. Evaluating 13 LLM agents spanning frontier systems from major providers, Terms-Bench turns negotiation evaluation from aggregate ranking into actionable diagnosis: where agents fail, why they fail, and what to strengthen. Empirically, frontier models saturate deal rate yet diverge in surplus extraction, cue use, belief calibration, and compliance, revealing agent-specific bargaining bottlenecks masked by prior benchmarks.
翻译:谈判是经济交换的核心机制,塑造着市场、采购、劳动协议和资源分配。同时,它也是智能体语言模型的经典测试平台,要求其在隐藏偏好、策略沟通和约束条件下进行多轮交互。这些特性使得谈判难以评估:与数学或代码不同,它缺乏内在验证器。现有LLM谈判评估依赖LLM-LLM交互或成交率等总体结果,导致失败原因不透明。我们提出TERMS-Bench(Multi-turn Strategy中经济推理的测试平台),这是一个贝叶斯博弈框架,通过指定对手的潜在类型、策略和收益结构,将环境本身转化为验证器。我们将其实例化为双边价格谈判场景,其中对手的私有状态和模拟器策略对智能体隐藏但可被评估者观测。这使对手从黑箱对手转变为诊断工具,实现智能体可归因的失败分析和基于最优参考的差距评估。通过对跨越主流供应商前沿系统的13个LLM智能体进行评估,TERMS-Bench将谈判评估从总体排名转化为可操作的诊断:智能体在何处失败、为何失败、需强化哪些能力。实证表明,前沿模型在成交率上趋于饱和,但在剩余索取、线索利用、信念校准和合规性方面存在显著差异,揭示了被先前基准掩盖的智能体特异性议价瓶颈。