思维之价：大型语言模型谈判中推理、性能与成本的多语言分析 (The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models)

Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We present the first comprehensive study that systematically evaluates how explicit reasoning training affects the negotiation abilities of both commercial and open-weight large language models, comparing these models to their vanilla counterparts across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models. Our findings show that enabling reasoning -- that is, scaling test time compute -- significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5's performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while a leading commercial model maintains language consistency between reasoning and final output.

翻译：谈判是人工智能代理面临的一项根本性挑战，因为它需要具备战略性推理、对手建模以及平衡合作与竞争的能力。我们首次开展了一项综合性研究，系统评估了显式推理训练如何影响商业和开源权重大型语言模型的谈判能力，并在三种语言中将这些模型与其基础版本进行了比较。通过在三种不同的对话游戏中进行自我对弈实验，我们分析了性能与成本之间的权衡、推理过程的语言一致性以及模型展现出的策略适应性本质。我们的研究结果表明，启用推理——即增加测试时的计算量——能通过增强协作能力和帮助模型克服任务复杂性，显著改善谈判结果，但需付出巨大的计算成本：推理使GPT-5的性能提升了31.4%，同时其成本增加了近400%。最关键的是，我们发现了一个显著的多语言推理差异：开源权重模型在内部推理步骤中始终切换至英语，即使谈判语言是德语或意大利语时亦然（这可能影响通过公开推理痕迹获得的可解释性收益），而领先的商业模型则能保持推理过程与最终输出之间的语言一致性。