When Reasoning Models Hurt Behavioral Simulation: A Solver-Sampler Mismatch in Multi-Agent LLM Negotiation

Behavioral simulation and strategic problem solving are different tasks. Large language models are increasingly explored as agents in policy-facing institutional simulations, but stronger reasoning need not improve behavioral sampling. We study this solver-sampler mismatch in three multi-agent negotiation environments: two trading-limits scenarios with different authority structures and a grid-curtailment case in emergency electricity management. Across two primary model families, native reasoning and often no reflection collapse toward authority-heavy outcomes. The sharpest case is DeepSeek native reasoning in the grid-curtailment transfer: it reaches action entropy 1.256 and a concession-arc rate of 0.933, yet still ends in authority decision in 15 of 15 runs. A direct OpenAI extension shows the same pressure at provider breadth: GPT-5.2 native reasoning ends in authority decisions in 45 of 45 runs across the three environments. Budget-matched no-reflection controls and orthogonal private-state controls remain rigid, while the negotiation-structured scaffold condition is the only condition that consistently opens negotiated outcomes. These diagnostics are failure screens within a fixed negotiation grammar, not evidence of external behavioral realism or policy-forecasting validity. The results show that neither more output space nor generic extra private state rescues solver-like sampler failure. For institutional simulation, solver strength and sampler qualification are different objectives: models should be evaluated for the behavioral role they are meant to play, not only for strategic capability.

翻译：行为模拟与策略性解决问题是不同的任务。大语言模型越来越多地被用作面向政策的制度模拟中的智能体，但更强的推理能力未必能改善行为采样。我们在三种多智能体谈判环境中研究这种求解器-采样器错配：两种具有不同权力结构的交易限额场景，以及一种紧急电力管理中的电网限电案例。在两个主要模型家族中，原生推理（且通常不进行反思）会导致结果向权力主导型结局坍缩。最典型的案例是DeepSeek在电网限电转移场景中的原生推理：其动作熵达到1.256，让步弧率为0.933，但在15次运行中全部以权力决策告终。直接扩展至OpenAI也呈现了相同的压力（体现在提供者广度上）：GPT-5.2原生推理在三个环境中的45次运行中全部以权力决策告终。预算匹配的无反思对照和正交私有状态对照保持僵化，而谈判结构化支架条件是唯一能持续产生谈判结果的条件。这些诊断是在固定谈判语法内的失败筛查，而非外部行为真实性或政策预测有效性的证据。结果表明，无论是更大的输出空间还是通用的额外私有状态，都无法挽救类求解器的采样器失败。对于制度模拟而言，求解器强度与采样器资格是不同的目标：应根据模型预期扮演的行为角色（而非仅根据策略能力）对其进行评估。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大语言模型智能体（LLM Agents）工具调用的演进：从单工具调用到多工具协同编排

专知会员服务

29+阅读 · 4月6日

面向战斗模拟空间推理的大语言模型指挥官智能体框架

专知会员服务

25+阅读 · 3月18日

《多智能体大语言模型系统的可靠决策研究》

专知会员服务

41+阅读 · 2月2日

大语言模型的智能体化推理

专知会员服务

35+阅读 · 1月21日