Behavioral simulation and strategic problem solving are different tasks. Large language models are increasingly explored as agents in policy-facing institutional simulations, but stronger reasoning need not improve behavioral sampling. We study this solver-sampler mismatch in three multi-agent negotiation environments: two trading-limits scenarios with different authority structures and a grid-curtailment case in emergency electricity management. Across two primary model families, native reasoning and often no reflection collapse toward authority-heavy outcomes. The sharpest case is DeepSeek native reasoning in the grid-curtailment transfer: it reaches action entropy 1.256 and a concession-arc rate of 0.933, yet still ends in authority decision in 15 of 15 runs. A direct OpenAI extension shows the same pressure at provider breadth: GPT-5.2 native reasoning ends in authority decisions in 45 of 45 runs across the three environments. Budget-matched no-reflection controls and orthogonal private-state controls remain rigid, while the negotiation-structured scaffold condition is the only condition that consistently opens negotiated outcomes. These diagnostics are failure screens within a fixed negotiation grammar, not evidence of external behavioral realism or policy-forecasting validity. The results show that neither more output space nor generic extra private state rescues solver-like sampler failure. For institutional simulation, solver strength and sampler qualification are different objectives: models should be evaluated for the behavioral role they are meant to play, not only for strategic capability.
翻译:行为模拟与策略性解决问题是不同的任务。大语言模型越来越多地被用作面向政策的制度模拟中的智能体,但更强的推理能力未必能改善行为采样。我们在三种多智能体谈判环境中研究这种求解器-采样器错配:两种具有不同权力结构的交易限额场景,以及一种紧急电力管理中的电网限电案例。在两个主要模型家族中,原生推理(且通常不进行反思)会导致结果向权力主导型结局坍缩。最典型的案例是DeepSeek在电网限电转移场景中的原生推理:其动作熵达到1.256,让步弧率为0.933,但在15次运行中全部以权力决策告终。直接扩展至OpenAI也呈现了相同的压力(体现在提供者广度上):GPT-5.2原生推理在三个环境中的45次运行中全部以权力决策告终。预算匹配的无反思对照和正交私有状态对照保持僵化,而谈判结构化支架条件是唯一能持续产生谈判结果的条件。这些诊断是在固定谈判语法内的失败筛查,而非外部行为真实性或政策预测有效性的证据。结果表明,无论是更大的输出空间还是通用的额外私有状态,都无法挽救类求解器的采样器失败。对于制度模拟而言,求解器强度与采样器资格是不同的目标:应根据模型预期扮演的行为角色(而非仅根据策略能力)对其进行评估。