What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control

LLM agents are known to deviate from Nash equilibria in strategic interactions, but nobody has looked inside the model to understand why, or asked whether the deviation can be reversed. We do both. Working with four open-source models (Llama-3 and Qwen2.5, 8B to 72B parameters) playing four canonical two-player games, we establish the behavioral picture through self-play and cross-play experiments, then open up the 32-layer Llama-3-8B model and examine what actually happens during a strategic decision. The mechanistic findings are clear. Opponent history is encoded with near-perfect fidelity at the first layer (96% probe accuracy) and consumed progressively, while Nash action encoding is weak throughout, never exceeding 56%. There is no dedicated Nash module. Instead, the model privately favors the Nash action through most of its forward pass, but a prosocial override rooted in pretraining on human text concentrated in the final layers reverses this, reaching 84% probability of cooperation at layer 30. Injecting a learned Nash direction into the residual stream shifts behavior bidirectionally and causally, confirmed through concept clamping. The behavioral experiments surface six scale- and architecture-dependent findings, the most notable being that chain-of-thought reasoning worsens Nash play in small models but achieves near-perfect Nash play above 70B parameters. The cross-play experiments reveal three phenomena invisible in self-play: a small model can unravel any partner's cooperation by defecting early; two large models reinforce each other's cooperative instincts indefinitely; and who moves first determines which Nash equilibrium the system reaches. LLMs do not lack Nash-playing competence. They compute it, then suppress it.

翻译：已知大语言模型代理在战略互动中偏离纳什均衡，但尚未有研究深入模型内部探究其原因，也未探讨这种偏离是否可逆。本研究同时填补了这两项空白。我们使用四种开源模型（Llama-3和Qwen2.5，参数量从80亿到720亿）参与四种经典双人博弈，通过自我对弈和交叉对弈实验建立行为特征图谱，随后拆解32层的Llama-3-8B模型，观察战略决策过程中的实际运作机制。机制性发现清晰明确：对手历史信息在第一层即以近乎完美的保真度编码（探针准确率96%），并随层数增加逐步消耗，而纳什动作编码始终薄弱，从未超过56%。模型中不存在专门的纳什模块。相反，在大部分前向传播过程中，模型私下偏好纳什动作，但根植于人类文本预训练的亲社会性覆盖机制在最后几层逆转了这一趋势——第30层时合作概率达到84%。通过概念夹持技术，将学习到的纳什方向注入残差流可双向因果性地改变模型行为。行为实验揭示了六项与规模及架构相关的发现，其中最突出的是：思维链推理在小模型中恶化纳什博弈表现，但在720亿参数以上模型中实现近乎完美的纳什博弈。交叉对弈实验揭示了自我对弈中无法观测的三种现象：小模型可通过早期背叛瓦解任意合作对象；两个大模型会无限强化彼此的合作本能；先手行动者决定系统最终达到哪个纳什均衡。大语言模型并非缺乏纳什博弈能力——它们计算了纳什策略，然后将其抑制了。