What Suppresses Nash Equilibrium Play in Large Language Models? Mechanistic Evidence and Causal Control

LLM agents are known to deviate from Nash equilibria in strategic interactions, but nobody has looked inside the model to understand why, or asked whether the deviation can be reversed. We do both. Working with four open-source models (Llama-3 and Qwen2.5, 8B to 72B parameters) playing four canonical two-player games, we establish the behavioral picture through self-play and cross-play experiments, then open up the 32-layer Llama-3-8B model and examine what actually happens during a strategic decision. The mechanistic findings are clear. Opponent history is encoded with near-perfect fidelity at the first layer (96% probe accuracy) and consumed progressively by later ones, while Nash action encoding is weak throughout, never exceeding 56%. There is no dedicated Nash module. Instead, the model privately favors the Nash action through most of its forward pass, but a prosocial override concentrated in the final layers reverses this, reaching 84% probability of cooperation at layer 30. When we inject a learned Nash direction into the residual stream, the behavior shifts bidirectionally, confirmed through concept clamping. The behavioral experiments surface six scale- and architecture-dependent findings, the most notable being that chain-of-thought reasoning worsens Nash play in small models but achieves near-perfect Nash play above 70B parameters. The cross-play experiments reveal three phenomena invisible in self-play: a small model can unravel any partner's cooperation by defecting early; two large models reinforce each other's cooperative instincts indefinitely; and who moves first in a coordination game determines which Nash equilibrium the system reaches. LLMs do not lack Nash-playing competence. They compute it, then suppress it.

翻译：已知大语言模型智能体在策略互动中会偏离纳什均衡，但此前无人深入模型内部探究原因，也未探讨这种偏离是否可逆转。本文兼攻二者。我们使用四个开源模型（Llama-3和Qwen2.5，参数量从8B到72B）进行四种经典双人博弈的自博弈与交叉博弈实验，首先建立行为层面的全景图，随后剖析32层Llama-3-8B模型，审视策略决策的实际运算过程。机制层面的发现清晰明确：对手历史信息在第一层便以近乎完美的保真度完成编码（探针准确率96%），并在后续层中逐步消耗；而纳什行动编码始终薄弱，最高不超过56%。模型并无专属纳什模块。相反，在前向传播的大多数阶段，模型内在倾向于纳什行动，但集中在最终层的亲社会性覆盖机制逆转了这一倾向：至第30层时，合作概率已达84%。当我们将学到的纳什方向注入残差流后，行为呈现双向偏移（经概念钳制实验验证）。行为实验揭示了六项与规模及架构相关的发现，其中最重要的是：思维链推理在小型模型中削弱纳什行为，但当参数超过70B时，却能实现近乎完美的纳什行为。交叉博弈实验揭示了自博弈中不可见的三种现象：小型模型可通过早期背叛瓦解任何对手的合作策略；两个大型模型会无限期强化彼此的合作本能；协调博弈中先手顺序决定了系统最终收敛于哪个纳什均衡。大语言模型并非缺乏执行纳什博弈的能力——它们计算出了纳什均衡，然后将其抑制了。