The Price of Paranoia: Robust Risk-Sensitive Cooperation in Non-Stationary Multi-Agent Reinforcement Learning

Cooperative equilibria are fragile. When agents learn alongside each other rather than in a fixed environment, the process of learning destabilizes the cooperation they are trying to sustain: every gradient step an agent takes shifts the distribution of actions its partner will play, turning a cooperative partner into a source of stochastic noise precisely where the cooperation decision is most sensitive. We study how this co-learning noise propagates through the structure of coordination games, and find that the cooperative equilibrium, even when strongly Pareto-dominant, is exponentially unstable under standard risk-neutral learning, collapsing irreversibly once partner noise crosses the game's critical cooperation threshold. The natural response to apply distributional robustness to hedge against partner uncertainty makes things strictly worse: risk-averse return objectives penalize the high-variance cooperative action relative to defection, widening the instability region rather than shrinking it, a paradox that reveals a fundamental mismatch between the domains where robustness is applied and instability originates. We resolve this by showing that robustness should target the policy gradient update variance induced by partner uncertainty, not the return distribution. This distinction yields an algorithm whose gradient updates are modulated by an online measure of partner unpredictability, provably expanding the cooperation basin in symmetric coordination games. To unify stability, sample complexity, and welfare consequences of this approach, we introduce the Price of Paranoia as the structural dual of the Price of Anarchy. Together with a novel Cooperation Window, it precisely characterizes how much welfare learning algorithms can recover under partner noise, pinning down the optimal degree of robustness as a closed-form balance between equilibrium stability and sample efficiency.

翻译：合作均衡是脆弱的。当智能体在彼此共存的环境中而非固定环境中学习时，学习过程本身会破坏它们试图维持的合作：智能体每执行一次梯度更新，都会改变其对手将要采取的行动分布，恰好在合作决策最敏感的环节将合作对手转化为随机噪声源。我们研究了这种共同学习噪声如何通过协调博弈的结构传播，并发现在标准风险中性学习下，合作均衡（即使强烈帕累托占优）也会指数级不稳定，一旦对手噪声超过博弈的关键合作阈值就会不可逆地崩溃。为了对冲对手不确定性而自然采用分布鲁棒性的做法反而会使情况更糟：风险规避的回报目标相对于背叛行为会惩罚高方差的合作行为，从而扩大而非缩小不稳定区域——这一悖论揭示了鲁棒性应用领域与不稳定根源之间的根本性错配。我们通过证明鲁棒性应针对由对手不确定性引发的策略梯度更新方差（而非回报分布）来解决这一问题。这一区分催生了如下算法：其梯度更新通过对手不可预测性的在线度量进行调节，可在对称协调博弈中可证明地扩大合作盆地。为统一分析该方法的稳定性、样本复杂度与福利后果，我们将“偏执的成本”引入为“无政府成本的结构对偶”。结合新提出的“合作窗口”，该指标精确刻画了学习算法在对手噪声下能恢复多少福利，并以均衡稳定性与样本效率之间的闭式平衡确定了最优鲁棒性程度。