探索如何破坏共享策略多智能体强化学习中的合作 (How Exploration Breaks Cooperation in Shared-Policy Multi-Agent Reinforcement Learning)

Multi-agent reinforcement learning in dynamic social dilemmas commonly relies on parameter sharing to enable scalability. We show that in shared-policy Deep Q-Network learning, standard exploration can induce a robust and systematic collapse of cooperation even in environments where fully cooperative equilibria are stable and payoff dominant. Through controlled experiments, we demonstrate that shared DQN converges to stable but persistently low-cooperation regimes. This collapse is not caused by reward misalignment, noise, or insufficient training, but by a representational failure arising from partial observability combined with parameter coupling across heterogeneous agent states. Exploration-driven updates bias the shared representation toward locally dominant defection responses, which then propagate across agents and suppress cooperative learning. We confirm that the failure persists across network sizes, exploration schedules, and payoff structures, and disappears when parameter sharing is removed or when agents maintain independent representations. These results identify a fundamental failure mode of shared-policy MARL and establish structural conditions under which scalable learning architectures can systematically undermine cooperation. Our findings provide concrete guidance for the design of multi-agent learning systems in social and economic environments where collective behavior is critical.

翻译：在动态社会困境中的多智能体强化学习通常依赖参数共享来实现可扩展性。我们发现，在共享策略的深度Q网络学习中，标准探索会引发合作稳健且系统性的崩溃，即使在完全合作均衡稳定且收益占优的环境中也是如此。通过受控实验，我们证明共享DQN会收敛到稳定但持续低合作的机制。这种崩溃并非由奖励错位、噪声或训练不足引起，而是源于部分可观测性与异构智能体状态间参数耦合共同导致的表征失效。探索驱动的更新使共享表征偏向局部占优的背叛响应，随后这种偏差在智能体间传播并抑制合作学习。我们证实该失效现象在不同网络规模、探索调度和收益结构中持续存在，而当移除参数共享或智能体保持独立表征时该现象消失。这些结果揭示了共享策略多智能体强化学习的一个基本失效模式，并确立了可扩展学习架构可能系统性地破坏合作的结构性条件。我们的研究为在社会和经济环境中设计多智能体学习系统提供了具体指导，此类环境中集体行为至关重要。