ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in the multi-agent setting it must preserve the strategic structure of the problem rather than merely improve short-term optimization. We propose Automatic Reward-shaping in Multi-agent Systems (ARMS), a self-supervised reward shaping framework for MARL that learns dense shaping signals from sparse environmental rewards through trajectory ranking. Since single-agent trajectory-ranking guarantees do not directly transfer to MARL, we reformulate policy invariance through conditional best-response reasoning, and show that if certain conditions hold, then using shaping rewards preserves each agent's best-response set under fixed opponent policies, and consequently preserve the set of Nash equilibria. Guided by this perspective, ARMS alternates between policy learning and reward learning while sharing shaping parameters across agents for efficiency. Experiments in a partially observable multi-agent pathfinding domain show that ARMS improves sampling efficiency under increasing reward sparsity and agent count, generalizes to unseen environments, and reveals a MARL-specific failure mode in which limited exploration and coupled policy--reward dynamics induce oscillatory behavior. Increasing exploration mitigates this effect and stabilizes learning. To the best of our knowledge, ARMS is the first automatic reward shaping framework for MARL whose design is motivated by a game-theoretic equilibrium-preservation result.

翻译：稀疏奖励是多智能体强化学习中的主要瓶颈——智能体的同步学习过程会引入非平稳性，使得奖励设计尤为复杂。奖励塑形虽能加速学习，但在多智能体场景中必须保留问题的策略结构，而非单纯优化短期目标。本文提出面向多智能体系统的自动奖励塑形框架ARMS，该自监督方法通过轨迹排序从稀疏环境奖励中学习密集的塑形信号。由于单智能体轨迹排序保证无法直接迁移至多智能体强化学习，我们基于条件最优反应推理重新定义了策略不变性，并证明在特定条件下，使用塑形奖励可在固定对手策略下保持各智能体的最优反应集合，进而维持纳什均衡集合。基于这一视角，ARMS在策略学习与奖励学习间交替迭代，同时跨智能体共享塑形参数以提升效率。在部分可观测的多智能体路径规划任务中的实验表明：ARMS能有效提升奖励稀疏度增加和智能体数量增多时的采样效率，可泛化至未见环境，并揭示了一种多智能体强化学习特有的故障模式——有限探索与耦合的策略-奖励动力学导致振荡行为。增加探索可缓解该效应并稳定学习过程。据我们所知，ARMS是首个基于博弈论均衡保持理论驱动的多智能体强化学习自动奖励塑形框架。