In multi-agent environments, agents often struggle to learn optimal policies due to sparse or delayed global rewards, particularly in long-horizon tasks where it is challenging to evaluate actions at intermediate time steps. We introduce Temporal-Agent Reward Redistribution (TAR$^2$), a novel approach designed to address the agent-temporal credit assignment problem by redistributing sparse rewards both temporally and across agents. TAR$^2$ decomposes sparse global rewards into time-step-specific rewards and calculates agent-specific contributions to these rewards. We theoretically prove that TAR$^2$ is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirical results demonstrate that TAR$^2$ stabilizes and accelerates the learning process. Additionally, we show that when TAR$^2$ is integrated with single-agent reinforcement learning algorithms, it performs as well as or better than traditional multi-agent reinforcement learning methods.
翻译:在多智能体环境中,由于稀疏或延迟的全局奖励,智能体往往难以学习最优策略,尤其是在长时程任务中,评估中间时间步的动作具有挑战性。我们提出了时序-智能体奖励再分配(TAR$^2$)这一新方法,旨在通过跨时间和跨智能体重新分配稀疏奖励,解决智能体-时序信用分配问题。TAR$^2$将稀疏的全局奖励分解为特定时间步的奖励,并计算各智能体对这些奖励的贡献。我们从理论上证明了TAR$^2$等价于基于势函数的奖励塑形,从而确保最优策略保持不变。实证结果表明,TAR$^2$能够稳定并加速学习过程。此外,我们还证明当TAR$^2$与单智能体强化学习算法结合时,其性能与传统多智能体强化学习方法相当或更优。