Policy gradient methods have become one of the most popular classes of algorithms for multi-agent reinforcement learning. A key challenge, however, that is not addressed by many of these methods is multi-agent credit assignment: assessing an agent's contribution to the overall performance, which is crucial for learning good policies. We propose a novel algorithm called Dr.Reinforce that explicitly tackles this by combining difference rewards with policy gradients to allow for learning decentralized policies when the reward function is known. By differencing the reward function directly, Dr.Reinforce avoids difficulties associated with learning the Q-function as done by Counterfactual Multiagent Policy Gradients (COMA), a state-of-the-art difference rewards method. For applications where the reward function is unknown, we show the effectiveness of a version of Dr.Reinforce that learns an additional reward network that is used to estimate the difference rewards.
翻译:策略梯度方法已成为多智能体强化学习中最流行的算法类别之一。然而,这些方法中的许多并未解决一个关键挑战:多智能体信用分配——评估智能体对整体性能的贡献,这对学习良好策略至关重要。我们提出了一种名为Dr.Reinforce的新算法,通过将差异奖励与策略梯度相结合,在奖励函数已知时直接解决这一问题,从而允许学习去中心化策略。通过直接对奖励函数进行差分,Dr.Reinforce避免了像最先进的差异奖励方法——反事实多智能体策略梯度(COMA)那样学习Q函数所遇到的困难。对于奖励函数未知的应用场景,我们证明了Dr.Reinforce一个版本的有效性,该版本学习用于估计差异奖励的额外奖励网络。