In the realm of multi-agent reinforcement learning, intrinsic motivations have emerged as a pivotal tool for exploration. While the computation of many intrinsic rewards relies on estimating variational posteriors using neural network approximators, a notable challenge has surfaced due to the limited expressive capability of these neural statistics approximators. We pinpoint this challenge as the "revisitation" issue, where agents recurrently explore confined areas of the task space. To combat this, we propose a dynamic reward scaling approach. This method is crafted to stabilize the significant fluctuations in intrinsic rewards in previously explored areas and promote broader exploration, effectively curbing the revisitation phenomenon. Our experimental findings underscore the efficacy of our approach, showcasing enhanced performance in demanding environments like Google Research Football and StarCraft II micromanagement tasks, especially in sparse reward settings.
翻译:在多智能体强化学习领域,内在动机已成为探索的关键工具。尽管许多内在奖励的计算依赖于使用神经网络近似器估计变分后验,但这类神经统计近似器的有限表达能力引发了显著挑战。我们将此问题定义为"重复访问"现象——智能体反复探索任务空间的狭隘区域。为此,我们提出一种动态奖励缩放方法,该方法旨在稳定已探索区域中内在奖励的剧烈波动,促进更广泛的探索,有效抑制重复访问现象。实验结果表明,该方法在Google Research Football和StarCraft II微操作等具有挑战性的环境中表现出色,尤其在稀疏奖励场景下性能显著提升。