Deep reinforcement learning in partially observable environments is a difficult task in itself, and can be further complicated by a sparse reward signal. Most tasks involving navigation in three-dimensional environments provide the agent with extremely limited information. Typically, the agent receives a visual observation input from the environment and is rewarded once at the end of the episode. A good reward function could substantially improve the convergence of reinforcement learning algorithms for such tasks. The classic approach to increase the density of the reward signal is to augment it with supplementary rewards. This technique is called the reward shaping. In this study, we propose two modifications of one of the recent reward shaping methods based on graph convolutional networks: the first involving advanced aggregation functions, and the second utilizing the attention mechanism. We empirically validate the effectiveness of our solutions for the task of navigation in a 3D environment with sparse rewards. For the solution featuring attention mechanism, we are also able to show that the learned attention is concentrated on edges corresponding to important transitions in 3D environment.
翻译:深度强化学习在处理部分可观测环境时本身即是一项困难任务,而稀疏奖励信号更增加了其复杂度。大多数涉及三维环境导航的任务仅能为智能体提供极为有限的信息,通常智能体从环境中获取视觉观测输入,并仅在回合结束时获得一次奖励。设计良好的奖励函数可显著提升此类任务中强化学习算法的收敛性能。增加奖励信号密度的经典方法是采用补充奖励对其进行增强,该技术被称为奖励塑形。本研究提出对最新基于图卷积网络的奖励塑形方法的两项改进:第一项改进采用高级聚合函数,第二项改进引入注意力机制。我们通过实验验证了所提方案在三维稀疏奖励环境导航任务中的有效性。针对采用注意力机制的方案,我们还证明了学习到的注意力集中于与三维环境中重要状态转移对应的边缘上。