Designing appropriate reward functions for Reinforcement Learning (RL) approaches has been a significant problem, especially for complex environments such as Atari games. Utilizing natural language instructions to provide intermediate rewards to RL agents in a process known as reward shaping can help the agent in reaching the goal state faster. In this work, we propose a natural language-based reward shaping approach that maps trajectories from the Montezuma's Revenge game environment to corresponding natural language instructions using an extension of the LanguagE-Action Reward Network (LEARN) framework. These trajectory-language mappings are further used to generate intermediate rewards which are integrated into reward functions that can be utilized to learn an optimal policy for any standard RL algorithms. For a set of 15 tasks from Atari's Montezuma's Revenge game, the Ext-LEARN approach leads to the successful completion of tasks more often on average than the reward shaping approach that uses the LEARN framework and performs even better than the reward shaping framework without natural language-based rewards.
翻译:为强化学习方法设计合适的奖励函数一直是一个重大难题,尤其是在Atari游戏等复杂环境中。利用自然语言指令在称为奖励塑形的过程中为强化学习智能体提供中间奖励,可以帮助智能体更快地达到目标状态。本文提出了一种基于自然语言的奖励塑形方法,该方法通过扩展语言-动作奖励网络(LEARN)框架,将Montezuma's Revenge游戏环境中的轨迹映射到相应的自然语言指令。这些轨迹-语言映射进一步用于生成中间奖励,并整合到奖励函数中,从而可用于学习任何标准强化学习算法的最优策略。在Atari游戏Montezuma's Revenge的15个任务上,Ext-LEARN方法在任务成功完成频率上平均高于使用LEARN框架的奖励塑形方法,甚至优于不使用基于自然语言奖励的奖励塑形框架。