Text-based games are a popular testbed for language-based reinforcement learning (RL). In previous work, deep Q-learning is commonly used as the learning agent. Q-learning algorithms are challenging to apply to complex real-world domains due to, for example, their instability in training. Therefore, in this paper, we adapt the soft-actor-critic (SAC) algorithm to the text-based environment. To deal with sparse extrinsic rewards from the environment, we combine it with a potential-based reward shaping technique to provide more informative (dense) reward signals to the RL agent. We apply our method to play difficult text-based games. The SAC method achieves higher scores than the Q-learning methods on many games with only half the number of training steps. This shows that it is well-suited for text-based games. Moreover, we show that the reward shaping technique helps the agent to learn the policy faster and achieve higher scores. In particular, we consider a dynamically learned value function as a potential function for shaping the learner's original sparse reward signals.
翻译:文本游戏是基于语言的强化学习(RL)的热门测试平台。以往的研究中,深度Q学习常被用作学习智能体。Q学习算法因其训练不稳定性等问题,难以应用于复杂的现实领域。因此,本文将软演员-评论家(SAC)算法适配到文本环境中。为应对环境提供的稀疏外部奖励,我们将其与基于势能的奖励塑造技术相结合,为RL智能体提供更丰富(密集)的奖励信号。我们将该方法应用于玩困难的文本游戏。结果显示,在许多游戏中,SAC方法仅用一半的训练步数便取得了比Q学习方法更高的分数,表明其非常适合文本游戏。此外,我们证明了奖励塑造技术有助于智能体更快地学习策略并获得更高分数。特别地,我们采用动态学习的值函数作为势能函数,对智能体原本稀疏的奖励信号进行塑造。