Reinforcement learning can solve decision-making problems and train an agent to behave in an environment according to a predesigned reward function. However, such an approach becomes very problematic if the reward is too sparse and the agent does not come across the reward during the environmental exploration. The solution to such a problem may be in equipping the agent with an intrinsic motivation, which will provide informed exploration, during which the agent is likely to also encounter external reward. Novelty detection is one of the promising branches of intrinsic motivation research. We present Self-supervised Network Distillation (SND), a class of internal motivation algorithms based on the distillation error as a novelty indicator, where the target model is trained using self-supervised learning. We adapted three existing self-supervised methods for this purpose and experimentally tested them on a set of ten environments that are considered difficult to explore. The results show that our approach achieves faster growth and higher external reward for the same training time compared to the baseline models, which implies improved exploration in a very sparse reward environment.
翻译:强化学习能够解决决策问题,并根据预设奖励函数训练智能体在环境中行动。然而,当奖励过于稀疏且智能体在环境探索过程中未遇到奖励时,此类方法会变得非常棘手。解决该问题的方法可能在于赋予智能体内在动机,使其进行信息引导的探索,在此过程中智能体也更有可能遇到外部奖励。新颖性检测是内在动机研究中有前景的分支之一。我们提出自监督网络蒸馏(Self-supervised Network Distillation, SND),这是一类基于蒸馏误差作为新颖性指标的内在动机算法,其中目标模型使用自监督学习进行训练。为此,我们改编了三种现有的自监督方法,并在十个被认为难以探索的环境上进行了实验测试。结果表明,与基线模型相比,我们的方法在相同训练时间内实现了更快的增长和更高的外部奖励,这表明在极其稀疏的奖励环境中探索性能得到了改善。