Reinforcement learning can solve decision-making problems and train an agent to behave in an environment according to a predesigned reward function. However, such an approach becomes very problematic if the reward is too sparse and the agent does not come across the reward during the environmental exploration. The solution to such a problem may be in equipping the agent with an intrinsic motivation, which will provide informed exploration, during which the agent is likely to also encounter external reward. Novelty detection is one of the promising branches of intrinsic motivation research. We present Self-supervised Network Distillation (SND), a class of internal motivation algorithms based on the distillation error as a novelty indicator, where the target model is trained using self-supervised learning. We adapted three existing self-supervised methods for this purpose and experimentally tested them on a set of ten environments that are considered difficult to explore. The results show that our approach achieves faster growth and higher external reward for the same training time compared to the baseline models, which implies improved exploration in a very sparse reward environment.
翻译:强化学习能够解决决策问题,并训练智能体依据预设的奖励函数在环境中行动。然而,当奖励过于稀疏且智能体在环境探索过程中未能遇到奖励时,这种方法便面临重大挑战。解决该问题的途径之一是赋予智能体内在动机,使其进行信息充分的探索,在此过程中更有可能同时获得外部奖励。新颖性检测是内在动机研究中一个前景广阔的方向。我们提出自监督网络蒸馏(SND)——一类以内蒸馏误差作为新颖性指标的内在动机算法,其中目标模型通过自监督学习方法进行训练。为此,我们适配了三种现有自监督方法,并在十个被视为探索困难的环境上进行了实验测试。结果表明,与基线模型相比,我们的方法在相同训练时间内实现了更快的奖励增长和更高的外部奖励,这意味着在极稀疏奖励环境中实现了改进的探索性能。