Reinforcement learning can solve decision-making problems and train an agent to behave in an environment according to a predesigned reward function. However, such an approach becomes very problematic if the reward is too sparse and so the agent does not come across the reward during the environmental exploration. The solution to such a problem may be to equip the agent with an intrinsic motivation that will provide informed exploration during which the agent is likely to also encounter external reward. Novelty detection is one of the promising branches of intrinsic motivation research. We present Self-supervised Network Distillation (SND), a class of intrinsic motivation algorithms based on the distillation error as a novelty indicator, where the predictor model and the target model are both trained. We adapted three existing self-supervised methods for this purpose and experimentally tested them on a set of ten environments that are considered difficult to explore. The results show that our approach achieves faster growth and higher external reward for the same training time compared to the baseline models, which implies improved exploration in a very sparse reward environment. In addition, the analytical methods we applied provide valuable explanatory insights into our proposed models.
翻译:强化学习能够解决决策问题,并训练智能体根据预先设计的奖励函数在环境中行动。然而,当奖励过于稀疏,智能体在环境探索过程中无法遇到奖励时,这种方法便会出现严重问题。该问题的解决方案可能是为智能体配备内在动机,使其在有信息量的探索过程中更可能同时遇到外部奖励。新奇检测是内在动机研究中有前景的分支之一。我们提出了自监督网络蒸馏(SND),这是一类基于蒸馏误差作为新奇指标的内在动机算法,其中预测模型和目标模型均进行训练。我们改进了三种现有的自监督方法以实现这一目的,并在十个被认为难以探索的环境上进行了实验测试。结果表明,与基线模型相比,我们的方法在相同训练时间内实现了更快的增长和更高的外部奖励,这表明在极稀疏奖励环境中探索得到了改进。此外,我们应用的分析方法为提出的模型提供了有价值的解释性见解。