Exploration is a fundamental aspect of reinforcement learning (RL), and its effectiveness is a deciding factor in the performance of RL algorithms, especially when facing sparse extrinsic rewards. Recent studies have shown the effectiveness of encouraging exploration with intrinsic rewards estimated from novelties in observations. However, there is a gap between the novelty of an observation and an exploration, as both the stochasticity in the environment and the agent's behavior may affect the observation. To evaluate exploratory behaviors accurately, we propose DEIR, a novel method in which we theoretically derive an intrinsic reward with a conditional mutual information term that principally scales with the novelty contributed by agent explorations, and then implement the reward with a discriminative forward model. Extensive experiments on both standard and advanced exploration tasks in MiniGrid show that DEIR quickly learns a better policy than the baselines. Our evaluations on ProcGen demonstrate both the generalization capability and the general applicability of our intrinsic reward. Our source code is available at https://github.com/swan-utokyo/deir.
翻译:探索是强化学习的基础方面,其有效性决定了强化学习算法的性能,尤其是在面临稀疏外部奖励时。近期研究表明,利用从观测新奇性中估计的内在奖励来鼓励探索是有效的。然而,观测的新奇性与实际探索之间存在差距,因为环境中的随机性和智能体的行为都可能影响观测。为准确评估探索行为,我们提出DEIR这一新方法,从理论上推导出包含条件互信息项的内在奖励,该奖励主要衡量智能体探索所贡献的新奇性,并通过判别式前向模型实现该奖励。在MiniGrid标准及高级探索任务上的大量实验表明,DEIR能比基线方法更快地学习到更优策略。我们在ProcGen上的评估证明了该内在奖励的泛化能力和通用适用性。我们的源代码已开源:https://github.com/swan-utokyo/deir。