Exploration algorithms for reinforcement learning typically replace or augment the reward function with an additional ``intrinsic'' reward that trains the agent to seek previously unseen states of the environment. Here, we consider an exploration algorithm that exploits meta-learning, or learning to learn, such that the agent learns to maximize its exploration progress within a single episode, even between epochs of training. The agent learns a policy that aims to minimize the probability density of new observations with respect to all of its memories. In addition, it receives as feedback evaluations of the current observation density and retains that feedback in a recurrent network. By remembering trajectories of density, the agent learns to navigate a complex and growing landscape of familiarity in real-time, allowing it to maximize its exploration progress even in completely novel states of the environment for which its policy has not been trained.
翻译:强化学习中的探索算法通常通过引入额外的"内在"奖励来替代或增强原有的奖励函数,从而训练智能体寻找环境中先前未访问过的状态。本文提出一种利用元学习(即学会学习)的探索算法,使智能体能够学会在单个训练周期内(甚至在训练轮次之间)最大化其探索进度。该智能体学习一种旨在最小化新观测值相对于其所有记忆的概率密度的策略。此外,它接收当前观测密度的评估反馈,并将该反馈保存在循环网络中。通过记忆密度轨迹,智能体学会实时导航复杂且不断增长的熟悉度景观,使其即使在环境完全新颖、策略未经训练的状态下也能最大化探索进度。