Reinforcement learning (RL) algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards. Most common RL algorithms use undirected exploration, i.e., select random sequences of actions. Exploration can also be directed using intrinsic rewards, such as curiosity or model epistemic uncertainty. However, effectively balancing task and intrinsic rewards is challenging and often task-dependent. In this work, we introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration. MaxInfoRL steers exploration towards informative transitions, by maximizing intrinsic rewards such as the information gain about the underlying task. When combined with Boltzmann exploration, this approach naturally trades off maximization of the value function with that of the entropy over states, rewards, and actions. We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits. We then apply this general formulation to a variety of off-policy model-free RL methods for continuous state-action spaces, yielding novel algorithms that achieve superior performance across hard exploration problems and complex scenarios such as visual control tasks.
翻译:强化学习(RL)算法的目标是在利用当前最优策略与探索可能带来更高奖励的新选项之间取得平衡。大多数常见的RL算法采用无向探索,即选择随机的动作序列。探索也可以通过内在奖励(如好奇心或模型认知不确定性)进行引导。然而,有效平衡任务奖励与内在奖励具有挑战性,且通常依赖于具体任务。本文中,我们提出了一个框架MaxInfoRL,用于平衡内在与外在探索。MaxInfoRL通过最大化关于底层任务的信息增益等内在奖励,将探索引导至信息丰富的状态转移。当与玻尔兹曼探索结合时,该方法自然地权衡了价值函数的最大化与状态、奖励及动作熵的最大化。我们证明,在多臂赌博机的简化设定下,我们的方法实现了次线性遗憾。随后,我们将这一通用框架应用于多种连续状态-动作空间的离轨无模型RL方法,产生了在困难探索问题及复杂场景(如视觉控制任务)中实现卓越性能的新算法。