Exploration is a major challenge in reinforcement learning, especially for high-dimensional domains that require function approximation. We propose exploration objectives -- policy optimization objectives that enable downstream maximization of any reward function -- as a conceptual framework to systematize the study of exploration. Within this framework, we introduce a new objective, $L_1$-Coverage, which generalizes previous exploration schemes and supports three fundamental desiderata: 1. Intrinsic complexity control. $L_1$-Coverage is associated with a structural parameter, $L_1$-Coverability, which reflects the intrinsic statistical difficulty of the underlying MDP, subsuming Block and Low-Rank MDPs. 2. Efficient planning. For a known MDP, optimizing $L_1$-Coverage efficiently reduces to standard policy optimization, allowing flexible integration with off-the-shelf methods such as policy gradient and Q-learning approaches. 3. Efficient exploration. $L_1$-Coverage enables the first computationally efficient model-based and model-free algorithms for online (reward-free or reward-driven) reinforcement learning in MDPs with low coverability. Empirically, we find that $L_1$-Coverage effectively drives off-the-shelf policy optimization algorithms to explore the state space.
翻译:探索是强化学习中的一项重大挑战,尤其对于需要函数逼近的高维领域而言。我们提出探索目标——这一概念框架旨在通过策略优化目标实现后续任意奖励函数的最大化,从而系统化探索研究。在此框架下,我们引入新目标$L_1$-覆盖性,该目标泛化了已有探索方案并支持三项基本特性:1. 内在复杂度控制。$L_1$-覆盖性与结构参数$L_1$-可覆盖性相关联,后者反映了底层MDP的固有统计难度,涵盖分块MDP和低秩MDP。2. 高效规划。对于已知MDP,优化$L_1$-覆盖性可简化为标准策略优化问题,支持与策略梯度、Q学习方法等现成方法的灵活集成。3. 高效探索。$L_1$-覆盖性首次在低可覆盖性MDP中实现了计算高效的基于模型与无模型在线(无奖励驱动或奖励驱动)强化学习算法。实验表明,$L_1$-覆盖性能有效驱动现成策略优化算法探索状态空间。