Collecting and leveraging data with good coverage properties plays a crucial role in different aspects of reinforcement learning (RL), including reward-free exploration and offline learning. However, the notion of "good coverage" really depends on the application at hand, as data suitable for one context may not be so for another. In this paper, we formalize the problem of active coverage in episodic Markov decision processes (MDPs), where the goal is to interact with the environment so as to fulfill given sampling requirements. This framework is sufficiently flexible to specify any desired coverage property, making it applicable to any problem that involves online exploration. Our main contribution is an instance-dependent lower bound on the sample complexity of active coverage and a simple game-theoretic algorithm, CovGame, that nearly matches it. We then show that CovGame can be used as a building block to solve different PAC RL tasks. In particular, we obtain a simple algorithm for PAC reward-free exploration with an instance-dependent sample complexity that, in certain MDPs which are "easy to explore", is lower than the minimax one. By further coupling this exploration algorithm with a new technique to do implicit eliminations in policy space, we obtain a computationally-efficient algorithm for best-policy identification whose instance-dependent sample complexity scales with gaps between policy values.
翻译:收集和利用具有良好覆盖性质的数据在强化学习的多个方面(包括无奖励探索和离线学习)中扮演着关键角色。然而,“良好覆盖”的概念实际上取决于具体应用场景,因为适用于某一情境的数据可能不适用于另一情境。本文形式化定义了情景马尔可夫决策过程中的主动覆盖问题,其目标是通过与环境交互来满足给定的采样需求。该框架具有充分灵活性,可指定任意期望的覆盖性质,故适用于任何涉及在线探索的问题。我们的主要贡献包括:一个与实例相关的主动覆盖样本复杂度下界,以及一个近乎匹配该下界的简单博弈论算法CovGame。我们进一步证明CovGame可作为构建模块解决不同的PAC强化学习任务。特别地,我们获得了一个具有实例相关样本复杂度的PAC无奖励探索简单算法——在部分“易于探索”的马尔可夫决策过程中,该复杂度低于极小极大值。通过将该探索算法与一种新颖的隐式策略空间消除技术相结合,我们得到了一种计算高效的策略最优识别算法,其样本复杂度随策略值之间的间隔而变化。