Simulators are a pervasive tool in reinforcement learning, but most existing algorithms cannot efficiently exploit simulator access -- particularly in high-dimensional domains that require general function approximation. We explore the power of simulators through online reinforcement learning with {local simulator access} (or, local planning), an RL protocol where the agent is allowed to reset to previously observed states and follow their dynamics during training. We use local simulator access to unlock new statistical guarantees that were previously out of reach: - We show that MDPs with low coverability (Xie et al. 2023) -- a general structural condition that subsumes Block MDPs and Low-Rank MDPs -- can be learned in a sample-efficient fashion with only $Q^{\star}$-realizability (realizability of the optimal state-value function); existing online RL algorithms require significantly stronger representation conditions. - As a consequence, we show that the notorious Exogenous Block MDP problem (Efroni et al. 2022) is tractable under local simulator access. The results above are achieved through a computationally inefficient algorithm. We complement them with a more computationally efficient algorithm, RVFS (Recursive Value Function Search), which achieves provable sample complexity guarantees under a strengthened statistical assumption known as pushforward coverability. RVFS can be viewed as a principled, provable counterpart to a successful empirical paradigm that combines recursive search (e.g., MCTS) with value function approximation.
翻译:模拟器是强化学习中广泛应用的工具,但现有算法大多无法高效利用模拟器访问能力——特别是在需要通用函数逼近的高维领域。本文通过"局部模拟器访问"(即局部规划)这一在线强化学习协议探索模拟器的能力:该协议允许智能体在训练过程中重置至先前观察过的状态并遵循其动力学进行训练。我们利用局部模拟器访问获取了此前难以企及的新统计保证:
- 我们证明,具有低覆盖性(Xie等人,2023)的马尔可夫决策过程(MDP)——该通用结构条件涵盖了分块MDP和低秩MDP——仅需最优状态价值函数的可实现性即可实现样本高效学习;而现有在线强化学习算法需要显著更强的表示条件。
- 由此引申,我们证明臭名昭著的外生分块MDP问题(Efroni等人,2022)在局部模拟器访问条件下是可处理的。
上述结论通过计算效率欠佳的算法实现。作为补充,我们提出了更具计算效率的算法RVFS(递归价值函数搜索),该算法在前推覆盖性这一增强统计假设下达到了可证明的样本复杂度保证。RVFS可被视为结合递归搜索(如蒙特卡洛树搜索)与价值函数逼近的成功经验范式的一个原则性、可证明的对应方法。