Simulators are a pervasive tool in reinforcement learning, but most existing algorithms cannot efficiently exploit simulator access -- particularly in high-dimensional domains that require general function approximation. We explore the power of simulators through online reinforcement learning with {local simulator access} (or, local planning), an RL protocol where the agent is allowed to reset to previously observed states and follow their dynamics during training. We use local simulator access to unlock new statistical guarantees that were previously out of reach: - We show that MDPs with low coverability Xie et al. 2023 -- a general structural condition that subsumes Block MDPs and Low-Rank MDPs -- can be learned in a sample-efficient fashion with only $Q^{\star}$-realizability (realizability of the optimal state-value function); existing online RL algorithms require significantly stronger representation conditions. - As a consequence, we show that the notorious Exogenous Block MDP problem Efroni et al. 2022 is tractable under local simulator access. The results above are achieved through a computationally inefficient algorithm. We complement them with a more computationally efficient algorithm, RVFS (Recursive Value Function Search), which achieves provable sample complexity guarantees under a strengthened statistical assumption known as pushforward coverability. RVFS can be viewed as a principled, provable counterpart to a successful empirical paradigm that combines recursive search (e.g., MCTS) with value function approximation.
翻译:模拟器是强化学习中广泛使用的工具,但现有算法大多无法高效利用模拟器交互——尤其是在需要通用函数逼近的高维领域。本文通过在线强化学习中的"局部模拟器交互"(或称局部规划)协议探索模拟器的能力:该协议允许智能体在训练过程中重置至先前观测过的状态并沿其动力学轨迹推进。借助局部模拟器交互,我们解锁了此前无法企及的新统计保证:
- 我们证明,具有低覆盖性(Xie等人2023年提出,该通用结构条件涵盖块MDP和低秩MDP)的马尔可夫决策过程仅需最优状态价值函数的可实现性即可实现样本高效学习;而现有在线强化学习算法需要显著更强的表示条件。
- 作为推论,我们证明臭名昭著的外生块MDP问题(Efroni等人2022年)在局部模拟器交互下是可处理的。
上述成果通过计算效率欠佳的算法实现。我们进一步提出计算更高效的算法RVFS(递归价值函数搜索),该算法在称为前推覆盖性的强化统计假设下具有可证明的样本复杂度保证。RVFS可被视为成功经验范式(如结合递归搜索与价值函数逼近的蒙特卡洛树搜索)的原则性可证明复现。