We introduce a novel framework for analyzing reinforcement learning (RL) in continuous state-action spaces, and use it to prove fast rates of convergence in both off-line and on-line settings. Our analysis highlights two key stability properties, relating to how changes in value functions and/or policies affect the Bellman operator and occupation measures. We argue that these properties are satisfied in many continuous state-action Markov decision processes, and demonstrate how they arise naturally when using linear function approximation methods. Our analysis offers fresh perspectives on the roles of pessimism and optimism in off-line and on-line RL, and highlights the connection between off-line RL and transfer learning.
翻译:我们提出了一种新颖的框架,用于分析连续状态-动作空间中的强化学习,并利用该框架证明离线与在线设定下的快速收敛速率。我们的分析凸显了两个关键稳定性性质,分别关乎价值函数和/或策略的变化如何影响贝尔曼算子与占据测度。我们论证了这些性质在许多连续状态-动作马尔可夫决策过程中均成立,并展示了它们在使用线性函数逼近方法时如何自然涌现。该分析为离线与在线强化学习中悲观主义与乐观主义的作用提供了全新视角,同时揭示了离线强化学习与迁移学习之间的关联。