Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging. In contrast to the tableau setting, one can not enumerate all the states and then iteratively update the policies for each state. This prevents the application of many well-studied RL methods especially those with provable convergence guarantees. In this paper, we first present a substantial generalization of the recently developed policy mirror descent method to deal with general state and action spaces. We introduce new approaches to incorporate function approximation into this method, so that we do not need to use explicit policy parameterization at all. Moreover, we present a novel policy dual averaging method for which possibly simpler function approximation techniques can be applied. We establish linear convergence rate to global optimality or sublinear convergence to stationarity for these methods applied to solve different classes of RL problems under exact policy evaluation. We then define proper notions of the approximation errors for policy evaluation and investigate their impact on the convergence of these methods applied to general-state RL problems with either finite-action or continuous-action spaces. To the best of our knowledge, the development of these algorithmic frameworks as well as their convergence analysis appear to be new in the literature.
翻译:强化学习在通用状态与动作空间上具有公认的挑战性。与表格型设定不同,我们无法枚举所有状态并针对每个状态迭代更新策略,这阻碍了许多经过充分研究的强化学习方法(尤其具有可证明收敛保证的方法)的应用。本文首先对近期发展的策略镜像下降方法进行实质性推广,以处理通用状态与动作空间。我们引入新方法将函数近似融入该框架,从而完全无需使用显式策略参数化。此外,我们提出一种新颖的策略对偶平均方法,该方法可应用更简洁的函数近似技术。在精确策略评估条件下,针对不同类别的强化学习问题,我们证明了这些方法能实现全局最优解的线性收敛速率或驻点的次线性收敛。随后,我们定义了策略评估近似误差的恰当概念,并研究其对于这些方法在有限动作空间或连续动作空间的通用状态强化学习问题中收敛性的影响。据我们所知,这些算法框架的开发及其收敛性分析在文献中尚属首次。