One of the most natural approaches to reinforcement learning (RL) with function approximation is value iteration, which inductively generates approximations to the optimal value function by solving a sequence of regression problems. To ensure the success of value iteration, it is typically assumed that Bellman completeness holds, which ensures that these regression problems are well-specified. We study the problem of learning an optimal policy under Bellman completeness in the online model of RL with linear function approximation. In the linear setting, while statistically efficient algorithms are known under Bellman completeness (e.g., Jiang et al. (2017); Zanette et al. (2020)), these algorithms all rely on the principle of global optimism which requires solving a nonconvex optimization problem. In particular, it has remained open as to whether computationally efficient algorithms exist. In this paper we give the first polynomial-time algorithm for RL under linear Bellman completeness when the number of actions is any constant.
翻译:强化学习中函数逼近最自然的方法之一是值迭代,该方法通过求解一系列回归问题,归纳地生成最优值函数的逼近。为确保值迭代的成功,通常假设贝尔曼完备性成立,这保证了这些回归问题是适定的。我们研究在线强化学习模型中,在线性函数逼近下,贝尔曼完备性条件下最优策略的学习问题。在线性设定下,尽管已知在贝尔曼完备性下存在统计高效的算法(例如 Jiang 等人 (2017);Zanette 等人 (2020)),但这些算法均依赖于全局乐观原则,需要求解非凸优化问题。特别是,计算高效算法是否存在一直是一个未解决的问题。本文中,我们首次给出了当动作数量为任意常数时,线性贝尔曼完备性下强化学习的多项式时间算法。