We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous linear Markov decision processes (linear MDPs) whose transition dynamic can be parameterized as a linear function of a given feature mapping, we propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $K$ is the number of episodes. Our algorithm is based on a weighted linear regression scheme with a carefully designed weight, which depends on a new variance estimator that (1) directly estimates the variance of the \emph{optimal} value function, (2) monotonically decreases with respect to the number of episodes to ensure a better estimation accuracy, and (3) uses a rare-switching policy to update the value function estimator to control the complexity of the estimated value function class. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
翻译:我们研究了基于线性函数逼近的强化学习问题。针对具有时变非齐次性的线性马尔可夫决策过程(线性MDPs),其转移动态可表示为给定特征映射的线性函数,我们首次提出了一种计算高效的算法,实现了近极小最优的遗憾界$\tilde O(d\sqrt{H^3K})$,其中$d$为特征映射维度,$H$为规划视界,$K$为交互回合数。该算法基于加权线性回归方案,其核心在于精心设计的权重函数,该权重依赖于一种新型方差估计器:(1)直接估计最优值函数的方差,(2)随交互回合数单调递减以保证更高的估计精度,(3)采用稀疏切换策略更新值函数估计器,从而控制值函数估计类的复杂度。本研究为线性MDP框架下的最优强化学习提供了完整解答,所提出的算法与理论工具可能具有独立的研究价值。