Mirror descent value iteration (MDVI), an abstraction of Kullback-Leibler (KL) and entropy-regularized reinforcement learning (RL), has served as the basis for recent high-performing practical RL algorithms. However, despite the use of function approximation in practice, the theoretical understanding of MDVI has been limited to tabular Markov decision processes (MDPs). We study MDVI with linear function approximation through its sample complexity required to identify an $\varepsilon$-optimal policy with probability $1-\delta$ under the settings of an infinite-horizon linear MDP, generative model, and G-optimal design. We demonstrate that least-squares regression weighted by the variance of an estimated optimal value function of the next state is crucial to achieving minimax optimality. Based on this observation, we present Variance-Weighted Least-Squares MDVI (VWLS-MDVI), the first theoretical algorithm that achieves nearly minimax optimal sample complexity for infinite-horizon linear MDPs. Furthermore, we propose a practical VWLS algorithm for value-based deep RL, Deep Variance Weighting (DVW). Our experiments demonstrate that DVW improves the performance of popular value-based deep RL algorithms on a set of MinAtar benchmarks.
翻译:镜像下降值迭代(MDVI)作为Kullback-Leibler(KL)正则化与熵正则化强化学习(RL)的抽象框架,已成为近期高性能实用RL算法的基础。然而,尽管实际应用中采用函数逼近方法,MDVI的理论理解始终局限于表格型马尔可夫决策过程(MDP)。本文在线性MDP的无限时域、生成模型及G-最优设计设置下,通过识别一个ε-最优策略所需的样本复杂度(保证概率为1−δ),研究具有线性函数逼近的MDVI。我们证明:利用下一状态最优值函数估计值的方差加权的线性最小二乘回归,对实现极小极大最优性至关重要。基于此发现,我们提出方差加权最小二乘MDVI(VWLS-MDVI)——首个在无限时域线性MDP中达到近乎极小极大最优样本复杂度的理论算法。进一步地,我们提出用于基于值函数深度RL的实用VWLS算法——深度方差加权(DVW)。实验表明,DVW在MinAtar基准测试集上提升了主流基于值函数的深度RL算法的性能。