We propose a new method for optimistic planning in infinite-horizon discounted Markov decision processes based on the idea of adding regularization to the updates of an otherwise standard approximate value iteration procedure. This technique allows us to avoid contraction and monotonicity arguments that are typically required by existing analyses of approximate dynamic programming methods, and in particular to use approximate transition functions estimated via least-squares procedures in MDPs with linear function approximation. We use our method to provide a computationally efficient algorithm for learning near-optimal policies in discounted linear kernel MDPs from a single stream of experience, and show that it achieves near-optimal statistical guarantees.
翻译:我们提出了一种针对无限期折扣马尔可夫决策过程的乐观规划新方法,该方法基于在标准近似值迭代过程的更新中加入正则化的思想。这一技术使我们能够避免现有近似动态规划方法分析中通常需要的收缩性和单调性论证,特别地,允许在线性函数近似的MDP中,使用通过最小二乘法估计的近似转移函数。我们利用该方法,从单一经验流中为折扣线性核MDP提供了计算高效的近似最优策略学习算法,并证明其达到了近乎最优的统计保证。