We propose a new method for optimistic planning in infinite-horizon discounted Markov decision processes based on the idea of adding regularization to the updates of an otherwise standard approximate value iteration procedure. This technique allows us to avoid contraction and monotonicity arguments that are typically required by existing analyses of approximate dynamic programming methods, and in particular to use approximate transition functions estimated via least-squares procedures in MDPs with linear function approximation. We use our method to provide a computationally efficient algorithm for learning near-optimal policies in discounted linear kernel MDPs from a single stream of experience, and show that it achieves near-optimal statistical guarantees.
翻译:我们提出了一种新的无限折扣马尔可夫决策过程中乐观规划的方法,其核心思想是在标准近似值迭代过程的更新中添加正则化项。这一技术使我们能够避免现有近似动态规划方法分析中通常需要的收缩性与单调性论证,尤其能够在线性函数近似的MDP中,使用通过最小二乘估计得到的近似转移函数。利用我们的方法,我们提供了一个计算高效的算法,可从单一经验流中学习折扣线性核MDP的近似最优策略,并证明该算法实现了近乎最优的统计保证。