We propose a new method for optimistic planning in infinite-horizon discounted Markov decision processes based on the idea of adding regularization to the updates of an otherwise standard approximate value iteration procedure. This technique allows us to avoid contraction and monotonicity arguments typically required by existing analyses of approximate dynamic programming methods, and in particular to use approximate transition functions estimated via least-squares procedures in MDPs with linear function approximation. We use our method to recover known guarantees in tabular MDPs and to provide a computationally efficient algorithm for learning near-optimal policies in discounted linear mixture MDPs from a single stream of experience, and show it achieves near-optimal statistical guarantees.
翻译:本文提出一种基于正则化思想的新方法,用于无限时域折扣马尔可夫决策过程中的乐观规划。该方法在标准近似价值迭代步骤的更新中引入正则化,从而避免了现有近似动态规划方法分析中通常需要的压缩性和单调性论证,尤其能够使用通过最小二乘估计得到的近似转移函数,在线性函数逼近的MDP中发挥作用。我们利用该方法恢复了表格型MDP中的已知保证,并提供了一个计算高效的算法,可从单一经验流中学习折扣线性混合MDP的近似最优策略,同时证明该算法实现了近乎最优的统计保证。