In this work, we study $\gamma$-discounted infinite-horizon tabular Markov decision processes (MDPs) and introduce a framework called dynamic policy gradient (DynPG). The framework directly integrates dynamic programming with (any) policy gradient method, explicitly leveraging the Markovian property of the environment. DynPG dynamically adjusts the problem horizon during training, decomposing the original infinite-horizon MDP into a sequence of contextual bandit problems. By iteratively solving these contextual bandits, DynPG converges to the stationary optimal policy of the infinite-horizon MDP. To demonstrate the power of DynPG, we establish its non-asymptotic global convergence rate under the tabular softmax parametrization, focusing on the dependencies on salient but essential parameters of the MDP. By combining classical arguments from dynamic programming with more recent convergence arguments of policy gradient schemes, we prove that softmax DynPG scales polynomially in the effective horizon $(1-\gamma)^{-1}$. Our findings contrast recent exponential lower bound examples for vanilla policy gradient.
翻译:在本工作中,我们研究$\gamma$折扣无限时域表格型马尔可夫决策过程(MDPs),并引入一个称为动态策略梯度(DynPG)的框架。该框架直接将动态规划与(任意)策略梯度方法相结合,显式地利用了环境的马尔可夫性质。DynPG在训练期间动态调整问题时域,将原始的无限时域MDP分解为一系列上下文赌博机问题。通过迭代求解这些上下文赌博机,DynPG收敛到无限时域MDP的平稳最优策略。为了展示DynPG的能力,我们在表格型softmax参数化下建立了其非渐近全局收敛速率,重点关注其对MDP中显著且关键参数的依赖性。通过将动态规划的经典论证与策略梯度方案中较新的收敛论证相结合,我们证明了softmax DynPG的收敛速率在有效时域$(1-\gamma)^{-1}$上是多项式阶的。我们的发现与近期关于原始策略梯度的指数级下界例子形成了对比。