Markov Decision Processes (MDPs) are a formal framework for modeling and solving sequential decision-making problems. In finite-time horizons such problems are relevant for instance for optimal stopping or specific supply chain problems, but also in the training of large language models. In contrast to infinite horizon MDPs optimal policies are not stationary, policies must be learned for every single epoch. In practice all parameters are often trained simultaneously, ignoring the inherent structure suggested by dynamic programming. This paper introduces a combination of dynamic programming and policy gradient called dynamic policy gradient, where the parameters are trained backwards in time. For the tabular softmax parametrisation we carry out the convergence analysis for simultaneous and dynamic policy gradient towards global optima, both in the exact and sampled gradient settings without regularisation. It turns out that the use of dynamic policy gradient training much better exploits the structure of finite- time problems which is reflected in improved convergence bounds.
翻译:马尔可夫决策过程(MDPs)是建模和求解序贯决策问题的形式化框架。在有限时间范围内,此类问题适用于最优停止、特定供应链问题以及大型语言模型的训练。与无限时间范围的MDPs不同,其最优策略并非平稳的,必须针对每个时间阶段分别学习策略。实践中通常同时训练所有参数,忽略了动态规划所暗示的内在结构。本文提出动态规划与策略梯度的结合方法——动态策略梯度,其参数按时间反向训练。针对表格型Softmax参数化,我们在精确梯度和采样梯度设置下(无需正则化),分析了同步策略梯度与动态策略梯度收敛到全局最优的性能。结果表明,动态策略梯度训练能更有效地利用有限时间问题的结构,并体现在更优的收敛界中。