Markov Decision Processes (MDPs) are a formal framework for modeling and solving sequential decision-making problems. In finite-time horizons such problems are relevant for instance for optimal stopping or specific supply chain problems, but also in the training of large language models. In contrast to infinite horizon MDPs optimal policies are not stationary, policies must be learned for every single epoch. In practice all parameters are often trained simultaneously, ignoring the inherent structure suggested by dynamic programming. This paper introduces a combination of dynamic programming and policy gradient called dynamic policy gradient, where the parameters are trained backwards in time. For the tabular softmax parametrisation we carry out the convergence analysis for simultaneous and dynamic policy gradient towards global optima, both in the exact and sampled gradient settings without regularisation. It turns out that the use of dynamic policy gradient training much better exploits the structure of finite-time problems which is reflected in improved convergence bounds.
翻译:马尔可夫决策过程(MDP)是建模和求解序贯决策问题的形式化框架。在有限时间范围内,此类问题涉及最优停时、特定供应链问题,以及大型语言模型的训练。与无限时间MDP不同,其最优策略并非平稳的,必须针对每个时间点学习策略。实践中通常对所有参数进行同步训练,忽略了动态规划所蕴含的内在结构。本文提出了一种结合动态规划与策略梯度的动态策略梯度方法,其中参数沿时间反向训练。针对表格型softmax参数化形式,我们在无正则化的精确梯度和采样梯度设定下,分别分析了同步策略梯度与动态策略梯度向全局最优解的收敛性。结果表明,动态策略梯度训练能更充分利用有限时间问题的结构特性,这体现在更优的收敛界上。