In this paper, we introduce a policy-gradient method for model-based reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from Markov decision processes (MDPs) in stochastic networks, queueing systems, and statistical mechanics. Specifically, when the stationary distribution of the MDP belongs to an exponential family that is parametrized by policy parameters, we can improve existing policy gradient methods for average-reward RL. Our key identification is a family of gradient estimators, called score-aware gradient estimators (SAGEs), that enable policy gradient estimation without relying on value-function estimation in the aforementioned setting. We show that SAGE-based policy-gradient locally converges, and we obtain its regret. This includes cases when the state space of the MDP is countable and unstable policies can exist. Under appropriate assumptions such as starting sufficiently close to a maximizer and the existence of a local Lyapunov function, the policy under SAGE-based stochastic gradient ascent has an overwhelming probability of converging to the associated optimal policy. Furthermore, we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor-critic method on several examples inspired from stochastic networks, queueing systems, and models derived from statistical physics. Our results demonstrate that a SAGE-based method finds close-to-optimal policies faster than an actor-critic method.
翻译:本文提出了一种基于模型的强化学习策略梯度方法,该方法利用了马尔可夫决策过程在随机网络、排队系统和统计力学中常见的一类平稳分布。具体而言,当MDP的平稳分布属于由策略参数参数化的指数族时,我们可以改进现有的平均奖励强化学习策略梯度方法。我们的核心贡献是提出了一类称为分数感知梯度估计器的梯度估计器族,该族可在上述设定下实现无需依赖价值函数估计的策略梯度估计。我们证明了基于SAGE的策略梯度方法具有局部收敛性,并推导了其遗憾界。这包括MDP状态空间可数且可能存在不稳定策略的情形。在适当假设下(如初始点充分接近极大值点且存在局部李雅普诺夫函数),基于SAGE的随机梯度上升策略将以压倒性概率收敛至相应的最优策略。此外,我们在受随机网络、排队系统和统计物理模型启发的多个算例中,对基于SAGE的策略梯度方法与行动者-评论家方法进行了数值比较。结果表明,基于SAGE的方法能比行动者-评论家方法更快地找到接近最优的策略。