Stochastic networks and queueing systems often lead to Markov decision processes (MDPs) with large state and action spaces as well as nonconvex objective functions, which hinders the convergence of many reinforcement learning (RL) algorithms. Policy-gradient methods perform well on MDPs with large state and action spaces, but they sometimes experience slow convergence due to the high variance of the gradient estimator. In this paper, we show that some of these difficulties can be circumvented by exploiting the structure of the underlying MDP. We first introduce a new family of gradient estimators called score-aware gradient estimators (SAGEs). When the stationary distribution of the MDP belongs to an exponential family parametrized by the policy parameters, SAGEs allow us to estimate the policy gradient without relying on value-function estimation, contrary to classical policy-gradient methods like actor-critic. To demonstrate their applicability, we examine two common control problems arising in stochastic networks and queueing systems whose stationary distributions have a product-form, a special case of exponential families. As a second contribution, we show that, under appropriate assumptions, the policy under a SAGE-based policy-gradient method has a large probability of converging to an optimal policy, provided that it starts sufficiently close to it, even with a nonconvex objective function and multiple maximizers. Our key assumptions are that, locally around a maximizer, a nondegeneracy property of the Hessian of the objective function holds and a Lyapunov function exists. Finally, we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor-critic algorithm. The results demonstrate that the SAGE-based method finds close-to-optimal policies more rapidly, highlighting its superior performance over the traditional actor-critic method.
翻译:随机网络与排队系统常产生具有大状态-动作空间的马尔可夫决策过程(MDP)及非凸目标函数,这阻碍了许多强化学习(RL)算法的收敛。策略梯度方法在处理大状态-动作空间的MDP时表现良好,但由于梯度估计器的高方差,有时收敛速度较慢。本文表明,通过利用底层MDP的结构可以规避部分困难。首先引入一类新型梯度估计器——评分感知梯度估计器(SAGEs)。当MDP的平稳分布属于由策略参数化指数族时,与演员-评论家等经典策略梯度方法不同,SAGEs无需基于值函数估计即可直接估计策略梯度。为验证其适用性,我们考察了随机网络与排队系统中两类平稳分布具有乘积形式(指数族的特例)的常见控制问题。第二个贡献是,在适当假设下(即使面对非凸目标函数与多最优解),只要初始策略充分接近最优策略,基于SAGEs的策略梯度方法将以高概率收敛至最优策略。核心假设包括:最优解局部邻域内目标函数Hessian矩阵的非退化性以及Lyapunov函数的存在性。最后,我们通过数值实验比较了基于SAGEs的策略梯度方法与演员-评论家算法。结果表明,基于SAGEs的方法能更快找到接近最优的策略,凸显其相对于传统演员-评论家方法的优越性能。