We derive the first large deviation rate function for the stochastic iterates generated by policy gradient methods with a softmax parametrization and an entropy regularized objective. Leveraging the contraction principle from large deviations theory, we also develop a general recipe for deriving exponential convergence rates for a wide spectrum of other policy parametrizations. This approach unifies several results from the literature and simplifies existing proof techniques.
翻译:本文推导了软最大参数化与熵正则化目标下策略梯度方法生成随机迭代的首个大偏差率函数。利用大偏差理论中的收缩原理,我们还发展了一个通用框架,用于推导其他广泛策略参数化下的指数收敛速率。该方法统一了文献中的若干结果,并简化了现有证明技术。