Motivated by policy gradient methods in the context of reinforcement learning, we derive the first large deviation rate function for the iterates generated by stochastic gradient descent for possibly non-convex objectives satisfying a Polyak-Lojasiewicz condition. Leveraging the contraction principle from large deviations theory, we illustrate the potential of this result by showing how convergence properties of policy gradient with a softmax parametrization and an entropy regularized objective can be naturally extended to a wide spectrum of other policy parametrizations.
翻译:受强化学习背景下策略梯度方法的启发,我们针对满足Polyak-Lojasiewicz条件的可能非凸目标,推导了随机梯度下降迭代序列的首个大偏差速率函数。借助大偏差理论中的压缩原理,我们展示了这一结果的潜力:通过分析softmax参数化和熵正则化目标下的策略梯度收敛性质,可将其自然推广至广泛的其它策略参数化形式。