Entropy regularization has been extensively used in policy optimization algorithms to regularize the optimization landscape and accelerate convergence; however, it comes at the cost of introducing an additional regularization bias. This work quantifies the impact of entropy regularization on the convergence of policy gradient methods for stochastic exit time control problems. We analyze a continuous-time policy mirror descent dynamics, which updates the policy based on the gradient of an entropy-regularized value function and adjusts the strength of entropy regularization as the algorithm progresses. We prove that with a fixed entropy level, the dynamics converges exponentially to the optimal solution of the regularized problem. We further show that when the entropy level decays at suitable polynomial rates, the annealed flow converges to the solution of the unregularized problem at a rate of $\mathcal O(1/S)$ for discrete action spaces and, under suitable conditions, at a rate of $\mathcal O(1/\sqrt{S})$ for general action spaces, with $S$ being the gradient flow time. This paper explains how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate.
翻译:熵正则化已被广泛用于策略优化算法中,以正则化优化景观并加速收敛;然而,这以引入额外的正则化偏差为代价。本文量化了熵正则化对随机退出时间控制问题中策略梯度方法收敛性的影响。我们分析了一种连续时间策略镜像下降动态,该动态基于熵正则化价值函数的梯度更新策略,并随着算法进程调整熵正则化的强度。我们证明,在固定的熵水平下,该动态以指数速度收敛到正则化问题的最优解。我们进一步表明,当熵水平以适当的多项式速率衰减时,对于离散动作空间,退火流以 $\mathcal O(1/S)$ 的速率收敛到无正则化问题的解;在适当条件下,对于一般动作空间,则以 $\mathcal O(1/\sqrt{S})$ 的速率收敛,其中 $S$ 为梯度流时间。本文从收敛速率的角度解释了熵正则化如何改进策略优化,即使在使用真实梯度的情况下。