We study the error introduced by entropy regularization of infinite-horizon discrete discounted Markov decision processes. We show that this error decreases exponentially in the inverse regularization strength both in a weighted KL-divergence and in value with a problem-specific exponent. We provide a lower bound matching our upper bound up to a polynomial factor. Our proof relies on the correspondence of the solutions of entropy-regularized Markov decision processes with gradient flows of the unregularized reward with respect to a Riemannian metric common in natural policy gradient methods. Further, this correspondence allows us to identify the limit of the gradient flow as the generalized maximum entropy optimal policy, thereby characterizing the implicit bias of the Kakade gradient flow which corresponds to a time-continuous version of the natural policy gradient method. We use this to show that for entropy-regularized natural policy gradient methods the overall error decays exponentially in the square root of the number of iterations improving existing sublinear guarantees.
翻译:我们研究了无限时域离散折扣马尔可夫决策过程引入熵正则化所产生的误差。我们证明该误差在加权KL散度和值函数意义上均随正则化强度倒数的增加呈指数衰减,且衰减指数具有问题特异性。我们给出了与上界匹配(至多相差多项式因子)的下界。我们的证明依赖于熵正则化马尔可夫决策过程解与未正则化奖励函数梯度流之间的对应关系,该梯度流采用自然策略梯度方法中常见的黎曼度量。进一步地,这种对应关系使我们能够将梯度流的极限识别为广义最大熵最优策略,从而刻画了Kakade梯度流(对应于自然策略梯度方法的连续时间版本)的隐式偏差。基于此,我们证明了对于熵正则化自然策略梯度方法,总体误差随迭代次数的平方根呈指数衰减,这改进了现有的次线性收敛保证。