Direct policy optimization in reinforcement learning is usually solved with policy-gradient algorithms, which optimize policy parameters via stochastic gradient ascent. This paper provides a new theoretical interpretation and justification of these algorithms. First, we formulate direct policy optimization in the optimization by continuation framework. The latter is a framework for optimizing nonconvex functions where a sequence of surrogate objective functions, called continuations, are locally optimized. Second, we show that optimizing affine Gaussian policies and performing entropy regularization can be interpreted as implicitly optimizing deterministic policies by continuation. Based on these theoretical results, we argue that exploration in policy-gradient algorithms consists in computing a continuation of the return of the policy at hand, and that the variance of policies should be history-dependent functions adapted to avoid local extrema rather than to maximize the return of the policy.
翻译:强化学习中的直接策略优化通常通过策略梯度算法解决,该类算法利用随机梯度上升优化策略参数。本文为这些算法提供了新的理论解释与验证。首先,我们在优化延续框架下形式化直接策略优化问题——该框架通过局部优化一系列称为“延续”的替代目标函数来处理非凸函数优化。其次,我们证明优化仿射高斯策略并执行熵正则化可被解释为通过延续隐式优化确定性策略。基于这些理论结果,我们认为策略梯度算法中的探索本质在于计算当前策略回报的延续,而策略的方差应作为依赖于历史状态的函数进行自适应调整,其目的是避免局部极值而非最大化策略回报。