We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) through \emph{optimism} \& \emph{adaptivity}. Leveraging the deep connection between policy iteration and policy gradient methods, we recast seemingly unrelated policy optimization algorithms as the repeated application of two interleaving steps (i) an \emph{optimistic policy improvement operator} maps a prior policy $\pi_t$ to a hypothesis $\pi_{t+1}$ using a \emph{gradient ascent prediction}, followed by (ii) a \emph{hindsight adaptation} of the optimistic prediction based on a partial evaluation of the performance of $\pi_{t+1}$. We use this shared lens to jointly express other well-known algorithms, including soft and optimistic policy iteration, natural actor-critic methods, model-based policy improvement based on forward search, and meta-learning algorithms. By doing so, we shed light on collective theoretical properties related to acceleration via optimism \& adaptivity. Building on these insights, we design an \emph{adaptive \& optimistic policy gradient} algorithm via meta-gradient learning, and empirically highlight several design choices pertaining to optimism, in an illustrative task.
翻译:我们致力于通过"乐观性"与"适应性"建立一个统一范式,以加速强化学习中的策略优化方法。利用策略迭代与策略梯度方法之间的深层联系,我们将看似无关的策略优化算法重新诠释为两个交错步骤的重复应用:(i)一个"乐观策略改进算子"通过"梯度上升预测"将先验策略$\pi_t$映射为假设策略$\pi_{t+1}$,随后(ii)基于$\pi_{t+1}$性能的部分评估对乐观预测进行"事后适应性调整"。我们借助这一共享视角共同表达了其他知名算法,包括软策略迭代与乐观策略迭代、自然Actor-Critic方法、基于前向搜索的模型基策略改进以及元学习算法。通过这一工作,我们揭示了与乐观性与适应性加速效应相关的集体理论性质。基于这些见解,我们通过元梯度学习设计了一种"自适应与乐观策略梯度"算法,并在一个示例任务中实证强调了与乐观性相关的若干设计选择。