We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates. Leveraging the connection between policy iteration and policy gradient methods, we view policy optimization algorithms as iteratively solving a sequence of surrogate objectives, local lower bounds on the original objective. We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate accumulating errors from overshooting predictions or delayed responses to change. We use this shared lens to jointly express other well-known algorithms, including model-based policy improvement based on forward search, and optimistic meta-learning algorithms. We analyze properties of this formulation, and show connections to other accelerated optimization algorithms. Then, we design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
翻译:我们致力于通过乐观和自适应更新在策略改进步骤中整合前瞻性,以构建强化学习中策略优化方法的统一加速范式。借助策略迭代与策略梯度方法之间的关联,我们将策略优化算法视为迭代求解一系列替代目标(原始目标的局部下界)。我们将乐观定义为对策略未来行为的预测建模,将自适应定义为采取即时且具有前瞻性的纠正措施,以缓解因预测过度或对变化响应延迟而累积的误差。利用这一统一视角,我们共同表达了其他知名算法,包括基于前向搜索的模型驱动策略改进以及乐观元学习算法。我们分析了该公式化方法的关键性质,并展示了其与其它加速优化算法的内在联系。随后,我们设计了一种乐观策略梯度算法,通过元梯度学习实现自适应,并在一个示例性任务中从实证角度重点考察了与加速相关的若干设计选择。