Warm-Start reinforcement learning (RL), aided by a prior policy obtained from offline training, is emerging as a promising RL approach for practical applications. Recent empirical studies have demonstrated that the performance of Warm-Start RL can be improved \textit{quickly} in some cases but become \textit{stagnant} in other cases, especially when the function approximation is used. To this end, the primary objective of this work is to build a fundamental understanding on ``\textit{whether and when online learning can be significantly accelerated by a warm-start policy from offline RL?}''. Specifically, we consider the widely used Actor-Critic (A-C) method with a prior policy. We first quantify the approximation errors in the Actor update and the Critic update, respectively. Next, we cast the Warm-Start A-C algorithm as Newton's method with perturbation, and study the impact of the approximation errors on the finite-time learning performance with inaccurate Actor/Critic updates. Under some general technical conditions, we derive the upper bounds, which shed light on achieving the desired finite-learning performance in the Warm-Start A-C algorithm. In particular, our findings reveal that it is essential to reduce the algorithm bias in online learning. We also obtain lower bounds on the sub-optimality gap of the Warm-Start A-C algorithm to quantify the impact of the bias and error propagation.
翻译:热启动强化学习借助离线训练获得的先验策略,正成为实际应用中极具前景的强化学习方法。近期实证研究表明,热启动强化学习的性能在某些情况下可快速提升,但在其他情况下(特别是使用函数近似时)则可能陷入停滞。为此,本研究的主要目标是建立关于"在线学习能否以及何时通过离线强化学习的热启动策略实现显著加速"这一问题的基本理解。具体而言,我们考虑采用先验策略的广泛使用的演员-评论家方法。首先分别量化演员更新和评论家更新中的近似误差,随后将热启动演员-评论家算法建模为带扰动的牛顿法,研究近似误差对非精确演员/评论家更新下有限时间学习性能的影响。在一般技术条件下,我们推导出上界,揭示了在热启动演员-评论家算法中实现理想有限学习性能的关键。研究发现表明,降低在线学习中的算法偏差至关重要。我们还推导了热启动演员-评论家算法的次优性差距下界,以量化偏差与误差传播的影响。