Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.
翻译:现代离线强化学习方法能够找到性能优良的演员-评论家模型,然而使用基于价值的强化学习算法对这些模型进行在线微调通常会导致性能的立即下降。我们提供的证据支持以下假设:在损失函数景观中,现有算法的离线最优解与在线最优解之间被低性能的谷底区域所分隔,而基于梯度的微调过程必须穿越这些区域。基于此,我们提出了基于分数匹配的演员-评论家算法(SMAC),这是一种离线强化学习方法,旨在学习能够无缝过渡到基于价值的在线强化学习算法且不产生性能下降的演员-评论家模型。SMAC通过在离线阶段对Q函数施加正则化约束,使其满足策略分数与Q函数动作梯度之间的一阶导数等式,从而避免了离线最优解与在线最优解之间的谷底区域。实验证明,SMAC能够收敛到与更优在线最优解相连接的离线最优解,且通过一阶优化可找到奖励单调递增的连通路径。在全部6个D4RL任务中,SMAC均能平滑迁移至Soft Actor-Critic和TD3算法。在4/6的环境中,其遗憾值相较于最佳基线方法降低了34-58%。