Offline pretraining with a static dataset followed by online fine-tuning (offline-to-online, or OtO) is a paradigm well matched to a real-world RL deployment process. In this scenario, we aim to find the best-performing policy within a limited budget of online interactions. Previous work in the OtO setting has focused on correcting for bias introduced by the policy-constraint mechanisms of offline RL algorithms. Such constraints keep the learned policy close to the behavior policy that collected the dataset, but we show this can unnecessarily limit policy performance if the behavior policy is far from optimal. Instead, we forgo constraints and frame OtO RL as an exploration problem that aims to maximize the benefit of online data-collection. We first study the major online RL exploration methods based on intrinsic rewards and UCB in the OtO setting, showing that intrinsic rewards add training instability through reward-function modification, and UCB methods are myopic and it is unclear which learned-component's ensemble to use for action selection. We then introduce an algorithm for planning to go out-of-distribution (PTGOOD) that avoids these issues. PTGOOD uses a non-myopic planning procedure that targets exploration in relatively high-reward regions of the state-action space unlikely to be visited by the behavior policy. By leveraging concepts from the Conditional Entropy Bottleneck, PTGOOD encourages data collected online to provide new information relevant to improving the final deployment policy without altering rewards. We show empirically in several continuous control tasks that PTGOOD significantly improves agent returns during online fine-tuning and avoids the suboptimal policy convergence that many of our baselines exhibit in several environments.
翻译:离线预训练(使用静态数据集)结合在线微调(offline-to-online,简称OtO)是一种与实际强化学习部署流程高度契合的范式。在此场景中,我们旨在有限的在线交互预算内找到最优策略。先前OtO设置下的工作主要集中于修正离线强化学习算法中策略约束机制带来的偏差。此类约束使学习到的策略与收集数据的行为策略保持相近,但我们证明,若行为策略远非最优,这将不必要地限制策略性能。为此,我们摒弃约束,将OtO强化学习定位为旨在最大化在线数据收集效益的探索问题。我们首先研究了基于内在奖励和UCB的主流在线强化学习探索方法在OtO设置下的表现,发现内在奖励通过修改奖励函数增加训练不稳定性,而UCB方法具有短视性且难以明确选择哪个学习组件的集成进行动作选择。随后,我们提出了一种面向分布外探索规划(PTGOOD)算法,该算法避免了上述问题。PTGOOD采用非短视的规划流程,针对状态-动作空间中行为策略不太可能访问的高奖励区域进行探索。通过利用条件熵瓶颈的概念,PTGOOD在不改变奖励的前提下,鼓励在线收集的数据提供有助于改进最终部署策略的新信息。我们在多个连续控制任务中的实验表明,PTGOOD显著提升了在线微调期间智能体的累积回报,并避免了多个基线方法在若干环境中出现的次优策略收敛问题。