Offline pretraining with a static dataset followed by online fine-tuning (offline-to-online, or OtO) is a paradigm well matched to a real-world RL deployment process. In this scenario, we aim to find the best-performing policy within a limited budget of online interactions. Previous work in the OtO setting has focused on correcting for bias introduced by the policy-constraint mechanisms of offline RL algorithms. Such constraints keep the learned policy close to the behavior policy that collected the dataset, but we show this can unnecessarily limit policy performance if the behavior policy is far from optimal. Instead, we forgo constraints and frame OtO RL as an exploration problem that aims to maximize the benefit of online data-collection. We first study the major online RL exploration methods based on intrinsic rewards and UCB in the OtO setting, showing that intrinsic rewards add training instability through reward-function modification, and UCB methods are myopic and it is unclear which learned-component's ensemble to use for action selection. We then introduce an algorithm for planning to go out-of-distribution (PTGOOD) that avoids these issues. PTGOOD uses a non-myopic planning procedure that targets exploration in relatively high-reward regions of the state-action space unlikely to be visited by the behavior policy. By leveraging concepts from the Conditional Entropy Bottleneck, PTGOOD encourages data collected online to provide new information relevant to improving the final deployment policy without altering rewards. We show empirically in several continuous control tasks that PTGOOD significantly improves agent returns during online fine-tuning and avoids the suboptimal policy convergence that many of our baselines exhibit in several environments.
翻译:利用静态数据集进行离线预训练,随后进行在线微调(离线到在线,简称OtO)是一种与现实世界强化学习部署流程高度契合的范式。在此场景下,我们的目标是在有限的在线交互预算内找到性能最优的策略。先前针对OtO设置的研究主要集中于纠正离线强化学习算法中策略约束机制引入的偏差。此类约束使学习到的策略接近于收集数据集的行为策略,但我们证明,若行为策略远非最优,这种做法可能会不必要地限制策略性能。为此,我们摒弃约束机制,将OtO强化学习重新定义为旨在最大化在线数据收集效益的探索问题。我们首先在OtO设置下研究了基于内在奖励和上置信界的主要在线强化学习探索方法,结果表明:内在奖励通过修改奖励函数引入了训练不稳定性;而上置信界方法具有短视性,且难以确定应使用何种学习组件的集成进行动作选择。随后,我们提出了一种避免上述问题的规划超出分布范围的算法。该算法采用非短视的规划流程,专注于在状态-动作空间中相对高奖励、但行为策略不太可能访问的区域进行探索。通过借鉴条件熵瓶颈的概念,该算法鼓励在线收集的数据能够为改进最终部署策略提供新的相关信息,且不改变奖励函数。我们在多个连续控制任务上的实验表明,该算法能显著提升智能体在线微调阶段的回报,并避免了基线方法在若干环境中表现出的次优策略收敛问题。