Offline reinforcement learning (RL) methods can generally be categorized into two types: RL-based and Imitation-based. RL-based methods could in principle enjoy out-of-distribution generalization but suffer from erroneous off-policy evaluation. Imitation-based methods avoid off-policy evaluation but are too conservative to surpass the dataset. In this study, we propose an alternative approach, inheriting the training stability of imitation-style methods while still allowing logical out-of-distribution generalization. We decompose the conventional reward-maximizing policy in offline RL into a guide-policy and an execute-policy. During training, the guide-poicy and execute-policy are learned using only data from the dataset, in a supervised and decoupled manner. During evaluation, the guide-policy guides the execute-policy by telling where it should go so that the reward can be maximized, serving as the \textit{Prophet}. By doing so, our algorithm allows \textit{state-compositionality} from the dataset, rather than \textit{action-compositionality} conducted in prior imitation-style methods. We dumb this new approach Policy-guided Offline RL (\texttt{POR}). \texttt{POR} demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline RL. We also highlight the benefits of \texttt{POR} in terms of improving with supplementary suboptimal data and easily adapting to new tasks by only changing the guide-poicy.
翻译:离线强化学习方法通常可分为两类:基于强化学习的方法和基于模仿的方法。基于强化学习的方法原则上能够进行分布外泛化,但存在错误的离策略评估问题;而基于模仿的方法避免了离策略评估,但过于保守而无法超越数据集。在本研究中,我们提出了一种替代方法,既继承了模仿类方法的训练稳定性,又允许逻辑上的分布外泛化。我们将离线强化学习中传统的奖励最大化策略分解为指导策略和执行策略。训练过程中,指导策略和执行策略仅使用数据集中的数据进行监督式解耦学习。评估时,指导策略通过指示执行策略应前往何处以最大化奖励来引导执行策略,扮演"先知"角色。通过这种方式,我们的算法实现了数据集的"状态组合性",而非先前模仿类方法采用的"动作组合性"。我们将这种新方法命名为策略引导的离线强化学习(\texttt{POR})。\texttt{POR}在离线强化学习标准基准D4RL上展现了最先进的性能。我们还强调了\texttt{POR}在利用补充次优数据改进性能,以及通过仅更改指导策略即可轻松适应新任务方面的优势。