Offline reinforcement learning (RL) addresses the problem of learning a performant policy from a fixed batch of data collected by following some behavior policy. Model-based approaches are particularly appealing in the offline setting since they can extract more learning signals from the logged dataset by learning a model of the environment. However, the performance of existing model-based approaches falls short of model-free counterparts, due to the compounding of estimation errors in the learned model. Driven by this observation, we argue that it is critical for a model-based method to understand when to trust the model and when to rely on model-free estimates, and how to act conservatively w.r.t. both. To this end, we derive an elegant and simple methodology called conservative Bayesian model-based value expansion for offline policy optimization (CBOP), that trades off model-free and model-based estimates during the policy evaluation step according to their epistemic uncertainties, and facilitates conservatism by taking a lower bound on the Bayesian posterior value estimate. On the standard D4RL continuous control tasks, we find that our method significantly outperforms previous model-based approaches: e.g., MOPO by $116.4$%, MOReL by $23.2$% and COMBO by $23.7$%. Further, CBOP achieves state-of-the-art performance on $11$ out of $18$ benchmark datasets while doing on par on the remaining datasets.
翻译:离线强化学习(offline reinforcement learning,RL)解决的是从遵循某个行为策略收集的固定批次数据中学习高性能策略的问题。在离线场景下,基于模型的方法尤为吸引人,因为它们可以通过学习环境模型从已记录数据集中提取更多学习信号。然而,由于学习模型中的估计误差会不断累积,现有基于模型的方法的性能不及无模型方法。受此观察启发,我们认为基于模型的方法关键在于理解何时信任模型、何时依赖无模型估计,以及如何对两者都采取保守行为。为此,我们推导出一种简洁而优雅的方法——保守贝叶斯模型基值扩展用于离线策略优化(CBOP),它在策略评估步骤中根据模型的无认知不确定性权衡无模型估计与模型基估计,并通过取贝叶斯后验值估计的下界来促进保守性。在标准的D4RL连续控制任务上,我们发现我们的方法显著优于以往的基于模型方法:例如,相比MOPO提升116.4%,相比MOReL提升23.2%,相比COMBO提升23.7%。此外,CBOP在18个基准数据集中的11个上达到了最先进性能,并在其余数据集上表现相当。