Popular offline reinforcement learning (RL) methods rely on conservatism, either by penalizing out-of-dataset actions or by restricting rollout horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a Bayesian perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize expected rewards, enabling test-time generalization. We first illustrate, in a bandit setting, that Bayesianism excels on low-quality datasets where conservatism fails. We then scale this principle to realistic tasks and show that long-horizon planning is critical for reducing value overestimation once conservatism is removed. To make this feasible, we introduce key design choices for performing and learning from long-horizon rollouts while controlling compounding errors. These yield our algorithm, NEUBAY, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, NEUBAY generally matches or surpasses leading conservative algorithms, achieving new state-of-the-art on 7 datasets. Notably, it succeeds with rollout horizons of several hundred steps, contrary to dominant practice. Finally, we characterize datasets by quality and coverage, showing when NEUBAY is preferable to conservative methods. Together, we argue NEUBAY lays the foundation for a new practical direction in offline and model-based RL.
翻译:主流的离线强化学习方法依赖于保守性原则,要么通过惩罚数据集外动作,要么通过限制推演步长。本文质疑该原则的普适性,转而重新审视一种互补性原则:贝叶斯视角。贝叶斯方法不强制保守性,而是通过对合理世界模型的后验分布建模来处理离线数据中的认知不确定性,并训练一个历史依赖的智能体以最大化期望奖励,从而实现测试时泛化。我们首先在赌博机场景中证明,贝叶斯方法在保守性方法失效的低质量数据集上表现优异。随后将该原则扩展至实际任务,并证明一旦移除保守性,长视野规划对于减少价值高估至关重要。为实现这一目标,我们提出了执行长视野推演并从中学习的关键设计选择,同时控制误差累积。这些设计形成了基于中性贝叶斯原理的算法NEUBAY。在D4RL和NeoRL基准测试中,NEUBAY普遍达到或超越了主流保守性算法的性能,在7个数据集上创造了新的最优结果。值得注意的是,该算法成功实现了数百步的长视野推演,这与主流实践相悖。最后,我们通过数据质量和覆盖度对数据集进行表征,明确了NEUBAY优于保守性方法的适用场景。综合而言,我们认为NEUBAY为离线与基于模型的强化学习开辟了新的实用研究方向。