Model-based offline reinforcement learning trains policies using pre-collected datasets and learned environment models, eliminating the need for direct real-world environment interaction. However, this paradigm is inherently challenged by distribution shift~(DS). Existing methods address this issue by leveraging off-policy mechanisms and estimating model uncertainty, but they often result in inconsistent objectives and lack a unified theoretical foundation. This paper offers a comprehensive analysis that disentangles the problem into two fundamental components: model bias and policy shift. Our theoretical and empirical investigations reveal how these factors distort value estimation and restrict policy optimization. To tackle these challenges, we derive a novel shifts-aware reward through a unified probabilistic inference framework, which modifies the vanilla reward to refine value learning and facilitate policy training. Building on this, we develop a practical implementation that leverages classifier-based techniques to approximate the adjusted reward for effective policy optimization. Empirical results across multiple benchmarks demonstrate that the proposed approach mitigates distribution shift and achieves superior or comparable performance, validating our theoretical insights.
翻译:基于模型的离线强化学习利用预先收集的数据集和习得的环境模型来训练策略,从而无需直接与现实世界环境进行交互。然而,该范式本质上受到分布偏移的挑战。现有方法通过利用离策略机制和估计模型不确定性来解决此问题,但它们通常导致目标不一致且缺乏统一的理论基础。本文提供了一个全面的分析,将问题分解为两个基本组成部分:模型偏差和策略偏移。我们的理论和实证研究揭示了这些因素如何扭曲价值估计并限制策略优化。为了应对这些挑战,我们通过一个统一的概率推理框架推导出一种新颖的偏移感知奖励,该奖励修改了原始奖励以改进价值学习并促进策略训练。在此基础上,我们开发了一种实用的实现方案,利用基于分类器的技术来近似调整后的奖励,以实现有效的策略优化。在多个基准测试上的实证结果表明,所提出的方法缓解了分布偏移,并取得了优异或相当的性能,验证了我们的理论见解。