Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.
翻译:基于模型的离线强化学习已取得显著进展,为通过合成模型展开提升泛化能力提供了有前景的途径。现有工作主要通过构建悲观马尔可夫决策过程(P-MDP)将悲观性纳入策略优化。然而,P-MDP会阻碍策略在离线数据集支持范围之外的分布外(OOD)区域进行学习,这可能导致动力学模型泛化能力利用率不足。与之相反,我们提出构建乐观马尔可夫决策过程(O-MDP)。我们首先观察到,鼓励更多OOD展开所引入的乐观性具有潜在优势。基于这一发现,我们提出ORPO——一个简洁而有效的基于模型的离线强化学习框架。ORPO通过乐观模型展开实现悲观离线策略优化。具体而言,我们在O-MDP中训练一个乐观展开策略以采样更多OOD模型展开,随后用惩罚奖励重新标记采样的状态-动作对,并在P-MDP中优化输出策略。理论分析表明,在线性MDP中,采用ORPO训练的策略性能具有下界保证。实验结果显示,我们的框架相较于P-MDP基线方法显著提升了30%的性能,在广泛使用的基准测试中达到了最优水平。此外,ORPO在需要泛化能力的问题中展现出显著优势。