Optimistic Model Rollouts for Pessimistic Offline Policy Optimization

Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.

翻译：基于模型的离线强化学习已取得显著进展，为通过合成模型展开提升泛化能力提供了有前景的途径。现有工作主要通过构建悲观马尔可夫决策过程（P-MDP）将悲观性纳入策略优化。然而，P-MDP会阻碍策略在离线数据集支持范围之外的分布外（OOD）区域进行学习，这可能导致动力学模型泛化能力利用率不足。与之相反，我们提出构建乐观马尔可夫决策过程（O-MDP）。我们首先观察到，鼓励更多OOD展开所引入的乐观性具有潜在优势。基于这一发现，我们提出ORPO——一个简洁而有效的基于模型的离线强化学习框架。ORPO通过乐观模型展开实现悲观离线策略优化。具体而言，我们在O-MDP中训练一个乐观展开策略以采样更多OOD模型展开，随后用惩罚奖励重新标记采样的状态-动作对，并在P-MDP中优化输出策略。理论分析表明，在线性MDP中，采用ORPO训练的策略性能具有下界保证。实验结果显示，我们的框架相较于P-MDP基线方法显著提升了30%的性能，在广泛使用的基准测试中达到了最优水平。此外，ORPO在需要泛化能力的问题中展现出显著优势。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/