Efficient exploration remains a central challenge in reinforcement learning (RL), particularly in sparse-reward environments. We introduce Optimistic World Models (OWMs), a principled and scalable framework for optimistic exploration that brings classical reward-biased maximum likelihood estimation (RBMLE) from adaptive control into deep RL. In contrast to upper confidence bound (UCB)-style exploration methods, OWMs incorporate optimism directly into model learning by augmentation with an optimistic dynamics loss that biases imagined transitions toward higher-reward outcomes. This fully gradient-based loss requires neither uncertainty estimates nor constrained optimization. Our approach is plug-and-play with existing world model frameworks, preserving scalability while requiring only minimal modifications to standard training procedures. We instantiate OWMs within two state-of-the-art world model architectures, leading to Optimistic DreamerV3 and Optimistic STORM, which demonstrate significant improvements in sample efficiency and cumulative return compared to their baseline counterparts.
翻译:高效探索仍然是强化学习(RL)中的一个核心挑战,尤其是在稀疏奖励环境中。我们提出了乐观世界模型(OWMs),这是一个用于乐观探索的原则性且可扩展的框架,它将自适应控制中的经典奖励偏置最大似然估计(RBMLE)引入深度RL。与基于置信上界(UCB)的探索方法不同,OWMs通过引入一个乐观动力学损失来直接增强模型学习,该损失将想象的转移过程偏向于更高奖励的结果。这种完全基于梯度的损失既不需要不确定性估计,也不需要约束优化。我们的方法可以与现有的世界模型框架即插即用,在保持可扩展性的同时,仅需对标准训练流程进行最小限度的修改。我们在两种最先进的世界模型架构中实例化了OWMs,从而产生了乐观DreamerV3和乐观STORM,与它们的基线对应模型相比,在样本效率和累积回报方面都展现出显著的提升。