We introduce DreamerV3-XP, an extension of DreamerV3 that improves exploration and learning efficiency. This includes (i) a prioritized replay buffer, scoring trajectories by return, reconstruction loss, and value error and (ii) an intrinsic reward based on disagreement over predicted environment rewards from an ensemble of world models. DreamerV3-XP is evaluated on a subset of Atari100k and DeepMind Control Visual Benchmark tasks, confirming the original DreamerV3 results and showing that our extensions lead to faster learning and lower dynamics model loss, particularly in sparse-reward settings.
翻译:我们提出了DreamerV3-XP,这是DreamerV3的一个扩展版本,旨在提升探索和学习效率。其核心改进包括:(i)一个优先经验回放缓冲区,该缓冲区通过回报、重建损失和价值误差对轨迹进行评分;(ii)一种基于集成世界模型对环境奖励预测分歧的内在奖励机制。我们在Atari100k和DeepMind Control Visual Benchmark任务的一个子集上评估了DreamerV3-XP,结果不仅验证了原始DreamerV3的性能,还表明我们的扩展能带来更快的收敛速度和更低的环境动态模型损失,尤其在稀疏奖励环境中效果显著。