Dyna-style off-policy model-based reinforcement learning (DMBRL) algorithms are a family of techniques for generating synthetic state transition data and thereby enhancing the sample efficiency of off-policy RL algorithms. This paper identifies and investigates a surprising performance gap observed when applying DMBRL algorithms across different benchmark environments with proprioceptive observations. We show that, while DMBRL algorithms perform well in OpenAI Gym, their performance can drop significantly in DeepMind Control Suite (DMC), even though these settings offer similar tasks and identical physics backends. Modern techniques designed to address several key issues that arise in these settings do not provide a consistent improvement across all environments, and overall our results show that adding synthetic rollouts to the training process -- the backbone of Dyna-style algorithms -- significantly degrades performance across most DMC environments. Our findings contribute to a deeper understanding of several fundamental challenges in model-based RL and show that, like many optimization fields, there is no free lunch when evaluating performance across diverse benchmarks in RL.
翻译:Dyna风格离策略模型强化学习(DMBRL)算法是一类通过生成合成状态转移数据来提升离策略强化学习算法样本效率的技术。本文发现并研究了在具有本体感知观测的不同基准环境中应用DMBRL算法时观察到的显著性能差距。我们证明,虽然DMBRL算法在OpenAI Gym中表现良好,但其在DeepMind控制套件(DMC)中的性能可能大幅下降——尽管这些环境提供相似任务且采用相同的物理后端。针对这些环境中若干关键问题设计的现代技术并未在所有环境中带来一致的改进,总体而言,我们的结果表明:将合成轨迹添加到训练过程(Dyna风格算法的核心机制)会显著降低大多数DMC环境中的性能。我们的研究结果深化了对模型强化学习中若干基础挑战的理解,并表明与许多优化领域类似,在强化学习的多样化基准中评估性能时不存在普遍适用的“免费午餐”。