Synthetic data is a core component of data-efficient Dyna-style model-based reinforcement learning, yet it can also degrade performance. We study when it helps, where it fails, and why, and we show that addressing the resulting failure modes enables policy improvement that was previously unattainable. We focus on Model-Based Policy Optimization (MBPO), which performs actor and critic updates using synthetic action counterfactuals. Despite reports of strong and generalizable sample-efficiency gains in OpenAI Gym, recent work shows that MBPO often underperforms its model-free counterpart, Soft Actor-Critic (SAC), in the DeepMind Control Suite (DMC). Although both suites involve continuous control with proprioceptive robots, this shift leads to sharp performance losses across seven challenging DMC tasks, with MBPO failing in cases where claims of generalization from Gym would imply success. This reveals how environment-specific assumptions can become implicitly encoded into algorithm design when evaluation is limited. We identify two coupled issues behind these failures: scale mismatches between dynamics and reward models that induce critic underestimation and hinder policy improvement during model-policy coevolution, and a poor choice of target representation that inflates model variance and produces error-prone rollouts. Addressing these failure modes enables policy improvement where none was previously possible, allowing MBPO to outperform SAC in five of seven tasks while preserving the strong performance previously reported in OpenAI Gym. Rather than aiming only for incremental average gains, we hope our findings motivate the community to develop taxonomies that tie MDP task- and environment-level structure to algorithmic failure modes, pursue unified solutions where possible, and clarify how benchmark choices ultimately shape the conditions under which algorithms generalize.
翻译:合成数据是数据高效的Dyna风格基于模型强化学习的核心组成部分,但它也可能降低性能。我们研究了合成数据何时有益、何处失效及其原因,并证明解决由此产生的失效模式能够实现先前无法达成的策略改进。我们聚焦于基于模型的策略优化(MBPO),该方法使用合成的动作反事实数据执行行动者与评论家更新。尽管在OpenAI Gym中报告了强大且可泛化的样本效率提升,近期研究表明MBPO在DeepMind控制套件(DMC)中常表现逊于其免模型对应方法——软行动者-评论家(SAC)。虽然这两个套件都涉及具有本体感知机器人的连续控制任务,但环境转换导致MBPO在七项具有挑战性的DMC任务中出现显著性能损失,其在Gym环境声称可泛化成功的情况下仍遭遇失败。这揭示了当评估受限时,环境特定假设如何被隐式编码到算法设计中。我们识别出导致这些失败的两个耦合问题:动态模型与奖励模型间的尺度失配会引发评论家低估并阻碍模型-策略协同进化期间的策略改进;以及目标表示的选择不当会放大模型方差并产生误差易发的轨迹推演。解决这些失效模式能够在先前无法实现改进的领域提升策略性能,使MBPO在七项任务中的五项超越SAC,同时保持先前在OpenAI Gym中报告的优异表现。我们不仅追求渐进的平均性能提升,更希望本研究能激励学界:建立将MDP任务与环境层面结构与算法失效模式相关联的分类体系;在可能情况下寻求统一解决方案;并厘清基准选择如何最终塑造算法泛化的适用条件。