Model-based reinforcement learning (MBRL) seeks to enhance data efficiency by learning a model of the environment and generating synthetic rollouts from it. However, accumulated model errors during these rollouts can distort the data distribution, negatively impacting policy learning and hindering long-term planning. Thus, the accumulation of model errors is a key bottleneck in current MBRL methods. We propose Infoprop, a model-based rollout mechanism that separates aleatoric from epistemic model uncertainty and reduces the influence of the latter on the data distribution. Further, Infoprop keeps track of accumulated model errors along a model rollout and provides termination criteria to limit data corruption. We demonstrate the capabilities of Infoprop in the Infoprop-Dyna algorithm, reporting state-of-the-art performance in Dyna-style MBRL on common MuJoCo benchmark tasks while substantially increasing rollout length and data quality.
翻译:基于模型的强化学习(MBRL)旨在通过构建环境模型并从中生成合成推演轨迹来提升数据效率。然而,推演过程中累积的模型误差会扭曲数据分布,对策略学习产生负面影响并阻碍长期规划。因此,模型误差的累积是当前MBRL方法的关键瓶颈。本文提出Infoprop,一种基于模型的推演机制,能够分离模型中的偶然不确定性与认知不确定性,并减少后者对数据分布的影响。此外,Infoprop能够追踪模型推演过程中累积的误差,并提供终止准则以限制数据污染。我们在Infoprop-Dyna算法中展示了Infoprop的性能,该算法在常见MuJoCo基准任务上实现了Dyna风格MBRL中最先进的性能,同时显著提升了推演长度与数据质量。