While reinforcement learning (RL) methods that learn an internal model of the environment have the potential to be more sample efficient than their model-free counterparts, learning to model raw observations from high dimensional sensors can be challenging. Prior work has addressed this challenge by learning low-dimensional representation of observations through auxiliary objectives, such as reconstruction or value prediction. However, the alignment between these auxiliary objectives and the RL objective is often unclear. In this work, we propose a single objective which jointly optimizes a latent-space model and policy to achieve high returns while remaining self-consistent. This objective is a lower bound on expected returns. Unlike prior bounds for model-based RL on policy exploration or model guarantees, our bound is directly on the overall RL objective. We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods. While sample efficient methods typically are computationally demanding, our method attains the performance of SAC in about 50% less wall-clock time.
翻译:尽管通过学习环境内部模型的强化学习方法在样本效率上可能优于无模型方法,但学习对高维传感器原始观测进行建模仍颇具挑战。先前的工作通过辅助目标(如重构或价值预测)学习观测的低维表征来应对这一挑战。然而,这些辅助目标与强化学习目标之间的一致性往往不明确。本研究提出一个单一目标,该目标联合优化潜在空间模型与策略,以在保持自洽性的同时实现高回报。该目标是期望回报的下界。不同于先前基于模型的强化学习中对策略探索或模型保证的界限,我们的界限直接作用于整体强化学习目标。我们证明,由此生成的算法在样本效率上匹配或优于最先进的基于模型和无模型的强化学习方法。尽管样本高效的方法通常计算量大,但我们的方法在减少约50%的实际运行时间后,仍能达到SAC的性能。