Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies, yet unavoidable modeling errors often lead performance deterioration. The model in MBRL is often solely fitted to reconstruct dynamics, state observations in particular, while the impact of model error on the policy is not captured by the training objective. This leads to a mismatch between the intended goal of MBRL, enabling good policy and value learning, and the target of the loss function employed in practice, future state prediction. Naive intuition would suggest that value-aware model learning would fix this problem and, indeed, several solutions to this objective mismatch problem have been proposed based on theoretical analysis. However, they tend to be inferior in practice to commonly used maximum likelihood (MLE) based approaches. In this paper we propose the Value-gradient weighted Model Learning (VaGraM), a novel method for value-aware model learning which improves the performance of MBRL in challenging settings, such as small model capacity and the presence of distracting state dimensions. We analyze both MLE and value-aware approaches and demonstrate how they fail to account for exploration and the behavior of function approximation when learning value-aware models and highlight the additional goals that must be met to stabilize optimization in the deep learning setting. We verify our analysis by showing that our loss function is able to achieve high returns on the Mujoco benchmark suite while being more robust than maximum likelihood based approaches.
翻译:基于模型的强化学习(MBRL)是一种样本高效的控制策略获取技术,然而难以避免的建模误差常导致性能下降。MBRL中的模型通常仅用于重构动力学过程(特别是状态观测值),而训练目标并未捕捉模型误差对策略的影响。这导致MBRL的预期目标(实现良好的策略与价值学习)与实际使用的损失函数目标(未来状态预测)之间存在失配。朴素直觉认为价值感知的模型学习能解决此问题,事实上基于理论分析已提出多种针对该目标失配问题的解决方案。然而在实践中,这些方法通常劣于常用的基于最大似然估计(MLE)的方法。本文提出价值梯度加权模型学习(VaGraM)——一种新颖的价值感知模型学习方法,可在小模型容量和存在干扰状态维度等挑战性场景中提升MBRL性能。我们分析了MLE与价值感知两种方法,揭示了它们在价值感知模型学习过程中如何无法妥善处理探索及函数逼近行为,并强调了在深度学习背景下稳定优化所需满足的额外目标。通过在Mujoco基准测试套件上的实验验证,我们的损失函数在保持比基于最大似然方法更高鲁棒性的同时,能够实现高回报率。