We propose and theoretically analyze an approach for planning with an approximate model in reinforcement learning that can reduce the adverse impact of model error. If the model is accurate enough, it accelerates the convergence to the true value function too. One of its key components is the MaxEnt Model Correction (MoCo) procedure that corrects the model's next-state distributions based on a Maximum Entropy density estimation formulation. Based on MoCo, we introduce the Model Correcting Value Iteration (MoCoVI) algorithm, and its sampled-based variant MoCoDyna. We show that MoCoVI and MoCoDyna's convergence can be much faster than the conventional model-free algorithms. Unlike traditional model-based algorithms, MoCoVI and MoCoDyna effectively utilize an approximate model and still converge to the correct value function.
翻译:我们提出并理论分析了一种在强化学习中利用近似模型进行规划的方法,该方法能够减少模型误差的不利影响。若模型足够精确,它还能加速收敛至真实价值函数。其关键组件之一是最大熵模型修正(MoCo)过程,该过程基于最大熵密度估计公式修正模型的下一状态分布。基于MoCo,我们提出了模型修正价值迭代(MoCoVI)算法及其基于采样的变体MoCoDyna。研究表明,MoCoVI和MoCoDyna的收敛速度远快于传统无模型算法。与传统基于模型的算法不同,MoCoVI和MoCoDyna有效利用近似模型,同时仍能收敛至正确的价值函数。