Equipped with the trained environmental dynamics, model-based offline reinforcement learning (RL) algorithms can often successfully learn good policies from fixed-sized datasets, even some datasets with poor quality. Unfortunately, however, it can not be guaranteed that the generated samples from the trained dynamics model are reliable (e.g., some synthetic samples may lie outside of the support region of the static dataset). To address this issue, we propose Trajectory Truncation with Uncertainty (TATU), which adaptively truncates the synthetic trajectory if the accumulated uncertainty along the trajectory is too large. We theoretically show the performance bound of TATU to justify its benefits. To empirically show the advantages of TATU, we first combine it with two classical model-based offline RL algorithms, MOPO and COMBO. Furthermore, we integrate TATU with several off-the-shelf model-free offline RL algorithms, e.g., BCQ. Experimental results on the D4RL benchmark show that TATU significantly improves their performance, often by a large margin. Code is available here.
翻译:基于训练的环境动力学模型,基于模型的离线强化学习算法通常能够从固定大小的数据集(即使部分数据集质量较差)中成功学习到良好的策略。然而,遗憾的是,无法保证从训练好的动力学模型生成的样本是可靠的(例如,某些合成样本可能位于静态数据集的支持区域之外)。为解决这一问题,我们提出了不确定性轨迹截断方法(TATU),该方法根据轨迹上累积不确定性的大小自适应地截断合成轨迹。我们从理论上推导了TATU的性能界,以证明其优势。为实证展示TATU的优越性,我们首先将其与两种经典的基于模型的离线强化学习算法MOPO和COMBO相结合。此外,我们将TATU集成到多种现成的无模型离线强化学习算法(如BCQ)中。在D4RL基准测试上的实验结果表明,TATU显著提升了这些算法的性能,且提升幅度通常较大。代码已公开。