Equipped with the trained environmental dynamics, model-based offline reinforcement learning (RL) algorithms can often successfully learn good policies from fixed-sized datasets, even some datasets with poor quality. Unfortunately, however, it can not be guaranteed that the generated samples from the trained dynamics model are reliable (e.g., some synthetic samples may lie outside of the support region of the static dataset). To address this issue, we propose Trajectory Truncation with Uncertainty (TATU), which adaptively truncates the synthetic trajectory if the accumulated uncertainty along the trajectory is too large. We theoretically show the performance bound of TATU to justify its benefits. To empirically show the advantages of TATU, we first combine it with two classical model-based offline RL algorithms, MOPO and COMBO. Furthermore, we integrate TATU with several off-the-shelf model-free offline RL algorithms, e.g., BCQ. Experimental results on the D4RL benchmark show that TATU significantly improves their performance, often by a large margin.
翻译:配备训练好的环境动态模型后,基于模型的离线强化学习算法通常能从固定大小的数据集(甚至包括部分低质量数据集)中成功学习到良好策略。然而,遗憾的是,这些算法无法保证从训练好的动态模型中生成的样本是可靠的(例如,某些合成样本可能位于静态数据集的支撑区域之外)。为解决该问题,我们提出了一种基于不确定性的轨迹截断方法(TATU),当沿轨迹累积的不确定性过大时,该方法能自适应地截断合成轨迹。我们从理论上证明了TATU的性能边界以阐明其优势。为实证展示TATU的优势,我们首先将其与两种经典基于模型的离线强化学习算法MOPO和COMBO结合。在此基础上,又将TATU与若干现成的无模型离线强化学习算法(如BCQ)集成。在D4RL基准上的实验结果表明,TATU能显著提升这些算法的性能,且提升幅度通常较大。