Model-based offline reinforcement learning (MORL) aims to learn a policy by exploiting a dynamics model derived from an existing dataset. Applying conservative quantification to the dynamics model, most existing works on MORL generate trajectories that approximate the real data distribution to facilitate policy learning by using current information (e.g., the state and action at time step $t$). However, these works neglect the impact of historical information on environmental dynamics, leading to the generation of unreliable trajectories that may not align with the real data distribution. In this paper, we propose a new MORL algorithm \textbf{R}eliability-guaranteed \textbf{T}ransformer (RT), which can eliminate unreliable trajectories by calculating the cumulative reliability of the generated trajectory (i.e., using a weighted variational distance away from the real data). Moreover, by sampling candidate actions with high rewards, RT can efficiently generate high-return trajectories from the existing offline data. We theoretically prove the performance guarantees of RT in policy learning, and empirically demonstrate its effectiveness against state-of-the-art model-based methods on several benchmark tasks.
翻译:基于模型的离线强化学习(MORL)旨在通过利用从现有数据集推导出的动力学模型来学习策略。大多数现有MORL研究通过对动力学模型施加保守量化,生成近似真实数据分布的轨迹,以利用当前信息(例如时间步$t$的状态与动作)促进策略学习。然而,这些方法忽略了历史信息对环境动力学的影响,导致可能生成与真实数据分布不一致的不可靠轨迹。本文提出一种新的MORL算法——可靠性保障Transformer(RT),该算法通过计算生成轨迹的累积可靠性(即采用加权变分距离度量其偏离真实数据的程度)来剔除不可靠轨迹。此外,通过采样高奖励的候选动作,RT能够从现有离线数据中高效生成高回报轨迹。我们从理论上证明了RT在策略学习中的性能保障,并在多个基准任务上通过实验验证了其相对于最先进基于模型方法的有效性。