Offline reinforcement learning (RL) algorithms can learn better decision-making compared to behavior policies by stitching the suboptimal trajectories to derive more optimal ones. Meanwhile, Decision Transformer (DT) abstracts the RL as sequence modeling, showcasing competitive performance on offline RL benchmarks. However, recent studies demonstrate that DT lacks of stitching capacity, thus exploiting stitching capability for DT is vital to further improve its performance. In order to endow stitching capability to DT, we abstract trajectory stitching as expert matching and introduce our approach, ContextFormer, which integrates contextual information-based imitation learning (IL) and sequence modeling to stitch sub-optimal trajectory fragments by emulating the representations of a limited number of expert trajectories. To validate our approach, we conduct experiments from two perspectives: 1) We conduct extensive experiments on D4RL benchmarks under the settings of IL, and experimental results demonstrate ContextFormer can achieve competitive performance in multiple IL settings. 2) More importantly, we conduct a comparison of ContextFormer with various competitive DT variants using identical training datasets. The experimental results unveiled ContextFormer's superiority, as it outperformed all other variants, showcasing its remarkable performance.
翻译:离线强化学习(RL)算法能够通过缝合次优轨迹以推导出更优轨迹,从而学习到比行为策略更优的决策。与此同时,决策Transformer(DT)将RL抽象为序列建模,在离线RL基准测试中展现出具有竞争力的性能。然而,近期研究表明DT缺乏轨迹缝合能力,因此为DT开发缝合能力对于进一步提升其性能至关重要。为了赋予DT缝合能力,我们将轨迹缝合抽象为专家匹配,并提出了我们的方法——ContextFormer。该方法整合了基于上下文信息的模仿学习(IL)与序列建模,通过模仿有限数量专家轨迹的表征来缝合次优轨迹片段。为验证我们的方法,我们从两个角度进行了实验:1)我们在D4RL基准测试的IL设置下进行了大量实验,实验结果表明ContextFormer在多种IL设置下均能取得具有竞争力的性能。2)更重要的是,我们使用相同的训练数据集,将ContextFormer与多种具有竞争力的DT变体进行了比较。实验结果揭示了ContextFormer的优越性,其性能超越了所有其他变体,展现出卓越的表现。