Recent works have shown that tackling offline reinforcement learning (RL) with a conditional policy produces promising results. The Decision Transformer (DT) combines the conditional policy approach and a transformer architecture, showing competitive performance against several benchmarks. However, DT lacks stitching ability -- one of the critical abilities for offline RL to learn the optimal policy from sub-optimal trajectories. This issue becomes particularly significant when the offline dataset only contains sub-optimal trajectories. On the other hand, the conventional RL approaches based on Dynamic Programming (such as Q-learning) do not have the same limitation; however, they suffer from unstable learning behaviours, especially when they rely on function approximation in an off-policy learning setting. In this paper, we propose the Q-learning Decision Transformer (QDT) to address the shortcomings of DT by leveraging the benefits of Dynamic Programming (Q-learning). It utilises the Dynamic Programming results to relabel the return-to-go in the training data to then train the DT with the relabelled data. Our approach efficiently exploits the benefits of these two approaches and compensates for each other's shortcomings to achieve better performance. We empirically show these in both simple toy environments and the more complex D4RL benchmark, showing competitive performance gains.
翻译:近期研究表明,采用条件策略解决离线强化学习问题取得了令人瞩目的成果。决策Transformer将条件策略方法与Transformer架构相结合,在多个基准测试中展现出具有竞争力的性能。然而,DT缺乏拼接能力——这是离线强化学习从次优轨迹中学习最优策略的关键能力之一。当离线数据集仅包含次优轨迹时,这一问题尤为突出。另一方面,基于动态规划的传统强化学习方法(如Q-learning)则不存在这一限制,但它们在离策略学习设置中依赖函数近似时,容易出现学习行为不稳定的问题。本文提出Q-learning决策Transformer,通过利用动态规划的优势来解决DT的缺陷。该方法利用动态规划结果对训练数据中的目标回报值进行重新标注,然后使用重新标注后的数据训练DT。我们的方法有效利用了这两种方法的优势,并相互弥补各自的不足,以实现更优的性能。我们在简单的玩具环境和更复杂的D4RL基准测试中通过实证展示了其具有竞争力的性能提升。