Recent work has demonstrated the effectiveness of formulating decision making as a supervised learning problem on offline-collected trajectories. However, the benefits of performing sequence modeling on trajectory data is not yet clear. In this work we investigate if sequence modeling has the capability to condense trajectories into useful representations that can contribute to policy learning. To achieve this, we adopt a two-stage framework that first summarizes trajectories with sequence modeling techniques, and then employs these representations to learn a policy along with a desired goal. This design allows many existing supervised offline RL methods to be considered as specific instances of our framework. Within this framework, we introduce Goal-Conditioned Predicitve Coding (GCPC), an approach that brings powerful trajectory representations and leads to performant policies. We conduct extensive empirical evaluations on AntMaze, FrankaKitchen and Locomotion environments, and observe that sequence modeling has a significant impact on some decision making tasks. In addition, we demonstrate that GCPC learns a goal-conditioned latent representation about the future, which serves as an "implicit planner", and enables competitive performance on all three benchmarks.
翻译:近期研究证明了将决策制定视为离线收集轨迹上的监督学习问题的有效性。然而,对轨迹数据进行序列建模的益处尚不明确。本研究探究了序列建模是否具备将轨迹压缩为有助于策略学习的有用表示的能力。为此,我们采用两阶段框架:首先通过序列建模技术总结轨迹,随后利用这些表示结合期望目标进行策略学习。该设计使现有多种监督式离线强化学习方法可被视为我们框架的特例。在该框架下,我们提出以目标为条件的预测编码(GCPC),该方法能生成强大的轨迹表示并导向高性能策略。我们在AntMaze、FrankaKitchen与Locomotion环境中进行了广泛实证评估,观察到序列建模对部分决策任务具有显著影响。此外,我们证明GCPC学习到面向未来的目标条件潜在表示,该表示可作为"隐式规划器",在全部三个基准测试中实现具有竞争力的表现。