Recent works have shown that sequence modeling can be effectively used to train reinforcement learning (RL) policies. However, the success of applying existing sequence models to planning, in which we wish to obtain a trajectory of actions to reach some goal, is less straightforward. The typical autoregressive generation procedures of sequence models preclude sequential refinement of earlier steps, which limits the effectiveness of a predicted plan. In this paper, we suggest an approach towards integrating planning with sequence models based on the idea of iterative energy minimization, and illustrate how such a procedure leads to improved RL performance across different tasks. We train a masked language model to capture an implicit energy function over trajectories of actions, and formulate planning as finding a trajectory of actions with minimum energy. We illustrate how this procedure enables improved performance over recent approaches across BabyAI and Atari environments. We further demonstrate unique benefits of our iterative optimization procedure, involving new task generalization, test-time constraints adaptation, and the ability to compose plans together. Project website: https://hychen-naza.github.io/projects/LEAP
翻译:近期研究表明,序列建模可有效用于训练强化学习策略。然而,将现有序列模型应用于规划(即希望获得达成目标的行为轨迹)的成功实践尚不明确。序列模型典型的自回归生成过程会阻碍对早期步骤的序贯优化,从而限制预测计划的有效性。本文提出一种基于迭代能量最小化思想的序列模型规划集成方法,并阐明该过程如何在不同任务中提升强化学习性能。我们训练掩码语言模型以捕捉行为轨迹上的隐式能量函数,将规划问题形式化为寻找能量最小的行为轨迹。实验表明,该方法在BabyAI和Atari环境中相比现有方法具有更优性能。我们进一步展示了迭代优化过程的独特优势,包括新任务泛化能力、测试时约束自适应能力以及多计划组合能力。项目网站:https://hychen-naza.github.io/projects/LEAP