Decision Transformer (DT), which employs expressive sequence modeling techniques to perform action generation, has emerged as a promising approach to offline policy optimization. However, DT generates actions conditioned on a desired future return, which is known to bear some weaknesses such as the susceptibility to environmental stochasticity. To overcome DT's weaknesses, we propose to empower DT with dynamic programming. Our method comprises three steps. First, we employ in-sample value iteration to obtain approximated value functions, which involves dynamic programming over the MDP structure. Second, we evaluate action quality in context with estimated advantages. We introduce two types of advantage estimators, IAE and GAE, which are suitable for different tasks. Third, we train an Advantage-Conditioned Transformer (ACT) to generate actions conditioned on the estimated advantages. Finally, during testing, ACT generates actions conditioned on a desired advantage. Our evaluation results validate that, by leveraging the power of dynamic programming, ACT demonstrates effective trajectory stitching and robust action generation in spite of the environmental stochasticity, outperforming baseline methods across various benchmarks. Additionally, we conduct an in-depth analysis of ACT's various design choices through ablation studies. Our code is available at https://github.com/LAMDA-RL/ACT.
翻译:决策Transformer(Decision Transformer, DT)利用富有表现力的序列建模技术生成动作,已成为离线策略优化的一种有前景的方法。然而,DT基于期望未来收益条件生成动作,这已知存在一些弱点,例如对环境随机性的敏感性。为克服DT的弱点,我们提出赋予DT动态规划能力。我们的方法包括三个步骤。首先,我们采用样本内价值迭代来获得近似价值函数,这涉及在马尔可夫决策过程(MDP)结构上进行动态规划。其次,我们利用估计的优势来评估上下文中的动作质量。我们引入了两种优势估计器,即IAE和GAE,它们适用于不同任务。第三,我们训练一个优势条件Transformer(Advantage-Conditioned Transformer, ACT),使其基于估计的优势条件生成动作。最后,在测试阶段,ACT基于期望的优势条件生成动作。我们的评估结果验证了通过利用动态规划的力量,ACT能够有效进行轨迹拼接,并在面对环境随机性时生成稳健动作,在各种基准测试中优于基线方法。此外,我们通过消融研究对ACT的各种设计选择进行了深入分析。我们的代码可在https://github.com/LAMDA-RL/ACT获取。