Decision Transformer (DT), which employs expressive sequence modeling techniques to perform action generation, has emerged as a promising approach to offline policy optimization. However, DT generates actions conditioned on a desired future return, which is known to bear some weaknesses such as the susceptibility to environmental stochasticity. To overcome DT's weaknesses, we propose to empower DT with dynamic programming. Our method comprises three steps. First, we employ in-sample value iteration to obtain approximated value functions, which involves dynamic programming over the MDP structure. Second, we evaluate action quality in context with estimated advantages. We introduce two types of advantage estimators, IAE and GAE, which are suitable for different tasks. Third, we train an Advantage-Conditioned Transformer (ACT) to generate actions conditioned on the estimated advantages. Finally, during testing, ACT generates actions conditioned on a desired advantage. Our evaluation results validate that, by leveraging the power of dynamic programming, ACT demonstrates effective trajectory stitching and robust action generation in spite of the environmental stochasticity, outperforming baseline methods across various benchmarks. Additionally, we conduct an in-depth analysis of ACT's various design choices through ablation studies.
翻译:决策Transformer(DT)采用序列建模技术生成动作,已成为离线策略优化的有效方法。然而,DT基于期望未来回报生成动作,存在对环境随机性敏感等弱点。为克服DT的缺陷,我们提出赋予DT动态规划能力。本方法包含三个步骤:首先,采用样本内值迭代获取近似价值函数,该过程涉及马尔可夫决策过程(MDP)结构上的动态规划;其次,结合估计优势评估动作质量,引入适用于不同任务的两种优势估计器IAE与GAE;第三,训练优势条件化Transformer(ACT)基于估计优势生成动作。最后在测试阶段,ACT根据期望优势生成动作。评估结果验证,ACT通过利用动态规划优势,能够有效进行轨迹拼接并生成鲁棒动作,在多种基准测试中超越基线方法。此外,通过消融实验对ACT的各类设计选择进行了深入分析。