Decision Transformers have recently emerged as a new and compelling paradigm for offline Reinforcement Learning (RL), completing a trajectory in an autoregressive way. While improvements have been made to overcome initial shortcomings, online finetuning of decision transformers has been surprisingly under-explored. The widely adopted state-of-the-art Online Decision Transformer (ODT) still struggles when pretrained with low-reward offline data. In this paper, we theoretically analyze the online-finetuning of the decision transformer, showing that the commonly used Return-To-Go (RTG) that's far from the expected return hampers the online fine-tuning process. This problem, however, is well-addressed by the value function and advantage of standard RL algorithms. As suggested by our analysis, in our experiments, we hence find that simply adding TD3 gradients to the finetuning process of ODT effectively improves the online finetuning performance of ODT, especially if ODT is pretrained with low-reward offline data. These findings provide new directions to further improve decision transformers.
翻译:决策Transformer最近作为一种新颖且引人注目的离线强化学习范式出现,以自回归方式完成轨迹。尽管已有改进来克服初始缺陷,但决策Transformer的在线微调却出人意料地未被充分探索。广泛采用的最先进在线决策Transformer在采用低奖励离线数据预训练时仍面临困难。本文从理论上分析了决策Transformer的在线微调过程,证明远离期望回报的常用"回报目标"会阻碍在线微调进程。然而,标准强化学习算法的价值函数与优势函数能有效解决该问题。根据理论分析,我们在实验中发现:简单地将TD3梯度添加到ODT的微调过程中,能显著提升其在线微调性能——尤其在采用低奖励离线数据预训练时更为明显。这些发现为决策Transformer的进一步改进提供了新方向。