In-context learning is a promising approach for offline reinforcement learning (RL) to handle online tasks, which can be achieved by providing task prompts. Recent works demonstrated that in-context RL could emerge with self-improvement in a trial-and-error manner when treating RL tasks as an across-episodic sequential prediction problem. Despite the self-improvement not requiring gradient updates, current works still suffer from high computational costs when the across-episodic sequence increases with task horizons. To this end, we propose an In-context Decision Transformer (IDT) to achieve self-improvement in a high-level trial-and-error manner. Specifically, IDT is inspired by the efficient hierarchical structure of human decision-making and thus reconstructs the sequence to consist of high-level decisions instead of low-level actions that interact with environments. As one high-level decision can guide multi-step low-level actions, IDT naturally avoids excessively long sequences and solves online tasks more efficiently. Experimental results show that IDT achieves state-of-the-art in long-horizon tasks over current in-context RL methods. In particular, the online evaluation time of our IDT is \textbf{36$\times$} times faster than baselines in the D4RL benchmark and \textbf{27$\times$} times faster in the Grid World benchmark.
翻译:上下文学习是一种用于离线强化学习(RL)以处理在线任务的有前景的方法,其可通过提供任务提示来实现。近期研究表明,当将RL任务视为跨幕次的序列预测问题时,上下文RL能够以试错的方式实现自我改进。尽管这种自我改进无需梯度更新,但当跨幕次序列随任务视野增长时,现有方法仍面临高昂的计算成本。为此,我们提出了一种上下文决策Transformer(IDT),以高层试错方式实现自我改进。具体而言,IDT受人类决策高效分层结构的启发,重构序列使其包含高层决策而非与环境交互的低层动作。由于一个高层决策可指导多步低层动作,IDT自然避免了过长的序列,从而更高效地解决在线任务。实验结果表明,在长视野任务上,IDT相较于当前上下文RL方法取得了最先进的性能。特别地,在D4RL基准测试中,我们的IDT在线评估时间比基线方法快\textbf{36$\times$}倍,在Grid World基准测试中快\textbf{27$\times$}倍。