Harnessing large offline datasets is vital for training foundation models that can generalize across diverse tasks. Offline Reinforcement Learning (RL) offers a powerful framework for these scenarios, enabling the derivation of optimal policies even from suboptimal data. The Prompting Decision Transformer (PDT) is an offline RL multi-task model that distinguishes tasks through stochastic trajectory prompts, which are task-specific tokens maintained in context during rollouts. However, PDT samples these tokens uniformly at random from per-task demonstration datasets, failing to account for differences in token informativeness and potentially leading to performance degradation. To address this limitation, we introduce a scalable bandit-based prompt-tuning method that dynamically learns to construct high-performance trajectory prompts. Our approach significantly enhances downstream task performance without modifying the pre-trained Transformer backbone. Empirical results on benchmark tasks and a newly designed multi-task environment demonstrate the effectiveness of our method, creating a seamless bridge between general multi-task offline pre-training and task-specific online adaptation.
翻译:利用大规模离线数据集对于训练能够泛化到多样化任务的基础模型至关重要。离线强化学习为此类场景提供了一个强大的框架,即使从次优数据中也能推导出最优策略。提示决策Transformer是一种离线强化学习多任务模型,它通过随机轨迹提示来区分任务,这些提示是在任务执行过程中保持在上下文中的任务特定标记。然而,PDT从每个任务的演示数据集中均匀随机采样这些标记,未能考虑标记信息量的差异,可能导致性能下降。为了解决这一局限性,我们引入了一种基于赌博机的可扩展提示调优方法,该方法能够动态学习构建高性能的轨迹提示。我们的方法在不修改预训练Transformer主干网络的情况下,显著提升了下游任务性能。在基准任务和新设计的多任务环境上的实证结果证明了我们方法的有效性,为通用的多任务离线预训练与任务特定的在线适应之间搭建了一座无缝桥梁。