Harnessing large offline datasets is vital for training foundation models that can generalize across diverse tasks. Offline Reinforcement Learning (RL) offers a powerful framework for these scenarios, enabling the derivation of optimal policies even from suboptimal data. The Prompting Decision Transformer (PDT) is an offline RL multi-task model that distinguishes tasks through stochastic trajectory prompts, which are task-specific tokens maintained in context during rollouts. However, PDT samples these tokens uniformly at random from per-task demonstration datasets, failing to account for differences in token informativeness and potentially leading to performance degradation. To address this limitation, we introduce a scalable bandit-based prompt-tuning method that dynamically learns to construct high-performance trajectory prompts. Our approach significantly enhances downstream task performance without modifying the pre-trained Transformer backbone. Empirical results on benchmark tasks and a newly designed multi-task environment demonstrate the effectiveness of our method, creating a seamless bridge between general multi-task offline pre-training and task-specific online adaptation.
翻译:利用大规模离线数据集对于训练能够泛化至多样化任务的基础模型至关重要。离线强化学习为此类场景提供了强大的框架,即使从次优数据中也能推导出最优策略。提示决策Transformer是一种离线强化学习多任务模型,其通过随机轨迹提示来区分不同任务,这些提示是在轨迹生成过程中于上下文中维护的特定任务令牌。然而,PDT从各任务的演示数据集中均匀随机采样这些令牌,未能考虑令牌信息量的差异,可能导致性能下降。为克服这一局限,我们提出了一种基于Bandit的可扩展提示调优方法,能够动态学习构建高性能轨迹提示。该方法在不修改预训练Transformer主干网络的前提下,显著提升了下游任务性能。在基准任务和新设计的多任务环境上的实证结果表明,我们的方法能有效在通用多任务离线预训练与特定任务在线适应之间建立无缝桥梁。