Developing a generalist agent is a longstanding objective in artificial intelligence. Previous efforts utilizing extensive offline datasets from various tasks demonstrate remarkable performance in multitasking scenarios within Reinforcement Learning. However, these works encounter challenges in extending their capabilities to new tasks. Recent approaches integrate textual guidance or visual trajectory into decision networks to provide task-specific contextual cues, representing a promising direction. However, it is observed that relying solely on textual guidance or visual trajectory is insufficient for accurately conveying the contextual information of tasks. This paper explores enhanced forms of task guidance for agents, enabling them to comprehend gameplay instructions, thereby facilitating a "read-to-play" capability. Drawing inspiration from the success of multimodal instruction tuning in visual tasks, we treat the visual-based RL task as a long-horizon vision task and construct a set of multimodal game instructions to incorporate instruction tuning into a decision transformer. Experimental results demonstrate that incorporating multimodal game instructions significantly enhances the decision transformer's multitasking and generalization capabilities.
翻译:开发通用智能体是人工智能领域的长期目标。先前利用多任务广泛离线数据集的研究在强化学习的多任务场景中展现了卓越性能。然而,这些工作在扩展至新任务时面临挑战。近期方法将文本引导或视觉轨迹融入决策网络以提供任务特定上下文线索,这代表了有前景的方向。但研究表明,仅依赖文本引导或视觉轨迹不足以准确传递任务的上下文信息。本文探索了针对智能体的增强型任务引导形式,使其能够理解游戏操作指令,从而具备"阅读即玩"能力。受多模态指令微调在视觉任务中成功的启发,我们将基于视觉的强化学习任务视为长时程视觉任务,并构建多模态游戏指令集,将指令微调融入决策Transformer框架。实验结果表明,融入多模态游戏指令显著提升了决策Transformer的多任务处理能力和泛化能力。