The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction, enabling a versatile and adaptable AI framework. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare. Our model demonstrates its ability to generate meaningful and contextually relevant outputs in each area. The strength of our approach lies in its generality, leveraging a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets, and textual information for effective multimodal and multi-task learning. Our approach provides a promising avenue for developing generalist, action-taking, multimodal systems.
翻译:人工智能系统的发展正从创建静态、任务特定型模型,转向能够在一系列广泛应用中表现优异的动态、基于智能体的系统。我们提出了一种交互式智能体基础模型,该模型采用新颖的多任务智能体训练范式,用于跨广泛领域、数据集和任务训练人工智能智能体。我们的训练范式统一了多样化的预训练策略,包括视觉掩码自编码器、语言建模和下一动作预测,从而构建了一个多功能且适应性强的AI框架。我们在三个独立领域——机器人学、游戏AI和医疗保健——展示了该框架的性能。我们的模型展现了其在每个领域生成有意义且上下文相关输出的能力。本方法的优势在于其通用性,能够利用机器人操作序列、游戏数据、大规模视频数据集和文本信息等多种数据源,实现有效的多模态与多任务学习。我们的方法为开发通用型、可执行动作的多模态系统提供了一条前景广阔的路径。