The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction, enabling a versatile and adaptable AI framework. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare. Our model demonstrates its ability to generate meaningful and contextually relevant outputs in each area. The strength of our approach lies in its generality, leveraging a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets, and textual information for effective multimodal and multi-task learning. Our approach provides a promising avenue for developing generalist, action-taking, multimodal systems.
翻译:人工智能系统的发展正从创建静态、任务特定模型转向能够广泛应用于各类场景的动态智能体系统。我们提出一种交互式智能体基础模型,采用新颖的多任务智能体训练范式,可跨领域、数据集和任务训练AI智能体。该训练范式统一了多种预训练策略,包括视觉掩码自编码器、语言建模和下一动作预测,构建出灵活且具适应性的AI框架。我们在机器人学、游戏AI和医疗保健三个不同领域验证了该框架的性能,模型在各领域均能生成有意义且上下文相关的输出结果。该方法的优势在于其通用性,通过整合机器人序列、游戏数据、大规模视频数据集和文本信息等多源数据,实现有效的多模态多任务学习。该研究为开发通用型、行动导向的多模态系统提供了富有前景的技术路径。