Collaboration is a cornerstone of society. In the real world, human teammates make use of multi-sensory data to tackle challenging tasks in ever-changing environments. It is essential for embodied agents collaborating in visually-rich environments replete with dynamic interactions to understand multi-modal observations and task specifications. To evaluate the performance of generalizable multi-modal collaborative agents, we present TeamCraft, a multi-modal multi-agent benchmark built on top of the open-world video game Minecraft. The benchmark features 55,000 task variants specified by multi-modal prompts, procedurally-generated expert demonstrations for imitation learning, and carefully designed protocols to evaluate model generalization capabilities. We also perform extensive analyses to better understand the limitations and strengths of existing approaches. Our results indicate that existing models continue to face significant challenges in generalizing to novel goals, scenes, and unseen numbers of agents. These findings underscore the need for further research in this area. The TeamCraft platform and dataset are publicly available at https://github.com/teamcraft-bench/teamcraft.
翻译:协作是社会的基石。在现实世界中,人类团队成员利用多感官数据在不断变化的环境中应对具有挑战性的任务。对于在充满动态交互的视觉丰富环境中协作的具身智能体而言,理解多模态观察和任务规范至关重要。为了评估可泛化的多模态协作智能体的性能,我们提出了TeamCraft,这是一个构建在开放世界视频游戏《我的世界》之上的多模态多智能体基准。该基准包含由多模态提示指定的55,000个任务变体、用于模仿学习的程序化生成的专家演示,以及精心设计的协议来评估模型的泛化能力。我们还进行了广泛的分析,以更好地理解现有方法的局限性和优势。我们的结果表明,现有模型在泛化到新目标、新场景以及未见过的智能体数量方面仍然面临重大挑战。这些发现强调了该领域需要进一步研究。TeamCraft平台和数据集已在 https://github.com/teamcraft-bench/teamcraft 公开提供。