While Vision-Language Models (VLMs) hold promise for tasks requiring extensive collaboration, traditional multi-agent simulators have facilitated rich explorations of an interactive artificial society that reflects collective behavior. However, these existing simulators face significant limitations. Firstly, they struggle with handling large numbers of agents due to high resource demands. Secondly, they often assume agents possess perfect information and limitless capabilities, hindering the ecological validity of simulated social interactions. To bridge this gap, we propose a multi-agent Minecraft simulator, MineLand, that bridges this gap by introducing three key features: large-scale scalability, limited multimodal senses, and physical needs. Our simulator supports 64 or more agents. Agents have limited visual, auditory, and environmental awareness, forcing them to actively communicate and collaborate to fulfill physical needs like food and resources. Additionally, we further introduce an AI agent framework, Alex, inspired by multitasking theory, enabling agents to handle intricate coordination and scheduling. Our experiments demonstrate that the simulator, the corresponding benchmark, and the AI agent framework contribute to more ecological and nuanced collective behavior.The source code of MineLand and Alex is openly available at https://github.com/cocacola-lab/MineLand.
翻译:尽管视觉语言模型(VLMs)在需要广泛协作的任务中展现出潜力,传统的多智能体模拟器已为探索反映集体行为的交互式人工社会提供了丰富平台。然而,现有模拟器面临显著局限。首先,由于资源需求高,它们难以处理大规模智能体。其次,它们通常假设智能体拥有完美信息与无限能力,这阻碍了模拟社会交互的生态效度。为弥补这一差距,我们提出了一个多智能体Minecraft模拟器——MineLand,通过引入三个关键特性来连接这一鸿沟:大规模可扩展性、有限多模态感知与物理需求。我们的模拟器支持64个或更多智能体。智能体具备有限的视觉、听觉与环境感知能力,迫使其主动沟通与协作,以满足食物与资源等物理需求。此外,我们进一步引入了一个受多任务处理理论启发的AI智能体框架——Alex,使智能体能够处理复杂的协调与调度。实验表明,该模拟器、相应基准测试及AI智能体框架共同促成了更具生态性与细致性的集体行为。MineLand与Alex的源代码已在https://github.com/cocacola-lab/MineLand公开提供。