Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics~(SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and traffic, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state-of-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.
翻译:基础模型的最新进展在开发通用机器人方面展现出良好前景,这类机器人能够在给定多模态输入的情况下,在开放场景中执行多样化任务。然而,当前工作主要集中于室内家庭场景。本文提出SimWorld-Robotics(SWR),一个面向大规模、逼真城市环境中具身智能体的仿真平台。基于Unreal Engine 5构建的SWR能够程序化生成无限量的逼真城市场景,其中包含行人、交通系统等动态元素,在真实性、复杂性和可扩展性上超越了先前的城市仿真系统。该平台同时支持多机器人控制与通信。基于这些核心特性,我们构建了两个具有挑战性的机器人基准测试:(1)多模态指令跟随任务,要求机器人在存在行人和交通的情况下,依据视觉-语言导航指令抵达目的地;(2)多智能体搜索任务,要求两个机器人通过通信协作定位并会合。与现有基准不同,这两项新基准在真实场景中全面评估了机器人的多项关键能力,包括:(1)多模态指令理解,(2)大范围环境中的三维空间推理,(3)存在行人及交通情况下的安全长距离导航,(4)多机器人协作,以及(5)基于环境的通信。实验结果表明,包括视觉-语言模型(VLM)在内的先进模型在我们的任务上表现不佳,缺乏城市环境所需的鲁棒感知、推理与规划能力。