Vision-language-action models have advanced robotic manipulation but remain constrained by reliance on the large, teleoperation-collected datasets dominated by the static, tabletop scenes. We propose a simulation-first framework to verify VLA architectures before real-world deployment and introduce MobileManiBench, a large-scale benchmark for mobile-based robotic manipulation. Built on NVIDIA Isaac Sim and powered by reinforcement learning, our pipeline autonomously generates diverse manipulation trajectories with rich annotations (language instructions, multi-view RGB-depth-segmentation images, synchronized object/robot states and actions). MobileManiBench features 2 mobile platforms (parallel-gripper and dexterous-hand robots), 2 synchronized cameras (head and right wrist), 630 objects in 20 categories, 5 skills (open, close, pull, push, pick) with over 100 tasks performed in 100 realistic scenes, yielding 300K trajectories. This design enables controlled, scalable studies of robot embodiments, sensing modalities, and policy architectures, accelerating research on data efficiency and generalization. We benchmark representative VLA models and report insights into perception, reasoning, and control in complex simulated environments.
翻译:视觉-语言-动作模型推动了机器人操作的发展,但仍受限于对大型遥操作数据集的依赖,这些数据集主要由静态桌面场景主导。我们提出了一个仿真优先的框架,用于在现实世界部署前验证VLA架构,并引入了MobileManiBench,一个用于移动机器人操作的大规模基准。该基准基于NVIDIA Isaac Sim构建,并由强化学习驱动,我们的流水线能够自主生成多样化的操作轨迹,并提供丰富的标注(语言指令、多视角RGB-深度-分割图像、同步的对象/机器人状态与动作)。MobileManiBench包含2个移动平台(平行夹爪机器人和灵巧手机器人)、2个同步摄像头(头部和右手腕)、20个类别共630个物体、5项技能(打开、关闭、拉、推、抓取)以及超过100个在100个真实感场景中执行的任务,共产生30万条轨迹。这种设计支持对机器人本体、感知模态和策略架构进行受控、可扩展的研究,从而加速数据效率和泛化能力的研究。我们对代表性的VLA模型进行了基准测试,并报告了在复杂仿真环境中关于感知、推理和控制的见解。