Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating CoT-inspired techniques. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi-view embodied reasoning.
翻译:视觉语言模型(VLMs)是具身人工智能的关键,使机器人能够在复杂环境中感知、推理与行动。它们也是近期视觉-语言-行动(VLA)模型的基础。然而,现有对VLMs的评估大多集中于单视角场景,对其整合多视角信息的能力探索不足。与此同时,多摄像头配置在机器人平台中日益普及,因其能提供互补视角以缓解遮挡与深度模糊问题。VLMs能否有效利用此类多视角输入进行机器人推理,仍是一个悬而未决的问题。为填补这一空白,我们提出了MV-RoboBench——一个专门用于评估VLMs在机器人操控任务中多视角空间推理能力的基准测试集。MV-RoboBench包含1.7k个人工标注的问答条目,涵盖八个子任务,分为两大类别:空间理解与机器人执行。我们评估了包括开源与闭源模型在内的多种现有VLMs,以及融合了思维链(CoT)启发技术的增强版本。结果表明,当前最先进模型的性能仍远低于人类水平,凸显了VLMs在多视角机器人感知领域面临的重大挑战。此外,我们的分析揭示了两项关键发现:(i)在多视角机器人场景中,空间智能与机器人任务执行能力呈正相关;(ii)在现有通用单视角空间理解基准测试中的优异表现,并不能可靠转化为在本基准测试所评估的机器人空间任务中的成功。我们公开MV-RoboBench作为开放资源,以促进空间具身化VLMs与VLAs的发展,不仅提供数据,还建立了多视角具身推理的标准化评估协议。