Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.
翻译:多模态大语言模型(MLLMs)擅长利用数字API,并日益成为具身人工智能的“大脑”,指导机器人完成与物理世界的交互。在此类具身场景中,物理工具的使用是一项核心能力,它支撑着MLLMs协助人类完成现实任务的能力。尽管重要性突出,MLLMs在物理工具使用方面的能力仍鲜有探究。为填补这一空白,我们提出了PhysTool-Bench——首个专为评估MLLMs理解现实场景、识别物理工具及规划其使用方法的能力而设计的物理工具使用基准。PhysTool-Bench包含2,510个查询,覆盖2,678个真实物理工具,横跨制造、电工、农业及医疗等多个领域。具体而言,模型评估围绕两个主要维度展开:1) 识别场景中所有存在的物理工具;2) 根据指令和视觉上下文规划工具选择及使用顺序。在13个领先的MLLMs中,即使是表现最强的模型(Gemini-3.1-Pro)也仅能识别场景中58.7%的工具,并且仅能端到端完成21.0%的查询。我们的分析揭示出两个层面的缺陷:MLLMs难以在现实场景中感知工具,而在规划阶段出现的更大幅度的性能下降则进一步表明,模型缺乏将感知到的工具映射到任务语义上的功能常识,这指出了开发实用具身人工智能的关键瓶颈。