Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.
翻译:多模态大语言模型(MLLMs)极大地推动了具身人工智能的发展,利用它们对机器人智能进行基准测试已成为一个关键趋势。然而,现有框架主要局限于单臂操作,未能捕捉如提起重锅等双手任务所需的时空协调能力。为此,我们提出了BiManiBench,这是一个层次化基准,用于从三个层面评估MLLMs:基础空间推理、高层动作规划以及低层末端执行器控制。我们的框架分离了独特的双手操作挑战,例如手臂可达性和运动学约束,从而将感知幻觉与规划失败区分开来。对超过30个最先进模型的分析表明,尽管MLLMs具备高层推理能力,但在双臂空间定位与控制方面仍存在困难,常常导致相互干扰和时序错误。这些发现表明,当前范式缺乏对相互运动学约束的深入理解,凸显了未来研究需要关注双臂间避碰和细粒度时序规划。