Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.
翻译:理解物理结构对于具身智能体、交互式设计和长时程操作等现实应用至关重要。然而,当前主流的视觉语言模型评估仍集中于与结构无关的单轮次设置(例如视觉问答),无法评估智能体在动态环境中推理几何、接触和支撑关系如何共同约束可行动作的能力。为填补这一空白,我们提出了因果行动与交互层次基准,这是一个交互式、基于物理的三维测试平台,旨在评估模型能否理解、规划并执行基于物理约束的结构化动作序列。该基准将评估重点从被动感知转向主动问题解决,涵盖互锁机械拼图和三维堆叠与装箱等任务。我们在统一的交互设置下对前沿的视觉语言模型和基于扩散的模型进行了全面研究。结果表明,性能最优的模型仍难以内化物理结构与因果约束,常无法生成可靠的长时程规划,且不能稳健地将感知结构转化为有效动作。项目地址:https://social-ai-studio.github.io/CHAIN/。