Implicit spatial relations and deep semantic structures encoded in object attributes are crucial for procedural planning in embodied AI systems. However, existing approaches often over rely on the reasoning capabilities of vision language models (VLMs) themselves, while overlooking the rich structured semantic information that can be mined from multimodal inputs. As a result, models struggle to effectively understand functional spatial relationships in complex scenes. To fully exploit implicit spatial relations and deep semantic structures in multimodal data, we propose GaLa, a vision language framework for multimodal procedural planning. GaLa introduces a hypergraph-based representation, where object instances in the image are modeled as nodes, and region-level hyperedges are constructed by aggregating objects according to their attributes and functional semantics. This design explicitly captures implicit semantic relations among objects as well as the hierarchical organization of functional regions. Furthermore, we design a TriView HyperGraph Encoder that enforces semantic consistency across the node view, area view, and node area association view via contrastive learning, enabling hypergraph semantics to be more effectively injected into downstream VLM reasoning. Extensive experiments on the ActPlan1K and ALFRED benchmarks demonstrate that GaLa significantly outperforms existing methods in terms of execution success rate, LCS, and planning correctness.
翻译:物体属性中编码的隐式空间关系和深层语义结构对于具身AI系统中的过程规划至关重要。然而现有方法往往过度依赖视觉语言模型(VLM)自身的推理能力,而忽视了可从多模态输入中挖掘的丰富结构化语义信息,导致模型难以有效理解复杂场景中的功能空间关系。为充分挖掘多模态数据中的隐式空间关系和深层语义结构,本文提出GaLa——一个面向多模态过程规划的视觉语言框架。GaLa引入基于超图的表示方法,将图像中的物体实例建模为节点,并根据物体属性与功能语义聚合物体构建区域级超边。该设计显式捕捉了物体间的隐式语义关系及功能区域的层级组织。此外,我们设计了三视图超图编码器,通过对比学习强化节点视图、区域视图和节点-区域关联视图间的语义一致性,使超图语义能更有效地注入下游VLM推理。在ActPlan1K和ALFRED基准上的大量实验表明,GaLa在执行成功率、最长公共子序列(LCS)和规划正确性指标上均显著优于现有方法。