V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks

Learning long-horizon embodied behaviors from synthetic data remains challenging because generated scenes are often physically implausible, language-driven programs frequently "succeed" without satisfying task semantics, and high-level instructions require grounding into executable action sequences. To address these limitations, we introduce V-CAGE, a closed-loop framework for generating robust, semantically aligned manipulation datasets at scale. First, we propose a context-aware instantiation mechanism that enforces geometric consistency during scene synthesis. By dynamically maintaining a map of prohibited spatial areas as objects are placed, our system prevents interpenetration and ensures reachable, conflict-free configurations in cluttered environments. Second, to bridge the gap between abstract intent and low-level control, we employ a hierarchical instruction decomposition module. This decomposes high-level goals (e.g., "get ready for work") into compositional action primitives, facilitating coherent long-horizon planning. Crucially, we enforce semantic correctness through a VLM-based verification loop. Acting as a visual critic, the VLM performs rigorous rejection sampling after each subtask, filtering out "silent failures" where code executes but fails to achieve the visual goal. Experiments demonstrate that V-CAGE yields datasets with superior physical and semantic fidelity, significantly boosting the success rate and generalization of downstream policies compared to non-verified baselines.

翻译：从合成数据中学习长程具身行为仍然面临挑战，因为生成的场景通常在物理上不可信，语言驱动的程序常常在未满足任务语义的情况下“成功”，且高层指令需要被具体化为可执行的动作序列。为应对这些局限，我们提出了V-CAGE，一个用于大规模生成鲁棒、语义对齐的操作数据集的闭环框架。首先，我们提出了一种情境感知实例化机制，在场景合成过程中强制执行几何一致性。通过动态维护一个随物体放置而更新的禁止空间区域地图，我们的系统防止了物体间的相互穿透，并确保在杂乱环境中生成可达且无冲突的配置。其次，为弥合抽象意图与底层控制之间的鸿沟，我们采用了一个分层指令分解模块。该模块将高层目标（例如“准备工作”）分解为组合式动作基元，从而促进连贯的长程规划。关键的是，我们通过一个基于视觉语言模型（VLM）的验证循环来强制执行语义正确性。该VLM作为视觉评判者，在每个子任务执行后进行严格的拒绝采样，过滤掉那些代码已执行但未能达成视觉目标的“静默失败”。实验表明，与未经验证的基线方法相比，V-CAGE生成的数据集具有更优的物理与语义保真度，能显著提升下游策略的成功率与泛化能力。