Simulation to Rules: A Dual-VLM Framework for Formal Visual Planning

Vision Language Models (VLMs) show strong potential for visual planning but struggle with precise spatial and long-horizon reasoning, while Planning Domain Definition Language (PDDL) planners excel at formal long-horizon planning but cannot interpret visual inputs. Recent works combine these complementary advantages by translating visual problems into PDDL. However, while VLMs can generate PDDL problem files satisfactorily, accurately generating PDDL domain files, which encode planning rules, remains challenging and typically requires human expertise or environment interaction. We propose VLMFP, a Dual-VLM-guided framework that autonomously generates both PDDL problem and domain files for formal visual planning. VLMFP combines a SimVLM that simulates action consequences with a GenVLM that generates and iteratively refines PDDL files by aligning symbolic execution with simulated outcomes, enabling multiple levels of generalization across unseen instances, visual appearances, and game rules. We evaluate VLMFP on 6 grid-world domains and demonstrate its generalization capability. On average, SimVLM achieves 87.3% and 86.0% scenario understanding and action simulation for seen and unseen appearances, respectively. With the guidance of SimVLM, VLMFP attains 70.0%, 54.1% planning success on unseen instances in seen and unseen appearances, respectively. We further demonstrate that VLMFP scales to complex long-horizon 3D planning tasks, including multi-robot collaboration and assembly scenarios with partial observability and diverse visual variations. Project page: https://sites.google.com/view/vlmfp.

翻译：视觉语言模型（VLM）在视觉规划方面展现出巨大潜力，但在精确空间推理和长时域推理方面存在不足，而规划域定义语言（PDDL）规划器擅长正式长时域规划，但无法解释视觉输入。近期研究工作通过将视觉问题转化为PDDL来结合这些互补优势。然而，虽然VLM能够令人满意地生成PDDL问题文件，但准确生成编码规划规则的PDDL域文件仍然具有挑战性，通常需要人类专业知识或环境交互。我们提出VLMFP——一种双VLM引导框架，能够自主生成用于正式视觉规划的PDDL问题和域文件。VLMFP结合了模拟动作后果的SimVLM和生成并迭代优化PDDL文件的GenVLM，通过将符号执行与模拟结果对齐，实现了对未见实例、视觉外观和游戏规则的多层次泛化。我们在6个网格世界域上评估VLMFP，证明了其泛化能力。平均而言，SimVLM在已见和未见外观上的场景理解和动作模拟分别达到87.3%和86.0%。在SimVLM引导下，VLMFP在已见和未见外观的未见实例上分别实现了70.0%和54.1%的规划成功率。我们进一步证明VLMFP可扩展到复杂的长期3D规划任务，包括具有部分可观测性和多样化视觉变化的多机器人协作与装配场景。项目页面：https://sites.google.com/view/vlmfp。