Integration of VLM reasoning with symbolic planning has proven to be a promising approach to real-world robot task planning. Existing work like UniDomain effectively learns symbolic manipulation domains from real-world demonstrations, described in Planning Domain Definition Language (PDDL), and has successfully applied them to real-world tasks. These domains, however, are restricted to tabletop manipulation. We propose UniPlan, a vision-language task planning system for long-horizon mobile-manipulation in large-scale indoor environments, that unifies scene topology, visuals, and robot capabilities into a holistic PDDL representation. UniPlan programmatically extends learned tabletop domains from UniDomain to support navigation, door traversal, and bimanual coordination. It operates on a visual-topological map, comprising navigation landmarks anchored with scene images. Given a language instruction, UniPlan retrieves task-relevant nodes from the map and uses a VLM to ground the anchored image into task-relevant objects and their PDDL states; next, it reconnects these nodes to a compressed, densely-connected topological map, also represented in PDDL, with connectivity and costs derived from the original map; Finally, a mobile-manipulation plan is generated using off-the-shelf PDDL solvers. Evaluated on human-raised tasks in a large-scale map with real-world imagery, UniPlan significantly outperforms VLM and LLM+PDDL planning in success rate, plan quality, and computational efficiency.
翻译:将视觉语言模型推理与符号规划相结合已被证明是实现现实世界机器人任务规划的有效途径。现有工作如UniDomain能够从现实世界演示中有效学习以规划领域定义语言描述的符号操作领域,并已成功应用于实际任务。然而,这些领域目前仅限于桌面操作场景。本文提出UniPlan——一种适用于大规模室内环境长时程移动操作的视觉-语言任务规划系统,该系统将场景拓扑结构、视觉信息与机器人能力统一整合为整体性PDDL表述。UniPlan通过程序化方式扩展了UniDomain学习的桌面操作领域,新增了导航、门道穿越及双手协调操作功能。该系统基于视觉-拓扑地图运行,该地图包含以场景图像为锚点的导航地标。在接收语言指令后,UniPlan首先从地图中检索任务相关节点,并利用视觉语言模型将锚定图像映射为任务相关对象及其PDDL状态;随后将这些节点重新连接为压缩的密集连接拓扑地图(同样采用PDDL表述),其连接关系与成本函数源自原始地图;最终使用现成的PDDL求解器生成移动操作规划方案。在采用真实世界图像的大规模地图中进行人工设定任务评估表明,UniPlan在成功率、规划质量与计算效率方面均显著优于纯视觉语言模型及大语言模型+PDDL的规划方法。