Recent advancements in vision-language models (VLMs) offer potential for robot task planning, but challenges remain due to VLMs' tendency to generate incorrect action sequences. To address these limitations, we propose VeriGraph, a novel framework that integrates VLMs for robotic planning while verifying action feasibility. VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement. The system generates a scene graph from input images and uses it to iteratively check and correct action sequences generated by an LLM-based task planner, ensuring constraints are respected and actions are executable. Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.
翻译:近年来,视觉-语言模型(VLMs)的发展为机器人任务规划带来了潜力,但由于VLM倾向于生成错误的动作序列,挑战依然存在。为应对这些局限,我们提出了VeriGraph,一种集成VLM进行机器人规划并同时验证动作可行性的新型框架。VeriGraph采用场景图作为中间表示,通过捕捉关键物体及其空间关系来改进规划验证与修正。该系统从输入图像生成场景图,并利用该图迭代检查和修正由基于LLM的任务规划器生成的动作序列,确保约束得到遵守且动作可执行。我们的方法在多种操作场景中显著提升了任务完成率,在基于语言的任务上优于基线方法58%,在基于图像的任务上优于30%。