Classical planning systems have shown great advances in utilizing rule-based human knowledge to compute accurate plans for service robots, but they face challenges due to the strong assumptions of perfect perception and action executions. To tackle these challenges, one solution is to connect the symbolic states and actions generated by classical planners to the robot's sensory observations, thus closing the perception-action loop. This research proposes a visually-grounded planning framework, named TPVQA, which leverages Vision-Language Models (VLMs) to detect action failures and verify action affordances towards enabling successful plan execution. Results from quantitative experiments show that TPVQA surpasses competitive baselines from previous studies in task completion rate.
翻译:经典规划系统在利用基于规则的人类知识为服务机器人计算精确规划方面取得了显著进展,但由于对完美感知和动作执行的强假设而面临挑战。为解决这些问题,一种方案是将经典规划器生成的符号化状态和动作与机器人的感官观测相连接,从而闭环感知-动作回路。本研究提出了一种名为TPVQA的视觉锚定规划框架,该框架利用视觉-语言模型(VLM)检测动作失败并验证动作可供性,以实现成功规划执行。定量实验结果表明,TPVQA在任务完成率上超越了以往研究的竞争性基线方法。