Classical planning systems have shown great advances in utilizing rule-based human knowledge to compute accurate plans for service robots, but they face challenges due to the strong assumptions of perfect perception and action executions. To tackle these challenges, one solution is to connect the symbolic states and actions generated by classical planners to the robot's sensory observations, thus closing the perception-action loop. This research proposes a visually-grounded planning framework, named TPVQA, which leverages Vision-Language Models (VLMs) to detect action failures and verify action affordances towards enabling successful plan execution. Results from quantitative experiments show that TPVQA surpasses competitive baselines from previous studies in task completion rate.
翻译:经典规划系统在利用基于规则的人类知识为服务机器人计算精确规划方面取得了重大进展,但由于其对完美感知与动作执行的强假设而面临挑战。为解决这些问题,一种可行方案是将经典规划器生成的符号化状态与动作连接至机器人的感官观测,从而闭合感知-动作循环。本研究提出了一种名为TPVQA的视觉具身规划框架,该框架利用视觉语言模型检测动作失败并验证动作可行性,以支持规划的成功执行。定量实验结果表明,TPVQA在任务完成率上显著超越了先前研究中的竞争性基线方法。