Large language models (LLMs) are accelerating the development of language-guided robot planners. Meanwhile, symbolic planners offer the advantage of interpretability. This paper proposes a new task that bridges these two trends, namely, multimodal planning problem specification. The aim is to generate a problem description (PD), a machine-readable file used by the planners to find a plan. By generating PDs from language instruction and scene observation, we can drive symbolic planners in a language-guided framework. We propose a Vision-Language Interpreter (ViLaIn), a new framework that generates PDs using state-of-the-art LLM and vision-language models. ViLaIn can refine generated PDs via error message feedback from the symbolic planner. Our aim is to answer the question: How accurately can ViLaIn and the symbolic planner generate valid robot plans? To evaluate ViLaIn, we introduce a novel dataset called the problem description generation (ProDG) dataset. The framework is evaluated with four new evaluation metrics. Experimental results show that ViLaIn can generate syntactically correct problems with more than 99% accuracy and valid plans with more than 58% accuracy.
翻译:大语言模型(LLMs)正加速语言引导型机器人规划器的发展。与此同时,符号规划器具有可解释性的优势。本文提出一项连接这两种趋势的新任务,即多模态规划问题规范。其目标是生成问题描述(PD)——一种供规划器用于寻找方案的机器可读文件。通过从语言指令和场景观测中生成PD,我们能够在语言引导框架中驱动符号规划器。我们提出视觉-语言解释器(ViLaIn),这是一种利用最先进的LLM和视觉-语言模型生成PD的新框架。ViLaIn能够通过符号规划器返回的错误信息反馈来改进生成的PD。我们的目标是回答以下问题:ViLaIn与符号规划器生成有效机器人规划方案的准确率有多高?为评估ViLaIn,我们引入名为问题描述生成(ProDG)数据集的新型数据集。该框架采用四项新评估指标进行评价。实验结果表明,ViLaIn生成语法正确问题的准确率超过99%,生成有效规划方案的准确率超过58%。