Large language models (LLMs) are accelerating the development of language-guided robot planners. Meanwhile, symbolic planners offer the advantage of interpretability. This paper proposes a new task that bridges these two trends, namely, multimodal planning problem specification. The aim is to generate a problem description (PD), a machine-readable file used by the planners to find a plan. By generating PDs from language instruction and scene observation, we can drive symbolic planners in a language-guided framework. We propose a Vision-Language Interpreter (ViLaIn), a new framework that generates PDs using state-of-the-art LLM and vision-language models. ViLaIn can refine generated PDs via error message feedback from the symbolic planner. Our aim is to answer the question: How accurately can ViLaIn and the symbolic planner generate valid robot plans? To evaluate ViLaIn, we introduce a novel dataset called the problem description generation (ProDG) dataset. The framework is evaluated with four new evaluation metrics. Experimental results show that ViLaIn can generate syntactically correct problems with more than 99\% accuracy and valid plans with more than 58\% accuracy. Our code and dataset are available at https://github.com/omron-sinicx/ViLaIn.
翻译:大型语言模型(LLMs)正加速语言引导的机器人规划器的发展。与此同时,符号规划器具有可解释性的优势。本文提出了一项融合上述两种趋势的新任务,即多模态规划问题规范。其目标是生成问题描述(PD)——一种由规划器用于寻找方案的机器可读文件。通过从语言指令和场景观察中生成PD,我们能够在语言引导的框架下驱动符号规划器。我们提出了一种视觉语言解释器(ViLaIn),这是一个利用最先进的LLM和视觉语言模型生成PD的新框架。ViLaIn能够通过符号规划器返回的错误信息反馈来优化生成的PD。我们的目标是回答以下问题:ViLaIn与符号规划器生成有效机器人规划的准确度如何?为评估ViLaIn,我们引入了一个名为问题描述生成(ProDG)数据集的新数据集。该框架使用四项新的评估指标进行评测。实验结果表明,ViLaIn能够以超过99%的准确率生成语法正确的问题,并以超过58%的准确率生成有效规划。我们的代码和数据集可在https://github.com/omron-sinicx/ViLaIn获取。