With their prominent scene understanding and reasoning capabilities, pre-trained visual-language models (VLMs) such as GPT-4V have attracted increasing attention in robotic task planning. Compared with traditional task planning strategies, VLMs are strong in multimodal information parsing and code generation and show remarkable efficiency. Although VLMs demonstrate great potential in robotic task planning, they suffer from challenges like hallucination, semantic complexity, and limited context. To handle such issues, this paper proposes a multi-agent framework, i.e., GameVLM, to enhance the decision-making process in robotic task planning. In this study, VLM-based decision and expert agents are presented to conduct the task planning. Specifically, decision agents are used to plan the task, and the expert agent is employed to evaluate these task plans. Zero-sum game theory is introduced to resolve inconsistencies among different agents and determine the optimal solution. Experimental results on real robots demonstrate the efficacy of the proposed framework, with an average success rate of 83.3%.
翻译:凭借其卓越的场景理解与推理能力,预训练的视觉语言模型(如GPT-4V)在机器人任务规划领域受到越来越多的关注。与传统任务规划策略相比,视觉语言模型在多模态信息解析与代码生成方面表现强大,并展现出显著的效率。尽管视觉语言模型在机器人任务规划中展现出巨大潜力,但仍面临幻觉、语义复杂性及上下文限制等挑战。为解决此类问题,本文提出一种多智能体框架——GameVLM,以增强机器人任务规划中的决策过程。本研究提出了基于视觉语言模型的决策智能体与专家智能体来执行任务规划:决策智能体用于规划任务,专家智能体则负责评估这些任务规划方案。通过引入零和博弈理论来解决不同智能体之间的不一致性并确定最优解。在真实机器人上进行的实验结果表明,所提框架具有显著效能,平均成功率可达83.3%。