The rapid advancement of Large Language Models (LLMs) has marked a significant breakthrough in Artificial Intelligence (AI), ushering in a new era of Human-centered Artificial Intelligence (HAI). HAI aims to better serve human welfare and needs, thereby placing higher demands on the intelligence level of robots, particularly in aspects such as natural language interaction, complex task planning, and execution. Intelligent agents powered by LLMs have opened up new pathways for realizing HAI. However, existing LLM-based embodied agents often lack the ability to plan and execute complex natural language control tasks online. This paper explores the implementation of intelligent robotic manipulating agents based on Vision-Language Models (VLMs) in the physical world. We propose a novel embodied agent framework for robots, which comprises a human-robot voice interaction module, a vision-language agent module and an action execution module. The vision-language agent itself includes a vision-based task planner, a natural language instruction converter, and a task performance feedback evaluator. Experimental results demonstrate that our agent achieves a 28\% higher average task success rate in both simulated and real environments compared to approaches relying solely on LLM+CLIP, significantly improving the execution success rate of high-level natural language instruction tasks.
翻译:大语言模型(LLMs)的快速发展标志着人工智能(AI)领域的一项重大突破,开启了以人为中心的人工智能(HAI)的新纪元。HAI旨在更好地服务于人类福祉与需求,从而对机器人的智能水平提出了更高要求,特别是在自然语言交互、复杂任务规划与执行等方面。由LLMs驱动的智能体为实现HAI开辟了新的路径。然而,现有基于LLM的具身智能体通常缺乏在线规划与执行复杂自然语言控制任务的能力。本文探讨了基于视觉语言模型(VLMs)的智能机器人操作智能体在物理世界中的实现。我们提出了一种新颖的机器人具身智能体框架,该框架包含人机语音交互模块、视觉语言智能体模块以及动作执行模块。视觉语言智能体本身包含基于视觉的任务规划器、自然语言指令转换器以及任务性能反馈评估器。实验结果表明,与仅依赖LLM+CLIP的方法相比,我们的智能体在模拟和真实环境中实现了平均任务成功率提升28%,显著提高了高层自然语言指令任务的执行成功率。