Enabling robots to understand language instructions and react accordingly to visual perception has been a long-standing goal in the robotics research community. Achieving this goal requires cutting-edge advances in natural language processing, computer vision, and robotics engineering. Thus, this paper mainly investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system to enhance the effectiveness of the human-robot interaction. We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration. The system utilizes the LLM of ChatGPT to summarize the preference object of the users as a target instruction via the multi-round interactive dialogue. The target instruction is then forwarded to a visual grounding system for object pose and size estimation, following which the robot grasps the object accordingly. We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task. The further experimental results on various real-world scenarios demonstrated the feasibility and efficacy of our proposed framework. See the project website at: https://star-uu-wang.github.io/WALL-E/
翻译:使机器人能够理解语言指令并根据视觉感知做出相应反应,一直是机器人研究领域的长期目标。实现这一目标需要自然语言处理、计算机视觉和机器人工程领域的前沿突破。因此,本文主要研究如何将最新的大型语言模型(LLMs)与现有的视觉定位和机器人抓取系统相集成,以增强人机交互的效果。我们以WALL-E(基于大型语言模型的具身机器人服务员负载抬升)作为该集成系统的范例。该系统利用ChatGPT的LLM能力,通过多轮交互对话将用户的偏好对象总结为目标指令。随后,目标指令被传递给视觉定位系统以估计物体的位姿和尺寸,机器人据此抓取物体。我们将这一LLM赋能的系统部署在实体机器人上,为指令引导的抓取任务提供了更友好的用户界面。在多种真实场景下的进一步实验结果证明了我们提出框架的可行性与有效性。项目网站详见:https://star-uu-wang.github.io/WALL-E/