Enabling robots to understand language instructions and react accordingly to visual perception has been a long-standing goal in the robotics research community. Achieving this goal requires cutting-edge advances in natural language processing, computer vision, and robotics engineering. Thus, this paper mainly investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system to enhance the effectiveness of the human-robot interaction. We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration. The system utilizes the LLM of ChatGPT to summarize the preference object of the users as a target instruction via the multi-round interactive dialogue. The target instruction is then forwarded to a visual grounding system for object pose and size estimation, following which the robot grasps the object accordingly. We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task. The further experimental results on various real-world scenarios demonstrated the feasibility and efficacy of our proposed framework.
翻译:使机器人能够理解语言指令并据此对视觉感知做出反应,一直是机器人研究领域的长期目标。实现这一目标需要自然语言处理、计算机视觉和机器人工程领域的尖端技术进展。因此,本文主要研究如何整合最新的大语言模型(LLMs)与现有视觉定位及机器人抓取系统,以提升人机交互的效率。我们以WALL-E(基于大语言模型的具身机器人侍者负载搬运)作为该整合范例。该系统利用ChatGPT的大语言模型,通过多轮交互对话将用户的偏好物体总结为目标指令,随后将目标指令输入视觉定位系统以估计物体位姿和尺寸,进而使机器人据此抓取物体。我们将这一基于大语言模型的系统部署于实体机器人上,为指令引导的抓取任务提供了更友好的用户界面。在多种真实场景下的实验结果表明,我们提出的框架具有可行性和有效性。