The development of large language models and vision-language models (VLMs) has resulted in the increasing use of robotic systems in various fields. However, the effective integration of these models into real-world robotic tasks is a key challenge. We developed a versatile robotic system called SuctionPrompt that utilizes prompting techniques of VLMs combined with 3D detections to perform product-picking tasks in diverse and dynamic environments. Our method highlights the importance of integrating 3D spatial information with adaptive action planning to enable robots to approach and manipulate objects in novel environments. In the validation experiments, the system accurately selected suction points 75.4%, and achieved a 65.0% success rate in picking common items. This study highlights the effectiveness of VLMs in robotic manipulation tasks, even with simple 3D processing.
翻译:大型语言模型和视觉语言模型的发展促使机器人系统在各领域的应用日益广泛。然而,将这些模型有效整合到现实世界的机器人任务中仍是一个关键挑战。我们开发了一种名为SuctionPrompt的多功能机器人系统,该系统利用视觉语言模型的提示技术结合三维检测,在多样化和动态环境中执行产品抓取任务。我们的方法强调了将三维空间信息与自适应动作规划相结合的重要性,从而使机器人能够在陌生环境中接近并操控物体。在验证实验中,该系统准确选择吸着点的比例为75.4%,并对常见物品实现了65.0%的成功抓取率。本研究凸显了视觉语言模型在机器人操控任务中的有效性,即使仅采用简单的三维处理技术。