Remarkable progress has been made in recent years in the fields of vision, language, and robotics. We now have vision models capable of recognizing objects based on language queries, navigation systems that can effectively control mobile systems, and grasping models that can handle a wide range of objects. Despite these advancements, general-purpose applications of robotics still lag behind, even though they rely on these fundamental capabilities of recognition, navigation, and grasping. In this paper, we adopt a systems-first approach to develop a new Open Knowledge-based robotics framework called OK-Robot. By combining Vision-Language Models (VLMs) for object detection, navigation primitives for movement, and grasping primitives for object manipulation, OK-Robot offers a integrated solution for pick-and-drop operations without requiring any training. To evaluate its performance, we run OK-Robot in 10 real-world home environments. The results demonstrate that OK-Robot achieves a 58.5% success rate in open-ended pick-and-drop tasks, representing a new state-of-the-art in Open Vocabulary Mobile Manipulation (OVMM) with nearly 1.8x the performance of prior work. On cleaner, uncluttered environments, OK-Robot's performance increases to 82%. However, the most important insight gained from OK-Robot is the critical role of nuanced details when combining Open Knowledge systems like VLMs with robotic modules. Videos of our experiments are available on our website: https://ok-robot.github.io
翻译:近年来,视觉、语言和机器人领域取得了显著进展。我们现已拥有能根据语言查询识别物体的视觉模型、可有效控制移动系统的导航模块,以及能处理多种物体的抓取模型。尽管这些基础能力(识别、导航和抓取)已相当成熟,但机器人通用应用仍显滞后。本文采用系统优先方法,开发了名为OK-Robot的新型开放知识机器人框架。该框架融合视觉-语言模型(VLM)进行物体检测、导航原语实现移动控制、抓取原语完成物体操作,为拾放操作提供无需训练的集成化解决方案。我们在10个真实家庭环境中对OK-Robot进行性能评估,结果表明其开放场景拾放任务成功率达58.5%,在开放词汇移动操作(OVMM)领域达到新标杆,性能较先前工作提升近1.8倍。在整洁环境中,OK-Robot的成功率可提升至82%。但最重要的发现是:当整合VLM等开放知识系统与机器人模块时,细微细节的处理至关重要。实验视频详见项目网站:https://ok-robot.github.io