Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even when they are heavily obstructed or nearly invisible, by using goal-oriented language to guide the removal of obstructing objects. This approach progressively uncovers the target object and ultimately grasps it with a few steps and a high success rate. In both simulated and real experiments, ThinkGrasp achieved a high success rate and significantly outperformed state-of-the-art methods in heavily cluttered environments or with diverse unseen objects, demonstrating strong generalization capabilities.
翻译:在杂乱环境中进行机器人抓取,由于遮挡和复杂的物体排列,仍然是一个重大挑战。我们开发了ThinkGrasp,一种即插即用的视觉-语言抓取系统,该系统利用GPT-4o的高级上下文推理能力,为重度杂乱环境制定抓取策略。ThinkGrasp能够有效识别并为目标物体生成抓取姿态,即使目标被严重遮挡或几乎不可见。这是通过使用目标导向的语言来指导移除遮挡物体实现的。该方法逐步揭示目标物体,最终以较少的步骤和高成功率抓取到它。在仿真和真实实验中,ThinkGrasp都取得了很高的成功率,并且在重度杂乱环境或面对多种未见物体时,其性能显著优于现有最先进方法,展现了强大的泛化能力。