ORACLE-Grasp: Zero-Shot Affordance-Aligned Robotic Grasping using Large Multimodal Models

Grasping unknown objects in unstructured environments is a critical challenge for service robots, which must operate in dynamic, real-world settings such as homes, hospitals, and warehouses. Success in these environments requires both semantic understanding and spatial reasoning. Traditional methods often rely on dense training datasets or detailed geometric modeling, which demand extensive data collection and do not generalize well to novel objects or affordances. We present ORACLE-Grasp, a zero-shot framework that leverages Large Multimodal Models (LMMs) as semantic oracles to guide affordance-aligned grasp selection, without requiring task-specific training or manual input. The system reformulates grasp prediction as a structured, iterative decision process, using a dual-prompt tool-calling strategy: the first prompt extracts high-level object semantics, while the second identifies graspable regions aligned with the object's function. To address the spatial limitations of LMMs, ORACLE-Grasp discretizes the image into candidate regions and reasons over them to produce human-like and context-sensitive grasp suggestions. A depth-based refinement step improves grasp reliability when available, and an early stopping mechanism enhances computational efficiency. We evaluate ORACLE-Grasp on a diverse set of RGB and RGB-D images featuring both everyday and AI-generated objects. The results show that our method produces physically feasible and semantically appropriate grasps that align closely with human annotations, achieving high success rates in real-world pick-up tasks. Our findings highlight the potential of LMMs for enabling flexible and generalizable grasping strategies in autonomous service robots, eliminating the need for object-specific models or extensive training.

翻译：在非结构化环境中抓取未知物体是服务机器人面临的一项关键挑战，这些机器人必须在家庭、医院和仓库等动态的真实世界环境中操作。在这些环境中取得成功需要语义理解和空间推理能力。传统方法通常依赖于密集的训练数据集或详细的几何建模，这需要大量的数据收集，并且难以泛化到新物体或可供性。我们提出了ORACLE-Grasp，一个零样本框架，该框架利用大型多模态模型作为语义预言机来指导可供性对齐的抓取选择，无需任务特定训练或手动输入。该系统将抓取预测重新构建为一个结构化的迭代决策过程，采用双提示工具调用策略：第一个提示提取高层物体语义，而第二个提示识别与物体功能对齐的可抓取区域。为了解决LMMs的空间局限性，ORACLE-Grasp将图像离散化为候选区域，并对其进行推理以产生类人且上下文敏感的抓取建议。当深度信息可用时，一个基于深度的细化步骤提高了抓取的可靠性，而提前停止机制则提升了计算效率。我们在包含日常物体和AI生成物体的多样化RGB和RGB-D图像集上评估了ORACLE-Grasp。结果表明，我们的方法能够产生物理上可行且语义上恰当的抓取，这些抓取与人工标注高度一致，在真实世界拾取任务中实现了高成功率。我们的发现凸显了LMMs在实现自主服务机器人灵活且可泛化的抓取策略方面的潜力，从而消除了对物体特定模型或大量训练的需求。