Grasping objects by a specific part is often crucial for safety and for executing downstream tasks. Yet, learning-based grasp planners lack this behavior unless they are trained on specific object part data, making it a significant challenge to scale object diversity. Instead, we propose LERF-TOGO, Language Embedded Radiance Fields for Task-Oriented Grasping of Objects, which uses vision-language models zero-shot to output a grasp distribution over an object given a natural language query. To accomplish this, we first reconstruct a LERF of the scene, which distills CLIP embeddings into a multi-scale 3D language field queryable with text. However, LERF has no sense of objectness, meaning its relevancy outputs often return incomplete activations over an object which are insufficient for subsequent part queries. LERF-TOGO mitigates this lack of spatial grouping by extracting a 3D object mask via DINO features and then conditionally querying LERF on this mask to obtain a semantic distribution over the object with which to rank grasps from an off-the-shelf grasp planner. We evaluate LERF-TOGO's ability to grasp task-oriented object parts on 31 different physical objects, and find it selects grasps on the correct part in 81% of all trials and grasps successfully in 69%. See the project website at: lerftogo.github.io
翻译:对物体特定部位进行抓取通常对安全性及后续任务执行至关重要。然而,基于学习的抓取规划器缺乏此类行为,除非在特定物体部位数据集上训练,这使得扩展到多样化物体面临重大挑战。为此,我们提出LERF-TOGO——面向任务导向物体抓取的语言嵌入辐射场,该方法利用视觉-语言模型实现零样本操作,根据自然语言查询输出物体上的抓取分布。首先,我们对场景进行LERF重建,将CLIP嵌入蒸馏到多尺度三维语言场中,支持文本查询。但LERF缺乏物体感知能力,其相关性输出常返回物体上的不完整激活区域,不足以支持后续部位查询。LERF-TOGO通过基于DINO特征提取三维物体掩码,并在此掩码上条件式查询LERF,从而缓解空间分组缺失问题,获得物体上的语义分布以对现成抓取规划器生成的抓取点进行排序。我们在31个不同实体物体上评估LERF-TOGO对任务导向物体部位的抓取能力,结果表明其在81%的试验中正确选择目标部位抓取点,成功抓取率达69%。项目网站详见:lerftogo.github.io