Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes AffordanceCLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models.
翻译:人类表现出识别支持特定动作的工具的先天能力。物体部位与其所促进的动作之间的关联通常被称为“可供性”。根据物体部位所支持的任务对其进行分割,对于使智能机器人能够使用日常物品至关重要。传统的可供性分割监督学习方法需要昂贵的像素级标注,而弱监督方法虽然要求较低,但仍依赖于物体-交互示例,并且只支持有限的动作集。这些局限性阻碍了可扩展性,可能引入偏差,并通常将模型限制在预定义的有限动作集中。本文提出AffordanceCLIP,通过利用大型预训练视觉-语言模型(如CLIP)中嵌入的隐式可供性知识来克服这些局限性。我们的实验证明,尽管CLIP并未明确训练用于可供性检测,但它保留了对此任务有价值的信息。与经过专门训练的方法相比,我们的AffordanceCLIP在零样本性能上具有竞争力,同时具备多项优势:i)它适用于任何动作提示,而不仅仅是预定义集;ii)与现有解决方案相比,只需训练少量额外参数;iii)消除了对动作-对象对进行直接监督的需求,为基于功能的模型推理开辟了新视角。