We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category, but is expected to identify novel objects and affordances. While vision-language models excel at recognizing novel objects and scenes, they often struggle to understand finer levels of granularity such as affordances. To handle this issue, we conduct a comprehensive analysis of existing foundation models, to explore their inherent understanding of affordances and assess the potential for data-limited affordance learning. We then propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings. Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data, and exhibits reasonable generalization capability on unseen objects and affordances.
翻译:我们提出单样本开放可供性学习(One-shot Open Affordance Learning, OOAL),该方法仅需每个基础物体类别一个训练样本,即可识别新颖物体及其可供性。尽管视觉语言模型在识别新颖物体与场景方面表现出色,但其对可供性等细粒度概念的理解仍存在不足。为解决该问题,我们对现有基础模型进行全面分析,探究其对可供性的内在理解能力,并评估数据有限条件下可供性学习的潜在可行性。在此基础上,我们提出一种视觉语言框架,通过简洁有效的设计增强视觉特征与可供性文本嵌入之间的对齐效果。在两个可供性分割基准上的实验表明,所提方法仅使用不足完整训练数据1%的样本,即可超越当前最优模型,且在未见物体及其可供性上展现出合理的泛化能力。