Affordance grounding refers to the task of finding the area of an object with which one can interact. It is a fundamental but challenging task, as a successful solution requires the comprehensive understanding of a scene in multiple aspects including detection, localization, and recognition of objects with their parts, of geo-spatial configuration/layout of the scene, of 3D shapes and physics, as well as of the functionality and potential interaction of the objects and humans. Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set. In this paper, we make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge from pretrained large-scale vision language models. Under the AGD20K benchmark, our proposed model demonstrates a significant performance gain over the competing methods for in-the-wild object affordance grounding. We further demonstrate it can ground affordance for objects from random Internet images, even if both objects and actions are unseen during training. Project site: https://jasonqsy.github.io/AffordanceLLM/
翻译:“可供性锚定”是指寻找物体可交互区域的任务。这是一项基础但富有挑战性的任务,因为成功的解决方案需要对场景进行多方面的全面理解,包括物体的检测、定位及其部件的识别,场景的地理空间配置/布局,三维形状与物理特性,以及物体与人类的功能和潜在交互。大部分知识是隐含的,且超出了由有限训练集监督标签所涵盖的图像内容。本文尝试利用预训练大规模视觉语言模型中丰富的世界、抽象及人-物交互知识,来提升当前可供性锚定的泛化能力。在AGD20K基准测试下,我们提出的模型在野外物体的可供性锚定任务中,相较于竞争方法展现出显著的性能提升。我们进一步证明,即使物体和动作在训练中均未出现,该模型也能从随机互联网图像中为物体锚定可供性。项目网站:https://jasonqsy.github.io/AffordanceLLM/