Affordance grounding refers to the task of finding the area of an object with which one can interact. It is a fundamental but challenging task, as a successful solution requires the comprehensive understanding of a scene in multiple aspects including detection, localization, and recognition of objects with their parts, of geo-spatial configuration/layout of the scene, of 3D shapes and physics, as well as of the functionality and potential interaction of the objects and humans. Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set. In this paper, we make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge from pretrained large-scale vision language models. Under the AGD20K benchmark, our proposed model demonstrates a significant performance gain over the competing methods for in-the-wild object affordance grounding. We further demonstrate it can ground affordance for objects from random Internet images, even if both objects and actions are unseen during training. Project site: https://jasonqsy.github.io/AffordanceLLM/
翻译:可供性定位是指寻找物体上可交互区域的任务。这是一项基础但具有挑战性的任务,因为成功的解决方案需要对场景进行多方面的全面理解,包括物体及其部件的检测、定位和识别,场景的地理空间配置/布局,三维形状与物理特性,以及物体与人类的功能性和潜在交互。大部分知识是隐含的,超出了有限训练集监督标签所涵盖的图像内容。本文尝试利用预训练大规模视觉语言模型中丰富的世界知识、抽象知识以及人-物交互知识,来提升当前可供性定位的泛化能力。在AGD20K基准测试中,我们提出的模型在野外物体可供性定位任务上相较竞争方法表现出显著的性能提升。我们进一步证明,该模型能够为随机互联网图像中的物体定位可供性,即使训练过程中未见过这些物体或动作。项目网站:https://jasonqsy.github.io/AffordanceLLM/