Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space, which serves as a link between perception and operation for embodied agents. Existing studies primarily focus on connecting visual affordances with geometry structures, e.g. relying on annotations to declare interactive regions of interest on the object and establishing a mapping between the regions and affordances. However, the essence of learning object affordance is to understand how to use it, and the manner that detaches interactions is limited in generalization. Normally, humans possess the ability to perceive object affordances in the physical world through demonstration images or videos. Motivated by this, we introduce a novel task setting: grounding 3D object affordance from 2D interactions in images, which faces the challenge of anticipating affordance through interactions of different sources. To address this problem, we devise a novel Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources and models the interactive contexts for 3D object affordance grounding. Besides, we collect a Point-Image Affordance Dataset (PIAD) to support the proposed task. Comprehensive experiments on PIAD demonstrate the reliability of the proposed task and the superiority of our method. The project is available at https://github.com/yyvhang/IAGNet.
翻译:三维物体功能定位旨在发现物体在三维空间中的“可操作区域”,这为具身智能体的感知与操作提供了桥梁。现有研究主要关注视觉功能与几何结构的关联,例如依赖标注来声明物体上的交互兴趣区域,并建立区域与功能之间的映射。然而,学习物体功能的本质在于理解如何使用物体,这种脱离交互的方式在泛化能力上存在局限。通常,人类能通过演示图像或视频感知物理世界中的物体功能。受此启发,我们提出一项新任务设定:从二维图像交互中定位三维物体功能区域,该任务面临通过不同来源的交互预测功能的挑战。为解决此问题,我们设计了一种新颖的交互驱动三维功能定位网络(IAG),该网络对齐不同来源物体的区域特征,并建模交互上下文以支持三维物体功能定位。此外,我们收集了Point-Image功能数据集(PIAD)以支持所提出的任务。在PIAD上的全面实验证明了所提任务的可靠性和方法的优越性。项目代码可见于https://github.com/yyvhang/IAGNet。