3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation. Previous research primarily focused on learning affordance knowledge from static cues such as language and images, which struggle to provide sufficient dynamic interaction context that can reveal temporal and causal cues. To alleviate this predicament, we collect a comprehensive video-based 3D affordance dataset, \textit{VIDA}, which contains 38K human-object-interaction videos covering 16 affordance types, 38 object categories, and 22K point clouds. Based on \textit{VIDA}, we propose a strong baseline: VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities, enabling both world knowledge reasoning and fine-grained affordance grounding within a unified framework. To enhance action understanding capability, we leverage a latent action encoder to extract dynamic interaction priors from HOI videos. Moreover, we introduce a \textit{spatial-aware} loss function to enable VideoAfford to obtain comprehensive 3D spatial knowledge. Extensive experimental evaluations demonstrate that our model significantly outperforms well-established methods and exhibits strong open-world generalization with affordance reasoning abilities. All datasets and code will be publicly released to advance research in this area.
翻译:三维可供性标注旨在识别三维物体上可供操作的功能区域,这对机器人操控至关重要。先前研究主要集中于从语言和图像等静态线索中学习可供性知识,这些方法难以提供揭示时序与因果关系的动态交互上下文。为缓解此问题,我们构建了首个基于视频的三维可供性数据集 \textit{VIDA},其包含3.8万段人-物交互视频,涵盖16种可供性类型、38个物体类别及2.2万个点云。基于 \textit{VIDA},我们提出一种强基线模型:VideoAfford,该模型通过增强多模态大语言模型的可供性分割能力,在统一框架内同时实现世界知识推理与细粒度可供性标注。为提升动作理解能力,我们采用潜在动作编码器从人-物交互视频中提取动态交互先验。此外,我们提出\textit{空间感知}损失函数,使VideoAfford能够获取全面的三维空间知识。大量实验评估表明,我们的模型显著优于现有成熟方法,并展现出强大的开放世界泛化能力与可供性推理能力。所有数据集与代码将公开发布以推动该领域研究。