Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN
翻译:可供性理解连接了视觉感知与物理动作,为开放、非结构化真实世界环境中的机器人操控提供了可解释的接口。然而,构建一个不仅能理解交互应在何处发生及如何发生,还能在多样化环境、物体和任务中泛化的可供性基础模型,仍是一个长期存在的研究挑战。现有方法通常仅解决该挑战的一部分:要么定位任务相关区域但未指定可执行动作,要么预测动作但可扩展性有限。本文提出的AFUN模型,是迈向理解功能性的可供性基础模型的一步。通过单张RGB-D观测图像和语言任务描述,AFUN可预测任务条件性功能掩码(交互位置)和三维接触后运动曲线(交互方式)。为支持开放世界泛化,我们构建了一个大规模标准化数据流水线,将异构的机器人、人类、仿真及真实世界扫描数据,转换为统一的包含语言、掩码和物体中心三维运动标签的可供性模式。我们从三个方面评估模型:在可供性分割上,AFUN在来自4个基准的8个测试集中以显著优势超越所有基线,平均gIoU/cIoU分别提升+23.9/+26.3;在接触点预测上,其预测精度大幅提升,相较于最佳基线命中率提高12.7%—61.3%;在三维运动预测上,模型在所有三个测试集上均取得最优性能。AFUN可直接部署于真实世界机器人操控任务,无需针对机器人本体微调或使用任务特定启发式规则,展现了适应开放世界可供性任务的能力。项目页面:https://www.zhaoningwang.com/AFUN