AFUN: Towards an Affordance Foundation Model for Functionality Understanding

Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN

翻译：可供性理解连接了视觉感知与物理动作，为开放、非结构化真实世界环境中的机器人操控提供了可解释的接口。然而，构建一个不仅能理解交互应在何处发生及如何发生，还能在多样化环境、物体和任务中泛化的可供性基础模型，仍是一个长期存在的研究挑战。现有方法通常仅解决该挑战的一部分：要么定位任务相关区域但未指定可执行动作，要么预测动作但可扩展性有限。本文提出的AFUN模型，是迈向理解功能性的可供性基础模型的一步。通过单张RGB-D观测图像和语言任务描述，AFUN可预测任务条件性功能掩码（交互位置）和三维接触后运动曲线（交互方式）。为支持开放世界泛化，我们构建了一个大规模标准化数据流水线，将异构的机器人、人类、仿真及真实世界扫描数据，转换为统一的包含语言、掩码和物体中心三维运动标签的可供性模式。我们从三个方面评估模型：在可供性分割上，AFUN在来自4个基准的8个测试集中以显著优势超越所有基线，平均gIoU/cIoU分别提升+23.9/+26.3；在接触点预测上，其预测精度大幅提升，相较于最佳基线命中率提高12.7%—61.3%；在三维运动预测上，模型在所有三个测试集上均取得最优性能。AFUN可直接部署于真实世界机器人操控任务，无需针对机器人本体微调或使用任务特定启发式规则，展现了适应开放世界可供性任务的能力。项目页面：https://www.zhaoningwang.com/AFUN

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

可解释人工智能的基础

专知会员服务

32+阅读 · 2025年10月26日

ICCV 2025 | 超越π0，无界智慧提出A0，首个空间可供性感知的通用操作模型

专知会员服务

6+阅读 · 2025年7月1日

视觉基础模型的可解释性：综述

专知会员服务

26+阅读 · 2025年1月24日

可解释的机器学习模型和架构

专知会员服务

92+阅读 · 2023年9月17日