This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose Affostruction, a generative framework that reconstructs complete geometry from partial observations and grounds affordances on the full shape including unobserved regions. We make three core contributions: generative multi-view reconstruction via sparse voxel fusion that extrapolates unseen geometry while maintaining constant token complexity, flow-based affordance grounding that captures inherent ambiguity in affordance distributions, and affordance-driven active view selection that leverages predicted affordances for intelligent viewpoint sampling. Affostruction achieves 19.1 aIoU on affordance grounding (40.4\% improvement) and 32.67 IoU for 3D reconstruction (67.7\% improvement), enabling accurate affordance prediction on complete shapes.
翻译:本文研究从物体的RGBD图像进行功能可供性定位的问题,其目标在于根据描述物体上动作的文本查询,定位对应的表面区域。现有方法仅能预测可见表面的功能可供性区域,而本文提出Affostruction——一种生成式框架,该框架能够从局部观测重建完整几何结构,并在包含未观测区域的整体形状上实现功能可供性定位。我们做出三项核心贡献:通过稀疏体素融合实现生成式多视角重建,在保持恒定令牌复杂度的同时外推未见几何结构;基于流的功能可供性定位方法,能够捕捉功能可供性分布中固有的模糊性;以及功能可供性驱动的主动视角选择策略,利用预测的功能可供性实现智能视点采样。Affostruction在功能可供性定位任务上达到19.1 aIoU(提升40.4%),在三维重建任务上达到32.67 IoU(提升67.7%),从而能够在完整形状上实现精确的功能可供性预测。