Affordance segmentation aims to parse 3D objects into functionally distinct parts, bridging recognition and interaction for applications in robotic manipulation, embodied AI, and AR. While recent studies leverage visual or textual prompts to guide this process, they often rely on point cloud encoders as generic feature extractors, overlooking the intrinsic challenges of 3D data such as sparsity, noise, and geometric ambiguity. As a result, 3D features learned in isolation frequently lack clear and semantically consistent functional boundaries. To address this bottleneck, we propose a semantic-grounded learning paradigm that transfers rich semantic knowledge from large-scale 2D Vision Foundation Models (VFMs) into the 3D domain. Specifically, We introduce Cross-Modal Affinity Transfer (CMAT), a pre-training strategy that aligns a 3D encoder with lifted 2D semantics and jointly optimizes reconstruction, affinity, and diversity to yield semantically organized representations. Building on this backbone, we further design the Cross-modal Affordance Segmentation Transformer (CAST), which integrates multi-modal prompts with CMAT-pretrained features to generate precise, prompt-aware segmentation maps. Extensive experiments on standard benchmarks demonstrate that our framework establishes new state-of-the-art results for 3D affordance segmentation.
翻译:可供性分割旨在将三维物体解析为功能上不同的部分,为机器人操作、具身人工智能和增强现实等应用架起识别与交互的桥梁。尽管近期研究利用视觉或文本提示来引导这一过程,但它们通常依赖点云编码器作为通用特征提取器,忽视了三维数据固有的挑战,如稀疏性、噪声和几何模糊性。因此,孤立学习的三维特征往往缺乏清晰且语义一致的功能边界。为应对这一瓶颈,我们提出了一种语义基础的学习范式,将大规模二维视觉基础模型中的丰富语义知识迁移至三维领域。具体而言,我们引入了跨模态亲和力迁移,这是一种预训练策略,通过将三维编码器与提升的二维语义对齐,并联合优化重建、亲和力与多样性,从而生成语义组织化的表征。基于此骨干网络,我们进一步设计了跨模态可供性分割Transformer,该模型将多模态提示与CMAT预训练特征相结合,以生成精确的、提示感知的分割图。在标准基准测试上的大量实验表明,我们的框架为三维可供性分割确立了新的最先进成果。