CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

When told to "cut the apple," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.

翻译：当机器人被指令“切苹果”时，它必须从附近的剪刀中选择小刀，尽管这两种物体都能提供相同的切割功能。在现实场景中，多个物体可能共享相同的可操作性，但在给定任务情境下，只有其中一个物体是合适的。我们将此类情形称为混淆对。然而，现有的3D可操作性方法大多回避了这一挑战，它们通常评估孤立的单一对象，并在查询中提供明确的类别名称。我们正式提出了在意图驱动指令下的多对象可操作性接地（Multi-Object Affordance Grounding under Intent-Driven Instructions），这是一种新的3D可操作性设定，要求在杂乱的多对象点云中，基于隐含的自然语言意图，预测正确对象上每个点的可操作性掩码。为了研究这一问题，我们构建了CompassAD，这是首个聚焦于可混淆多对象场景中隐含意图的基准数据集。它包含跨越16种可操作性类型的30个混淆对象对，6422个场景以及88000多个查询-答案对。此外，我们提出了CompassNet框架，其中包含两个专为这一任务设计的新型模块：实例边界交叉注入（ICI）限制了语言-几何对齐在对象边界内，以防止跨对象语义泄漏；双层对比细化（BCR）在几何组和点层面强化了区分性，提升了目标表面与混淆表面之间的差异。大量实验结果表明，该方法在已知和未知查询上均达到了最优性能，并在机器人操作臂上的部署验证了其在混淆的多对象场景下有效迁移至真实世界抓取任务的能力。